# Collaborative filtering practice

In this homework you will test different collaborative filtering (CF) approaches on famous Movielens dataset.

In class we implemented item2item CF, so this time let's use **user2user** approach.

## Task 0: Dataset (5 points)

We had this code in class, so you need to put it here and run.

Split dataset to train and validation parts.

Don't forget to encode users and items from 0 to maximum!

In [None]:
import os

if not (os.path.exists("recsys.zip") or os.path.exists("recsys")):
    !wget https://github.com/nzhinusoftcm/review-on-collaborative-filtering/raw/master/recsys.zip
    !unzip recsys.zip

--2024-04-14 16:00:00--  https://github.com/nzhinusoftcm/review-on-collaborative-filtering/raw/master/recsys.zip
Resolving github.com (github.com)... 140.82.116.4
Connecting to github.com (github.com)|140.82.116.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://raw.githubusercontent.com/nzhinusoftcm/review-on-collaborative-filtering/master/recsys.zip [following]
--2024-04-14 16:00:00--  https://raw.githubusercontent.com/nzhinusoftcm/review-on-collaborative-filtering/master/recsys.zip
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15312323 (15M) [application/zip]
Saving to: ‘recsys.zip’


2024-04-14 16:00:01 (183 MB/s) - ‘recsys.zip’ saved [15312323/15312323]

Archive:  recsys.zip
   creating: recsys/
  inflating: recsy

In [None]:
from collections import defaultdict
import random

from functools import lru_cache
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder
import tqdm.notebook

from recsys.datasets import ml1m, ml100k

In [None]:
ratings, movies = ml100k.load()

Download data 100.2%
Successfully downloaded ml-100k.zip 4924029 bytes.
Unzipping the ml-100k.zip zip file ...


Unnamed: 0,userid,itemid,rating
0,1,1,5
1,1,2,3
2,1,3,4
3,1,4,3
4,1,5,3


In [386]:
def encode_ratings(ratings):
  item_encoder = LabelEncoder().fit(sorted(ratings['itemid'].unique()))
  user_encoder = LabelEncoder().fit(sorted(ratings['userid'].unique()))
  ratings.itemid = item_encoder.transform(ratings.itemid)
  ratings.userid = user_encoder.transform(ratings.userid)
  return ratings, user_encoder, item_encoder


def split_train_valid(dataset, coef=0.3):
  if coef == 0:
    return dataset.groupby("userid")[["itemid", "rating"]].apply(lambda x: x.values.tolist()).to_dict(), None
  num_of_samples = round(coef * dataset.shape[0])
  test = dataset.groupby('userid', group_keys=False).apply(lambda x: x.sample(round(coef*len(x))))
  train = dataset.iloc[~test.index]
  test = test.groupby("userid")[["itemid", "rating"]].apply(lambda x: x.values.tolist()).to_dict()
  train = train.groupby("userid")[["itemid", "rating"]].apply(lambda x: x.values.tolist()).to_dict()
  return train, test


In [254]:
ratings, uencoder, iencoder = encode_ratings(ratings)

## Task 1: Similarities (5 points each)

You need to implement 3 similarity functions:
1. Dot product (intersection)
1. Jaccard index (intersection over union)
1. Pearson correlation

In [255]:
def sim_dot(left, right) -> float:
    '''Dot product similarity for users

    Args:
        left: first user ratings
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    intersection = left.keys() & right.keys()
    return sum([left[k] * right[k] for k in intersection])

In [None]:
def sim_jacc(left, right) -> float:
    '''Jaccard index similarity for users

    Args:
        left: first user ratings
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    intersection = left.keys() & right.keys()
    union = left.keys() | right.keys()
    return len(intersection)/len(union)

In [None]:
def sim_pearson(left, right) -> float:
    '''Pearson correlation similarity for users

    Args:
        left: first user ratings
        right: second user ratings

    Retruns:
        Similarity score for this pair
    '''
    l_mean = np.mean(np.array(list(left.values())))
    r_mean = np.mean(np.array(list(right.values())))
    l_cor = {k: v - l_mean for k, v in left.items()}
    r_cor = {k: v - r_mean for k, v in right.items()}
    numerator = sim_dot(l_cor, r_cor)
    denominator = np.sqrt(sim_dot(l_cor, l_cor)) * np.sqrt(sim_dot(r_cor, r_cor))
    coef = numerator / denominator
    if np.isnan(coef):
      return 0
    return coef

## Task 2: Collaborative filtering algorithm (5 points each)

Now you have several options to use similarities for ratings prediction:
1. Simple averaging
1. Mean corrected averaging

In [399]:
class UserBasedCollaborativeFilter:
    def __init__(self, sim_fn):
        self.sim_fn = sim_fn
        self.feedbacks_train = None
        self.feedbacks_valid = None
        self.sim_matrix = None
        self.users = None
        self.items = None

    def calc_sim_matrix(self, feedbacks, prod: bool = False):
        '''Fills matrix of user similarities

        Args:
            feedbacks: numpy array with ratings
        '''
        self.items = set(np.unique(feedbacks.itemid))
        coef = 0.3
        if prod:
          coef = 0
        self.feedbacks_train, self.feedbacks_valid = split_train_valid(feedbacks, coef=coef)
        self.feedbacks_ratings = defaultdict()
        self.users = self.feedbacks_train.keys()
        n_users = len(self.users)
        sim_matrix = np.full((n_users, n_users), np.nan)
        np.fill_diagonal(sim_matrix, 1)
        with tqdm.notebook.tqdm(total=n_users * (n_users - 1) // 2) as pbar:
          for i in range(n_users):
            u_i_dct = {k: v for k, v in self.feedbacks_train[i]}
            for j in range(i + 1, n_users):
                u_j_dct = {k: v for k, v in self.feedbacks_train[j]}
                sim = self.sim_fn(u_i_dct, u_j_dct)
                sim_matrix[i, j] = sim
                sim_matrix[j, i] = sim
                pbar.update()
        self.sim_matrix = sim_matrix

    def recommend(self, user: int, n: int, prod: bool = False):
        '''Computes most relevant unseen items for the user

        Args:
            user: user_id for which to provide recommendations
            n: how many items to return
        '''
        if not prod:
          user_unseen = list(map(lambda x: x[0], self.feedbacks_valid[user]))
        else:
          user_items = set(map(lambda x: x[0], self.feedbacks_train[user]))
          user_unseen = set(self.items) - user_items
        recs = {}

        numerator = 0
        denominator = 0
        for item in user_unseen:
          for other_user in self.users:
            if other_user == user:
              continue
            other_ratings = {k: v for k, v in self.feedbacks_train[other_user]}
            if item not in other_ratings.keys():
              continue
            numerator += self.sim_matrix[user, other_user] * other_ratings[item]
            denominator += np.abs(self.sim_matrix[user, other_user])
            recs[item] = 0
            if denominator != 0:
                recs[item] = numerator / denominator
        sorted_by_rating = sorted(recs.items(), key=lambda i: i[1], reverse=True)

        return sorted_by_rating[:n]

In [400]:
class UserBasedMeanCorrectedCollaborativeFilter:
    def __init__(self, sim_fn):
        self.sim_fn = sim_fn
        self.feedbacks_train = None
        self.feedbacks_valid = None
        self.sim_matrix = None
        self.users = None
        self.items = None

    def calc_sim_matrix(self, feedbacks, prod: bool = False):
        '''Fills matrix of user similarities

        Args:
            feedbacks: numpy array with ratings
        '''
        self.items = set(np.unique(feedbacks.itemid))
        coef = 0.3
        if prod:
          coef = 0
        self.feedbacks_train, self.feedbacks_valid = split_train_valid(feedbacks, coef=coef)
        self.feedbacks_ratings = defaultdict()
        self.users = self.feedbacks_train.keys()
        n_users = len(self.users)
        sim_matrix = np.full((n_users, n_users), np.nan)
        np.fill_diagonal(sim_matrix, 1)
        with tqdm.notebook.tqdm(total=n_users * (n_users - 1) // 2) as pbar:
          for i in range(n_users):
            u_i_dct = {k: v for k, v in self.feedbacks_train[i]}
            for j in range(i + 1, n_users):
                u_j_dct = {k: v for k, v in self.feedbacks_train[j]}
                sim = self.sim_fn(u_i_dct, u_j_dct)
                sim_matrix[i, j] = sim
                sim_matrix[j, i] = sim
                pbar.update()
        self.sim_matrix = sim_matrix

    def recommend(self, user: int, n: int, prod: bool = False):
        '''Computes most relevant unseen items for the user

        Args:
            user: user_id for which to provide recommendations
            n: how many items to return
        '''
        user_ratings = list(map(lambda x: x[1], self.feedbacks_train[user]))
        if not prod:
          user_unseen = list(map(lambda x: x[0], self.feedbacks_valid[user]))
        else:
          user_items = set(map(lambda x: x[0], self.feedbacks_train[user]))
          user_unseen = set(self.items) - user_items
        user_avg_rating = sum(user_ratings) / len(user_ratings)
        recs = {}

        numerator = 0
        denominator = 0
        for item in user_unseen:
          for other_user in self.users:
            if other_user == user:
              continue
            other_ratings = {k: v for k, v in self.feedbacks_train[other_user]}
            if item not in other_ratings.keys():
              continue
            other_user_avg_rating = sum(other_ratings.values()) / len(other_ratings.values())
            numerator += self.sim_matrix[user, other_user] * (other_ratings[item] - other_user_avg_rating)
            denominator += np.abs(self.sim_matrix[user, other_user])
            recs[item] = 0
            if denominator != 0:
                recs[item] = user_avg_rating + numerator / denominator
        sorted_by_rating = sorted(recs.items(), key=lambda i: i[1], reverse=True)

        return sorted_by_rating[:n]

This way you have got 6 different recommendation methods (each of two CF can be used with 3 similarity score).

## Task 3: Apply models

1. For all 6 possible algorithm variations train it and compute recomendations for validation part. (10 points)
2. Show that your implementation is relevant by computing metrics. Compare algorithms. (15 points)

In [264]:
def calc_metrics(df_merged):
  df_merged['MAE'] = (df_merged['value_true'] - df_merged['value_recs']).abs()
  df_merged['MSE'] = (df_merged['value_true'] - df_merged['value_recs']) ** 2
  print(f"MAE  - {df_merged['MAE'].mean():.4f}")
  print(f"MSE  - {df_merged['MSE'].mean():.4f}")
  print(f"RMSE - {np.sqrt(df_merged['MSE'].mean()):.4f}")


def precision_at_k(df_merged, k=10):
  df_merged['rank_recs'] = df_merged['value_recs'].groupby(['user_id']).rank(ascending=False, method='first')
  df_merged['rank_recs'] = df_merged['rank_recs'].astype(int)
  df_merged['rank_true'] = df_merged['value_true'].groupby(['user_id']).rank(ascending=False, method='first')
  df_merged['rank_true'] = df_merged['rank_true'].astype(int)
  df_merged_p = df_merged.query(f'rank_true <= {k}').copy()
  df_merged_p.loc[df_merged_p.index, f'hit@{k}'] = df_merged_p['rank_recs'] <= k
  df_merged_p.loc[df_merged_p.index, f'hit@{k}/{k}'] = df_merged_p[f'hit@{k}'] / k
  df_prec2 = df_merged_p.groupby(level=0)[f'hit@{k}/{k}'].sum()
  print(f'Precision@{k} - {df_prec2.mean()}')


def get_df_merged(filt, valid):
  users_sample = list(valid.keys())
  df_merged = []
  with tqdm.notebook.tqdm(total=len(users_sample)) as pbar:
    for u in users_sample:
      n_items = len(valid[u])
      check = filt.recommend(u, n_items)
      df1 = pd.DataFrame(check)
      df2 = pd.DataFrame(valid[u])
      for_metrics = pd.merge(df1, df2, on=[0])
      for_metrics['user'] = u
      for_metrics.columns = ['item_id', 'value_recs', 'value_true', 'user_id']
      df_merged.append(for_metrics)
      pbar.update()

  df_merged = pd.concat(df_merged)
  df_merged.set_index(['user_id', 'item_id'], inplace=True)
  return df_merged

In [265]:
filt_dot = UserBasedCollaborativeFilter(sim_dot)
filt_dot.calc_sim_matrix(ratings)
df_merged = get_df_merged(filt_dot, filt_dot.feedbacks_valid)
calc_metrics(df_merged)
precision_at_k(df_merged, k=5)

  0%|          | 0/444153 [00:00<?, ?it/s]

  0%|          | 0/943 [00:00<?, ?it/s]

MAE  - 0.8929
MSE  - 1.2530
RMSE - 1.1194
Precision@5 - 0.5276776246023341


In [266]:
filt_jacc = UserBasedCollaborativeFilter(sim_jacc)
filt_jacc.calc_sim_matrix(ratings)
df_merged = get_df_merged(filt_jacc, filt_jacc.feedbacks_valid)
calc_metrics(df_merged)
precision_at_k(df_merged, k=5)

  0%|          | 0/444153 [00:00<?, ?it/s]

  0%|          | 0/943 [00:00<?, ?it/s]

MAE  - 0.9018
MSE  - 1.2531
RMSE - 1.1194
Precision@5 - 0.5268292682926832


In [267]:
filt_pearson = UserBasedCollaborativeFilter(sim_pearson)
filt_pearson.calc_sim_matrix(ratings)
df_merged = get_df_merged(filt_pearson, filt_pearson.feedbacks_valid)
calc_metrics(df_merged)
precision_at_k(df_merged, k=5)

  0%|          | 0/444153 [00:00<?, ?it/s]

  0%|          | 0/943 [00:00<?, ?it/s]

MAE  - 2.3468
MSE  - 7.0886
RMSE - 2.6625
Precision@5 - 0.5196182396606578


In [268]:
filt_dot = UserBasedMeanCorrectedCollaborativeFilter(sim_dot)
filt_dot.calc_sim_matrix(ratings)
df_merged = get_df_merged(filt_dot, filt_dot.feedbacks_valid)
calc_metrics(df_merged)
precision_at_k(df_merged, k=5)

  0%|          | 0/444153 [00:00<?, ?it/s]

  0%|          | 0/943 [00:00<?, ?it/s]

MAE  - 0.8345
MSE  - 1.1255
RMSE - 1.0609
Precision@5 - 0.5372216330858971


In [269]:
filt_jacc = UserBasedMeanCorrectedCollaborativeFilter(sim_jacc)
filt_jacc.calc_sim_matrix(ratings)
df_merged = get_df_merged(filt_jacc, filt_jacc.feedbacks_valid)
calc_metrics(df_merged)
precision_at_k(df_merged, k=5)

  0%|          | 0/444153 [00:00<?, ?it/s]

  0%|          | 0/943 [00:00<?, ?it/s]

MAE  - 0.8298
MSE  - 1.1110
RMSE - 1.0541
Precision@5 - 0.5297985153764588


In [270]:
filt_pearson = UserBasedMeanCorrectedCollaborativeFilter(sim_pearson)
filt_pearson.calc_sim_matrix(ratings)
df_merged = get_df_merged(filt_pearson, filt_pearson.feedbacks_valid)
calc_metrics(df_merged)
precision_at_k(df_merged, k=5)

  0%|          | 0/444153 [00:00<?, ?it/s]

  0%|          | 0/943 [00:00<?, ?it/s]

MAE  - 0.8218
MSE  - 1.1002
RMSE - 1.0489
Precision@5 - 0.5444326617179224


# Task 4: Your favorite films

1. Choose from 10 to 50 films rated by you (you can export it from IMDB or kinopoisk) which are presented in Movielens dataset. </br> Print them in human readable form (5 points)

In [403]:
my_id = ratings['userid'].max() + 1
me = [
  [my_id, 256, 5],
  [my_id, 71, 5],
  [my_id, 1482, 5],
  [my_id, 68, 5],
  [my_id, 312, 5],
  [my_id, 126, 4],
  [my_id, 271, 5],
  [my_id, 251, 5],
  [my_id, 21, 5],
  [my_id, 143, 5],
  [my_id, 225, 5],
  [my_id, 471, 5],
  [my_id, 1090, 5],
  [my_id, 63, 5],
  [my_id, 0, 5],
  [my_id, 779, 4],
]
for _, film, r in me:
  print(f'{movies.title.loc[film]} - {r}')

Men in Black (1997) - 5
Mask, The (1994) - 5
Man in the Iron Mask, The (1998) - 5
Forrest Gump (1994) - 5
Titanic (1997) - 5
Godfather, The (1972) - 4
Good Will Hunting (1997) - 5
Lost World: Jurassic Park, The (1997) - 5
Braveheart (1995) - 5
Die Hard (1988) - 5
Die Hard 2 (1990) - 5
Dragonheart (1996) - 5
Pete's Dragon (1977) - 5
Shawshank Redemption, The (1994) - 5
Toy Story (1995) - 5
Dumb & Dumber (1994) - 4


2. Compute top 10 recomendations based on this films for each of 6 methods implemented. Print them in human readable from (5 points)

In [405]:
my_df = pd.DataFrame(me, columns = ['userid', 'itemid', 'rating'])
new_ratings = pd.concat([ratings, my_df])

me_filt = UserBasedCollaborativeFilter(sim_dot)
me_filt.calc_sim_matrix(new_ratings, prod=True)

results = me_filt.recommend(my_id, 10, prod=True)
for film, r in results:
  print(f'{movies.title.loc[film]} - {r}')

  0%|          | 0/445096 [00:00<?, ?it/s]

Usual Suspects, The (1995) - 3.8245147032972553
Postino, Il (1994) - 3.802553285153527
Mr. Holland's Opus (1995) - 3.799659613432331
Mighty Aphrodite (1995) - 3.7966776917964484
French Twist (Gazon maudit) (1995) - 3.793742162149773
M*A*S*H (1970) - 3.7731563197361173
When Harry Met Sally... (1989) - 3.7724291476359113
Room with a View, A (1986) - 3.7723060705560583
Indiana Jones and the Last Crusade (1989) - 3.772028515791831
Unbearable Lightness of Being, The (1988) - 3.7718604708709433


In [406]:
my_df = pd.DataFrame(me, columns = ['userid', 'itemid', 'rating'])
new_ratings = pd.concat([ratings, my_df])

me_filt = UserBasedCollaborativeFilter(sim_jacc)
me_filt.calc_sim_matrix(new_ratings, prod=True)

results = me_filt.recommend(my_id, 10, prod=True)
for film, r in results:
  print(f'{movies.title.loc[film]} - {r}')

  0%|          | 0/445096 [00:00<?, ?it/s]

Usual Suspects, The (1995) - 3.797198355679894
Mr. Holland's Opus (1995) - 3.7810118831595774
M*A*S*H (1970) - 3.7804319220048597
Indiana Jones and the Last Crusade (1989) - 3.7798823916081514
Room with a View, A (1986) - 3.779842316383059
Unbearable Lightness of Being, The (1988) - 3.779315162901992
When Harry Met Sally... (1989) - 3.7789524757137727
Pink Floyd - The Wall (1982) - 3.7783378001341688
Field of Dreams (1989) - 3.7778826822338107
This Is Spinal Tap (1984) - 3.777475359578058


In [407]:
my_df = pd.DataFrame(me, columns = ['userid', 'itemid', 'rating'])
new_ratings = pd.concat([ratings, my_df])

me_filt = UserBasedCollaborativeFilter(sim_pearson)
me_filt.calc_sim_matrix(new_ratings, prod=True)

results = me_filt.recommend(my_id, 10, prod=True)
for film, r in results:
  print(f'{movies.title.loc[film]} - {r}')

  0%|          | 0/445096 [00:00<?, ?it/s]

GoldenEye (1995) - -0.03342452029583302
Children of the Corn: The Gathering (1996) - -0.25765044609217075
E.T. the Extra-Terrestrial (1982) - -0.25770219114867765
Transformers: The Movie, The (1986) - -0.258117127841073
Aladdin and the King of Thieves (1996) - -0.2587168976362757
Bob Roberts (1992) - -0.2588031138109205
Alice in Wonderland (1951) - -0.2588284272392072
Mary Poppins (1964) - -0.2589565533074165
William Shakespeare's Romeo and Juliet (1996) - -0.25971770601230143
Cinderella (1950) - -0.2598078365093563


In [408]:
my_df = pd.DataFrame(me, columns = ['userid', 'itemid', 'rating'])
new_ratings = pd.concat([ratings, my_df])

me_filt = UserBasedMeanCorrectedCollaborativeFilter(sim_dot)
me_filt.calc_sim_matrix(new_ratings, prod=True)

results = me_filt.recommend(my_id, 10, prod=True)
for film, r in results:
  print(f'{movies.title.loc[film]} - {r}')

  0%|          | 0/445096 [00:00<?, ?it/s]

Usual Suspects, The (1995) - 5.1350223749771375
Postino, Il (1994) - 5.109998286228083
Mr. Holland's Opus (1995) - 5.106939909026941
Mighty Aphrodite (1995) - 5.106168086747484
French Twist (Gazon maudit) (1995) - 5.1021119102576025
M*A*S*H (1970) - 5.087305141539157
Room with a View, A (1986) - 5.0862925576000695
Indiana Jones and the Last Crusade (1989) - 5.0861100422552346
Unbearable Lightness of Being, The (1988) - 5.0860903224203815
When Harry Met Sally... (1989) - 5.085965078300546


In [409]:
my_df = pd.DataFrame(me, columns = ['userid', 'itemid', 'rating'])
new_ratings = pd.concat([ratings, my_df])

me_filt = UserBasedMeanCorrectedCollaborativeFilter(sim_jacc)
me_filt.calc_sim_matrix(new_ratings, prod=True)

results = me_filt.recommend(my_id, 10, prod=True)
for film, r in results:
  print(f'{movies.title.loc[film]} - {r}')

  0%|          | 0/445096 [00:00<?, ?it/s]

Usual Suspects, The (1995) - 5.085523853569544
M*A*S*H (1970) - 5.0684298455270484
Indiana Jones and the Last Crusade (1989) - 5.067876296701475
Room with a View, A (1986) - 5.067564082947592
Unbearable Lightness of Being, The (1988) - 5.067282385451366
When Harry Met Sally... (1989) - 5.066338496549018
Pink Floyd - The Wall (1982) - 5.066011586837895
Mr. Holland's Opus (1995) - 5.065926198647207
This Is Spinal Tap (1984) - 5.065701188912684
Field of Dreams (1989) - 5.065502765116842


In [410]:
my_df = pd.DataFrame(me, columns = ['userid', 'itemid', 'rating'])
new_ratings = pd.concat([ratings, my_df])

me_filt = UserBasedMeanCorrectedCollaborativeFilter(sim_pearson)
me_filt.calc_sim_matrix(new_ratings, prod=True)

results = me_filt.recommend(my_id, 10, prod=True)
for film, r in results:
  print(f'{movies.title.loc[film]} - {r}')

  0%|          | 0/445096 [00:00<?, ?it/s]

Four Rooms (1995) - 5.078817158599242
GoldenEye (1995) - 5.056998968834785
Copycat (1995) - 4.971484827200381
Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) - 4.969481367329959
Get Shorty (1995) - 4.965126426433848
Twelve Monkeys (1995) - 4.932785529226061
Babe (1995) - 4.908893270984166
Gone with the Wind (1939) - 4.888860110030094
Spitfire Grill, The (1996) - 4.888840296428844
Kansas City (1996) - 4.888795180178124


3. Rate films that was recommended in previous step (by title, description, trailer). For each algorithm compute metrics based on ratings you put. Was recommedations different? Which set of recomendations you like the most?

<Your ratings and conclusions here>

# Task 5: Conclusion (10 points)

Compare all methods based on both dataset (metrics) and your personal recomendations.

Which algorithm is the best? Why?

What differences in algorithms have you noted?

Algorithms with mean correction show better performance with all present metrics (MAE, MSE, RMSE, Precision at 5) compared to algorithms without mean correction. On one hand, for algorithms without mean correction, using dot and jacc similarity, it is not clear which one is better, because metrics are quite similar, but pearson correlation is definetely worse them, on another hand, with mean correction it is clear that similarity with dot shows the worst results by all metrics, jacc is a bit better and pearson is best. \\
Overall, the algorithm with mean correction and pearson similarity shows best results.