In [0]:
import pandas as pd
import numpy as np
import os

#Collaborative filtering#
In this notebook we are going to perform a colaborative filtering recommender with the small MovieLens dataset. It will recommnend based on the similarity between users

Download the dataset from the webpage and extract it.

In [0]:
small_dataset_url = 'http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'

In [0]:
import urllib.request
datasets_path = ''
small_dataset_path = os.path.join(datasets_path, 'ml-latest-small.zip')

small_f = urllib.request.urlretrieve(small_dataset_url, small_dataset_path)

In [0]:
import zipfile

with zipfile.ZipFile(small_dataset_path, "r") as z:
    z.extractall(datasets_path)

This is the dataset used for the collaborative filtering, it contains the rating that a user has given for a movie.

In [0]:
df_ratings = pd.read_csv('ml-latest-small/ratings.csv')

In [0]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


timestamp feature is not needed

In [0]:
df_ratings = df_ratings.drop(['timestamp'], axis = 1)

In [0]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


By means of optimization, we check that all Id are contigous and they do not have empty spaces, as python starts indexing by 0 we also rest 1 to all Ids.

In [0]:
df_ratings.userId.nunique()

610

In [0]:
df_ratings.userId.max()

610

In [0]:
df_ratings.userId = df_ratings.userId - 1

In [0]:
df_movies = pd.read_csv('ml-latest-small/movies.csv')

In [0]:
df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


We merge by movieId to know in every case of the movie the rating is referring to.

In [0]:
df_ratings = df_ratings.merge(df_movies, on = 'movieId')

In [0]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,title,genres
0,0,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,4,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,6,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,14,1,2.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,16,1,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


In [0]:
len(df_ratings)

100836

In [0]:
df_ratings = df_ratings.drop(['genres'], axis = 1)

In [0]:
df_ratings.movieId.nunique()

9724

In [0]:
df_ratings.movieId.max()

193609

In [0]:
#take unique indexes
unique_movie_ids = set(df_ratings.movieId.values) 
#empty dictionary where the key is the movie and the value is the new index
dic_index_movie = {} 
new_index = 0

for movie in unique_movie_ids:
    dic_index_movie[movie] = new_index # relate movie to the new index
    new_index += 1 
    
df_ratings['movieId_new'] = df_ratings.apply(lambda row: dic_index_movie[row.movieId], axis=1) # generate new column with new index

In [0]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,title,movieId_new
0,0,1,4.0,Toy Story (1995),0
1,4,1,4.0,Toy Story (1995),0
2,6,1,4.5,Toy Story (1995),0
3,14,1,2.5,Toy Story (1995),0
4,16,1,4.5,Toy Story (1995),0


N is the number of users and M is the number of movies

In [0]:
N = df_ratings.userId.nunique() + 1
M = df_ratings.movieId.nunique() + 1

Divide data in training(80%) and test(20%):

In [0]:
from sklearn.utils import shuffle

In [0]:
df_ratings = shuffle(df_ratings) #mix randomly
cutoff = int(0.8*len(df_ratings)) 
df_train = df_ratings.iloc[:cutoff]  #80% for training
df_test = df_ratings.iloc[cutoff:] #20% for testing

In [0]:
df_train.userId.nunique()

610

#Collaborative filtering#

We are gonna stablish a maximum number of neighbors and a minimum threshold to be a neighbor.

In [0]:
K = 25 #max neighbors
th = 5 #threshold for a neighbor

The lists neigbors, average and std are going to be filled with the data obtained in the training and we will predict based on the information in this lists.

In [0]:
neighbors = [] #neighbors per user
average = [] # mean rating per user
std = [] # standard desviation per user and movie

Training the recommender. It will calculate the mean for every user and the deviation for each movie. It will look  the whole dataset who are the closests neighbors and sort it by the weight. The weight takes into account the mean of every one and the deviation and it does not work directly with the rating in the common movies to compensate the difference between users at the time of rating a movie and their subjective criteria.

In [0]:
from sortedcontainers import SortedList
#for each user
for i in range(N): 
    #search if the k users closer to i
    df_user_movie = df_train.loc[df_train['userId'] == i] # movies rated by the user i
    avg_i_rating = df_user_movie['rating'].mean() #calculate the mean of the user
    df_user_movie['deviation'] = df_user_movie['rating'] - avg_i_rating; # add the deviation
    sigma_i = np.sqrt(df_user_movie['deviation'].dot(df_user_movie['deviation'])) #Pearson coefficient denominator

    
    deviations = dict(zip(df_user_movie.movieId_new, df_user_movie.deviation)) # dictionary that contains the deviation per movie

    average.append(avg_i_rating) #save the averages
    std.append(deviations) #save the deviations

    sl = SortedList() #sorted list to store the weights

    for j in range(N): #search for every user different from i the closest neighbours.
        if j != i: 

            df_user_movie_neighb = df_train.loc[df_train['userId'] == j] # movies rated by user j
            common_movies = pd.merge(df_user_movie, df_user_movie_neighb, how='inner', on=['movieId_new']) # look to the common movies i and j has

            #if they have more movies in common than the threshold, we proceed:
            if len(common_movies) > th:
                avg_j_rating = df_user_movie_neighb['rating'].mean()
                common_movies['deviation_y'] = common_movies['rating_y'] - avg_j_rating;
                df_user_movie_neighb['deviation'] = df_user_movie_neighb['rating'] - avg_i_rating;
                sigma_j = np.sqrt(df_user_movie_neighb['deviation'].dot(df_user_movie_neighb['deviation'])) #Pearson coefficient denominator
                
                # calculate correlation coefficient
                numerator = common_movies['deviation'].dot(common_movies['deviation_y'])
                #calculate the weight
                w_ij = numerator / (sigma_i * sigma_j)

                sl.add((-w_ij, j)) #add th weight to the sorted list in ascending order
                if len(sl) > K: # If the list > neighbor limit, we eliminate the least important weight
                    del sl[-1]

    neighbors.append(sl) #store neighbors
    print(i,"/",N-1)    


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


0 / 610
1 / 610
2 / 610
3 / 610
4 / 610
5 / 610
6 / 610
7 / 610
8 / 610
9 / 610
10 / 610
11 / 610
12 / 610
13 / 610
14 / 610
15 / 610
16 / 610
17 / 610
18 / 610
19 / 610
20 / 610
21 / 610
22 / 610
23 / 610
24 / 610
25 / 610
26 / 610
27 / 610
28 / 610
29 / 610
30 / 610
31 / 610
32 / 610
33 / 610
34 / 610
35 / 610
36 / 610
37 / 610
38 / 610
39 / 610
40 / 610
41 / 610
42 / 610
43 / 610
44 / 610
45 / 610
46 / 610
47 / 610
48 / 610
49 / 610
50 / 610
51 / 610




52 / 610
53 / 610
54 / 610
55 / 610
56 / 610
57 / 610
58 / 610
59 / 610
60 / 610
61 / 610
62 / 610
63 / 610
64 / 610
65 / 610
66 / 610
67 / 610
68 / 610
69 / 610
70 / 610
71 / 610
72 / 610
73 / 610
74 / 610
75 / 610
76 / 610
77 / 610
78 / 610
79 / 610
80 / 610
81 / 610
82 / 610
83 / 610
84 / 610
85 / 610
86 / 610
87 / 610
88 / 610
89 / 610
90 / 610
91 / 610
92 / 610
93 / 610
94 / 610
95 / 610
96 / 610
97 / 610
98 / 610
99 / 610
100 / 610
101 / 610
102 / 610
103 / 610
104 / 610
105 / 610
106 / 610
107 / 610
108 / 610
109 / 610
110 / 610
111 / 610
112 / 610
113 / 610
114 / 610
115 / 610
116 / 610
117 / 610
118 / 610
119 / 610
120 / 610
121 / 610
122 / 610
123 / 610
124 / 610
125 / 610
126 / 610
127 / 610
128 / 610
129 / 610
130 / 610
131 / 610
132 / 610
133 / 610
134 / 610
135 / 610
136 / 610
137 / 610
138 / 610
139 / 610
140 / 610
141 / 610
142 / 610
143 / 610
144 / 610
145 / 610
146 / 610
147 / 610
148 / 610
149 / 610
150 / 610
151 / 610
152 / 610
153 / 610
154 / 610
155 / 610
156 / 61

In [0]:
def predict(i, m):
   
    numerator = 0 # sum of the product of the weights and deviations
    den = 0 # sum of the absolute values of the weights
    for neigh in neighbors[i]:
        neg_w =neigh[0]
        j = neigh[1]
        std_j = std[j]
        try:
        # If the neighbor has rated the movie:
            numerator += -neg_w * std_j[m] # we add the weight in negative value
            den += abs(neg_w)
        except KeyError:
        #When the neighbor has not rated the movie launch exception.
            pass

    if den == 0:
        pred = average[i] #We predict as our average rating in this case, 
    else:
        pred = numerator / den + average[i]
    pred = min(5, pred) # to bound the prediction score between 0 and 5
    pred = max(0.5, pred)
    return pred

Now we can predict al the ratings to check if our algorithm is capable to obatin good results

In [0]:
pred_train = []
label_train = []

In [0]:
for row in df_train.iterrows(): # we iterate through the whole dataset of training
    pred = predict(row[1]['userId'], row[1]['movieId_new'])

    #store pred and label
    pred_train.append(pred)
    label_train.append(row[1]['rating'])
    print(row[1]['title'] + ' pred: ' + str(pred) + ' label: '+ str(row[1]['rating']))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Searching for Bobby Fischer (1993) pred: 3.6181982198470974 label: 2.5
Metropolis (2001) pred: 4.119433411555541 label: 4.0
Tank Girl (1995) pred: 2.5434587655680527 label: 4.0
Blue Velvet (1986) pred: 4.4369083133939755 label: 4.0
Star Wars: Episode V - The Empire Strikes Back (1980) pred: 4.093794567616522 label: 3.0
Shadow of the Vampire (2000) pred: 2.8364347980786335 label: 1.5
Wedding Crashers (2005) pred: 3.546669754514661 label: 3.0
Anchorman: The Legend of Ron Burgundy (2004) pred: 3.350075719136717 label: 4.0
One Flew Over the Cuckoo's Nest (1975) pred: 3.9264875914524984 label: 4.0
DiG! (2004) pred: 3.8813915135455526 label: 4.0
Wonder Woman (2017) pred: 3.085333164952602 label: 2.0
The Devil's Advocate (1997) pred: 3.710928595518941 label: 3.0
Batman Begins (2005) pred: 3.2778776879849185 label: 4.0
Kill Bill: Vol. 1 (2003) pred: 4.511211899534957 label: 5.0
Time Bandits (1981) pred: 4.071387459528027 label: 4

In [0]:
pred_test = []
label_test = []

In [0]:
for row in df_test.iterrows():

    pred = predict(row[1]['userId'], row[1]['movieId_new'])

    #store pred and label
    pred_test.append(pred)
    label_test.append(row[1]['rating'])
    print(row[1]['title'] + ' pred: ' + str(pred) + ' label: '+ str(row[1]['rating']))

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Hudsucker Proxy, The (1994) pred: 2.8970163689453745 label: 3.0
Thor: The Dark World (2013) pred: 4.342127183552993 label: 3.5
No Country for Old Men (2007) pred: 3.701218965155735 label: 3.0
Live Free or Die Hard (2007) pred: 3.008176865013523 label: 4.5
Creature Comforts (1989) pred: 4.329655496535292 label: 4.0
Screamers (1995) pred: 3.457468015185601 label: 5.0
Dr. Dolittle 2 (2001) pred: 2.6620776128468813 label: 2.0
Outbreak (1995) pred: 3.223394185990169 label: 3.0
Last King of Scotland, The (2006) pred: 3.5915381631308914 label: 3.0
Family Stone, The (2005) pred: 3.3875530410183874 label: 2.0
Die Hard: With a Vengeance (1995) pred: 4.146675700935941 label: 5.0
Perfect Murder, A (1998) pred: 3.1841003593226382 label: 2.0
City of God (Cidade de Deus) (2002) pred: 4.278827228786941 label: 5.0
Cliffhanger (1993) pred: 3.5197692464284924 label: 4.0
Cop Land (1997) pred: 5 label: 3.5
Star Wars: Episode IV - A New Hope (

MSE: mean squared error. Metric that tells by how much our algotithm predicts wrong a rating.

In [0]:
def mse(p, t):
    p = np.array(p)
    t = np.array(t)
    return np.mean((p - t)**2)

We can say that the results are pretty decent, in the testing the error is below 1 point, which given that we have used the small dataset and our implementation has not involved any deep network, just a formula, we conclude that this algorithm it is a good option as a first approach to the problem.



In [0]:
print('Train mse:', mse(pred_train, label_train))
print('Test mse:', mse(pred_test, label_test))

Train mse: 0.694063960495097
Test mse: 0.8650730403186344
