## In this project, we want to build up movie recommender based on 1) User-based Collaborative Filtering method; 2) Item-based Collaborative method, and 3) Matrix Factorization method. We use data from movieLens, which has the detailed rating for movies by users. The target is to generate accurate movie ratings within the test set. I use mean-squared error as evaluation metric.

In [233]:
# Load the package
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [234]:
# Load the data
# file: movies
import types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.
client_a84041a6902847f291e985bc45c84367 = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='t6mzym1IR5gszI5BXI0AP3byDq5ooTRi42BRhh3YX8eP',
    ibm_auth_endpoint="https://iam.ng.bluemix.net/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3-api.us-geo.objectstorage.service.networklayer.com')

body = client_a84041a6902847f291e985bc45c84367.get_object(Bucket='airliquidedsprojectmovierecommend-donotdelete-pr-x01ke2zaci5dpf',Key='movies.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

movies_df = pd.read_csv(body)
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [235]:
# load the data
# file: movie ratings
body = client_a84041a6902847f291e985bc45c84367.get_object(Bucket='airliquidedsprojectmovierecommend-donotdelete-pr-x01ke2zaci5dpf',Key='ratings.csv')['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

ratings_df = pd.read_csv(body)
# For the sake of computation speed, I only kept 100 movies
ratings_df = ratings_df[ratings_df['movieId'] <= 100]
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [236]:
print(movies_df.shape)
print(ratings_df.shape)

(9742, 3)
(3206, 4)


Each movie has a unique ID, a title with its release year along with it and several different genres in the same field. Let's remove the year from the title column and place it into its own one.

Let's remove the year from the title column by using pandas' replace function and store in a new year column.


In [237]:
#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
movies_df = movies_df.drop('genres', axis=1)

In [238]:
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [239]:
#Drop removes a specified row or column from a dataframe
ratings_df = ratings_df.drop('timestamp', 1)
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [240]:
#Calculate the number of users, pin down training set and test set
ratings_by_id = ratings_df.groupby('userId').count()
ratings_by_id.head()

Unnamed: 0_level_0,movieId,rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,6,6
3,1,1
4,6,6
5,7,7
6,46,46


In [241]:
ratings_df['rating'].unique()

array([4. , 5. , 3. , 0.5, 2. , 1. , 4.5, 2.5, 3.5, 1.5])

In [242]:
#Pin down training set and test set
userId_list = ratings_by_id.index

In [243]:
userId_list

Int64Index([  1,   3,   4,   5,   6,   7,   8,   9,  11,  12,
            ...
            601, 602, 603, 604, 605, 606, 607, 608, 609, 610],
           dtype='int64', name='userId', length=503)

In [244]:
# Train Test Split: For each user, I split the movies into train and test (10% randomly selected movies belong to test set, the remaining in the training set)
import random
random.seed(2019)
TEST_df = pd.DataFrame(columns=ratings_df.columns)
TRAIN_df = pd.DataFrame(columns=ratings_df.columns)
for user_id in list(userId_list):
    if ratings_by_id['movieId'][id] > 1:
        df_PORTION = ratings_df[ratings_df['userId']==user_id] # Get the portion of the data with the corresponding user id
        # Generate a random number and determine which rows are used as training data
        test_movieId = random.choices(list(df_PORTION['movieId']),k=int(len(df_PORTION)*0.1)+1)
        msk_test = df_PORTION['movieId'].isin(test_movieId)
        msk_train = (1-msk_test)>0
        TEST_df = TEST_df.append(df_PORTION[msk_test])
        TRAIN_df = TRAIN_df.append(df_PORTION[msk_train])    

In [245]:
# Check the shape.
print('The shape of train set is: ', TRAIN_df.shape)
print('The shape of test set is: ', TEST_df.shape)

The shape of train set is:  (2567, 3)
The shape of test set is:  (639, 3)


## 1. Now, we generate user-based recommendations. For every user in the test set, we want to use the train set to 1) pin down similar users using the train set; 2) use the "weights" to generate predicted moving ratings for movies in the test set of that user. 3) Compare the predicted movie ratings with the actual movie ratings and compute (R)MSE.

In [246]:
ind=0
error = []
ele = []
for i in TEST_df['userId'].unique():
    # Check whether the user is in the train set
    if i in (list(TRAIN_df['userId'])):
        # Pin down the set of users who have seen at least one movie seen by this active user
        inputMovies = TRAIN_df[TRAIN_df['userId']==i]
        userSubset = TRAIN_df[TRAIN_df['movieId'].isin(TRAIN_df[TRAIN_df['userId']==i]['movieId'])]
        userSubsetGroup = userSubset.groupby(['userId'])
        userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)
        
        # Then, we compute similarity score (pearson correlation) among the subgroup of users
        pearsonCorrelationDict = {}

        #For every user group in our subset
        for name, group in userSubsetGroup:
        #Let's start by sorting the input and current user group so the values aren't mixed up later on
            group = group.sort_values(by='movieId')
            inputMovies = inputMovies.sort_values(by='movieId')
        #Get the N for the formula
            nRatings = len(group)
        #Get the review scores for the movies that they both have in common
            temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
        #And then store them in a temporary buffer variable in a list format to facilitate future calculations
            tempRatingList = temp_df['rating'].tolist()
        #Let's also put the current user group reviews in a list format
            tempGroupList = group['rating'].tolist()
        #Now let's calculate the pearson correlation between two users, so called, x and y
            Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
            Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
            Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
        #If the denominator is different than zero, then divide, else, 0 correlation.
            if Sxx != 0 and Syy != 0:
                pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
            else:
                pearsonCorrelationDict[name] = 0
        
        #Genereate dataframe with pearson correlation info
        pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
        pearsonDF.columns = ['similarityIndex']
        pearsonDF['userId'] = pearsonDF.index
        pearsonDF.index = range(len(pearsonDF))
        
        # Merge this similarity information with the 
        topUsersRating=topUsers.merge(TRAIN_df, left_on='userId', right_on='userId', how='inner')
        #Multiplies the similarity by the user's ratings
        topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
        tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
        tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
        recommendation_df = pd.DataFrame()
        # predicted rating
        recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
        recommendation_df['movieId'] = tempTopUsersRating.index
        recommendation_df = recommendation_df.reset_index(drop=True)
        # actual rating
        temp_check = TEST_df[TEST_df['userId']==i]
        testMovies = temp_check.merge(recommendation_df, left_on='movieId', right_on='movieId', how='inner')
        # error
        error.append(sum((testMovies['rating'] - testMovies['weighted average recommendation score'])**2))
        ele.append(len(testMovies))
        
        
        print(ind, ' ', error[ind], ' ', ele[ind])
        ind+=1

0   0.3718304410766845   1
1   1.149687559614638   1
2   0.25661998491121235   1
3   14.159290906813702   5
4   1.149687559614638   1
5   0.04905225652940399   1
6   0.1780824861199088   1
7   0.001165440397407868   1
8   1.993369165326437   1
9   0.15227242101091576   1
10   0.008292187578687646   1
11   0.17209992010760694   1
12   7.313017295359003   3
13   0.16112283605570818   1
14   0.009721744806932199   1
15   0.2756053986736531   1
16   0.24694334909231036   1
17   0.1780824861199088   1
18   0.8315019033905151   1
19   0.312803249123465   2
20   0.7813130517815673   1
21   0.7033976853233153   2
22   1.7744043867366104   2
23   0.0010451267167263239   1
24   2.2697734479388276   1
25   0.3718304410766845   1
26   0.3718304410766845   1
27   1.5449989949038363   2
28   0.3493536370438024   1
29   2.5588192445405693   2
30   4.857399512114071   2
31   0.8396198173600008   2
32   0.06533671592500848   2
33   1.1840328556468278   1
34   0.5541205072190946   1
35   0.8315019033905

In [247]:
print('RMSE = ', sum(error)/sum(ele))

RMSE =  0.9243726711558072


## 2 Next, we want to do item-based CF.

In [251]:
# calculate adjusted ratings based on training data: For item-based CF, the normalization is important as different users 
# have heterogeneous rating scales.
TRAIN_df_mean= TRAIN_df.groupby(['movieId'], as_index = False, sort = False).mean().rename(columns = {'rating': 'rating_mean'})[['movieId','rating_mean']]
adjusted_TRAIN_df = pd.merge(TRAIN_df,TRAIN_df_mean,on = 'movieId', how = 'left', sort = False)
adjusted_TRAIN_df['rating_adjusted']=adjusted_TRAIN_df['rating']-adjusted_TRAIN_df['rating_mean']
# replace 0 adjusted rating values to 1*e-8 in order to avoid 0 denominator
adjusted_TRAIN_df.loc[adjusted_TRAIN_df['rating_adjusted'] == 0, 'rating_adjusted'] = 1e-8

In [254]:
adjusted_TRAIN_df.head()

Unnamed: 0,userId,movieId,rating,rating_mean,rating_adjusted
0,1,1,4.0,3.912903,0.087097
1,1,3,4.0,3.261364,0.738636
2,1,6,4.0,3.948052,0.051948
3,1,47,5.0,3.982993,1.017007
4,1,70,3.0,3.522222,-0.522222


In [257]:
# function of building the item-to-item weight matrix: 
def build_w_matrix(df):
   
    # define weight matrix
    w_matrix_columns = ['movie_1','movie_2','weight']
    w_matrix = pd.DataFrame(columns=w_matrix_columns)
    
    distinct_movies = np.unique(df['movieId']) # get the unique movies
    
    i=0
    for movie_1 in distinct_movies:
        
        if i%10==0:
            print(i)
        # extract all users who rated movie_1
        
        user_data = df[df['movieId'] == movie_1]
        distinct_users = np.unique(user_data['userId'])
        
        # record the ratings for users who rated both movie_1 and movie_2
        record_row_columns = ['userId', 'movie_1', 'movie_2', 'rating_adjusted_1', 'rating_adjusted_2']
        record_movie_1_2 = pd.DataFrame(columns=record_row_columns)
        
        # for each customer C WHO RATED movie_1
        for c_userid in distinct_users:
            # the customer's rating for movie_1
            c_movie_1_rating = user_data[user_data['userId'] == c_userid]['rating_adjusted'].iloc[0]
            # extract movies rated by the customer excluding movie_1
            c_user_data = df[(df['userId'] == c_userid) & (df['movieId'] != movie_1)]
            c_distinct_movies = np.unique(c_user_data['movieId'])

            # for each movie rated by customer C as movie=2
            for movie_2 in c_distinct_movies:
                # the customer's rating for movie_2
                c_movie_2_rating = c_user_data[c_user_data['movieId'] == movie_2]['rating_adjusted'].iloc[0]
                record_row = pd.Series([c_userid, movie_1, movie_2, c_movie_1_rating, c_movie_2_rating], index=record_row_columns)
                record_movie_1_2 = record_movie_1_2.append(record_row, ignore_index=True)
                
        # calculate the similarity values between movie_1 and the above recorded movies
        distinct_movie_2 = np.unique(record_movie_1_2['movie_2'])
        # for each movie 2
        for movie_2 in distinct_movie_2:
            print('calculate weight movie_1 %d, movie_2 %d' % (movie_1, movie_2))
            paired_movie_1_2 = record_movie_1_2[record_movie_1_2['movie_2'] == movie_2]
            sim_value_numerator = (paired_movie_1_2['rating_adjusted_1'] * paired_movie_1_2['rating_adjusted_2']).sum()
            sim_value_denominator = np.sqrt(np.square(paired_movie_1_2['rating_adjusted_1']).sum()) * np.sqrt(np.square(paired_movie_1_2['rating_adjusted_2']).sum())
            sim_value_denominator = sim_value_denominator if sim_value_denominator != 0 else 1e-8
            sim_value = sim_value_numerator / sim_value_denominator
            w_matrix = w_matrix.append(pd.Series([movie_1, movie_2, sim_value], index=w_matrix_columns), ignore_index=True)

    i=i+1
    
    return w_matrix

In [258]:
w_matrix=build_w_matrix(adjusted_TRAIN_df)

0
calculate weight movie_1 1, movie_2 2
calculate weight movie_1 1, movie_2 3
calculate weight movie_1 1, movie_2 4
calculate weight movie_1 1, movie_2 5
calculate weight movie_1 1, movie_2 6
calculate weight movie_1 1, movie_2 7
calculate weight movie_1 1, movie_2 8
calculate weight movie_1 1, movie_2 9
calculate weight movie_1 1, movie_2 10
calculate weight movie_1 1, movie_2 11
calculate weight movie_1 1, movie_2 12
calculate weight movie_1 1, movie_2 13
calculate weight movie_1 1, movie_2 14
calculate weight movie_1 1, movie_2 15
calculate weight movie_1 1, movie_2 16
calculate weight movie_1 1, movie_2 17
calculate weight movie_1 1, movie_2 18
calculate weight movie_1 1, movie_2 19
calculate weight movie_1 1, movie_2 20
calculate weight movie_1 1, movie_2 21
calculate weight movie_1 1, movie_2 22
calculate weight movie_1 1, movie_2 23
calculate weight movie_1 1, movie_2 24
calculate weight movie_1 1, movie_2 25
calculate weight movie_1 1, movie_2 26
calculate weight movie_1 1, mov

In [259]:
w_matrix

Unnamed: 0,movie_1,movie_2,weight
0,1.0,2.0,0.433435
1,1.0,3.0,0.534481
2,1.0,4.0,0.408718
3,1.0,5.0,0.268786
4,1.0,6.0,0.008770
5,1.0,7.0,0.064662
6,1.0,8.0,0.950236
7,1.0,9.0,-0.108718
8,1.0,10.0,-0.051464
9,1.0,11.0,0.019629


In [291]:
# calculate the predicted ratings
def predict(userId, movieId, w_matrix, df, df_mean):
    # fix missing mean rating which was caused by no ratings for the given movie
    # mean_rating exists for movieId
    if df_mean[df_mean['movieId'] == movieId].shape[0] > 0:
        mean_rating = df_mean[df_mean['movieId'] == movieId]['rating_mean'].iloc[0]
    # mean_rating does not exist for movieId(which may be caused by no ratings for the movie)
    else:
        mean_rating = 2.5

    # calculate the rating of the given movie by the given user
    user_other_ratings = df[df['userId'] == userId]
    user_distinct_movies = np.unique(user_other_ratings['movieId'])
    sum_weighted_other_ratings = 0
    sum_weghts = 0
    for movie_j in user_distinct_movies:
        if df_mean[df_mean['movieId'] == movie_j].shape[0] > 0:
            rating_mean_j = df_mean[df_mean['movieId'] == movie_j]['rating_mean'].iloc[0]
        else:
            rating_mean_j = 2.5
        
        # only calculate the weighted values when the weight between movie_1 and movie_2 exists in weight matrix
        w_movie_1_2 = w_matrix[(w_matrix['movie_1'] == movieId) & (w_matrix['movie_2'] == movie_j)]
        if w_movie_1_2.shape[0] > 0:
            user_rating_j = user_other_ratings[user_other_ratings['movieId']==movie_j]
            sum_weighted_other_ratings += (user_rating_j['rating'].iloc[0] - rating_mean_j) * w_movie_1_2['weight'].iloc[0]
            sum_weghts += np.abs(w_movie_1_2['weight'].iloc[0])

   # if sum_weights is 0 (which may be because of no ratings from new users), use the mean ratings
    if sum_weghts == 0:
        predicted_rating = mean_rating
   # sum_weights is bigger than 0
    else:
        predicted_rating = mean_rating + sum_weighted_other_ratings/sum_weghts

        
    return predicted_rating

In [300]:
# for each data point in test set, we generate predicted moving rate and compute prediction error
ind=0
error = []
ind =0
for i in TEST_df['userId'].unique():
    for j in list(TEST_df[TEST_df['userId']==i]['movieId']):
        actual_rating = TEST_df[(TEST_df['userId']==i) & (TEST_df['movieId']==j)]['rating']
        temp = (predict(i, j, w_matrix, adjusted_TRAIN_df, TRAIN_df_mean)-actual_rating)**2
        error.append(temp.values)
        print(ind)
        ind+=1

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
27

In [299]:
print('Item-Based CF MSE = ', np.mean(error))

Item-Based CF MSE =  1.0284160948675094


### Item-Based CF did worse than User-Based CF

## 3. Finally, we want to use Matrix Factorization Method

In [335]:
R_df = TRAIN_df.pivot(index = 'userId', columns ='movieId', values = 'rating').fillna(0)
R_df.head()
for i in range(1,101):
    if '{}'.format(i) in list(R_df.columns):
        print('no need to add column')
    
    else:
            R_df.loc[:,i] = 0
    


In [336]:
R_df.shape

(386, 100)

In [337]:
R = R_df.as_matrix()
user_ratings_mean = np.mean(R, axis = 1)
R_demeaned = R - user_ratings_mean.reshape(-1, 1)

  if __name__ == '__main__':


In [338]:
R_demeaned.shape

(386, 100)

In [339]:
# We use Singular Value Decomposition as the matrix factorization method
from scipy.sparse.linalg import svds
U, sigma, Vt = svds(R_demeaned, k = 50) # assume there are 50 latent variables (this part can  is indeed hyperparameter, due to time constraint, i'm not tuning it)


In [None]:
ind=0
error = []
ind =0
for i in TEST_df['userId'].unique():
    for j in list(TEST_df[TEST_df['userId']==i]['movieId']):
        actual_rating = TEST_df[(TEST_df['userId']==i) & (TEST_df['movieId']==j)]['rating']
        temp = U[i]*
        error.append(temp.values)
        print(ind)
        ind+=1