In [1]:
from jupyterthemes import get_themes
import jupyterthemes as jt
from jupyterthemes.stylefx import set_nb_theme
set_nb_theme('monokai')

<hr>

<a id="ref2"></a>
# Preprocessing

First, let's get all of the imports out of the way:

In [262]:
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Now let's read each file into their Dataframes:

In [263]:
movies_df = pd.read_csv('C:/Users/ngcph/Desktop/Recommender-System-master/ml-latest-small/movies.csv')
ratings_df = pd.read_csv('C:/Users/ngcph/Desktop/Recommender-System-master/ml-latest-small/ratings.csv')

Let's also take a peek at how each of them are organized:

In [264]:
movies_df.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


So each movie has a unique ID, a title with its release year along with it (Which may contain unicode characters) and several different genres in the same field. Let's remove the year from the title column and place it into its own one by using the handy [extract](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html#pandas.Series.str.extract) function that Pandas has.

Let's remove the year from the __title__ column by using pandas' replace function and store in a new __year__ column.

In [265]:
#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

  import sys


Let's look at the result!

In [266]:
movies_df.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [267]:
len(movies_df)

9125

With that, let's also drop the genres column since we won't need it for this particular recommendation system.

In [268]:
#Dropping the genres column
movies_df = movies_df.drop('genres', 1)

  


Here's the final movies dataframe:

In [269]:
movies_df.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


<br>

Next, let's look at the ratings dataframe.

In [270]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [271]:
len(ratings_df)

100004

Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory.

In [272]:
#Drop removes a specified row or column from a dataframe
ratings_df = ratings_df.drop('timestamp', 1)

  


Here's how the final ratings Dataframe looks like:

In [273]:
ratings_df.head()

Unnamed: 0,userId,movieId,rating
0,1,31,2.5
1,1,1029,3.0
2,1,1061,3.0
3,1,1129,2.0
4,1,1172,4.0


In [274]:
ratings_df.groupby(["userId"]).count()

Unnamed: 0_level_0,movieId,rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,20,20
2,76,76
3,51,51
4,204,204
5,100,100
...,...,...
667,68,68
668,20,20
669,37,37
670,31,31


<hr>

<a id="ref3"></a>
# Collaborative Filtering


**__Collaborative Filtering__** </br>
__User-User Filtering__

The process for creating a User Based recommendation system is as follows:
- Select a user with the movies the user has watched
- Based on his rating to movies, find the top X neighbours 
- Get the watched movie record of the user for each neighbour.
- Calculate a similarity score using some formula
- Recommend the items with the highest score


In [324]:
# Rated movie by other user
userInput = [
            {'title':'Toy Story', 'rating':5},
            {'title':'Jumanji', 'rating':1.5},
            {'title':'Red Rock West', 'rating':2},
            {'title':"Balto", 'rating':5},
            {'title':'Lion King, The', 'rating':4.5}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Toy Story,5.0
1,Jumanji,1.5
2,Red Rock West,2.0
3,Balto,5.0
4,"Lion King, The",4.5


#### Add movieId to input user


In [325]:
movies_df['title'].isin(inputMovies['title'].tolist())

0        True
1        True
2       False
3       False
4       False
        ...  
9120    False
9121    False
9122    False
9123    False
9124    False
Name: title, Length: 9125, dtype: bool

In [326]:
#Filtering out the movies by title
inputId = movies_df[movies_df['title'].isin(inputMovies['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('year', 1)
#Final input dataframe
inputMovies

  


Unnamed: 0,movieId,title,rating
0,1,Toy Story,5.0
1,2,Jumanji,1.5
2,13,Balto,5.0
3,364,"Lion King, The",4.5
4,373,Red Rock West,2.0


#### The users who has seen the same movies


In [333]:
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = ratings_df[ratings_df['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head(10)

Unnamed: 0,userId,movieId,rating
59,2,364,3.0
161,4,364,5.0
360,5,364,4.0
495,7,1,3.0
518,7,364,3.0
699,9,1,4.0
889,13,1,5.0
962,15,1,2.0
963,15,2,2.0
1056,15,364,4.0


In [335]:
len(userSubset)

580

In [334]:
userSubset.groupby(["userId"]).count()

Unnamed: 0_level_0,movieId,rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,1,1
4,1,1
5,1,1
7,2,2
9,1,1
...,...,...
663,1,1
664,1,1
665,2,2
670,1,1


In [336]:
userSubset.groupby("movieId").count()

Unnamed: 0_level_0,userId,rating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,247,247
2,107,107
13,8,8
364,200,200
373,18,18


We now group up the rows by user ID.

In [337]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])
userSubsetGroup.count()

Unnamed: 0_level_0,movieId,rating
userId,Unnamed: 1_level_1,Unnamed: 2_level_1
2,1,1
4,1,1
5,1,1
7,2,2
9,1,1
...,...,...
663,1,1
664,1,1
665,2,2
670,1,1


lets look at one of the users, e.g. the one with userID=1130

In [338]:
userSubsetGroup.get_group(48)

Unnamed: 0,userId,movieId,rating
7375,48,1,4.0
7376,48,2,3.5
7389,48,364,4.0


In [339]:
userSubsetGroup.get_group(73)

Unnamed: 0,userId,movieId,rating
10214,73,1,5.0
10215,73,2,2.5
10296,73,364,5.0


In [340]:
userSubsetGroup.get_group(402)

Unnamed: 0,userId,movieId,rating
55493,402,1,2.0
55495,402,13,4.5
55502,402,364,4.0


In [341]:
userSubsetGroup.get_group(213)

Unnamed: 0,userId,movieId,rating
29266,213,1,3.0
29267,213,2,3.0
29269,213,13,2.5
29313,213,364,2.5


In [342]:
userSubsetGroup.get_group(262)

Unnamed: 0,userId,movieId,rating
35983,262,1,2.5
35984,262,2,2.0
36010,262,364,2.0


Let's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.

In [343]:
#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)
len(userSubsetGroup)

346

Now lets look at the first user

In [344]:
userSubsetGroup[0:3]
# type(userSubsetGroup[0])

[(564,
         userId  movieId  rating
  82824     564        1     4.0
  82825     564        2     4.0
  82833     564       13     4.0
  83034     564      364     4.0
  83042     564      373     5.0),
 (15,
        userId  movieId  rating
  962       15        1     2.0
  963       15        2     2.0
  1056      15      364     4.0
  1061      15      373     3.5),
 (19,
        userId  movieId  rating
  3105      19        1     3.0
  3106      19        2     3.0
  3219      19      364     3.0
  3226      19      373     4.0)]

In [345]:
userSubsetGroup

[(564,
         userId  movieId  rating
  82824     564        1     4.0
  82825     564        2     4.0
  82833     564       13     4.0
  83034     564      364     4.0
  83042     564      373     5.0),
 (15,
        userId  movieId  rating
  962       15        1     2.0
  963       15        2     2.0
  1056      15      364     4.0
  1061      15      373     3.5),
 (19,
        userId  movieId  rating
  3105      19        1     3.0
  3106      19        2     3.0
  3219      19      364     3.0
  3226      19      373     4.0),
 (119,
         userId  movieId  rating
  17701     119        1     2.0
  17702     119        2     3.0
  17753     119      364     3.0
  17756     119      373     4.0),
 (213,
         userId  movieId  rating
  29266     213        1     3.0
  29267     213        2     3.0
  29269     213       13     2.5
  29313     213      364     2.5),
 (287,
         userId  movieId  rating
  39293     287        1     5.0
  39294     287        2     5.0
  3

In [346]:
userSubsetGroup[-3:]

[(664,
         userId  movieId  rating
  98740     664        1     3.5),
 (670,
         userId  movieId  rating
  99858     670        1     4.0),
 (671,
         userId  movieId  rating
  99889     671        1     5.0)]

#### Similarity of users to input user

Pearson correlation 

![alt text](https://www.wallstreetmojo.com/wp-content/uploads/2019/07/Pearson-Correlation-Coefficient-Formula1.jpg "Pearson Correlation")

The values given by the formula vary from r = -1 to r = 1, where 1 forms a direct correlation between the two entities (it means a perfect positive correlation) and -1 forms a perfect negative correlation. 

In our case, a 1 means that the two users have similar tastes while a -1 means the opposite.

In [359]:
userSubsetGroup = userSubsetGroup[0:100]

Now, we calculate the Pearson Correlation between input user and subset group, and store it in a dictionary, where the key is the user Id and the value is the coefficient


In [360]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    # print("N Ratings:", nRatings)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


In [361]:
pearsonCorrelationDict.items()

dict_items([(564, -0.5229763603684897), (15, 0.1151023138790327), (19, -0.4745789978762495), (119, -0.6974858324629157), (213, -0.5144957554275265), (287, -0.5144957554275265), (306, -0.6644105970267493), (518, 0.6859943405700353), (595, 0.9801960588196068), (30, 0.9244734516419051), (48, 0.9912407071619398), (69, 0.9912407071619305), (73, 0.9912407071619302), (92, 0.13206763594884358), (126, 0.9244734516419051), (128, 0.9912407071619259), (134, -0.8934051474415661), (150, 0), (165, -0.13206763594884358), (176, -0.6099942813304177), (182, -0.9244734516419051), (185, 0.13206763594884358), (187, 0.8934051474415661), (200, -0.9912407071619291), (212, 0.38124642583151175), (241, 0), (262, 0.6099942813304209), (268, 0.9912407071619305), (285, 0.6099942813304209), (292, 0.9912407071619398), (313, 0.7924058156930616), (324, -0.3812464258315123), (353, 0.6099942813304188), (355, 0.38124642583151347), (382, 0.6099942813304144), (396, 0.9244734516419051), (402, -0.32732683535398394), (418, 0.857

In [362]:
len(pearsonCorrelationDict.items())

100

In [363]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head(10)

Unnamed: 0,similarityIndex,userId
0,-0.522976,564
1,0.115102,15
2,-0.474579,19
3,-0.697486,119
4,-0.514496,213
5,-0.514496,287
6,-0.664411,306
7,0.685994,518
8,0.980196,595
9,0.924473,30


#### The top x similar users to input user
Now let's get the top 50 users that are most similar to the input.

In [377]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:15]
topUsers

Unnamed: 0,similarityIndex,userId
99,1.0,192
93,1.0,168
77,1.0,88
81,1.0,99
84,1.0,124
72,1.0,77
71,1.0,72
70,1.0,68
89,1.0,151
90,1.0,157


Now, let's start recommending movies to the input user.

#### Rating of selected users to all movies

In [365]:
topUsersRating=topUsers.merge(ratings_df, left_on='userId', right_on='userId', how='inner')
topUsersRating.head(10)

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.0,192,2,4.0
1,1.0,192,19,3.0
2,1.0,192,34,5.0
3,1.0,192,39,3.0
4,1.0,192,44,3.0
5,1.0,192,47,3.0
6,1.0,192,48,3.0
7,1.0,192,62,5.0
8,1.0,192,158,3.0
9,1.0,192,160,3.0


In [366]:
len(topUsersRating)

7058

In [367]:
topUsersRating

Unnamed: 0,similarityIndex,userId,movieId,rating
0,1.000000,192,2,4.0
1,1.000000,192,19,3.0
2,1.000000,192,34,5.0
3,1.000000,192,39,3.0
4,1.000000,192,44,3.0
...,...,...,...,...
7053,0.991241,73,158238,4.0
7054,0.991241,73,159462,3.0
7055,0.991241,73,159858,3.5
7056,0.991241,73,161594,3.0


Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

In [388]:
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head(30)

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,192,2,4.0,4.0
1,1.0,192,19,3.0,3.0
2,1.0,192,34,5.0,5.0
3,1.0,192,39,3.0,3.0
4,1.0,192,44,3.0,3.0
5,1.0,192,47,3.0,3.0
6,1.0,192,48,3.0,3.0
7,1.0,192,62,5.0,5.0
8,1.0,192,158,3.0,3.0
9,1.0,192,160,3.0,3.0


In [391]:
topUsersRating.groupby(['movieId']).count()

Unnamed: 0_level_0,similarityIndex,userId,rating,weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,17,17,17,17
2,17,17,17,17
3,1,1,1,1
4,1,1,1,1
5,3,3,3,3
...,...,...,...,...
158238,1,1,1,1
159462,1,1,1,1
159858,1,1,1,1
161594,1,1,1,1


In [369]:
#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,16.92387,74.142923
2,16.92387,51.764526
3,0.993944,2.48486
4,1.0,3.0
5,2.973722,12.886129


In [370]:
len(tempTopUsersRating)

2993

In [392]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4.380967,1
2,3.05867,2
3,2.5,3
4,3.0,4
5,4.333333,5
6,3.5,6
7,4.0,7
10,3.445452,10
11,3.570549,11
12,2.0,12


Now let's sort it and see the top 20 movies that the algorithm recommended!

In [395]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(50)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
7802,5.0,7802
67504,5.0,67504
83359,5.0,83359
83318,5.0,83318
5791,5.0,5791
76173,5.0,76173
5244,5.0,5244
4822,5.0,4822
215,5.0,215
2297,5.0,2297


In [373]:
len(recommendation_df)

2993

In [374]:
movies_df.loc[movies_df['movieId'].isin(recommendation_df.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
1818,2297,What Dreams May Come,1998
1820,2301,History of the World: Part I,1981
3769,4822,Max Keeble's Big Move,2001
4023,5244,Shogun Assassin,1980
4313,5791,Frida,2002
5237,7802,"Warriors, The",1979
7194,67504,Land of Silence and Darkness (Land des Schweig...,1971
7505,76173,Micmacs (Micmacs à tire-larigot),2009
7710,83318,"Goat, The",1921
7714,83359,"Play House, The",1921


### Advantages and Disadvantages of Collaborative Filtering

##### Advantages
* Takes other user's ratings into consideration
* Doesn't need to study or extract information from the recommended item
* Adapts to the user's interests which might change over time

##### Disadvantages
* Approximation function can be slow
* There might be a low of amount of users to approximate
* Privacy issues when trying to learn the user's preferences