#Movies Recommendation System(Content Based Filtering | Collaborative Based Filtering)

In [1]:
!wget  http://files.grouplens.org/datasets/movielens/ml-25m.zip 

--2021-02-05 06:14:57--  http://files.grouplens.org/datasets/movielens/ml-25m.zip
Resolving files.grouplens.org (files.grouplens.org)... 128.101.65.152
Connecting to files.grouplens.org (files.grouplens.org)|128.101.65.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261978986 (250M) [application/zip]
Saving to: ‘ml-25m.zip.1’


2021-02-05 06:15:01 (71.7 MB/s) - ‘ml-25m.zip.1’ saved [261978986/261978986]



In [2]:
!unzip /content/ml-25m.zip

Archive:  /content/ml-25m.zip
replace ml-25m/tags.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ml-25m/tags.csv         
replace ml-25m/links.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ml-25m/links.csv        
replace ml-25m/README.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ml-25m/README.txt       
replace ml-25m/ratings.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ml-25m/ratings.csv      y

replace ml-25m/genome-tags.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename:   inflating: ml-25m/genome-tags.csv  
replace ml-25m/genome-scores.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ml-25m/genome-scores.csv  
replace ml-25m/movies.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ml-25m/movies.csv       


In [3]:
#import some important library
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from math import sqrt

In [4]:
# #Read the csv file
df1 = pd.read_csv("/content/ml-25m/movies.csv")
df2 = pd.read_csv("/content/ml-25m/ratings.csv")

In [5]:
# #shape of the data set
print("df1 shape movies.csv is: ",df1.shape)
print("df1 shape ratings.csv is: ",df2.shape)

df1 shape movies.csv is:  (62423, 3)
df1 shape ratings.csv is:  (25000095, 4)


In [6]:
df1.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [7]:
df2.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


#Preprocessing

In [8]:
#split the title string based on '|' character
df1['genres']=df1.genres.apply(lambda x:x.split('|'))
df1.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]"
1,2,Jumanji (1995),"[Adventure, Children, Fantasy]"
2,3,Grumpier Old Men (1995),"[Comedy, Romance]"
3,4,Waiting to Exhale (1995),"[Comedy, Drama, Romance]"
4,5,Father of the Bride Part II (1995),[Comedy]


In [9]:
#Remove the year from title and add onto a separate column
df1['year'] = df1.title.str.extract('(\(\d\d\d\d\))',expand=False)
df1['year']=df1.year.str.extract('(\d\d\d\d)',expand=False)
df1['title']=df1.title.str.replace('(\(\d\d\d\d\))','')
df1['title']=df1.title.apply(lambda x:x.strip()) 
df1.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


In [10]:
df1.shape

(62423, 4)

Since keeping genres in a list format isn't optimal for the content-based recommendation system technique, we will use the One Hot Encoding technique to convert the list of genres to a vector where each column corresponds to one possible value of the feature. This encoding is needed for feeding categorical data. In this case, we store every different genre in columns that contain either 1 or 0. 1 shows that a movie has that genre and 0 shows that it doesn't. Let's also store this dataframe in another variable since genres won't be important for our first recommendation system

In [11]:
#Copying the df1 dataframe into a new one since we won't need to use the genre information in our first case.
moviesWithGenres_df = df1.copy()

#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in df1.iterrows():
    for genre in row['genres']:
        moviesWithGenres_df.at[index, genre] = 1
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
moviesWithGenres_df = moviesWithGenres_df.fillna(0)
moviesWithGenres_df.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


#rating data frame

In [12]:
df2.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


Every row in the ratings dataframe has a user id associated with at least one movie, a rating and a timestamp showing when they reviewed it. We won't be needing the timestamp column, so let's drop it to save on memory.

In [13]:
#Drop removes a specified row or column from a dataframe
df2 = df2.drop('timestamp', 1)
df2.head()

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
3,1,665,5.0
4,1,899,3.5


#Content-Based filtering

Let's begin by creating an input user to recommend movies to:

Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a "The", like "The Matrix" then write it in like this: 'Matrix, The'

In [14]:
userInput = [
            {'title':'Toy Story', 'rating':5},
            {'title':'Jumanji', 'rating':3.5},
            {'title':'Grumpier Old Men', 'rating':3},
            {'title':'Bad Poems', 'rating':1},
            {'title':'A Girl Thing', 'rating':4.5},
            {'title':"Women of Devil's Island", 'rating':2}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Toy Story,5.0
1,Jumanji,3.5
2,Grumpier Old Men,3.0
3,Bad Poems,1.0
4,A Girl Thing,4.5
5,Women of Devil's Island,2.0


Add movieId to input user With the input complete, let's extract the input movie's ID's from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input movie's title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space.


In [15]:
#Filtering out the movies by title
inputId = df1[df1['title'].isin(inputMovies['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('genres', 1).drop('year', 1)
#Final input dataframe
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,5.0
1,2,Jumanji,3.5
2,3,Grumpier Old Men,3.0
3,209163,Bad Poems,1.0
4,209169,A Girl Thing,4.5
5,209171,Women of Devil's Island,2.0


We're going to start by learning the input's preferences, so let's get the subset of movies that the input has watched from the Dataframe containing genres defined with binary values.


In [16]:
#Filtering out the movies from the input
userMovies = moviesWithGenres_df[moviesWithGenres_df['movieId'].isin(inputMovies['movieId'].tolist())]
userMovies

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62420,209163,Bad Poems,"[Comedy, Drama]",2018,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
62421,209169,A Girl Thing,[(no genres listed)],2001,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
62422,209171,Women of Devil's Island,"[Action, Adventure, Drama]",1962,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We'll only need the actual genre table, so let's clean this up a bit by resetting the index and dropping the movieId, title, genres and year columns

In [17]:
#Resetting the index to avoid future issues
userMovies = userMovies.reset_index(drop=True)
#Dropping unnecessary issues due to save memory and to avoid issues
userGenreTable = userMovies.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userGenreTable

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
inputMovies['rating']

0    5.0
1    3.5
2    3.0
3    1.0
4    4.5
5    2.0
Name: rating, dtype: float64

In [19]:
#Dot product to get weights
userProfile = userGenreTable.transpose().dot(inputMovies['rating'])
#The user profile
userProfile

Adventure             10.5
Animation              5.0
Children               8.5
Comedy                 9.0
Fantasy                8.5
Romance                3.0
Drama                  3.0
Action                 2.0
Crime                  0.0
Thriller               0.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 0.0
IMAX                   0.0
Documentary            0.0
War                    0.0
Musical                0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     4.5
dtype: float64

Now, we have the weights for every of the user's preferences. This is known as the User Profile. Using this, we can recommend movies that satisfy the user's preferences.

Let's start by extracting the genre table from the original dataframe:


In [20]:
#Now let's get the genres of every movie in our original dataframe
genreTable = moviesWithGenres_df.set_index(moviesWithGenres_df['movieId'])
#And drop the unnecessary information
genreTable = genreTable.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genreTable.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,Documentary,War,Musical,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [21]:
genreTable.shape

(62423, 20)

With the input's profile and the complete list of movies and their genres in hand, we're going to take the weighted average of every movie based on the input profile and recommend the top twenty movies that most satisfy it.


In [22]:
#Multiply the genres by the weights and then take the weighted average
recommendationTable_df = ((genreTable*userProfile).sum(axis=1))/(userProfile.sum())
recommendationTable_df.head()

movieId
1    0.768519
2    0.509259
3    0.222222
4    0.277778
5    0.166667
dtype: float64

#Recommendation system Content Based Filtering

In [23]:
#Sort our recommendations in descending order
recommendationTable_df = recommendationTable_df.sort_values(ascending=False)
#Just a peek at the values
recommendationTable_df.head()

movieId
26093     0.879630
134853    0.824074
4306      0.824074
56152     0.824074
84637     0.824074
dtype: float64

In [24]:
#The final recommendation table
df1.loc[df1['movieId'].isin(recommendationTable_df.head(20).keys())]

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
3653,3754,"Adventures of Rocky and Bullwinkle, The","[Adventure, Animation, Children, Comedy, Fantasy]",2000
3912,4016,"Emperor's New Groove, The","[Adventure, Animation, Children, Comedy, Fantasy]",2000
4201,4306,Shrek,"[Adventure, Animation, Children, Comedy, Fanta...",2001
8571,26093,"Wonderful World of the Brothers Grimm, The","[Adventure, Animation, Children, Comedy, Drama...",1962
8748,26340,"Twelve Tasks of Asterix, The (Les douze travau...","[Action, Adventure, Animation, Children, Comed...",1976
9949,33463,DuckTales: The Movie - Treasure of the Lost Lamp,"[Adventure, Animation, Children, Comedy, Fantasy]",1990
10189,36397,Valiant,"[Adventure, Animation, Children, Comedy, Fanta...",2005
11480,51939,TMNT (Teenage Mutant Ninja Turtles),"[Action, Adventure, Animation, Children, Comed...",2007
11967,56152,Enchanted,"[Adventure, Animation, Children, Comedy, Fanta...",2007


Above Table are repersent top 20 movies in recommendation system

#Collaborative Filtering
1)User based Collaborative filtering
2)item based collaborative filtering.

#first implementation of User based collaborative filtering

Read Data

In [25]:
# #Read the csv file
df1 = pd.read_csv("/content/ml-25m/movies.csv")
df2 = pd.read_csv("/content/ml-25m/ratings.csv")

In [26]:
df1.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


#Preprocessing the Data 

In [27]:
#Using regular expressions to find a year stored between parentheses
#We specify the parantheses so we don't conflict with movies that have years in their titles
df1['year'] = df1.title.str.extract('(\(\d\d\d\d\))',expand=False)
#Removing the parentheses
df1['year'] = df1.year.str.extract('(\d\d\d\d)',expand=False)
#Removing the years from the 'title' column
df1['title'] = df1.title.str.replace('(\(\d\d\d\d\))', '')
#Applying the strip function to get rid of any ending whitespace characters that may have appeared
df1['title'] = df1['title'].apply(lambda x: x.strip())

In [28]:
df1.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,Jumanji,Adventure|Children|Fantasy,1995
2,3,Grumpier Old Men,Comedy|Romance,1995
3,4,Waiting to Exhale,Comedy|Drama|Romance,1995
4,5,Father of the Bride Part II,Comedy,1995


In [29]:
df1.keys()

Index(['movieId', 'title', 'genres', 'year'], dtype='object')

In [30]:
#Droping the genres column
df1.drop(['genres'],axis=1, inplace=True)

In [31]:
df1.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [32]:
#rating the data set
df2.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,296,5.0,1147880044
1,1,306,3.5,1147868817
2,1,307,5.0,1147868828
3,1,665,5.0,1147878820
4,1,899,3.5,1147868510


In [33]:
#Drop the 'timestamp column
df2.drop(['timestamp'],axis=1,inplace=True)

In [34]:
#Here is the final rating data set
df2.head()

Unnamed: 0,userId,movieId,rating
0,1,296,5.0
1,1,306,3.5
2,1,307,5.0
3,1,665,5.0
4,1,899,3.5


Let's begin by creating an input user to recommend movies to:

Notice: To add more movies, simply increase the amount of elements in the userInput. Feel free to add more in! Just be sure to write it in with capital letters and if a movie starts with a "The", like "The Matrix" then write it in like this: 'Matrix, The' .


#Test User

In [35]:
userInput = [
            {'title':'Toy Story', 'rating':5},
            {'title':'Jumanji', 'rating':3.5},
            {'title':'Grumpier Old Men', 'rating':3},
            {'title':'Bad Poems', 'rating':1},
            {'title':'A Girl Thing', 'rating':4.5},
            {'title':"Women of Devil's Island", 'rating':2}
         ] 
inputMovies = pd.DataFrame(userInput)
inputMovies

Unnamed: 0,title,rating
0,Toy Story,5.0
1,Jumanji,3.5
2,Grumpier Old Men,3.0
3,Bad Poems,1.0
4,A Girl Thing,4.5
5,Women of Devil's Island,2.0


Add movieId to input user

With the input complete, let's extract the input movies's ID's from the movies dataframe and add them into it.

We can achieve this by first filtering out the rows that contain the input movies' title and then merging this subset with the input dataframe. We also drop unnecessary columns for the input to save memory space

In [36]:
#Filtering out the movies by title
inputId = df1[df1['title'].isin(inputMovies['title'].tolist())]
#Then merging it so we can get the movieId. It's implicitly merging it by title.
inputMovies = pd.merge(inputId, inputMovies)
#Dropping information we won't use from the input dataframe
inputMovies = inputMovies.drop('year', 1)
#Final input dataframe
#If a movie you added in above isn't here, then it might not be in the original 
#dataframe or it might spelled differently, please check capitalisation.
inputMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,5.0
1,2,Jumanji,3.5
2,3,Grumpier Old Men,3.0
3,209163,Bad Poems,1.0
4,209169,A Girl Thing,4.5
5,209171,Women of Devil's Island,2.0


The users who has seen the same movies

Now with the movie ID's in our input, we can now get the subset of users that have watched and reviewed the movies in our input.


In [37]:
#Filtering out users that have watched movies that the input has watched and storing it
userSubset = df2[df2['movieId'].isin(inputMovies['movieId'].tolist())]
userSubset.head()

Unnamed: 0,userId,movieId,rating
70,2,1,3.5
254,3,1,4.0
910,4,1,3.0
1152,5,1,4.0
1304,8,1,4.0


We now group up the rows by user ID

In [38]:
#Groupby creates several sub dataframes where they all have the same value in the column specified as the parameter
userSubsetGroup = userSubset.groupby(['userId'])

lets look at one of the users, e.g. the one with userID=1304

In [39]:
userSubsetGroup.get_group(1304)

Unnamed: 0,userId,movieId,rating
183426,1304,1,5.0


Let's also sort these groups so the users that share the most movies in common with the input have higher priority. This provides a richer recommendation since we won't go through every single user.

In [40]:
#Sorting it so users with movie most in common with the input will have priority
userSubsetGroup = sorted(userSubsetGroup,  key=lambda x: len(x[1]), reverse=True)

In [41]:
userSubsetGroup = userSubsetGroup[0:5]
userSubsetGroup

[(12,       userId  movieId  rating
  1714      12        1     4.0
  1715      12        2     2.0
  1716      12        3     2.0), (125,        userId  movieId  rating
  16018     125        1     4.0
  16019     125        2     2.0
  16020     125        3     4.0), (187,        userId  movieId  rating
  23893     187        1     3.5
  23894     187        2     3.5
  23895     187        3     3.0), (226,        userId  movieId  rating
  29312     226        1     3.0
  29313     226        2     2.0
  29314     226        3     2.5), (230,        userId  movieId  rating
  30451     230        1     5.0
  30452     230        2     4.0
  30453     230        3     3.0)]

#Find Similar Users
Similarity of users to input user using Persion Correlation

In [42]:
#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in userSubsetGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    inputMovies = inputMovies.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    temp_df = inputMovies[inputMovies['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = temp_df['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0

In [43]:
pearsonCorrelationDict.items()

dict_items([(12, 0.9707253433941508), (125, 0.27735009811261385), (187, 0.6933752452815377), (226, 0.7205766921228924), (230, 0.9607689228305233)])

In [44]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.970725,12
1,0.27735,125
2,0.693375,187
3,0.720577,226
4,0.960769,230


The top x similar users to input user

Now let's get the top 50 users that are most similar to the input.


In [45]:
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:50]
topUsers.head()

Unnamed: 0,similarityIndex,userId
0,0.970725,12
4,0.960769,230
3,0.720577,226
2,0.693375,187
1,0.27735,125


Now, let's start recommending movies to the input user.
Rating of selected users to all movies

We're going to do this by taking the weighted average of the ratings of the movies using the Pearson Correlation as the weight. But to do this, we first need to get the movies watched by the users in our pearsonDF from the ratings dataframe and then store their correlation in a new column called _similarityIndex". This is achieved below by merging of these two tables.


#Recommendations

In [46]:
topUsersRating=topUsers.merge(df2, left_on='userId', right_on='userId', how='inner')
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating
0,0.970725,12,1,4.0
1,0.970725,12,2,2.0
2,0.970725,12,3,2.0
3,0.970725,12,7,3.0
4,0.970725,12,10,3.0


Now all we need to do is simply multiply the movie rating by its weight (The similarity index), then sum up the new ratings and divide it by the sum of the weights.

We can easily do this by simply multiplying two columns, then grouping up the dataframe by movieId and then dividing two columns:

It shows the idea of all similar users to candidate movies for the input user:


In [47]:
#Multiplies the similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,0.970725,12,1,4.0,3.882901
1,0.970725,12,2,2.0,1.941451
2,0.970725,12,3,2.0,1.941451
3,0.970725,12,7,3.0,2.912176
4,0.970725,12,10,3.0,2.912176


In [48]:
#Applies a sum to the topUsers after grouping it up by userId
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.622796,14.38469
2,3.622796,10.207193
3,3.622796,9.814725
4,0.960769,2.882307
5,0.960769,2.882307


In [49]:
#Creates an empty dataframe
recommendation_df = pd.DataFrame()
#Now we take the weighted average
recommendation_df['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
recommendation_df['movieId'] = tempTopUsersRating.index
recommendation_df.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.970604,1
2,2.81749,2
3,2.709157,3
4,3.0,4
5,3.0,5


In [50]:
recommendation_df = recommendation_df.sort_values(by='weighted average recommendation score', ascending=False)
recommendation_df.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
176601,5.0,176601
54190,5.0,54190
51935,5.0,51935
1243,5.0,1243
48997,5.0,48997
86377,5.0,86377
1276,5.0,1276
46347,5.0,46347
1298,5.0,1298
88129,5.0,88129


In [51]:
df1.loc[df1['movieId'].isin(recommendation_df.head(20)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
1210,1243,Rosencrantz and Guildenstern Are Dead,1990.0
1243,1276,Cool Hand Luke,1967.0
1264,1298,Pink Floyd: The Wall,1982.0
8067,8781,"Manchurian Candidate, The",2004.0
9411,27904,"Scanner Darkly, A",2006.0
9604,31696,Constantine,2005.0
9945,33437,Unleashed (Danny the Dog),2005.0
10257,37830,Final Fantasy VII: Advent Children,2004.0
10263,37857,MirrorMask,2005.0
10679,44191,V for Vendetta,2006.0


Now let see above table the top 20 movies that the algorithm recommended!
