<h1>Movie recommendation System</h1>

Goal: Recommend items to users based on information taken from the user.
Content-Based Filtering and Collaborative Filtering.

@author: Mariana R.Barros

<h2>Dataset</h2>  

MovieLens Dataset. 
It consists of 105339 ratings applied over 10329 movies.
Files: movies.csv and ratings.csv.
https://drive.google.com/file/d/1Dn1BZD3YxgBQJSIjbfNnmCFlDW2jdQGD/view


In [1]:
#Libraries
import pandas as pd
from math import sqrt
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
#Reading the data
df_movies = pd.read_csv('movies.csv')
df_ratings = pd.read_csv('ratings.csv')

df_movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [3]:
df_ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,16,4.0,1217897793
1,1,24,1.5,1217895807
2,1,32,4.0,1217896246
3,1,47,4.0,1217896556
4,1,50,4.0,1217896523


<h5>Cleaning dataset</h5>

In [4]:
#Removing the timestamp from the df_ratings dataframe
df_ratings = df_ratings.drop('timestamp', 1)
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,16,4.0
1,1,24,1.5
2,1,32,4.0
3,1,47,4.0
4,1,50,4.0


In [5]:
#Working on the df_movies dataframe
#Creating a "Year" column and removing this information from the Title column.
df_movies['year'] = df_movies.title.str.extract('(\(\d\d\d\d\))',expand=False)
df_movies['year'] = df_movies.year.str.extract('(\d\d\d\d)',expand=False)
df_movies['title'] = df_movies.title.str.replace('(\(\d\d\d\d\))', '')
df_movies['title'] = df_movies['title'].apply(lambda x: x.strip())


#Creating a list of Genres from the Genres column. 
#Calls the split function on |, since the genres are divided by such symbol. 
df_movies['genres'] = df_movies.genres.str.split('|')
df_movies.head()

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995
2,3,Grumpier Old Men,"[Comedy, Romance]",1995
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995
4,5,Father of the Bride Part II,[Comedy],1995


<h5> Constructing the movie genre matrix</h5>
Converting the list of genres to a vector where each column corresponds to one possible value of the feature. 

1: movie that has the genre;
0: movie that hasn't the genre. 

In [6]:
#Copying the movie dataframe into a new one since we won't need to use the genre information in our first case.
df_moviegenre = df_movies.copy()

#For every row in the dataframe, iterate through the list of genres and place a 1 into the corresponding column
for index, row in df_movies.iterrows():
    for genre in row['genres']:
        df_moviegenre.at[index, genre] = 1
#Filling in the NaN values with 0 to show that a movie doesn't have that column's genre
df_moviegenre = df_moviegenre.fillna(0)
df_moviegenre.head()

Unnamed: 0,movieId,title,genres,year,Adventure,Animation,Children,Comedy,Fantasy,Romance,...,Horror,Mystery,Sci-Fi,IMAX,War,Musical,Documentary,Western,Film-Noir,(no genres listed)
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995,1.0,1.0,1.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,"[Adventure, Children, Fantasy]",1995,1.0,0.0,1.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,"[Comedy, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,4,Waiting to Exhale,"[Comedy, Drama, Romance]",1995,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,[Comedy],1995,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h2>Content-Based recommendation system</h2>  

Discover what is the user's favourite aspects of an item and recommends items that present those aspects. 


<h5> Preparing the data </h5>

In [7]:
#Creating a target user to recommend movies to, based on his/hers ratings on movies watched.
TargetUserRatings = [
            {'title':'Breakfast Club, The', 'rating':5},
            {'title':'Toy Story', 'rating':3.5},
            {'title':'Jumanji', 'rating':2},
            {'title':"Pulp Fiction", 'rating':5},
            {'title':'Akira', 'rating':4.5}
         ] 
TargetUserMovies = pd.DataFrame(TargetUserRatings)
TargetUserMovies

Unnamed: 0,title,rating
0,"Breakfast Club, The",5.0
1,Toy Story,3.5
2,Jumanji,2.0
3,Pulp Fiction,5.0
4,Akira,4.5


In [8]:
#Adding the Movie ID to the UserMovies dataframe

#Filtering movies by title
TargetUserRatings = df_movies[df_movies['title'].isin(TargetUserMovies['title'].tolist())]
#Merging userRatings and UserMovies to obtain Movie Id in the UserMovies.
TargetUserMovies = pd.merge(TargetUserRatings, TargetUserMovies)
#Dropping information that is not going to be used.
TargetUserMovies = TargetUserMovies.drop('genres', 1).drop('year', 1)
TargetUserMovies

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


In [9]:
#Filtering out the movies from the input
userMoviesgenres = df_moviegenre[df_moviegenre['movieId'].isin(TargetUserRatings['movieId'].tolist())]
#Resetting the index 
userMoviesgenres = userMoviesgenres.reset_index(drop=True)

#Dropping unnecessary columns
userMoviesgenres_ = userMoviesgenres.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
userMoviesgenres_

Unnamed: 0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,War,Musical,Documentary,Western,Film-Noir,(no genres listed)
0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


<h5> Creating the weighted genre matrix </h5>

Such matrix is constructed turning each genre into weights. 

This is done using the user's reviews (userRatings matrix) multipled by the user's genre table (userMoviesgenres_) and then summing up the resulting table by column. 

In [10]:
TargetUserMovies['rating']

0    3.5
1    2.0
2    5.0
3    4.5
4    5.0
Name: rating, dtype: float64

In [11]:
#Dot produt to get weights
userProfile = userMoviesgenres_.transpose().dot(TargetUserMovies['rating'])
#The user profile: contains the weights for the user's preferences
userProfile

Adventure             10.0
Animation              8.0
Children               5.5
Comedy                13.5
Fantasy                5.5
Romance                0.0
Drama                 10.0
Action                 4.5
Crime                  5.0
Thriller               5.0
Horror                 0.0
Mystery                0.0
Sci-Fi                 4.5
IMAX                   0.0
War                    0.0
Musical                0.0
Documentary            0.0
Western                0.0
Film-Noir              0.0
(no genres listed)     0.0
dtype: float64

<h3>Recommending movies to the user based on his/hers preferences: </h3>

In [12]:
# Constructing a Genre Table with the movie's genres from the original dataframe
genre = df_moviegenre.set_index(df_moviegenre['movieId'])
#Droping unnecessary information
genre = genre.drop('movieId', 1).drop('title', 1).drop('genres', 1).drop('year', 1)
genre.head()

Unnamed: 0_level_0,Adventure,Animation,Children,Comedy,Fantasy,Romance,Drama,Action,Crime,Thriller,Horror,Mystery,Sci-Fi,IMAX,War,Musical,Documentary,Western,Film-Noir,(no genres listed)
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
print(genre.shape)

(10329, 20)


Taking the weighted average of all movies based on the user's profile:

In [14]:
#Multipling the genres by the weights and then take the weighted average
recommendation_df = ((genre*userProfile).sum(axis=1))/(userProfile.sum())

<h6> Recommending the top 30 movies that most satisfy the target user: </h6>

In [15]:
#Sorting the recommendations in descending order
recommendation_df = recommendation_df.sort_values(ascending=False)
#The final recommendation table
df_movies.loc[df_movies['movieId'].isin(recommendation_df.head(30).keys())]

Unnamed: 0,movieId,title,genres,year
0,1,Toy Story,"[Adventure, Animation, Children, Comedy, Fantasy]",1995
489,546,Super Mario Bros.,"[Action, Adventure, Children, Comedy, Fantasy,...",1993
580,673,Space Jam,"[Adventure, Animation, Children, Comedy, Fanta...",1996
1480,1907,Mulan,"[Adventure, Animation, Children, Comedy, Drama...",1998
2388,2987,Who Framed Roger Rabbit?,"[Adventure, Animation, Children, Comedy, Crime...",1988
2496,3114,Toy Story 2,"[Adventure, Animation, Children, Comedy, Fantasy]",1999
3166,4016,"Emperor's New Groove, The","[Adventure, Animation, Children, Comedy, Fantasy]",2000
3696,4719,Osmosis Jones,"[Action, Animation, Comedy, Crime, Drama, Roma...",2001
3853,4956,"Stunt Man, The","[Action, Adventure, Comedy, Drama, Romance, Th...",1980
4304,5657,Flashback,"[Action, Adventure, Comedy, Crime, Drama]",1990


<h2>Collaborative Filtering</h2>

Also known as User-User Filtering. 
Utilizes other users, with similar preferences, to recommend items. 

Here, we will use __Pearson Correlation Function__ to find similar users.

In [16]:
#Dropping the genres column
df_movies2 = df_movies.drop('genres', 1)
df_movies2.head()

Unnamed: 0,movieId,title,year
0,1,Toy Story,1995
1,2,Jumanji,1995
2,3,Grumpier Old Men,1995
3,4,Waiting to Exhale,1995
4,5,Father of the Bride Part II,1995


In [17]:
#Adding the Movie ID to the UserMovies dataframe

#Filtering movies by title
TargetUserRatings2 = df_movies2[df_movies2['title'].isin(TargetUserMovies['title'].tolist())]
#Merging userRatings and UserMovies to obtain Movie Id in the UserMovies.
TargetUserMovies2 = pd.merge(TargetUserRatings2, TargetUserMovies)
#Dropping information that is not going to be used.
TargetUserMovies2 = TargetUserMovies2.drop('year', 1)
TargetUserMovies2

Unnamed: 0,movieId,title,rating
0,1,Toy Story,3.5
1,2,Jumanji,2.0
2,296,Pulp Fiction,5.0
3,1274,Akira,4.5
4,1968,"Breakfast Club, The",5.0


<h6>Users that saw the same movies</h6>

In [18]:
#Filtering users that watched the same movies as the target user 
Subset_users = df_ratings[df_ratings['movieId'].isin(TargetUserMovies2['movieId'].tolist())]
Subset_users.head()

Unnamed: 0,userId,movieId,rating
15,1,296,4.0
113,2,1,5.0
166,3,296,5.0
220,4,296,4.0
339,5,1,4.0


In [19]:
Subset_users.shape

(816, 3)

In [20]:
#Groupby userId
Subset_usersGroup = Subset_users.groupby(['userId'])

In [21]:
#Taking a look at a specific user: userID=50
Subset_usersGroup.get_group(50)

Unnamed: 0,userId,movieId,rating
4513,50,1,4.0
4524,50,296,2.0
4610,50,1968,2.0


In [22]:
#Sorting groups in order that users with most in common movies have priority
Subset_usersGroup = sorted(Subset_usersGroup,  key=lambda x: len(x[1]), reverse=True)

In [23]:
#Looking at the firsts users
Subset_usersGroup[0:2]

[(62,
        userId  movieId  rating
  5535      62        1     2.0
  5536      62        2     1.5
  5604      62      296     5.0
  5760      62     1274     3.5
  5857      62     1968     1.0),
 (122,
         userId  movieId  rating
  14732     122        1     5.0
  14733     122        2     3.0
  14753     122      296     5.0
  14819     122     1274     5.0
  14859     122     1968     5.0)]

<h6> Similarity of users to input user</h6>

Comparing users to the target user in order to find the most similar one.

Similarity of each user: found via the Pearson Correlation Coefficient. 

Vary from r = -1 to r = 1, where 1 means a perfect positive correlation and -1 a perfect negative correlation. 
I.e., 1 = two users have similar tastes and  -1 = users have opposite taste.

In [24]:
#Creating a subset of users to avoid iterating through the intere set. 
Subset_usersGroup = Subset_usersGroup[0:100]

In [25]:
#Calculating the Pearson Correlation between the target user and subset group.

#Store the Pearson Correlation in a dictionary, where the key is the user Id and the value is the coefficient
pearsonCorrelationDict = {}

#For every user group in our subset
for name, group in Subset_usersGroup:
    #Let's start by sorting the input and current user group so the values aren't mixed up later on
    group = group.sort_values(by='movieId')
    TargetUserMovies2 = TargetUserMovies2.sort_values(by='movieId')
    #Get the N for the formula
    nRatings = len(group)
    #Get the review scores for the movies that they both have in common
    df_temp = TargetUserMovies2[TargetUserMovies2['movieId'].isin(group['movieId'].tolist())]
    #And then store them in a temporary buffer variable in a list format to facilitate future calculations
    tempRatingList = df_temp['rating'].tolist()
    #Let's also put the current user group reviews in a list format
    tempGroupList = group['rating'].tolist()
    #Now let's calculate the pearson correlation between two users, so called, x and y
    Sxx = sum([i**2 for i in tempRatingList]) - pow(sum(tempRatingList),2)/float(nRatings)
    Syy = sum([i**2 for i in tempGroupList]) - pow(sum(tempGroupList),2)/float(nRatings)
    Sxy = sum( i*j for i, j in zip(tempRatingList, tempGroupList)) - sum(tempRatingList)*sum(tempGroupList)/float(nRatings)
    
    #If the denominator is different than zero, then divide, else, 0 correlation.
    if Sxx != 0 and Syy != 0:
        pearsonCorrelationDict[name] = Sxy/sqrt(Sxx*Syy)
    else:
        pearsonCorrelationDict[name] = 0


In [26]:
pearsonCorrelationDict.items()

dict_items([(62, 0.44965838938680786), (122, 0.8770580193070289), (224, 0.29417420270727607), (409, 0.8056292332943623), (451, 0.6564386345361464), (461, 0.11720180773462363), (567, 0.8320502943378437), (590, 0.6933752452815365), (607, 0.6020183016345586), (627, 0.7240929464269849), (38, 0.5222329678670935), (88, 0.0), (109, -0.47140452079103173), (128, 0.8866206949335731), (164, 0.4923659639173309), (176, 0.592156525463792), (177, 0.0), (192, -0.51425947722658), (220, 0.30151134457776363), (232, 0.8703882797784892), (250, 0.899228803025897), (310, 0.9438798074485389), (328, -0.7302967433402214), (354, 0.8181818181818182), (358, 0.6446583712203042), (387, 0.45454545454545453), (410, 0.8528028654224418), (413, 0.899228803025897), (493, 0.5222329678670935), (501, 0.4082482904638631), (555, 0.8017837257372732), (560, 0), (603, -0.3779644730092272), (628, 0.8703882797784892), (632, 0.2075143391598224), (668, 0.5222329678670935), (29, 0), (31, -0.8660254037844402), (32, 0.8660254037844402),

In [27]:
pearsonDF = pd.DataFrame.from_dict(pearsonCorrelationDict, orient='index')
pearsonDF.columns = ['similarityIndex']
pearsonDF['userId'] = pearsonDF.index
pearsonDF.index = range(len(pearsonDF))
pearsonDF.head()

Unnamed: 0,similarityIndex,userId
0,0.449658,62
1,0.877058,122
2,0.294174,224
3,0.805629,409
4,0.656439,451


<h6> The top similar users to input user </h6>


In [28]:
#Getting the 30 top users that are most similar to the target user.
topUsers=pearsonDF.sort_values(by='similarityIndex', ascending=False)[0:30]
topUsers.head()

Unnamed: 0,similarityIndex,userId
50,1.0,158
53,1.0,213
94,0.987829,615
55,0.987829,228
39,0.944911,44


<h6> Recommending movies to the target user</h6>


Recommendation: taking the weighted average of the movies'ratings using the Pearson Correlation as the weight. 

In [30]:
topUsersRating=topUsers.merge(df_ratings, left_on='userId', right_on='userId', how='inner')
#Multipling similarity by the user's ratings
topUsersRating['weightedRating'] = topUsersRating['similarityIndex']*topUsersRating['rating']
topUsersRating.head()

Unnamed: 0,similarityIndex,userId,movieId,rating,weightedRating
0,1.0,158,1,4.5,4.5
1,1.0,158,2,4.0,4.0
2,1.0,158,6,4.5,4.5
3,1.0,158,10,4.0,4.0
4,1.0,158,32,4.0,4.0


In [31]:
#Grouping  by userId and summing to the topUsers
tempTopUsersRating = topUsersRating.groupby('movieId').sum()[['similarityIndex','weightedRating']]
tempTopUsersRating.columns = ['sum_similarityIndex','sum_weightedRating']
tempTopUsersRating.head()

Unnamed: 0_level_0,sum_similarityIndex,sum_weightedRating
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,22.543606,83.623455
2,21.117584,64.678869
3,2.994871,9.522418
4,0.755929,2.267787
5,2.765254,7.896534


In [32]:
#Creates an empty dataframe
df_recommendation2 = pd.DataFrame()
#Now we take the weighted average
df_recommendation2['weighted average recommendation score'] = tempTopUsersRating['sum_weightedRating']/tempTopUsersRating['sum_similarityIndex']
df_recommendation2['movieId'] = tempTopUsersRating.index
df_recommendation2.head()

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,3.709409,1
2,3.062797,2
3,3.179576,3
4,3.0,4
5,2.855627,5


In [33]:
#Sorting to see the top 20 movies recommended
df_recommendation2 = df_recommendation2.sort_values(by='weighted average recommendation score', ascending=False)
df_recommendation2.head(10)

Unnamed: 0_level_0,weighted average recommendation score,movieId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1
26131,5.0,26131
1946,5.0,1946
230,5.0,230
469,5.0,469
233,5.0,233
1925,5.0,1925
1927,5.0,1927
1942,5.0,1942
4970,5.0,4970
897,5.0,897


In [34]:
df_movies2.loc[df_movies2['movieId'].isin(df_recommendation2.head(10)['movieId'].tolist())]

Unnamed: 0,movieId,title,year
201,230,Dolores Claiborne,1995
204,233,Exotica,1994
416,469,"House of the Spirits, The",1993
718,897,For Whom the Bell Tolls,1943
1495,1925,Wings,1927
1496,1927,All Quiet on the Western Front,1930
1508,1942,All the King's Men,1949
1512,1946,Marty,1955
3864,4970,"Blue Angel, The (Blaue Engel, Der)",1930
6060,26131,"Battle of Algiers, The (La battaglia di Algeri)",1966
