# Basic Recommender System

This basic recommender system offers generalized recommendations to user based on the popularity and ratings of the movie. In this case, it does not offer personalised recommendations and is purely based on the list of top movies.

In [1]:
#Import libraries
import pandas as pd
import numpy as np
import ast
import nltk
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from surprise import accuracy, Dataset, SVD, Reader, NMF, KNNBasic
from surprise.model_selection import train_test_split, KFold, GridSearchCV

#Suppress Warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
#load data
metadata = pd.read_csv('../Data/movies_metadata_clean.csv')

We will use IMDB weighted rating formula to come up with our recommendations. Mathematically, it is represented by:<br>

Weighted Rating (WR) = (v/v+m.R)+(m/v+m.C)<br>
 

where,<br>

v is the number of votes for the movie <br>
m is the minimum votes required to be listed in the chart<br>
R is the average rating of the movie<br>
C is the mean vote across the whole report<br>

IMDB uses this formula to come up with their top 250 movies chart. ([Source](https://www.quora.com/How-does-IMDbs-rating-system-work))

In [3]:
#here we calculate the mean vote
vote_averages = metadata[metadata['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244878649855876

In [4]:
#here we are setting the min vote required to be at 90 percentile
vote_counts = metadata[metadata['vote_count'].notnull()]['vote_count'].astype('int')
m = vote_counts.quantile(0.90)
m

160.0

In [5]:
#set the requirement of vote count to be more than min vote and what columns we want to see in our top chart
top_chart = metadata[(metadata['vote_count'] >= m) & (metadata['vote_count'].notnull()) & (metadata['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
top_chart['vote_count'] = top_chart['vote_count'].astype('int')
top_chart['vote_average'] = top_chart['vote_average'].astype('int')
top_chart.shape

(4553, 6)

In [6]:
#reset index
top_chart.reset_index(inplace=True, drop=True)

A movie needs to have at least 160 votes on TMDB and average rating of 5.24/10 to be qualified in our top movies chart. In this case, 4553 movies qualifies.

In [7]:
#function to calculate the weighted rating
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [8]:
#apply it to new column
top_chart['wr'] = top_chart.apply(weighted_rating, axis=1)

In [9]:
#sort movies according to descending wr value and only selecting the top 250 movies
top_chart = top_chart.sort_values('wr', ascending=False).head(250)

In [10]:
#view top 10
top_chart.head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
1965,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",8.268186
2816,Inception,2010,14075,8,29.108149,"[{'id': 28, 'name': 'Action'}, {'id': 53, 'nam...",7.969033
2397,The Dark Knight,2008,12269,8,123.167259,"[{'id': 18, 'name': 'Drama'}, {'id': 28, 'name...",7.964533
3637,Interstellar,2014,11187,8,32.213481,"[{'id': 12, 'name': 'Adventure'}, {'id': 18, '...",7.961151
916,Fight Club,1999,9678,8,63.869599,"[{'id': 18, 'name': 'Drama'}]",7.955192
1338,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",7.951301
91,Pulp Fiction,1994,8670,8,140.950236,"[{'id': 53, 'name': 'Thriller'}, {'id': 80, 'n...",7.950077
101,The Shawshank Redemption,1994,8358,8,51.645403,"[{'id': 18, 'name': 'Drama'}, {'id': 80, 'name...",7.948248
1654,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",7.947434
115,Forrest Gump,1994,8147,8,48.307194,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",7.946934


This chart is just based on wr rating. We can create another chart based on the top movies based on their genres.

In [11]:
#convert genres from list of dictionary to a list of strings
metadata['genres'] = metadata['genres'].fillna('[]').apply(ast.literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [12]:
#create multiple entries for each genre of the movie as one row

#create a series based on genre, reset index level 1 to ensure multiple entries belongs to same record.
genre = metadata.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
#give name to the column
genre.name = 'genre'
#drop genres column in md and join the genre series to the md df
genre_metadata = metadata.drop('genres', axis=1).join(genre)

In [13]:
#take a look at the genre series, multiple entries for one record in same index #
genre

0        Animation
0           Comedy
0           Family
1        Adventure
1          Fantasy
           ...    
45445       Family
45446        Drama
45447       Action
45447        Drama
45447     Thriller
Name: genre, Length: 91062, dtype: object

In [14]:
#check for null values
genre.isnull().sum()

0

In [15]:
#take a look at the genre_md df
genre_metadata

Unnamed: 0,id,overview,tagline,title,popularity,release_date,vote_average,vote_count,year,genre
0,862,"Led by Woody, Andy's toys live happily in his ...",,Toy Story,21.946943,1995-10-30,7.7,5415.0,1995,Animation
0,862,"Led by Woody, Andy's toys live happily in his ...",,Toy Story,21.946943,1995-10-30,7.7,5415.0,1995,Comedy
0,862,"Led by Woody, Andy's toys live happily in his ...",,Toy Story,21.946943,1995-10-30,7.7,5415.0,1995,Family
1,8844,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,Jumanji,17.015539,1995-12-15,6.9,2413.0,1995,Adventure
1,8844,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,Jumanji,17.015539,1995-12-15,6.9,2413.0,1995,Fantasy
...,...,...,...,...,...,...,...,...,...,...
45447,67758,"When one of her hits goes wrong, a professiona...",A deadly game of wits.,Betrayal,0.903007,2003-08-01,3.8,6.0,2003,Action
45447,67758,"When one of her hits goes wrong, a professiona...",A deadly game of wits.,Betrayal,0.903007,2003-08-01,3.8,6.0,2003,Drama
45447,67758,"When one of her hits goes wrong, a professiona...",A deadly game of wits.,Betrayal,0.903007,2003-08-01,3.8,6.0,2003,Thriller
45448,227506,"In a small town live two brothers, one a minis...",,Satan Triumphant,0.003503,1917-10-21,0.0,0.0,1917,


In [16]:
#build function on top genre chart and keep percentile at 0.90

def top_genre_chart(genre, percentile=0.90):
    df = genre_metadata[genre_metadata['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    top_genre_chart = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    top_genre_chart['vote_count'] = top_genre_chart['vote_count'].astype('int')
    top_genre_chart['vote_average'] = top_genre_chart['vote_average'].astype('int')
    
    top_genre_chart['wr'] = top_genre_chart.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    top_genre_chart = top_genre_chart.sort_values('wr', ascending=False).head(250)
    
    return top_genre_chart

In [17]:
top_genre_chart('Action').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
15476,Inception,2010,14075,8,29.108149,7.904507
12478,The Dark Knight,2008,12269,8,123.167259,7.890993
4862,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,7.851764
6999,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,7.840439
5813,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,7.828962
256,Star Wars,1977,6778,8,42.149697,7.808658
1154,The Empire Strikes Back,1980,5998,8,19.470959,7.78566
4134,Scarface,1983,3017,8,11.299673,7.603562
9427,Oldboy,2003,2000,8,10.616859,7.441761
1909,Seven Samurai,1954,892,8,15.01777,6.994782


We created another chart based on the top movies based on their wr and sorted by genres. This is the most basic way of doing a recommender system. However, you can see here that these movies are not personalised according to user preference. We will now analyse content-based filtering and collaborative filtering next.

# Content Based Filtering

This filtering involves recommending movies to a user based on the properties of a movies which is similar to the properties of a movie which the user likes.

In [18]:
#we have load the movies metadata file as metadata
#recalled that we have sorted genres column into a list in the metadata df

#load keywords data
keywords = pd.read_csv("../Data/keywords_clean.csv")

#load credts data
credits = pd.read_csv("../Data/credits_clean.csv")

In [19]:
#finding 80th percentile of votes to be 50

metadata['vote_count'].quantile(0.80)

50.0

In [20]:
#we will select movies with at least 50 votes (subset of full dataset)

metadata_cb = metadata[metadata['vote_count']>=50]

For the metadata dataset, we will be using title, overview, tagline and genres columns.

In [21]:
metadata_cb = metadata_cb[['id','title','overview','tagline', 'genres']]

In [22]:
metadata_cb

Unnamed: 0,id,title,overview,tagline,genres
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",,"[Animation, Comedy, Family]"
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,"[Adventure, Fantasy, Family]"
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...,"[Romance, Comedy]"
4,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...,[Comedy]
5,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",A Los Angeles Crime Saga,"[Action, Crime, Drama, Thriller]"
...,...,...,...,...,...
45323,430365,With Open Arms,Jean-Étienne Fougerole is an intellectual bohe...,Thanks for the invitation!,[Comedy]
45327,248705,The Visitors: Bastille Day,"Stuck in the corridors of time, Godefroy de Mo...",,[Comedy]
45332,44918,Titanic 2,On the 100th anniversary of the original voyag...,"100 years later, lightning strikes twice","[Action, Adventure, Thriller]"
45421,455661,In a Heartbeat,A closeted boy runs the risk of being outed by...,The Heart Wants What The Heart Wants,"[Family, Animation, Romance, Comedy]"


In [23]:
#create a duplicate title column as we will need it later 
metadata_cb['title_copy'] = metadata_cb['title'].copy()

In [24]:
#will be using both columns in our recommender for the keywords dataset
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [25]:
#will be using the id and cast column for the credits dataset
credits = credits[['id','cast']]
credits.head()

Unnamed: 0,id,cast
0,862,"[{'cast_id': 14, 'character': 'Woody (voice)',..."
1,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '..."
2,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c..."
3,31357,"[{'cast_id': 1, 'character': ""Savannah 'Vannah..."
4,11862,"[{'cast_id': 1, 'character': 'George Banks', '..."


To combine all 3 dfs into one df via the id.

In [26]:
#check dtype of id
keywords['id'].dtypes

dtype('int64')

In [27]:
#check dtype of id
credits['id'].dtypes

dtype('int64')

In [28]:
#check dtype of id
metadata_cb['id'].dtypes

dtype('int64')

In [29]:
# Merge medata and keywords df
df = pd.merge(metadata_cb, keywords, on='id', how='left')

# Reset the index
df.reset_index(inplace=True, drop=True)

In [30]:
# Merge with credits df
df = pd.merge(df, credits, on='id', how='left')

# Reset the index
df.reset_index(inplace=True, drop=True)

In [31]:
df.head()

Unnamed: 0,id,title,overview,tagline,genres,title_copy,keywords,cast
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",,"[Animation, Comedy, Family]",Toy Story,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...","[{'cast_id': 14, 'character': 'Woody (voice)',..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,"[Adventure, Fantasy, Family]",Jumanji,"[{'id': 10090, 'name': 'board game'}, {'id': 1...","[{'cast_id': 1, 'character': 'Alan Parrish', '..."
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...,"[Romance, Comedy]",Grumpier Old Men,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392...","[{'cast_id': 2, 'character': 'Max Goldman', 'c..."
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...,[Comedy],Father of the Bride Part II,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...","[{'cast_id': 1, 'character': 'George Banks', '..."
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",A Los Angeles Crime Saga,"[Action, Crime, Drama, Thriller]",Heat,"[{'id': 642, 'name': 'robbery'}, {'id': 703, '...","[{'cast_id': 25, 'character': 'Lt. Vincent Han..."


In [32]:
#genres is already in a list
#to remove spaces in between genre types(eg: sci fi to scifi) and make it a string
df['genres'] = df['genres'].apply(lambda x: ' '.join([i.replace(" ","") for i in x]))

df['genres']

0               Animation Comedy Family
1              Adventure Fantasy Family
2                        Romance Comedy
3                                Comedy
4           Action Crime Drama Thriller
                     ...               
9146                             Comedy
9147                             Comedy
9148          Action Adventure Thriller
9149    Family Animation Romance Comedy
9150                             Comedy
Name: genres, Length: 9151, dtype: object

In [33]:
#convert keywords from list of dictionary to a list of strings
df['keywords'] = df['keywords'].fillna('[]').apply(ast.literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [34]:
#remove empty spaces and remove spaces in each keyword
df['keywords'] = df['keywords'].apply(lambda x: ' '.join([i.replace(" ",'') for i in x]))

In [35]:
df['keywords']

0       jealousy toy boy friendship friends rivalry bo...
1       boardgame disappearance basedonchildren'sbook ...
2          fishing bestfriend duringcreditsstinger oldmen
3       baby midlifecrisis confidence aging daughter m...
4       robbery detective bank obsession chase shootin...
                              ...                        
9146                                                     
9147                  nazis castle timetravel robespierre
9148                                             suspense
9149                             love teenager lgbt short
9150                                       militaryschool
Name: keywords, Length: 9151, dtype: object

In [36]:
#convert cast from list of dictionary to a list of strings
df['cast'] = df['cast'].fillna('[]').apply(ast.literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [37]:
#remove the expty spaces and remove spaces in each cast
df['cast'] = df['cast'].apply(lambda x: ' '.join([i.replace(" ",'') for i in x]))

In [38]:
df['cast']

0       TomHanks TimAllen DonRickles JimVarney Wallace...
1       RobinWilliams JonathanHyde KirstenDunst Bradle...
2       WalterMatthau JackLemmon Ann-Margret SophiaLor...
3       SteveMartin DianeKeaton MartinShort KimberlyWi...
4       AlPacino RobertDeNiro ValKilmer JonVoight TomS...
                              ...                        
9146    ChristianClavier AryAbittan ElsaZylberstein Cy...
9147    JeanReno ChristianClavier FranckDubosc KarinVi...
9148    ShaneVanDyke MarieWestbrook BruceDavison Brook...
9149                                                     
9150    HilaryDuff ChristyCarlsonRomano GaryCole Shawn...
Name: cast, Length: 9151, dtype: object

In [39]:
#fill null with ' '
df['tagline'] = df['tagline'].fillna(' ')

In [40]:
#no more null values
df['tagline'].isnull().sum()

0

In [41]:
#all columns are now ready to be merged
df

Unnamed: 0,id,title,overview,tagline,genres,title_copy,keywords,cast
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ...",,Animation Comedy Family,Toy Story,jealousy toy boy friendship friends rivalry bo...,TomHanks TimAllen DonRickles JimVarney Wallace...
1,8844,Jumanji,When siblings Judy and Peter discover an encha...,Roll the dice and unleash the excitement!,Adventure Fantasy Family,Jumanji,boardgame disappearance basedonchildren'sbook ...,RobinWilliams JonathanHyde KirstenDunst Bradle...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...,Still Yelling. Still Fighting. Still Ready for...,Romance Comedy,Grumpier Old Men,fishing bestfriend duringcreditsstinger oldmen,WalterMatthau JackLemmon Ann-Margret SophiaLor...
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...,Just When His World Is Back To Normal... He's ...,Comedy,Father of the Bride Part II,baby midlifecrisis confidence aging daughter m...,SteveMartin DianeKeaton MartinShort KimberlyWi...
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ...",A Los Angeles Crime Saga,Action Crime Drama Thriller,Heat,robbery detective bank obsession chase shootin...,AlPacino RobertDeNiro ValKilmer JonVoight TomS...
...,...,...,...,...,...,...,...,...
9146,430365,With Open Arms,Jean-Étienne Fougerole is an intellectual bohe...,Thanks for the invitation!,Comedy,With Open Arms,,ChristianClavier AryAbittan ElsaZylberstein Cy...
9147,248705,The Visitors: Bastille Day,"Stuck in the corridors of time, Godefroy de Mo...",,Comedy,The Visitors: Bastille Day,nazis castle timetravel robespierre,JeanReno ChristianClavier FranckDubosc KarinVi...
9148,44918,Titanic 2,On the 100th anniversary of the original voyag...,"100 years later, lightning strikes twice",Action Adventure Thriller,Titanic 2,suspense,ShaneVanDyke MarieWestbrook BruceDavison Brook...
9149,455661,In a Heartbeat,A closeted boy runs the risk of being outed by...,The Heart Wants What The Heart Wants,Family Animation Romance Comedy,In a Heartbeat,love teenager lgbt short,


In [42]:
#we shall now merge all columns tgt into one column called 'tags'
df['tags'] = df['overview'] + ' ' + df['tagline'] +  ' ' + df['genres'] +  ' ' + df['title_copy'] + ' ' + df['keywords'] + ' ' + df['cast']

In [43]:
#remove all other columns
df.drop(columns=['overview','tagline','genres','title_copy','keywords','cast'], inplace=True)

In [44]:
#to remove these null rows since tags column is empty

df.isnull().sum()

id        0
title     0
tags     36
dtype: int64

In [45]:
df.drop(df[df['tags'].isnull()].index, inplace=True)

In [46]:
#to remove duplicate rows as well

df.duplicated().sum()

9

In [47]:
df.drop_duplicates(inplace=True)

In [48]:
#reset index after all the dropping of null/duplicate rows
df.reset_index(inplace=True, drop=True)

In [49]:
df

Unnamed: 0,id,title,tags
0,862,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,8844,Jumanji,When siblings Judy and Peter discover an encha...
2,15602,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,11862,Father of the Bride Part II,Just when George Banks has recovered from his ...
4,949,Heat,"Obsessive master thief, Neil McCauley leads a ..."
...,...,...,...
9101,430365,With Open Arms,Jean-Étienne Fougerole is an intellectual bohe...
9102,248705,The Visitors: Bastille Day,"Stuck in the corridors of time, Godefroy de Mo..."
9103,44918,Titanic 2,On the 100th anniversary of the original voyag...
9104,455661,In a Heartbeat,A closeted boy runs the risk of being outed by...


We will now do some cleaning on the words in the tags column.

In [50]:
#drop spaces if there is two consecutive spaces 
#since empty list or null values were replaced with ' '

df['tags']= df['tags'].apply(lambda x: " ".join(x.split()))

In [51]:
#store character only if it is not a punctuation
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

df['tags']= df['tags'].apply(remove_punctuations)

In [52]:
#lowercase the letters in the words

df['tags']= df['tags'].apply(lambda x: x.lower())

In [53]:
#create function to store character only if it is not a stopword and do lemmatizing
#remove numbers with words, numbers, and words which are not in english

def clean_text(text):
    #define wn and stopword
    wn = nltk.WordNetLemmatizer()
    stopword = nltk.corpus.stopwords.words('english')

    # \W matches any non-word character (equivalent to [^a-zA-Z0-9_]). This does not include spaces i.e. \s
    # Add a + just in case there are 2 or more spaces between certain words
    tokens = re.split('\W+', text)
    
    # apply lemmatization and stopwords exclusion within the same step
    text = " ".join([wn.lemmatize(word) for word in tokens if word not in stopword])
    
    #remove numbers and word with numbers
    text = "".join([re.sub('\d+', "", word) for word in text])
    
    #remove words which are not in english
    text = "".join([word for word in text if word.isascii() == True])
    
    return text

#apply function to tags column
df['tags']= df['tags'].apply(clean_text)

In [54]:
df

Unnamed: 0,id,title,tags
0,862,Toy Story,led woody andys toy live happily room andys bi...
1,8844,Jumanji,sibling judy peter discover enchanted board ga...
2,15602,Grumpier Old Men,family wedding reignites ancient feud nextdoor...
3,11862,Father of the Bride Part II,george bank recovered daughter wedding receive...
4,949,Heat,obsessive master thief neil mccauley lead topn...
...,...,...,...
9101,430365,With Open Arms,jeantienne fougerole intellectual bohemian rel...
9102,248705,The Visitors: Bastille Day,stuck corridor time godefroy de montmirail fai...
9103,44918,Titanic 2,th anniversary original voyage modern luxury l...
9104,455661,In a Heartbeat,closeted boy run risk outed heart pop chest ch...


In [55]:
#9106 rows in total
df.shape

(9106, 3)

We will now convert the words into vectors via TF-IDF so that the machine can understand the text inputs.

In [56]:
# Initialize a tfidf object
#select top 10% more frequent words = 13000
tfidf_vect = TfidfVectorizer(max_features=13000)

# Transform the data
tfidf = tfidf_vect.fit_transform(df['tags'].values)

#print shape
print(tfidf.shape)

#save to df
tfidf = pd.DataFrame(tfidf.toarray())
tfidf.columns = tfidf_vect.get_feature_names()
tfidf

(9106, 13000)


Unnamed: 0,aamirkhan,aaron,aaronabrams,aarondouglas,aaroneckhart,aaronhimelstein,aaronlustig,aaronpaul,aaronstanford,aarontaylorjohnson,...,zoekazan,zoesaldana,zoeydeutch,zokravitz,zombie,zombieapocalypse,zone,zoo,zooeydeschanel,zulayhenao
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9101,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9103,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9104,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We will use consine similarity to make recommendations

In [57]:
similarity = cosine_similarity(tfidf)

#create function to recommend movie based on consine similarity

def recommendation(movie_title):
    id_of_movie = df[df['title']==movie_title].index[0]
    distances = similarity[id_of_movie]
    movie_list = sorted(list(enumerate(distances)), reverse=True, key=lambda x:x[1])[1:10]
    
    for i in movie_list:
        print(df.iloc[i[0]].title, i[1])

During EDA, we noted that drama, comedy and thriller are the top 3 genres of movies. We will take a look at the recommendations for two movies: Minion (Comedy) and The Dark Knight (Thriller and Drama).

In [58]:
recommendation('Minions')

Mower Minions 0.4132663656955075
Minions: Orientation Day 0.3619024153480107
Banana 0.2726926284720872
Despicable Me 2 0.25126831109042813
Despicable Me 0.21450225603596204
Despicable Me 3 0.20373447635405717
Stuart Little 3: Call of the Wild 0.12831003995272408
One Hundred and One Dalmatians 0.1177490070710192
Stuart Little 2 0.11528787750371525


In [59]:
recommendation('The Dark Knight')

The Dark Knight Rises 0.30656164913885514
Batman Begins 0.2734739349154689
Batman Returns 0.2696719706306492
Batman Forever 0.25746940494643594
Batman: Under the Red Hood 0.2294065040485468
Batman: The Killing Joke 0.22844990347536065
Batman Beyond: Return of the Joker 0.2203864803427848
Batman: The Dark Knight Returns, Part 2 0.21794405127810568
Batman 0.2167996215908805


We see in the recommendations that other minions or batman movies will be suggested to the user. For content-based filtering, we are only using the features of the movies and finding their consine similarities. However, users may want to watch other movies which are not in the minions or batman series. We will next see how collaborative filtering can help to make better recommendations.

# Collaborative Filtering

Collaborative filtering can be further divided into model-based and memory based. Model based is the use of a model for user-item interaction where user and item representations have to be learnt from interaction matrix. Memory based will be purely based on similarities between users or items in terms of observed interactions.

## Memory-Based

### User-Based

In user based recommendation method, we will be computing similarities between users and will fetch the most similar users using the KNN algorithm and will recommend movies which one user likes to another user and vice versa.

In [60]:
#load ratings_small dataset
ratings = pd.read_csv("../Data/ratings_small.csv")

In [61]:
#take a look at the df
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [62]:
#we will select movies with at least 50 votes, 80th percentile of votes
metadata_cf = metadata[metadata['vote_count']>=50][['id','title']]

#to select IDs of movies with vote count more than 50 in metadata_cf
movie_ids = [int(x) for x in metadata_cf['id'].values]

#select ratings of movies with more than 50 vote counts
ratings = ratings[ratings['movieId'].isin(movie_ids)]

#to reset index
ratings.reset_index(inplace=True, drop=True)

#view first 5 rows
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1371,2.5,1260759135
1,1,2105,4.0,1260759139
2,1,2294,2.0,1260759108
3,2,17,5.0,835355681
4,2,62,3.0,835355749


In [63]:
#30581 records selected
ratings.shape

(30581, 4)

In [64]:
#Declaring the similarity options.
sim_options = {'name': 'cosine',
               'user_based': True}

#initialize a surprise reader object
reader = Reader(line_format='user item rating', sep=',', rating_scale=(0,5), skip_lines=1)

#load the data
data = Dataset.load_from_df(ratings[['userId','movieId','rating']], reader=reader)

#Retrieve the trainset.
trainset = data.build_full_trainset()

# Build a KNN algorithm, and train it to find similar users
sim_user = KNNBasic(sim_options=sim_options, verbose=False, random_state=42)
sim_user.fit(trainset)

<surprise.prediction_algorithms.knns.KNNBasic at 0x7fc31204dee0>

In [65]:
#create function to get recommendations by specifying user_id, top n movies and algo
#based on what the user has not watch
#to use algo to estimate user's rating on the movie

def get_recommendations(user_id, top_n, algo):
    
    # creating an empty list to store the recommended product ids
    recommendations = []
    
    # creating an user item interactions matrix 
    user_movie_interactions_matrix = ratings.pivot(index='userId', columns='movieId', values='rating')
    
    # extracting those product ids which the user_id has not interacted yet
    non_interacted_movies = user_movie_interactions_matrix.loc[user_id][user_movie_interactions_matrix.loc[user_id].isnull()].index.tolist()
    
    # looping through each of the product ids which user_id has not interacted yet
    for item_id in non_interacted_movies:
        
        # predicting the ratings for those non interacted product ids by this user
        est = algo.predict(user_id, item_id).est
        
        # appending the predicted ratings
        movie_name = metadata_cf[metadata_cf['id']==item_id]['title'].values[0]
        recommendations.append((movie_name, est))

    # sorting the predicted ratings in descending order
    recommendations.sort(key=lambda x: x[1], reverse=True)

    return recommendations[:top_n]

In [66]:
get_recommendations(671,10,sim_user)

[('The Wizard', 5),
 ('Rio Bravo', 5),
 ('The Celebration', 5),
 ('Spider-Man 3', 5),
 ('A Streetcar Named Desire', 5),
 ('Gentlemen Prefer Blondes', 5),
 ('The Evil Dead', 5),
 ('JFK', 5),
 ("Singin' in the Rain", 5),
 ("Frank Herbert's Dune", 5)]

### Item-Based

In item based recommendation method, we will be computing similarities between items(movies) and will fetch the most similar items(movies) using the KNN algorithm and will recommend items(movies) which one user likes to another user who likes similar kind of item(movie) and vice versa.

In [67]:
#Declaring the similarity options.
#set user_based as false
sim_options = {'name': 'cosine',
               'user_based': False}

# Build a KNN algorithm, and train it to find similar items
sim_item = KNNBasic(sim_options=sim_options, verbose=False, random_state=42)
sim_item.fit(trainset)

<surprise.prediction_algorithms.knns.KNNBasic at 0x7fc2fd436970>

In [68]:
get_recommendations(671,10,sim_item)

[('Hard Candy', 5),
 ('Visitor Q', 5),
 ('The Protector', 4.666666666666667),
 ('Shaun of the Dead', 4.571428571428571),
 ('The Silence of the Lambs', 4.503228000162119),
 ("Singin' in the Rain", 4.5),
 ("Hearts of Darkness: A Filmmaker's Apocalypse", 4.5),
 ('Sense and Sensibility', 4.5),
 ("The Hitchhiker's Guide to the Galaxy", 4.5),
 ('Daisies', 4.4375)]

We see here that memory-based collaborative filtering method can determine what movies the user will like based on the knn algorithm using consine similarities. In this case, the ratings are estimated for the user (based on the movie they have not yet watch), and recommended to user if the estimated ratings for the movie for the user is high. However, the RMSE scores are quite high. We can take a look at the model-based approach to see if we can find other models with lower RMSE values.

## Model-Based

For the model based recommender system, we will use a library called Surprise use SVD and NMF (Non -ve Matrix Factorisation) as a matrix factorization method.

### SVD

In [69]:
#build trainset and testset object using train_test_split
trainset, testset = train_test_split(data, test_size=0.25, random_state=42)

# Initialize model for svd
svd = SVD(random_state=42)

# Train the algorithm on the trainset, and predict ratings for the testset
svd.fit(trainset)
predictions = svd.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.8979


0.8978931665639476

While doing kfold, we see fairly similar RMSE scores across the folds.

In [70]:
# define a cross-validation iterator
kf = KFold(n_splits=5, random_state=42)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    svd.fit(trainset)
    predictions = svd.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions)

RMSE: 0.8961
RMSE: 0.8882
RMSE: 0.8994
RMSE: 0.8910
RMSE: 0.9004


Use gridsearch CV improves RMSE by 0.02.

In [71]:
param_grid = {"n_epochs": [50, 80], "lr_all": [0.010, 0.012], "reg_all": [0.12, 0.14]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=5)

gs.fit(data)

# best RMSE score
print(gs.best_score["rmse"])

# combination of parameters that gave the best RMSE score
print(gs.best_params["rmse"])

0.8760241357445813
{'n_epochs': 80, 'lr_all': 0.012, 'reg_all': 0.12}


### NMF

In [72]:
# Initialize model for nmf
nmf = NMF(random_state=42)

# Train the algorithm on the trainset, and predict ratings for the testset
nmf.fit(trainset)
predictions = nmf.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.9574


0.9573906031530087

While doing kfold, we see fairly similar RMSE scores across the folds.

In [73]:
# define a cross-validation iterator
kf = KFold(n_splits=5, random_state=42)

for trainset, testset in kf.split(data):

    # train and test algorithm.
    nmf.fit(trainset)
    predictions = nmf.test(testset)

    # Compute and print Root Mean Squared Error
    accuracy.rmse(predictions)

RMSE: 0.9449
RMSE: 0.9413
RMSE: 0.9459
RMSE: 0.9381
RMSE: 0.9574


When comparing SVD and NMF, we see here that SVD has lower RMSE score than NMF. We will use SVD to come out with our recommendations.

In [74]:
get_recommendations(671,10,svd)

[('Nell', 4.561367353331784),
 ('Flags of Our Fathers', 4.505289018375288),
 ('Fools Rush In', 4.485137634498578),
 ('Shriek If You Know What I Did Last Friday the Thirteenth',
  4.480267721603876),
 ('Sleepless in Seattle', 4.470329127881765),
 ('Galaxy Quest', 4.462592937627489),
 ('Cool Hand Luke', 4.461669090179023),
 ('Birdman of Alcatraz', 4.412927255554211),
 ('Edward Scissorhands', 4.408618345245424),
 ('The Thomas Crown Affair', 4.4085953121641435)]

### Deep Neural Network

We will use the fastai library to build our Deep Neural Network model.

*To refer to the Deep Neural Network.ipynb file for codes as google colab was used to run the codes.

Based on the best RMSE scores for DNN of 0.87, it is actually same as the RMSE score for SVD of 0.87. DNN and SVD model shows the best RMSE score to make predictions for our recommender system.

# Conclusion

In our project, we analyse basic, content-based and collaborative filtering movie recommendater systems. Basic movie recommender systems uses a weighted rating formula to come up with top movies chart or top movies in each genre list. However, this type of recommender system is not unique to user preference. Content-based filtering considers the movie attributes of the movie which the user is fond of to come up with recommendations for user. In our project, we utilise the title, overview, tagline, genre, keywords of plot and movie cast and apply TF-IDF and consine similarities to determine which movies users will like. However, there is the issue of cold-start, it does not takes into account other users' preference and determining what characteristics of the item the user dislikes or likes is not usually obvious. In collaborative filtering, all the users are taken into consideration and people with similar tastes and preferences are used to suggest new and specific products to the primary user. In our project, we utilised memory and model based method. Model-based method SVD and DNN proved to perform the best based on the lowest RMSE score of 0.87. However, collaborative filtering has issues with data sparsity if there is no sufficient historical data and scalability when a decrease in performance is inevitable with increase amount of data. Recommendations would be to come up with a hybrid recommender system combining both content-based and collaborative filtering.