# Movie Recommender system

This is the second part of my Springboard Capstone Project on Movie Data Analysis and Recommendation Systems. In my first notebook, I attempted at narrating the story of film by performing an extensive exploratory data analysis on Movies Metadata collected from TMDB. I also built two extremely minimalist predictive models to predict movie revenue and movie success and visualise which features influence the output (revenue and success respectively).

In this notebook, I will attempt at implementing a few recommendation algorithms (content based, popularity based and collaborative filtering) and try to build an ensemble of these models to come up with our final recommendation system. With us, we have two MovieLens datasets.

We will build our Simple Recommender using movies from the Full Dataset whereas all personalised recommender systems will make use of the small dataset (due to the computing power I possess being very limited). As a first step, let us build our simple recommender system.

In [2]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD

import warnings; warnings.simplefilter('ignore')

## 1.Simple Recommender system

The Simple Recommender offers generalized recommnendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user.

The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre.

In [5]:
md=pd.read_csv('../data/movies_metadata1.csv')

In [6]:
md.head()

Unnamed: 0,adult,backdrop_path,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,/lxD5ak7BOoinRNehOCA85CQ8ubr.jpg,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,...,1995-10-30,373554033,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.9,12976
1,False,/y7SQmjlB42VvYyRIFQXLQ4ZYrn.jpg,"{'id': 495527, 'name': 'Jumanji Collection', '...",65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",http://www.sonypictures.com/movies/jumanji/,8844,tt0113497,en,Jumanji,...,1995-12-15,262797249,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,7.2,7583
2,False,/1J4Z7VhdAgtdd97nCxY7dcBpjGT.jpg,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,...,1995-12-22,0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.6,212
3,False,/yibpm3qFap62p92GL2mP71cevS9.jpg,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,...,1995-12-22,81452156,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.2,78
4,False,/5rPY0WtseHhtSMZt8kxfgU2rsZp.jpg,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,...,1995-12-08,76578911,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,6.2,435


In [7]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

I use the TMDB Ratings to come up with our Top Movies Chart. I will use IMDB's weighted rating formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,

v is the number of votes for the movie
m is the minimum votes required to be listed in the chart
R is the average rating of the movie
C is the mean vote across the whole report
The next step is to determine an appropriate value for m, the minimum votes required to be listed in the chart. We will use 95th percentile as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

I will build our overall Top 250 Chart and will define a function to build charts for a particular genre. Let's begin!

In [8]:
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.499821914259811

In [9]:
m = vote_counts.quantile(0.95)
m

824.0

In [10]:
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [11]:
qualified = md[(md['vote_count'] >= m) & (md['vote_count'].notnull()) & (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(3091, 6)

Therefore, to qualify to be considered for the chart, a movie has to have at least 824 votes on TMDB. We also see that the average rating for a movie on TMDB is 5.499 on a scale of 10. 3091 Movies qualify to be on our chart.

In [12]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [13]:
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [14]:
qualified = qualified.sort_values('wr', ascending=False).head(250)

### Top Movies

In [16]:
qualified.head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
14909,Inception,2010,27268,8,41.479,"[Action, Science Fiction, Adventure]",7.926664
21059,Interstellar,2014,23833,8,65.838,"[Adventure, Drama, Science Fiction]",7.916448
12227,The Dark Knight,2008,23385,8,71.553,"[Drama, Action, Crime, Thriller]",7.914902
2908,Fight Club,1999,20162,8,46.674,[Drama],7.901832
24872,Avengers: Infinity War,2018,19819,8,186.028,"[Adventure, Action, Science Fiction]",7.900201
352,Pulp Fiction,1994,19623,8,54.985,"[Thriller, Crime]",7.899245
18932,Django Unchained,2012,19162,8,46.684,"[Drama, Western]",7.896921
411,Forrest Gump,1994,18956,8,42.077,"[Comedy, Drama, Romance]",7.895847
2523,The Matrix,1999,17814,8,49.458,"[Action, Science Fiction]",7.889465
4925,The Lord of the Rings: The Fellowship of the Ring,2001,17795,8,51.27,"[Adventure, Fantasy, Action]",7.889352


We see that three Christopher Nolan Films, **Inception, The Dark Knight and Interstellar** occur at the very top of our chart. The chart also indicates a strong bias of TMDB Users towards particular genres and directors.

Let us now construct our function that builds charts for particular genres. For this, we will use relax our default conditions to the 85th percentile instead of 95.

In [17]:
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

In [18]:
def build_chart(genre, percentile=0.85):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)
    
    return qualified

Let us see our method in action by displaying the Top 15 Romance Movies (Romance almost didn't feature at all in our Generic Top Chart despite being one of the most popular movie genres).

### Top Action Movies

In [20]:
build_chart('Action').head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
14909,Inception,2010,27268,8,41.479,7.959071
12227,The Dark Knight,2008,23385,8,71.553,7.952395
24872,Avengers: Infinity War,2018,19819,8,186.028,7.944009
2523,The Matrix,1999,17814,8,49.458,7.937853
4925,The Lord of the Rings: The Fellowship of the Ring,2001,17795,8,51.27,7.937788
7062,The Lord of the Rings: The Return of the King,2003,16318,8,54.239,7.932299
13247,Inglourious Basterds,2009,15448,8,26.686,7.928588
5875,The Lord of the Rings: The Two Towers,2002,15362,8,48.255,7.928199
24873,Avengers: Endgame,2019,15204,8,175.451,7.927473
317,Star Wars,1977,14373,8,54.567,7.9234


### Top Romance Movies

In [21]:
build_chart('Romance').head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
411,Forrest Gump,1994,18956,8,42.077,7.978308
7270,Eternal Sunshine of the Spotless Mind,2004,9996,8,25.54,7.959207
44123,Call Me by Your Name,2017,7943,8,53.529,7.948896
41708,Your Name.,2016,6406,8,99.039,7.936967
10419,Pride & Prejudice,2005,5030,8,32.841,7.920317
51728,"Love, Simon",2018,4469,8,31.054,7.910699
937,Vertigo,1958,3567,8,27.631,7.889197
57600,Five Feet Apart,2019,3379,8,47.498,7.883342
946,Casablanca,1942,3342,8,15.752,7.882116
954,Gone with the Wind,1939,2663,8,18.239,7.853952


## 2.Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If s/he were to go one step further and look at our charts by genre, s/he wouldn't still be getting the best recommendations.

To personalise our recommendations more, I am going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked. Since we will be using movie metadata (or content) to build this engine, this also known as Content Based Filtering.

I will build two Content Based Recommenders based on:

- Movie Overviews and Taglines
- Movie Cast, Crew, Keywords and Genre

Also, as mentioned in the introduction, I will be using a subset of all the movies available to us due to limiting computing power available to me.



In [30]:
links=pd.read_csv('../data/links_small.csv')
links=links[links['tmdbId'].notnull()]['tmdbId'].astype('int')

In [31]:
md['id']=md['id'].astype('int')

In [32]:
smd = md[md['id'].isin(links)]
smd.shape

(9449, 26)

### 2.1 Movie Description Based Recommender
Let us first try to build a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [33]:
smd['tagline'] = smd['tagline'].fillna('')
smd['description'] = smd['overview'] + smd['tagline']
smd['description'] = smd['description'].fillna('')

In [34]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [35]:
tfidf_matrix.shape

(9449, 263918)

**Cosine Similarity**
I will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's linear_kernel instead of cosine_similarities since it is much faster.

In [36]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [37]:
cosine_sim[0]

array([1.        , 0.00757092, 0.        , ..., 0.01218673, 0.00471348,
       0.        ])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [38]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [39]:
def get_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:31]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [44]:
get_recommendations('The Matrix').head(10)

6121                      Pulse
2455                  Supernova
1226    Speed 2: Cruise Control
7123                     Avatar
8132           The Zero Theorem
9259                Kid's Story
199                     Hackers
579       Hellraiser: Bloodline
8274                   Who Am I
5665              The Animatrix
Name: title, dtype: object

In [45]:
get_recommendations('The Dark Knight').head(10)

7637           The Dark Knight Rises
183                   Batman Forever
1113                  Batman Returns
565                           Batman
9170           The Lego Batman Movie
9075        Batman: The Killing Joke
2469    Batman: Mask of the Phantasm
2577                             JFK
5903                   Batman Begins
5317                 To End All Wars
Name: title, dtype: object

We see that for The Dark Knight, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie. Someone who liked The Dark Knight probably likes it more because of Nolan and would hate Batman Forever and every other substandard movie in the Batman Franchise.

Therefore, we are going to use much more suggestive metadata than Overview and Tagline. In the next subsection, we will build a more sophisticated recommender that takes genre, keywords, cast and crew into consideration.

### 2.2 Metadata Based Recommeder

In [54]:
credits = pd.read_csv('../data/credits_metadata1.csv')
keywords = pd.read_csv('../data/keywords_metadata1.csv')

In [55]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [57]:
md.shape

(61768, 26)

In [58]:
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [60]:
smd = md[md['id'].isin(links)]
smd.shape

(9533, 29)

We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:

- **Crew**: From the crew, we will only pick the director as our feature since the others don't contribute that much to the feel of the movie.
- **Cast**: Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list.

In [61]:
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))

In [62]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [63]:
smd['director'] = smd['crew'].apply(get_director)

In [64]:
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [65]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

What I plan on doing is creating a metadata dump for every movie which consists of genres, director, main actors and keywords. I then use a Count Vectorizer to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

These are steps I follow in the preparation of my genres and credits data:

Strip Spaces and Convert to Lowercase from all our features. 
Mention Director 2 times to give it more weight relative to the entire cast.

In [66]:
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [67]:
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x])

**Keywords**
We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we calculate the frequenct counts of every keyword that appears in the dataset.

In [68]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [69]:
s = s.value_counts()
s[:5]

based on novel or book    708
woman director            577
murder                    464
duringcreditsstinger      341
new york city             339
Name: keyword, dtype: int64

Keywords occur in frequencies ranging from 1 to 708. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. Finally, we will convert every word to its stem so that words such as Dogs and Dog are considered the same.

In [70]:
s = s[s > 1]

In [71]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [72]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [73]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [74]:
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [75]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [76]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [77]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

We will reuse the get_recommendations function that we had written earlier. Since our cosine similarity scores have changed, we expect it to give us different (and probably better) results. Let us check for The Dark Knight again and see what recommendations I get this time around.

In [78]:
get_recommendations('The Dark Knight').head(10)

5977               Batman Begins
7731       The Dark Knight Rises
6384                The Prestige
2010                   Following
8634     Kidnapping Mr. Heineken
9165    Batman: The Killing Joke
5686                    Thursday
8123                  Kick-Ass 2
1240              Batman & Robin
1890                 The General
Name: title, dtype: object

I am much more satisfied with the results I get this time around. The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations. I enjoyed watching The Dark Knight as well as some of the other ones in the list including Batman Begins, The Prestige and The Dark Knight Rises.

We can of course experiment on this engine by trying out different weights for our features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

In [79]:

get_recommendations('Pulp Fiction').head(10)

4558                              S.W.A.T.
8595                     The Hateful Eight
1361                          Jackie Brown
896                         Reservoir Dogs
5033                     Kill Bill: Vol. 2
2479    National Lampoon's Loaded Weapon 1
137                           Nick of Time
4376                                 Basic
7618                                 Drive
4713                     Kill Bill: Vol. 1
Name: title, dtype: object

### 2.3 Improved recommeder based on popularity

One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that Batman and Robin has a lot of similar characters as compared to The Dark Knight but it was a terrible movie that shouldn't be recommended to anyone.

Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of $m$, we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [80]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)
    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [81]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
6384,The Prestige,10590,8,2006,7.819507
7731,The Dark Knight Rises,16707,7,2012,6.929488
5977,Batman Begins,15004,7,2005,6.921901
7304,Kick-Ass,8850,7,2010,6.87222
7177,Law Abiding Citizen,3258,7,2009,6.697171
8123,Kick-Ass 2,4542,6,2013,5.923193
1127,Batman Returns,4191,6,1992,5.917817
4320,Daredevil,3515,5,2003,5.094919
8799,Batman v Superman: Dawn of Justice,13884,5,2016,5.028002
1240,Batman & Robin,3413,4,1997,4.291681



Unfortunately, Batman and Robin does not disappear from our recommendation list. This is probably due to the fact that it is rated a 4, which is only slightly below average on TMDB. It certainly doesn't deserve a 4 when amazing movies like The Dark Knight Rises has only a 7. However, there is nothing much we can do about this. Therefore, we will conclude our Content Based Recommender section here

## 3. Collaberative Filtering

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the **Surprise** library that used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [82]:
reader = Reader()

In [83]:
ratings=pd.read_csv('../data/ratings_small.csv')

In [84]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


In [88]:
data=Dataset.load_from_df(ratings[['userId','movieId','rating']],reader)
data

<surprise.dataset.DatasetAutoFolds at 0x2448e1194a8>

In [105]:
import surprise
svd=SVD()
param_grid={'lr_all':[0.001,0.1],'reg_all':[0.1,0.5]}
surprise_cv=surprise.model_selection.GridSearchCV(SVD,param_grid,measures=['RMSE', 'MAE'],cv=5)
surprise_cv.fit(data)
print(surprise_cv.best_params['rmse'])

{'lr_all': 0.1, 'reg_all': 0.1}


In [106]:
from surprise.model_selection import cross_validate
svd=SVD(lr_all=0.1,reg_all=0.1)
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8752  0.8756  0.8738  0.8734  0.8804  0.8757  0.0025  
MAE (testset)     0.6716  0.6696  0.6751  0.6722  0.6741  0.6725  0.0019  
Fit time          15.07   14.96   14.14   15.07   14.88   14.82   0.35    
Test time         0.44    0.74    0.48    0.48    0.38    0.51    0.12    


{'test_rmse': array([0.87517185, 0.87558928, 0.87377793, 0.87343165, 0.88041334]),
 'test_mae': array([0.67159601, 0.66963641, 0.67506118, 0.67219113, 0.67408292]),
 'fit_time': (15.065889120101929,
  14.961552619934082,
  14.137922286987305,
  15.067900896072388,
  14.883702516555786),
 'test_time': (0.43599557876586914,
  0.7429759502410889,
  0.48368215560913086,
  0.4807579517364502,
  0.38198184967041016)}

We get a mean Root Mean Sqaure Error of 0.88 which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

In [108]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x2448d681e80>

lets pick user 10, and check the ratings s/he given.

In [110]:
ratings[ratings['userId'] == 10]

Unnamed: 0,userId,movieId,rating,timestamp
1119,10,296,1.0,1455303387
1120,10,356,3.5,1455301685
1121,10,588,4.0,1455306173
1122,10,597,3.5,1455357645
1123,10,912,4.0,1455302254
...,...,...,...,...
1254,10,119145,1.0,1455302650
1255,10,129428,3.5,1455357384
1256,10,136020,5.0,1455302192
1257,10,137595,4.0,1455356898


In [111]:
svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=3.855988082512109, details={'was_impossible': False})

For movie with ID 302, we get an estimated prediction of 3.855. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

## 4. Hybrid Recommender

In this section, I will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

- **Input**: User ID and the Title of a Movie
- **Output**: Similar movies sorted on the basis of expected ratings by that particular user.

In [113]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [114]:
id_map = pd.read_csv('../data/links_small.csv')[['movieId', 'tmdbId']]
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')

In [115]:
indices_map = id_map.set_index('id')

In [116]:

def hybrid(userId, title):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [117]:
hybrid(1, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
563,Terminator 2: Judgment Day,8533,8.0,1991,280,4.961209
1006,The Terminator,8698,7.6,1984,218,4.877801
8524,Star Wars: The Force Awakens,14989,7.4,2015,140607,4.743601
8281,X-Men: Days of Future Past,11772,7.5,2014,127585,4.739453
969,Aliens,6340,7.9,1986,679,4.729334
8533,Black Panther,16086,7.4,2018,284054,4.684835
7412,Alice in Wonderland,21,6.1,1933,25694,4.465827
8753,Star Trek Beyond,4984,6.7,2016,188927,4.459703
7138,Green Lantern: First Flight,224,6.6,2009,17445,4.415512
8059,Star Trek Into Darkness,7097,7.3,2013,54138,4.339972


In [118]:
hybrid(500, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
969,Aliens,6340,7.9,1986,679,4.397864
563,Terminator 2: Judgment Day,8533,8.0,1991,280,4.384852
8524,Star Wars: The Force Awakens,14989,7.4,2015,140607,4.110229
8281,X-Men: Days of Future Past,11772,7.5,2014,127585,3.977752
1006,The Terminator,8698,7.6,1984,218,3.904374
7412,Alice in Wonderland,21,6.1,1933,25694,3.860915
8059,Star Trek Into Darkness,7097,7.3,2013,54138,3.698867
8533,Black Panther,16086,7.4,2018,284054,3.637653
7138,Green Lantern: First Flight,224,6.6,2009,17445,3.587362
4394,Ghosts of the Abyss,87,7.0,2003,24982,3.574494



We see that for our hybrid recommender, we get different recommendations for different users although the movie is the same. Hence, our recommendations are more personalized and tailored towards particular users.

### Conclusion
In this notebook, I have built 4 different recommendation engines based on different ideas and algorithms. They are as follows:

- **Simple Recommender**: This system used overall TMDB Vote Count and Vote Averages to build Top Movies Charts, in general and for a specific genre. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.
- **Content Based Recommender**: We built two content based engines; one that took movie overview and taglines as input and the other which took metadata such as cast, crew, genre and keywords to come up with predictions. We also deviced a simple filter to give greater preference to movies with more votes and higher ratings.
- **Collaborative Filtering**: We used the powerful Surprise Library to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.
- **Hybrid Engine**: We brought together ideas from content and collaborative filterting to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.