<h1> Movie Recommender System </h1>

Recommendation Systems are a type of information filtering systems as they improve the quality of search results and provides items that are more relevant to the search item or are related to the search history of the user.  
Amazon uses it to suggest products to customers, YouTube uses it to decide which video to play next on autoplay, and Facebook uses it to recommend pages to like and people to follow. Moreover, companies like Netflix and Spotify depend highly on the effectiveness of their recommendation engines for their business and sucees.

![](https://miro.medium.com/max/3220/1*ZrT6TIWpH8OgWDXGN_WqpQ.png)

There are basically three types of recommender systems:-

- **Demographic Filtering** They offer generalized recommendations to every user, based on movie popularity and/or genre. The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience.


- **Content Based Filtering** They suggest similar items based on a particular item. This system uses item metadata, such as genre, director, description, actors, etc. for movies, to make these recommendations. The general idea behind these recommender systems is that if a person liked a particular item, he or she will also like an item that is similar to it.


- **Collaborative Filtering** This system matches persons with similar interests and provides recommendations based on this matching. Collaborative filters do not require item metadata like its content-based counterparts.

In this notebook, I will attempt at implementing the above recommendation algorithms and then try to build an ensemble of these models to come up with our final hybrid recommendation system.   
We will start by building Simple Recommender using movies from the **Full Dataset** (comprise of 25 million ratings for 45,000 movies) whereas other complex recommender systems will make use of the smaller dataset (consisting of 100k ratings for 9000 movies).

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet

import warnings
warnings.simplefilter('ignore')

## Simple Recommender  
The implementation of this model is extremely trivial. All we have to do is sort our movies based on ratings and popularity and display the top movies of our list. As an added step, we can pass in a genre argument to get the top movies of a particular genre.

In [108]:
mdata = pd.read_csv(r'C:\Users\Pundeer\Desktop\Data Science\Kaggle\Recommender System\movies_metadata.csv')
mdata.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [109]:
mdata['genres'] = mdata['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
mdata.genres

0         [Animation, Comedy, Family]
1        [Adventure, Fantasy, Family]
2                   [Romance, Comedy]
3            [Comedy, Drama, Romance]
4                            [Comedy]
                     ...             
45461                 [Drama, Family]
45462                         [Drama]
45463       [Action, Drama, Thriller]
45464                              []
45465                              []
Name: genres, Length: 45466, dtype: object

We will use IMDB's *weighted rating* formula to construct **Top Movies Chart**. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.

In [9]:
m = mdata[mdata.vote_count.notnull()].vote_count.astype('int').quantile(0.95)
m

434.0

In [10]:
C = mdata[mdata.vote_average.notnull()].vote_average.mean()
C

5.618207215134184

In [110]:
mdata['year'] = pd.to_datetime(mdata.release_date, errors = 'coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [12]:
qualified = mdata[(mdata['vote_count'] >= m) & (mdata['vote_count'].notnull()) & 
                  (mdata['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified.shape

(2274, 6)

Therefore, to be considered for the top 250 chart, a movie has to have at least 434 votes. We also see that the average rating for a movie is 5.618 on 10.   
2274 Movies qualify to be on our chart.

In [13]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [14]:
qualified['ratings'] = qualified.apply(weighted_rating, axis=1)

#### Top movies

In [15]:
qualified = qualified.sort_values('ratings', ascending=False).head(250)
qualified.head(20)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,ratings
314,The Shawshank Redemption,1994,8358,8.5,51.6454,"[Drama, Crime]",8.357746
834,The Godfather,1972,6024,8.5,41.1093,"[Drama, Crime]",8.306334
12481,The Dark Knight,2008,12269,8.3,123.167,"[Drama, Action, Crime, Thriller]",8.208376
2843,Fight Club,1999,9678,8.3,63.8696,[Drama],8.184899
292,Pulp Fiction,1994,8670,8.3,140.95,"[Thriller, Crime]",8.172155
351,Forrest Gump,1994,8147,8.2,48.3072,"[Comedy, Drama, Romance]",8.069421
522,Schindler's List,1993,4436,8.3,41.7251,"[Drama, History, War]",8.061007
23673,Whiplash,2014,4376,8.3,64.3,[Drama],8.058025
5481,Spirited Away,2001,3968,8.3,41.0489,"[Fantasy, Adventure, Animation, Family]",8.035598
1154,The Empire Strikes Back,1980,5998,8.2,19.471,"[Adventure, Action, Science Fiction]",8.025793


Let us now construct our function that builds charts for particular genres. For this, we will use relax our default conditions to the 85th percentile instead of 95.

In [16]:
s = mdata.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_df = mdata.drop('genres', axis=1).join(s)

In [17]:
def build_chart(genre, percentile=0.85):
    df = gen_df[gen_df['genre'] == genre]
    m = df[df.vote_count.notnull()].vote_count.astype('int').quantile(percentile)
    C = df[df.vote_average.notnull()]['vote_average'].mean()

    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & 
                   (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    
    qualified['ratings'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average'])
                                                       + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('ratings', ascending=False).head(250)    
    return qualified

Let us see our method in action by displaying the Top 10 Science Fictions

In [18]:
build_chart('Science Fiction').head(10)

Unnamed: 0,title,year,vote_count,vote_average,popularity,ratings
1154,The Empire Strikes Back,1980,5998,8.2,19.471,8.071052
15480,Inception,2010,14075,8.1,29.1081,8.045563
22879,Interstellar,2014,11187,8.1,32.2135,8.031857
256,Star Wars,1977,6778,8.1,42.1497,7.98931
1225,Back to the Future,1985,6239,8.0,25.7785,7.884509
23753,Guardians of the Galaxy,2014,10014,7.9,53.2916,7.8296
2458,The Matrix,1999,9079,7.9,33.3663,7.82257
1163,A Clockwork Orange,1971,3432,8.0,17.1126,7.797258
1167,Return of the Jedi,1983,4763,7.9,14.5861,7.756348
1171,Alien,1979,4564,7.9,23.3774,7.750451


We see here all of the original Star Wars trilogy was able to make it to top 10 science fiction movies of all time

## Content Based Recommender

The recommender we built in the previous section suffers some severe limitations. For one, it gives the same recommendation to everyone, regardless of the user's personal taste. To personalise our recommendations more, we are going to build an engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.  
Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**  
We will built a Content Based Recommenders based on:
- Movie Overviews and Taglines
- Movie Cast, Crew, Keywords and Genre  

We will be using the smaller dataset here, due to limited computation capabilities

In [111]:
mdata = mdata.drop([19730, 29503, 35587])

In [112]:
links_small = pd.read_csv(r'C:\Users\Pundeer\Desktop\Data Science\Kaggle\Recommender System\links_small.csv')
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
mdata.id = mdata.id.astype('int')
small_df = mdata[mdata['id'].isin(links_small)]
small_df.shape

(9099, 25)

We have 9099 movies avaiable in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies

#### Cosine Similarity

We will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we are using the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

### 1. Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines.

In [113]:
small_df.tagline = small_df.tagline.fillna('')
small_df.overview = small_df.overview.fillna('')
small_df['description'] = small_df.overview + small_df.tagline

In [114]:
tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(small_df['description'])
tfidf_matrix.shape

(9099, 268124)

In [115]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [116]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

In [117]:
small_df = small_df.reset_index()
titles = small_df['title']
indices = pd.Series(small_df.index, index=small_df['title'])

In [118]:
def get_recommendations(title, cosine_sim):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:16]
    movie_indices = [i[0] for i in sim_scores]
    return titles.iloc[movie_indices]

Let us now try and get the top recommendations for a few movies and see how good the recommendations are

In [119]:
get_recommendations('The Godfather', cosine_sim)

973      The Godfather: Part II
8387                 The Family
3509                       Made
4196         Johnny Dangerously
29               Shanghai Triad
5667                       Fury
2412             American Movie
1582    The Godfather: Part III
4221                    8 Women
2159              Summer of Sam
618                     Thinner
3609              Harlem Nights
8816              Run All Night
3288          Jaws: The Revenge
2192           The Color Purple
Name: title, dtype: object

In [120]:
get_recommendations('The Dark Knight', cosine_sim)

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
6144                              Batman Begins
7933         Sherlock Holmes: A Game of Shadows
5511                            To End All Wars
4489                                      Q & A
7344                        Law Abiding Citizen
Name: title, dtype: object

We see that for The Dark Knight, our system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. Someone who liked The Dark Knight probably likes it more because of its director Christopher Nolan and might not like Batman Forever and every other substandard movie in the Batman Franchise.

### 2. Genres, Cast and Keywords Based Recommender

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets.

In [121]:
credits = pd.read_csv(r'C:\Users\Pundeer\Desktop\Data Science\Kaggle\Recommender System\credits.csv')
keywords = pd.read_csv(r'C:\Users\Pundeer\Desktop\Data Science\Kaggle\Recommender System\keywords.csv')

In [122]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')

In [123]:
md1 = mdata.merge(credits, on='id')
md1 = md1.merge(keywords, on='id')

In [124]:
small_df= md1[md1['id'].isin(links_small)]
small_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,year,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,1995,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,1995,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,1995,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,1995,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


We now have our cast, crew, genres and credits, all in one dataframe. Let us wrangle this a little more using the following intuitions:

1. **Crew:** We will only pick the director as our feature
2. **Cast:** We only select the major characters and their respective actors. Arbitrarily, we will choose the top 3 actors that appear in the credits list. 

In [125]:
small_df.shape

(9219, 28)

In [126]:
# Wrangling crew column
def director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

small_df['director'] = small_df['crew'].apply(literal_eval).apply(lambda x: director(x))

In [127]:
# Wrangling cast column
small_df['cast'] = small_df['cast'].apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [128]:
small_df['cast'] = small_df['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [129]:
# Wrangling keywords
small_df['keywords'] = small_df['keywords'].apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

The approach to building the recommender is creating a metadata dump for every movie which consists of **genres, director, main actors and keywords.** I then use a **Count Vectorizer** to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.  

These are steps I follow in the preparation of my genres and credits data:
1. **Strip Spaces and Convert to Lowercase** from all our features. This way, our model will not confuse between **Johnny Depp** and **Johnny Galecki.** 
2. **Mention Director 3 times** to give it more weight relative to the entire cast.

In [130]:
small_df['cast'] = small_df['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
small_df['director'] = small_df['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
small_df['director'] = small_df['director'].apply(lambda x: [x, x, x])

In [132]:
# Further processing Keywords
kw = small_df.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
kw.name = 'keyword'
kw = kw.value_counts()
kw = kw[kw > 1]
kw

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
                       ... 
public murder             2
teenage life              2
sign                      2
motorcycle chase          2
practical joke            2
Name: keyword, Length: 6709, dtype: int64

In [133]:
def filter_keywords(keys):
    words = []
    for i in keys:
        if i in kw:
            words.append(i)
    return words

In [134]:
small_df.keywords = small_df['keywords'].apply(lambda x: filter_keywords(x))

In [135]:
stemmer = SnowballStemmer('english')
small_df['keywords'] = small_df['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
small_df['keywords'] = small_df['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

We are now in a position to create our "metadata soup", which is a string that contains all the metadata that we want to feed to our vectorizer.   
(namely actors, director and keywords)  

The next steps are the same as what we did with our description based recommender. One important difference is that we use the **CountVectorizer() instead of TF-IDF**. This is because we do not want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies.

In [136]:
small_df['soup'] = small_df['cast'] + small_df['director'] + small_df['genres'] + small_df['keywords']
small_df['soup'] = small_df['soup'].apply(lambda x: ' '.join(x))

In [137]:
small_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,year,cast,crew,keywords,director,soup
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Toy Story,False,7.7,5415.0,1995,"[tomhanks, timallen, donrickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousi, toy, boy, friendship, friend, rival...","[johnlasseter, johnlasseter, johnlasseter]",tomhanks timallen donrickles johnlasseter john...
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Jumanji,False,6.9,2413.0,1995,"[robinwilliams, jonathanhyde, kirstendunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[boardgam, disappear, basedonchildren'sbook, n...","[joejohnston, joejohnston, joejohnston]",robinwilliams jonathanhyde kirstendunst joejoh...
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Grumpier Old Men,False,6.5,92.0,1995,"[waltermatthau, jacklemmon, ann-margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fish, bestfriend, duringcreditssting]","[howarddeutch, howarddeutch, howarddeutch]",waltermatthau jacklemmon ann-margret howarddeu...
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Waiting to Exhale,False,6.1,34.0,1995,"[whitneyhouston, angelabassett, lorettadevine]","[{'credit_id': '52fe44779251416c91011acb', 'de...","[basedonnovel, interracialrelationship, single...","[forestwhitaker, forestwhitaker, forestwhitaker]",whitneyhouston angelabassett lorettadevine for...
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Father of the Bride Part II,False,5.7,173.0,1995,"[stevemartin, dianekeaton, martinshort]","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[babi, midlifecrisi, confid, age, daughter, mo...","[charlesshyer, charlesshyer, charlesshyer]",stevemartin dianekeaton martinshort charlesshy...


In [138]:
count = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
count_matrix = count.fit_transform(small_df['soup'])

In [139]:
cos_sim = cosine_similarity(count_matrix, count_matrix)

In [140]:
small_df = small_df.reset_index()
titles = small_df['title']
indices = pd.Series(small_df.index, index=small_df['title'])

We can now reuse our get_recommendations() function and since our similarity measure cosine_sim matrix has changed,  we expect to get diffrent and probably better results.

In [141]:
get_recommendations('The Dark Knight', cos_sim)

8031                 The Dark Knight Rises
6218                         Batman Begins
6623                          The Prestige
2085                             Following
7648                             Inception
4145                              Insomnia
3381                               Memento
8613                          Interstellar
7659            Batman: Under the Red Hood
1134                        Batman Returns
8927               Kidnapping Mr. Heineken
5943                              Thursday
1260                        Batman & Robin
9024    Batman v Superman: Dawn of Justice
4021                  The Long Good Friday
Name: title, dtype: object

The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations.

#### Popularity and Ratings

One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that **Batman and Robin** has a lot of similar characters as compared to **The Dark Knight** but it was a terrible movie that shouldn't be recommended to anyone.  
Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response. We will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [142]:
def improved_recommendation(title, cosine_sim):   
    idx = indices[title]
    sim_score = list(enumerate(cosine_sim[idx]))
    sim_score = sorted(sim_score, key = lambda x:x[1], reverse = True)[1:51]
    movie_index = [i[0] for i in sim_score]
    
    movies = small_df.iloc[movie_index][['title', 'vote_average', 'vote_count', 'genres', 'cast', 'year']]
    m = movies[movies.vote_count.notnull()]['vote_count'].quantile(0.75)
    C = movies[movies.vote_average.notnull()]['vote_average'].mean()
    
    qualified = movies[(movies.vote_count >= m) & (movies.vote_count.notnull()) & (movies.vote_average.notnull())]
    qualified['Rating'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('Rating', ascending=False)[['title','Rating', 'genres', 'cast', 'year']]
    return qualified.head(10)


In [143]:
improved_recommendation('The Dark Knight', cos_sim)

Unnamed: 0,title,Rating,genres,cast,year
7648,Inception,8.025763,"[Action, Thriller, Science Fiction, Mystery, A...","[leonardodicaprio, josephgordon-levitt, ellenp...",2010
8613,Interstellar,8.007315,"[Adventure, Drama, Science Fiction]","[matthewmcconaughey, jessicachastain, annehath...",2014
3381,Memento,7.86595,"[Mystery, Thriller]","[guypearce, carrie-annemoss, joepantoliano]",2000
6623,The Prestige,7.790919,"[Drama, Mystery, Thriller]","[hughjackman, christianbale, michaelcaine]",2006
8031,The Dark Knight Rises,7.511303,"[Action, Crime, Drama, Thriller]","[christianbale, michaelcaine, garyoldman]",2012
6218,Batman Begins,7.397206,"[Action, Crime, Drama]","[christianbale, michaelcaine, liamneeson]",2005
7583,Kick-Ass,6.975874,"[Action, Crime]","[aarontaylor-johnson, chloëgracemoretz, christ...",2010
524,Batman,6.767469,"[Fantasy, Action]","[jacknicholson, michaelkeaton, kimbasinger]",1989
1134,Batman Returns,6.400889,"[Action, Fantasy]","[michaelkeaton, dannydevito, michellepfeiffer]",1992
8467,Kick-Ass 2,6.190772,"[Action, Adventure, Crime]","[aarontaylor-johnson, chloëgracemoretz, christ...",2013


In [144]:
improved_recommendation('Iron Man', cos_sim)

Unnamed: 0,title,Rating,genres,cast,year
8712,Guardians of the Galaxy,7.805216,"[Action, Science Fiction, Adventure]","[chrispratt, zoesaldana, davebautista]",2014
8626,Captain America: The Winter Soldier,7.463801,"[Action, Adventure, Science Fiction]","[chrisevans, samuell.jackson, scarlettjohansson]",2014
8658,X-Men: Days of Future Past,7.376051,"[Action, Adventure, Fantasy, Science Fiction]","[hughjackman, jamesmcavoy, michaelfassbender]",2014
7969,The Avengers,7.337808,"[Science Fiction, Action, Adventure]","[robertdowneyjr., chrisevans, markruffalo]",2012
8871,Deadpool,7.334897,"[Action, Adventure, Comedy]","[ryanreynolds, morenabaccarin, edskrein]",2016
8868,Avengers: Age of Ultron,7.200586,"[Action, Adventure, Science Fiction]","[robertdowneyjr., chrishemsworth, markruffalo]",2015
8872,Captain America: Civil War,7.018554,"[Adventure, Action, Science Fiction]","[chrisevans, robertdowneyjr., scarlettjohansson]",2016
8869,Ant-Man,6.907211,"[Science Fiction, Action, Adventure]","[paulrudd, michaeldouglas, evangelinelilly]",2015
8392,Iron Man 3,6.745349,"[Action, Adventure, Science Fiction]","[robertdowneyjr., gwynethpaltrow, doncheadle]",2013
7923,Captain America: The First Avenger,6.543993,"[Action, Adventure, Science Fiction]","[chrisevans, hugoweaving, tommyleejones]",2011


We see that our recommender has been successful in capturing more information due to more metadata and has given us better recommendations.

## Collaborative Filtering Based Recommender

Our content based engine still suffers from some severe limitations. It is only capable of suggesting movies which are close to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.  
Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of user's identity.

Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not. It is basically of two types:-

*  **User based filtering** These systems recommend products to a user that similar users have liked. For measuring the similarity between two users we can either use pearson correlation or cosine similarity.
* **Item Based Collaborative Filtering** Instead of measuring the similarity between users, the item-based CF recommends items based on their similarity with the items that the target user rated. Likewise, the similarity can be computed with Pearson Correlation or Cosine Similarity.

We will use the **Surprise library** that uses extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [173]:
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

reader = Reader()

In [146]:
ratings = pd.read_csv(r'C:\Users\Pundeer\Desktop\Data Science\Kaggle\Recommender System\ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [147]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [148]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8850  0.9024  0.9021  0.8949  0.8969  0.8963  0.0063  
MAE (testset)     0.6827  0.6966  0.6929  0.6888  0.6910  0.6904  0.0046  
Fit time          17.82   17.38   18.26   16.89   16.90   17.45   0.53    
Test time         0.67    0.74    0.61    1.29    0.63    0.79    0.25    


{'test_rmse': array([0.88498903, 0.90241925, 0.90210751, 0.89487029, 0.89691861]),
 'test_mae': array([0.68272646, 0.6966014 , 0.69286256, 0.6887626 , 0.690971  ]),
 'fit_time': (17.82364249229431,
  17.381481409072876,
  18.264177083969116,
  16.890681266784668,
  16.903923511505127),
 'test_time': (0.6683306694030762,
  0.7448601722717285,
  0.6126194000244141,
  1.2896075248718262,
  0.6325645446777344)}

We get a mean **Root Mean Sqaure Error** of 0.8985 which is more than good enough for our case. Let us now train on our dataset and arrive at predictions.

In [149]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x24a29e0ec18>

Let us pick user with user Id 100 and check the ratings she/he has given.

In [150]:
ratings[ratings['userId'] == 100]

Unnamed: 0,userId,movieId,rating,timestamp
15273,100,1,4.0,854193977
15274,100,3,4.0,854194024
15275,100,6,3.0,854194023
15276,100,7,3.0,854194024
15277,100,25,4.0,854193977
15278,100,32,5.0,854193977
15279,100,52,3.0,854194056
15280,100,62,3.0,854193977
15281,100,86,3.0,854194208
15282,100,88,2.0,854194208


In [172]:
svd.predict(100, 302)

Prediction(uid=100, iid=302, r_ui=None, est=3.6266232075173943, details={'was_impossible': False})

For movie with ID 302, we get an estimated prediction of 3.6266. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

## Hybrid Recommender

In this section, we will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

* **Input:** User ID and the Title of a Movie
* **Output:** Similar movies sorted on the basis of expected ratings by that particular user

In [167]:
id_map = pd.read_csv(r'C:\Users\Pundeer\Desktop\Data Science\Kaggle\Recommender System\links_small.csv')
id_map = id_map[['movieId', 'tmdbId']].dropna()
id_map['tmdbId'] = id_map['tmdbId'].astype('int')
id_map.columns = ['movieId', 'id']
id_map = id_map.merge(small_df[['title', 'id']], on='id').set_index('title')

In [168]:
indices_map = id_map.set_index('id')

In [169]:
def hybrid(userId, title, cosine_sim):
    idx = indices[title]
    tmdbId = id_map.loc[title]['id']
    #print(idx)
    movie_id = id_map.loc[title]['movieId']
    
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)[1:51]
    movie_indices = [i[0] for i in sim_scores]
    
    movies = small_df.iloc[movie_indices][['title', 'genres', 'vote_count', 'vote_average', 'year', 'id']]
    movies['est_rating'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est_rating', ascending=False)
    return movies.head(10)

In [170]:
hybrid(1, 'Avatar', cos_sim)

Unnamed: 0,title,genres,vote_count,vote_average,year,id,est_rating
522,Terminator 2: Judgment Day,"[Action, Thriller, Science Fiction]",4274.0,7.7,1991,280,3.126748
2834,Predator,"[Science Fiction, Action, Adventure, Thriller]",2129.0,7.3,1987,106,3.047741
974,Aliens,"[Horror, Action, Thriller, Science Fiction]",3282.0,7.7,1986,679,3.037944
8401,Star Trek Into Darkness,"[Action, Adventure, Science Fiction]",4479.0,7.4,2013,54138,3.010345
922,The Abyss,"[Adventure, Action, Thriller, Science Fiction]",822.0,7.1,1989,2756,2.987448
1011,The Terminator,"[Action, Thriller, Science Fiction]",4208.0,7.4,1984,218,2.924104
1668,Return from Witch Mountain,"[Adventure, Fantasy, Science Fiction, Family]",38.0,5.6,1978,14822,2.915279
3999,Vampire Hunter D: Bloodlust,"[Action, Adventure, Animation, Fantasy, Horror...",92.0,7.0,2000,15999,2.887549
8865,Star Wars: The Force Awakens,"[Action, Adventure, Science Fiction, Fantasy]",7993.0,7.5,2015,140607,2.872679
8658,X-Men: Days of Future Past,"[Action, Adventure, Fantasy, Science Fiction]",6155.0,7.5,2014,127585,2.801807


In [171]:
hybrid(100, 'Avatar', cos_sim)

Unnamed: 0,title,genres,vote_count,vote_average,year,id,est_rating
1011,The Terminator,"[Action, Thriller, Science Fiction]",4208.0,7.4,1984,218,3.87485
974,Aliens,"[Horror, Action, Thriller, Science Fiction]",3282.0,7.7,1986,679,3.80062
522,Terminator 2: Judgment Day,"[Action, Thriller, Science Fiction]",4274.0,7.7,1991,280,3.753275
1241,The Fifth Element,"[Adventure, Fantasy, Action, Thriller, Science...",3962.0,7.3,1997,18,3.684955
8401,Star Trek Into Darkness,"[Action, Adventure, Science Fiction]",4479.0,7.4,2013,54138,3.643864
3999,Vampire Hunter D: Bloodlust,"[Action, Adventure, Animation, Fantasy, Horror...",92.0,7.0,2000,15999,3.635952
3013,Titan A.E.,"[Animation, Action, Science Fiction, Family, A...",320.0,6.3,2000,7450,3.594047
8865,Star Wars: The Force Awakens,"[Action, Adventure, Science Fiction, Fantasy]",7993.0,7.5,2015,140607,3.592127
9004,Suicide Squad,"[Action, Adventure, Crime, Fantasy, Science Fi...",7717.0,5.9,2016,297761,3.579047
8658,X-Men: Days of Future Past,"[Action, Adventure, Fantasy, Science Fiction]",6155.0,7.5,2014,127585,3.526547


We see that for our hybrid recommender, we get different recommendations for different users although the movie is the same. Hence, our recommendations are more personalized and tailored towards particular users.

## Conclusion

In this notebook, I have built 4 different recommendation engines based on different ideas and algorithms. They are as follows:

1. **Simple Recommender:** This system used overall IMDB Vote Count and Vote Averages to build Top Movies Charts, in general and for a specific genre. The IMDB Weighted Rating System was used to calculate ratings on which the sorting was finally performed.
2. **Content Based Recommender:** We built two content based engines; one that took movie overview and taglines as input and the other which took metadata such as cast, crew, genre and keywords to come up with predictions. We also deviced a simple filter to give greater preference to movies with more votes and higher ratings.
3. **Collaborative Filtering:** We used the powerful Surprise Library to build a collaborative filter based on single value decomposition. The RMSE obtained was less than 1 and the engine gave estimated ratings for a given user and movie.
4. **Hybrid Engine:** We brought together ideas from content and collaborative filterting to build an engine that gave movie suggestions to a particular user based on the estimated ratings that it had internally calculated for that user.
