# Movies Recommender System

![](http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg)

I will be  implementing a few recommendation algorithms (content based, popularity based) and try to build  final recommendation system.

With us, we have two MovieLens datasets.
* **The Full Dataset:** Consists of 26,000,000 ratings and 750,000 tag applications applied to 45,000 movies by 270,000 users. Includes tag genome data with 12 million relevance scores across 1,100 tags.
* **The Small Dataset:** Comprises of 100,000 ratings and 1,300 tag applications applied to 9,000 movies by 700 users.

Firstly I will build a Simple Recommender using movies from the *Full Dataset* 
Then I will implement The Content Based recommender systems will make use of the small dataset (due to the computing power I possess being very limited).

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
import warnings; warnings.simplefilter('ignore')

## Simple Recommender

The Simple Recommender offers generalized recommendations to every user based on movie popularity and (sometimes) genre. The basic idea behind this recommender is that movies that are more popular and more critically acclaimed will have a higher probability of being liked by the average audience. This model does not give personalized recommendations based on the user. 

In [2]:
df = pd. read_csv('movies_metadata.csv')
df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [3]:
df['genres'] = df['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

I will use IMDB's *weighted rating* formula to construct my chart. Mathematically, it is represented as follows:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$


where,
* *v* is the number of votes for the movie
* *m* is the minimum votes required to be listed in the chart
* *R* is the average rating of the movie
* *C* is the mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list( i.e we will use 95 percentile system to get the minimum votes required to be listed in chart).

In [4]:
vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
C = vote_averages.mean()
C

5.244896612406511

In [5]:
m = vote_counts.quantile(0.95)
m

434.0

In [6]:
df['year'] = pd.to_datetime(df['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [7]:
qualified_movies = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]
qualified_movies['vote_count'] = qualified_movies['vote_count'].astype('int')
qualified_movies['vote_average'] = qualified_movies['vote_average'].astype('int')
qualified_movies.shape

(2274, 6)

the minimum votes required to be listed in the chart is 434.0.

the mean vote across the whole report is 5.244896612406511

In [8]:
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [9]:
qualified_movies['weighted_rating'] = qualified_movies.apply(weighted_rating, axis=1)

In [10]:
qualified_movies = qualified_movies.sort_values('weighted_rating', ascending=False)

# Top Movies

In [11]:
qualified_movies.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,weighted_rating
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.950236,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.645403,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.307194,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,"[Adventure, Fantasy, Action]",7.851924


# Top Movies Based on Genres

In [12]:
new_df = df.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
new_df.name = 'genre'
gen_df = df.drop('genres', axis=1).join(new_df)

In [13]:
def genrebasedrec(genre, percentile=0.95):
    df = gen_df[gen_df['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)
    
    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    
    qualified['weighted_rating'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('weighted_rating', ascending=False).head(250)
    
    return qualified

# Top Romance Movies

In [14]:
genrebasedrec('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_rating
351,Forrest Gump,1994,8147,8,48.307194,7.86986
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,7.582757
876,Vertigo,1958,1162,8,18.20822,7.298862
40251,Your Name.,2016,1030,8,34.461252,7.235471
883,Some Like It Hot,1959,835,8,11.845107,7.117619
1132,Cinema Paradiso,1988,834,8,14.177005,7.116921
19901,Paperman,2012,734,8,7.198633,7.041055
37863,Sing Street,2016,669,8,10.672862,6.984338
1639,Titanic,1997,7770,7,26.88907,6.916316
19731,Silver Linings Playbook,2012,4840,7,14.488111,6.869789


# Top Action Movies

In [15]:
genrebasedrec('Action').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,weighted_rating
15480,Inception,2010,14075,8,29.108149,7.736766
12481,The Dark Knight,2008,12269,8,123.167259,7.702099
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,7.604772
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,7.577552
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,7.550351
256,Star Wars,1977,6778,8,42.149697,7.503157
1154,The Empire Strikes Back,1980,5998,8,19.470959,7.451086
4135,Scarface,1983,3017,8,11.299673,7.084315
9430,Oldboy,2003,2000,8,10.616859,6.813948
14551,Avatar,2009,12114,7,185.070892,6.805225


# Content Based Recommender

Why we need the content based recommender?why simple recommendor system was not enough?
 
Simple Recommender System  gives the same recommendation to everyone, regardless of the user's personal taste. If a person who loves romantic movies (and hates action) were to look at our Top 15 Chart, s/he wouldn't probably like most of the movies. If someone look at our charts by genre, he/she wouldn't still be getting the best recommendations.

To personalise our recommendations more, we will do **Content Based Filtering.** and we will try to improve it further so that we have better recommendations.

We will use the small dataset provided 

In [16]:
new_df = pd.read_csv('links_small.csv')
new_df = new_df[new_df['tmdbId'].notnull()]['tmdbId'].astype('int')

In [17]:
#deleting the rows with bad format data
df = df.drop([19730, 29503, 35587])

In [18]:
#Check Notebook for how and why I got these indices.
df['id'] = df['id'].astype('int')

In [19]:
small_data = df[df['id'].isin(new_df)]
small_data.shape

(9099, 25)

### Movie Description Based Recommender

Let us first try to build a recommender using movie descriptions and taglines.

In [20]:
small_data['tagline'] = small_data['tagline'].fillna('')
small_data['description'] = small_data['overview'] + small_data['tagline']
small_data['description'] = small_data['description'].fillna('')

In [21]:
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(small_data['description'])

In [22]:
tfidf_matrix.shape

(9099, 268124)

#### Cosine Similarity

We will be using the Cosine Similarity to calculate a numeric quantity that denotes the similarity between two movies. Mathematically, it is defined as follows:

$cosine(x,y) = \frac{x. y^\intercal}{||x||.||y||} $

Since we have used the TF-IDF Vectorizer, calculating the Dot Product will directly give us the Cosine Similarity Score. Therefore, we will use sklearn's **linear_kernel** instead of cosine_similarities since it is much faster.

In [23]:
similarity = linear_kernel(tfidf_matrix, tfidf_matrix)

In [24]:
similarity

array([[1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
        0.        ],
       [0.00680476, 1.        , 0.01531062, ..., 0.00357057, 0.00762326,
        0.        ],
       [0.        , 0.01531062, 1.        , ..., 0.        , 0.00286535,
        0.00472155],
       ...,
       [0.        , 0.00357057, 0.        , ..., 1.        , 0.07811616,
        0.        ],
       [0.00344913, 0.00762326, 0.00286535, ..., 0.07811616, 1.        ,
        0.        ],
       [0.        , 0.        , 0.00472155, ..., 0.        , 0.        ,
        1.        ]])

In [25]:
small_data = small_data.reset_index()
titles = small_data['title']
indices = pd.Series(small_data.index, index=small_data['title'])

In [26]:
def get_recommendations(title):
    idx = indices[title]
    similarity_scores = list(enumerate(similarity[idx]))
    similarity_scores = sorted(similarity_scores, key=lambda x: x[1], reverse=True)
    similarity_scores = similarity_scores[1:31]
    movie_indices = [i[0] for i in similarity_scores]
    return titles.iloc[movie_indices]

In [27]:
get_recommendations('The Family').head(10)

7391    Did You Hear About the Morgans?
692                       The Godfather
4665    Charlie's Angels: Full Throttle
2642               The Whole Nine Yards
8103                       Bachelorette
1774                     My Blue Heaven
647                              Eraser
2738                        Cool as Ice
3509                               Made
4196                 Johnny Dangerously
Name: title, dtype: object

In [28]:
get_recommendations('Batman Forever').head(10)

7931                      The Dark Knight Rises
2579               Batman: Mask of the Phantasm
6900                            The Dark Knight
6144                              Batman Begins
8165    Batman: The Dark Knight Returns, Part 1
524                                      Batman
1240                             Batman & Robin
1113                             Batman Returns
7565                 Batman: Under the Red Hood
7901                           Batman: Year One
Name: title, dtype: object

We see that for **Batman Forever**, this recommendation system is able to identify it as a Batman film and subsequently recommend other Batman films as its top recommendations. But unfortunately, that is all this system can do at the moment. This is not of much use to most people as it doesn't take into considerations very important features such as cast, crew, director and genre, which determine the rating and the popularity of a movie.

Someone who liked **The Dark Knight** probably likes it more because of Nolan and would hate **Batman Forever** and every other substandard movie in the Batman Franchise.

### Metadata Based Recommender

we are going to use much more suggestive metadata than **Overview** and **Tagline**. metadata_based recommender will take **genre**, **keywords**, **cast** and **crew** into consideration.

To build metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

In [29]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

In [30]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
df['id'] = df['id'].astype('int')

In [31]:
df.shape

(45463, 25)

In [32]:
df = df.merge(credits, on='id')
df = df.merge(keywords, on='id')

In [33]:
small_data1 = df[df['id'].isin(new_df)]
small_data1.shape

(9219, 28)

We now have our cast, crew, genres and credits, all in one dataframe.

1. **Crew:** From the crew, we will only pick the director as our feature since the others don't contribute that much to the *feel* of the movie.
2. **Cast:** Choosing Cast is a little more tricky. Lesser known actors and minor roles do not really affect people's opinion of a movie. Therefore, we must only select the major characters and their respective actors. Arbitrarily we will choose the top 3 actors that appear in the credits list. 

In [34]:
small_data1['cast'] = small_data1['cast'].apply(literal_eval)
small_data1['crew'] = small_data1['crew'].apply(literal_eval)
small_data1['keywords'] = small_data1['keywords'].apply(literal_eval)
small_data1['cast_size'] = small_data1['cast'].apply(lambda x: len(x))
small_data1['crew_size'] = small_data1['crew'].apply(lambda x: len(x))

In [35]:
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [36]:
small_data1['director'] = small_data1['crew'].apply(get_director)

In [37]:
small_data1['cast'] = small_data1['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
small_data1['cast'] = small_data1['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [38]:
small_data1['keywords'] = small_data1['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

We will be creating a metadata dump(combination) for every movie which consists of **genres, director, main actors and keywords.** I then use a **Count Vectorizer** to create our count matrix as we did in the Description Recommender. The remaining steps are similar to what we did earlier: we calculate the cosine similarities and return movies that are most similar.

1. **Strip Spaces and Convert to Lowercase** from all  features. This way, engine will not confuse between **Sam Wilson** and **Sam Jones.** 
2. We will Mention Director 2 times to give it more weight relative to the entire cast.

In [39]:
small_data1['cast'] = small_data1['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [40]:
small_data1['director'] = small_data1['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
small_data1['director'] = small_data1['director'].apply(lambda x: [x,x])

#### Keywords

We will do a small amount of pre-processing of our keywords before putting them to any use. As a first step, we will calculate the frequenct counts of every keyword that appears in the dataset.

In [41]:
s = small_data1.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [42]:
s = s.value_counts()
s

independent film         610
woman director           550
murder                   399
duringcreditsstinger     327
based on novel           318
                        ... 
summer cottage             1
kitchen sink realism       1
motion picture studio      1
driving in the rain        1
toyko                      1
Name: keyword, Length: 12940, dtype: int64

Keywords occur in frequencies ranging from 1 to 2170. We do not have any use for keywords that occur only once. Therefore, these can be safely removed. 

Finally, we will convert every word to its stem so that words such as *Dogs* and *Dog* are considered the same.

In [43]:
s = s[s >= 2]

In [44]:
stemmer = SnowballStemmer('english')
stemmer.stem('dogs')

'dog'

In [45]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [46]:
small_data1['keywords'] = small_data1['keywords'].apply(filter_keywords)
small_data1['keywords'] = small_data1['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
small_data1['keywords'] = small_data1['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [47]:
small_data1['combination'] = small_data1['keywords'] + small_data1['cast'] + small_data1['director'] + small_data1['genres']
small_data1['combination'] = small_data1['combination'].apply(lambda x: ' '.join(x))

In [48]:
from tmdbv3api import TMDb
import json
import requests
tmdb = TMDb()
tmdb.api_key = '68b8a37c9ca19b233cc057643bfbb9eb'
from tmdbv3api import Movie
tmdb_movie = Movie()
def get_poster(x):
    response = requests.get('https://api.themoviedb.org/3/movie/{}?api_key={}'.format(x,tmdb.api_key))
    if response.status_code==200:
        data_json = response.json()
        if data_json['poster_path']:
            poster_str = 'https://image.tmdb.org/t/p/w500'+data_json['poster_path']
            return poster_str
        else:
            return 'static/default.jpg'
    return 'static/default.jpg'

In [49]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(small_data1['combination'])

In [50]:
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [51]:
small_data1 = small_data1.reset_index()
titles = small_data1['title']
indices = pd.Series(small_data1.index, index=small_data1['title'])

In [52]:
small_data1['poster']=np.nan
for i in range(2000):
    small_data1['poster'][i]=get_poster(small_data1['id'][i])

ContentDecodingError: ('Received response with content-encoding: gzip, but failed to decode it.', error('Error -3 while decompressing data: incorrect header check'))

In [None]:
for i in range(2000,4000):
    small_data1['poster'][i]=get_poster(small_data1['id'][i])

In [None]:
for i in range(4000,7000):
    small_data1['poster'][i]=get_poster(small_data1['id'][i])

In [None]:
for i in range(7000,9219):
    small_data1['poster'][i]=get_poster(small_data1['id'][i])

In [None]:
def get_recommendations1(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    try:
        sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    except:
        sim_scores = sorted(sim_scores, key=lambda x: x[1][1], reverse=True)  
    sim_scores = sim_scores[1:11]
    movie_indices = [i[0] for i in sim_scores]
    tit = small_data1['title'].iloc[movie_indices]
    dat = small_data1['release_date'].iloc[movie_indices]
    rating = small_data1['vote_average'].iloc[movie_indices]
    moviedetails=small_data1['overview'].iloc[movie_indices]
    movietypes=small_data1['keywords'].iloc[movie_indices]
    movieid=small_data1['id'].iloc[movie_indices]
    org_title=small_data1['original_title'].iloc[movie_indices]
    poster=small_data1['poster'].iloc[movie_indices]
    
    
    return_df = pd.DataFrame(columns=['Title','Year'])
    return_df['Title'] = tit
    return_df['Year'] = dat
    return_df['Ratings'] = rating
    return_df['Overview']=moviedetails
    return_df['Types']=movietypes
    return_df['ID']=movieid
    return_df['org_title']=org_title
    return_df['poster'] =poster
    sorted_df = return_df.sort_values(by=['Ratings'], ascending=False)
    return sorted_df

In [None]:
get_recommendations1("The Dark Knight")

In [None]:
small_data1.to_csv('final_movies_data.csv',index=False)

#  We find this recommender system quite good and will be using to to recommend movies on our web application.