<a href="https://colab.research.google.com/github/guddushah/Movie-Recommender-System-Dissertation/blob/main/Movie_recommender_systems.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Algorithms:
1. Simple Recommender System based on ratings of movie
2. Content-Based

(i) Movie Overviews and Taglines- Description based

(ii) Movie Cast, Crew, Keywords and Genre- Metadata based
3. Popularity-Based
4. Collaborative Filtering
5. Ensemble method


![](http://labs.criteo.com/wp-content/uploads/2017/08/CustomersWhoBought3.jpg)

In [23]:
pip install surprise



In [24]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from ast import literal_eval
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import linear_kernel, cosine_similarity
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import wordnet
from surprise import Reader, Dataset, SVD
from surprise.model_selection import cross_validate

import warnings; warnings.simplefilter('ignore')

In [25]:
import os
import zipfile
os.environ['KAGGLE_CONFIG_DIR']="/content/"

!kaggle datasets download rounakbanik/the-movies-dataset

the-movies-dataset.zip: Skipping, found more recently modified local copy (use --force to force download)


In [26]:
! unzip "/content/the-movies-dataset.zip"

Archive:  /content/the-movies-dataset.zip
  inflating: credits.csv             
  inflating: keywords.csv            
  inflating: links.csv               
  inflating: links_small.csv         
  inflating: movies_metadata.csv     
  inflating: ratings.csv             
  inflating: ratings_small.csv       


# **Analysing Data**

In [27]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:
pd.set_option('display.max_colwidth', 50)

In [29]:
md = pd.read_csv('movies_metadata.csv')
md.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [30]:
md.head().transpose()

Unnamed: 0,0,1,2,3,4
adult,False,False,False,False,False
belongs_to_collection,"{'id': 10194, 'name': 'Toy Story Collection', ...",,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",,"{'id': 96871, 'name': 'Father of the Bride Col..."
budget,30000000,65000000,0,16000000,0
genres,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...","[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...","[{'id': 35, 'name': 'Comedy'}]"
homepage,http://toystory.disney.com/toy-story,,,,
id,862,8844,15602,31357,11862
imdb_id,tt0114709,tt0113497,tt0113228,tt0114885,tt0113041
original_language,en,en,en,en,en
original_title,Toy Story,Jumanji,Grumpier Old Men,Waiting to Exhale,Father of the Bride Part II
overview,"Led by Woody, Andy's toys live happily in his ...",When siblings Judy and Peter discover an encha...,A family wedding reignites the ancient feud be...,"Cheated on, mistreated and stepped on, the wom...",Just when George Banks has recovered from his ...


**Understanding the Dataset**

The dataset present in the Kaggle has been originally obtained through the TMDB API. The movies available in this dataset are in correspondence with the movies that are listed in the MovieLens Latest Full Dataset comprising of 26 million ratings on 45,000 movies from 27,000 users.

In [31]:
#displaying columns present in the dataset
md.columns

Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count'],
      dtype='object')

In [32]:
#displaying number of rows and columns in the dataset
md.shape

(45466, 24)

In [33]:
#Checking for missing values
md.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  45466 non-null  object 
 1   belongs_to_collection  4494 non-null   object 
 2   budget                 45466 non-null  object 
 3   genres                 45466 non-null  object 
 4   homepage               7782 non-null   object 
 5   id                     45466 non-null  object 
 6   imdb_id                45449 non-null  object 
 7   original_language      45455 non-null  object 
 8   original_title         45466 non-null  object 
 9   overview               44512 non-null  object 
 10  popularity             45461 non-null  object 
 11  poster_path            45080 non-null  object 
 12  production_companies   45463 non-null  object 
 13  production_countries   45463 non-null  object 
 14  release_date           45379 non-null  object 
 15  re

There are a total of 45,466 movies with 24 features. Most of the features have very few NaN values excluding belongs_to_collection, homepage and tagline.

All the further steps are performed according to the recommender systems.

## Simple Recommender

Generating general Recommendations to every user on the basis of popularity of movie and genre. Personalised recommendations are not given to the user.

Idea- There are high chances that large audience will like more popular movies.

Implementation- Sort the movies on the basis of Ratings and popularity and then display top movies of the list.

Top movies can also be recommended according to different genres.


In [34]:
md['genres'] = md['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

IMDB's *weighted rating* formula is used to construct the Top Movies Chart. The formula is given as:

Weighted Rating (WR) = $(\frac{v}{v + m} . R) + (\frac{m}{v + m} . C)$

where,
* *v* = number of votes for the movie
* *m* = minimum votes required to be listed in the chart
* *R* = average rating of the movie
* *C* = mean vote across the whole report

The next step is to determine an appropriate value for *m*, the minimum votes required to be listed in the chart. We will use **95th percentile** as our cutoff. In other words, for a movie to feature in the charts, it must have more votes than at least 95% of the movies in the list.


In [35]:
#getting the value of v (number of votes) from dataset
vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

#getting the value of R (average rating of the movie) from dataset
vote_averages = md[md['vote_average'].notnull()]['vote_average'].astype('int')

#Calculating the value of mean votes C
C = vote_averages.mean()
C

5.244896612406511

In [36]:
#Calculating the value of m which is minimum votes required to be listed in the chart
m = vote_counts.quantile(0.95)
m

434.0

In [37]:
#fetching year from the release date and adding it as separate col in dataframe
md['year'] = pd.to_datetime(md['release_date'], errors='coerce').apply(lambda x: str(x).split('-')[0] if x != np.nan else np.nan)

In [38]:
#Creating a dataframe with useful columns only
#Title, Year, Vote Count, Vote Average, Popularity, Genres
#Extracting information about movies where vote count and vote average is not null
# and vote count should be greater than value of m = 434 (here)
qualified = md[(md['vote_count'] >= m) &
               (md['vote_count'].notnull()) &
               (md['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity', 'genres']]

#Converting vote count and vote average values from float to int
qualified['vote_count'] = qualified['vote_count'].astype('int')
qualified['vote_average'] = qualified['vote_average'].astype('int')
qualified.shape

(2274, 6)

**Observation:**

Therefore, to qualify to be considered for the chart, a movie has to have at least **434 votes** on TMDB. We also see that the average rating for a movie on TMDB is **5.244** on a scale of 10. **2274** Movies qualify to be on our chart.

In [39]:
#Defining function for calculating IMDB's weighted rating
def weighted_rating(x):
    v = x['vote_count']
    R = x['vote_average']
    return (v/(v+m) * R) + (m/(m+v) * C)

In [40]:
#Calculating and adding IMDB weighted rating in dataframe
qualified['wr'] = qualified.apply(weighted_rating, axis=1)

In [41]:
#Sorting the values for movies according to IMDB weighted rating
#Fetching only Top 250 movies
qualified = qualified.sort_values('wr', ascending=False).head(250)

### Top Movies

In [42]:
#Displaying Top 15 Movies
qualified.head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,genres,wr
15480,Inception,2010,14075,8,29.108149,"[Action, Thriller, Science Fiction, Mystery, A...",7.917588
12481,The Dark Knight,2008,12269,8,123.167259,"[Drama, Action, Crime, Thriller]",7.905871
22879,Interstellar,2014,11187,8,32.213481,"[Adventure, Drama, Science Fiction]",7.897107
2843,Fight Club,1999,9678,8,63.869599,[Drama],7.881753
4863,The Lord of the Rings: The Fellowship of the Ring,2001,8892,8,32.070725,"[Adventure, Fantasy, Action]",7.871787
292,Pulp Fiction,1994,8670,8,140.950236,"[Thriller, Crime]",7.86866
314,The Shawshank Redemption,1994,8358,8,51.645403,"[Drama, Crime]",7.864
7000,The Lord of the Rings: The Return of the King,2003,8226,8,29.324358,"[Adventure, Fantasy, Action]",7.861927
351,Forrest Gump,1994,8147,8,48.307194,"[Comedy, Drama, Romance]",7.860656
5814,The Lord of the Rings: The Two Towers,2002,7641,8,29.423537,"[Adventure, Fantasy, Action]",7.851924


We see that three Christopher Nolan Films-  **Inception**, **The Dark Knight** and **Interstellar** occur at the very top of chart.


In [43]:
#Generating record for each movie for each genre in new dataframe
s = md.apply(lambda x: pd.Series(x['genres']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'genre'
gen_md = md.drop('genres', axis=1).join(s)

In [44]:
gen_md.shape

(93548, 25)

In [45]:
#Different categories/genres of movies
gen_md.genre.unique()

array(['Animation', 'Comedy', 'Family', 'Adventure', 'Fantasy', 'Romance',
       'Drama', 'Action', 'Crime', 'Thriller', 'Horror', 'History',
       'Science Fiction', 'Mystery', 'War', 'Foreign', nan, 'Music',
       'Documentary', 'Western', 'TV Movie', 'Carousel Productions',
       'Vision View Entertainment', 'Telescene Film Group Productions',
       'Aniplex', 'GoHands', 'BROSTA TV',
       'Mardock Scramble Production Committee', 'Sentai Filmworks',
       'Odyssey Media', 'Pulser Productions', 'Rogue State', 'The Cartel'],
      dtype=object)

In [46]:
#Displaying Top 250 movies belonging to the specified genre
#by using IMDB weighted ratings for each set of movies belonging to similar genres
def build_chart(genre, percentile=0.95):
    df = gen_md[gen_md['genre'] == genre]
    vote_counts = df[df['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = df[df['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(percentile)

    qualified = df[(df['vote_count'] >= m) & (df['vote_count'].notnull()) & (df['vote_average'].notnull())][['title', 'year', 'vote_count', 'vote_average', 'popularity']]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')

    qualified['wr'] = qualified.apply(lambda x: (x['vote_count']/(x['vote_count']+m) * x['vote_average']) + (m/(m+x['vote_count']) * C), axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(250)

    return qualified



### Top 15 Movies for some genres

In [47]:
build_chart('Comedy').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
351,Forrest Gump,1994,8147,8,48.307194,7.843994
1225,Back to the Future,1985,6239,8,25.778509,7.79985
18465,The Intouchables,2011,5410,8,16.086919,7.771794
22841,The Grand Budapest Hotel,2014,4644,8,14.442048,7.737837
2211,Life Is Beautiful,1997,3643,8,39.39497,7.674556
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,7.408901
732,Dr. Strangelove or: How I Learned to Stop Worr...,1964,1472,8,9.80398,7.316986
3342,Modern Times,1936,881,8,8.159556,7.025524
883,Some Like It Hot,1959,835,8,11.845107,6.992045
26564,Deadpool,2016,11444,7,187.860492,6.929222


In [48]:
build_chart('Romance').head(15)

Unnamed: 0,title,year,vote_count,vote_average,popularity,wr
351,Forrest Gump,1994,8147,8,48.307194,7.86986
10309,Dilwale Dulhania Le Jayenge,1995,661,9,34.457024,7.582757
876,Vertigo,1958,1162,8,18.20822,7.298862
40251,Your Name.,2016,1030,8,34.461252,7.235471
883,Some Like It Hot,1959,835,8,11.845107,7.117619
1132,Cinema Paradiso,1988,834,8,14.177005,7.116921
19901,Paperman,2012,734,8,7.198633,7.041055
37863,Sing Street,2016,669,8,10.672862,6.984338
1639,Titanic,1997,7770,7,26.88907,6.916316
19731,Silver Linings Playbook,2012,4840,7,14.488111,6.869789


## Content Based Recommender

To overcome Limitations of Simple Recommender which recommends similar movie to everyone irrespective of personal user interest.

Idea- Generating Personalised Recommendations by building engine that computes similarity between movies based on certain metrics and suggests movies that are most similar to a particular movie that a user liked.

Since we will be using movie metadata (or content) to build this engine, this also known as **Content Based Filtering.**

Two Content Based Recommenders based on:
* Movie Overviews and Taglines
* Movie Cast, Crew, Keywords and Genre

For building this recommender, a subset of all the movies available is used due to limiting computing power available.

**Preparing Small Dataset**

In [49]:
links_small = pd.read_csv('links_small.csv')
links_small.head()

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0
3,4,114885,31357.0
4,5,113041,11862.0


In [50]:
#Fetching non-null tmbID values
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype('int')
links_small.head()

0      862
1     8844
2    15602
3    31357
4    11862
Name: tmdbId, dtype: int64

In [51]:
#Dropping these records as they have null values of id column
md = md.drop([19730, 29503, 35587])

In [52]:
md['id'] = md['id'].astype('int')

In [53]:
#fetching metadata about the movies present in subset
smd = md[md['id'].isin(links_small)]
smd.shape

(9099, 25)

We have **9099** movies available in our small movies metadata dataset which is 5 times smaller than our original dataset of 45000 movies.

### Movie Description Based Recommender

Building a recommender using movie descriptions and taglines. We do not have a quantitative metric to judge our machine's performance so this will have to be done qualitatively.

In [54]:
#Filling null values of tagline
smd['tagline'] = smd['tagline'].fillna('')
#Adding description column by concatenating overview and tagline column
smd['description'] = smd['overview'] + smd['tagline']
#Filling null values of description
smd['description'] = smd['description'].fillna('')

In [55]:
#Creating TF-IDF Matrix using description of movies
tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
tfidf_matrix = tf.fit_transform(smd['description'])

In [56]:
tfidf_matrix.shape

(9099, 268124)

#### Cosine Similarity



In [57]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [58]:
cosine_sim[0]

array([1.        , 0.00680476, 0.        , ..., 0.        , 0.00344913,
       0.        ])

We now have a pairwise cosine similarity matrix for all the movies in our dataset. The next step is to write a function that returns the 30 most similar movies based on the cosine similarity score.

In [59]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

In [60]:
indices

title
Toy Story                                                0
Jumanji                                                  1
Grumpier Old Men                                         2
Waiting to Exhale                                        3
Father of the Bride Part II                              4
                                                      ... 
Shin Godzilla                                         9094
The Beatles: Eight Days a Week - The Touring Years    9095
Pokémon: Spell of the Unknown                         9096
Pokémon 4Ever: Celebi - Voice of the Forest           9097
Force Majeure                                         9098
Length: 9099, dtype: int64

In [61]:
#Generating Recommendations
def get_recommendations(title):
    #Getting the index of the title of the movie
    idx = indices[title]
    print("Index of the movie is: ",idx)

    #Extracting cosine similarity of the given movie with all the remaining movies
    sim_scores = list(enumerate(cosine_sim[idx]))
    print(sim_scores)

    #Sorting Similarity Score in descending order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    print('Sorted Similarity Score in descending order:')
    print(sim_scores)

    #Extracting Top 30 similarity score
    sim_scores = sim_scores[1:31]

    #Extracting movie names for the matched indices
    movie_indices = [i[0] for i in sim_scores]
    print('Recommended Movie indices:')
    print(movie_indices)

    return titles.iloc[movie_indices]

We're all set. Let us now try and get the top recommendations for a few movies and see how good the recommendations are.

In [62]:
get_recommendations('Jumanji').head(10)

Index of the movie is:  1
[(0, 0.006804755671748422), (1, 1.0000000000000007), (2, 0.015310620291169728), (3, 0.0), (4, 0.0022368099402477032), (5, 0.014747265950384272), (6, 0.0), (7, 0.0), (8, 0.03313705135498922), (9, 0.0), (10, 0.004282524058444696), (11, 0.0), (12, 0.0), (13, 0.004048799784885859), (14, 0.004143369506587468), (15, 0.0), (16, 0.0), (17, 0.007494114259123057), (18, 0.0), (19, 0.0), (20, 0.0), (21, 0.0), (22, 0.001702176056020182), (23, 0.0), (24, 0.0), (25, 0.006001303866813543), (26, 0.007834123900175035), (27, 0.0024136787615748217), (28, 0.0), (29, 0.0), (30, 0.0), (31, 0.014982782214356552), (32, 0.0016392353355782388), (33, 0.0015213134697954999), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.0), (39, 0.0), (40, 0.0), (41, 0.0), (42, 0.009429799536455586), (43, 0.0), (44, 0.0), (45, 0.005232515853105469), (46, 0.0020420853293543584), (47, 0.0), (48, 0.006252261588711543), (49, 0.0), (50, 0.010085204776655761), (51, 0.0), (52, 0.0), (53, 0.0), (54, 0.0), (5

8889                       Pixels
8608      Guardians of the Galaxy
6392                   Stay Alive
8154               Wreck-It Ralph
3196           Dungeons & Dragons
8670                        Ouija
5356     Night of the Living Dead
8211             Would You Rather
6323                Grandma's Boy
4082    The Giant Spider Invasion
Name: title, dtype: object

In [63]:
get_recommendations('The Dark Knight').head(10)

Index of the movie is:  6900
[(0, 0.0), (1, 0.007774131635450035), (2, 0.0), (3, 0.0), (4, 0.0), (5, 0.004610245611341961), (6, 0.0), (7, 0.0), (8, 0.003941701813242567), (9, 0.0), (10, 0.0), (11, 0.015033901723582844), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.0), (18, 0.003060523962865731), (19, 0.0), (20, 0.0), (21, 0.0), (22, 0.0), (23, 0.00554416072159223), (24, 0.0), (25, 0.0), (26, 0.0), (27, 0.0), (28, 0.0), (29, 0.005685807213192394), (30, 0.0), (31, 0.0), (32, 0.0026321591707230527), (33, 0.002447686357407317), (34, 0.0), (35, 0.0), (36, 0.0), (37, 0.0), (38, 0.009209792148487802), (39, 0.0), (40, 0.0031659362621153796), (41, 0.0), (42, 0.009708064101583662), (43, 0.0), (44, 0.0026621220185520022), (45, 0.0), (46, 0.003556791935000713), (47, 0.0), (48, 0.00807551441490781), (49, 0.0), (50, 0.004042728575225387), (51, 0.003068606774275635), (52, 0.0), (53, 0.004286712312209404), (54, 0.0), (55, 0.0), (56, 0.0), (57, 0.0), (58, 0.0), (59, 0.00665539905261634

7931                      The Dark Knight Rises
132                              Batman Forever
1113                             Batman Returns
8227    Batman: The Dark Knight Returns, Part 2
7565                 Batman: Under the Red Hood
524                                      Batman
7901                           Batman: Year One
2579               Batman: Mask of the Phantasm
2696                                        JFK
8165    Batman: The Dark Knight Returns, Part 1
Name: title, dtype: object

### Metadata Based Recommender

To build our standard metadata based content recommender, we will need to merge our current dataset with the crew and the keyword datasets. Let us prepare this data as our first step.

In [64]:
credits = pd.read_csv('credits.csv')
keywords = pd.read_csv('keywords.csv')

In [65]:
credits.head()

Unnamed: 0,cast,crew,id
0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...",862
1,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...",8844
2,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...",15602
3,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...",31357
4,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...",11862


In [66]:
keywords.head()

Unnamed: 0,id,keywords
0,862,"[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,8844,"[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,15602,"[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,31357,"[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,11862,"[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


In [67]:
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
md['id'] = md['id'].astype('int')

In [68]:
md.shape

(45463, 25)

In [69]:
#merging cast, crew, and keywords with the original dataset
md = md.merge(credits, on='id')
md = md.merge(keywords, on='id')

In [70]:
#Creating a subset
smd = md[md['id'].isin(links_small)]
smd.shape

(9219, 28)

In [71]:
#Pick director only from Crew
#Pick Top 3 actors from Cast
smd['cast'] = smd['cast'].apply(literal_eval)
smd['crew'] = smd['crew'].apply(literal_eval)
smd['keywords'] = smd['keywords'].apply(literal_eval)

#Calculating Cast and Crew size and storing in dataframe
smd['cast_size'] = smd['cast'].apply(lambda x: len(x))
smd['crew_size'] = smd['crew'].apply(lambda x: len(x))


In [72]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,title,video,vote_average,vote_count,year,cast,crew,keywords,cast_size,crew_size
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Toy Story,False,7.7,5415.0,1995,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,...",13,106
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Jumanji,False,6.9,2413.0,1995,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1...",26,16
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Grumpier Old Men,False,6.5,92.0,1995,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392...",7,4
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Waiting to Exhale,False,6.1,34.0,1995,"[{'cast_id': 1, 'character': 'Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':...",10,10
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Father of the Bride Part II,False,5.7,173.0,1995,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n...",12,7


In [73]:
#Fetching name of director
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [74]:
#Adding director name in dataframe
smd['director'] = smd['crew'].apply(get_director)

In [75]:
#Fetching Top 3 cast actors for each movie
smd['cast'] = smd['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smd['cast'] = smd['cast'].apply(lambda x: x[:3] if len(x) >=3 else x)

In [76]:
smd['keywords'] = smd['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])

In [77]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,video,vote_average,vote_count,year,cast,crew,keywords,cast_size,crew_size,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,False,7.7,5415.0,1995,"[Tom Hanks, Tim Allen, Don Rickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends, riva...",13,106,John Lasseter
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,False,6.9,2413.0,1995,"[Robin Williams, Jonathan Hyde, Kirsten Dunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[board game, disappearance, based on children'...",26,16,Joe Johnston
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,False,6.5,92.0,1995,"[Walter Matthau, Jack Lemmon, Ann-Margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, best friend, duringcreditsstinger, o...",7,4,Howard Deutch
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,False,6.1,34.0,1995,"[Whitney Houston, Angela Bassett, Loretta Devine]","[{'credit_id': '52fe44779251416c91011acb', 'de...","[based on novel, interracial relationship, sin...",10,10,Forest Whitaker
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,False,5.7,173.0,1995,"[Steve Martin, Diane Keaton, Martin Short]","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlife crisis, confidence, aging, daug...",12,7,Charles Shyer


In [78]:
#Stripping spaces and converting to lower case
smd['cast'] = smd['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [79]:
#Stripping spaces and converting to lower case
#Repeating director name thrice to give weightage to the feature
smd['director'] = smd['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))
smd['director'] = smd['director'].apply(lambda x: [x,x, x])

In [80]:
smd.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,video,vote_average,vote_count,year,cast,crew,keywords,cast_size,crew_size,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,False,7.7,5415.0,1995,"[tomhanks, timallen, donrickles]","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends, riva...",13,106,"[johnlasseter, johnlasseter, johnlasseter]"
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,False,6.9,2413.0,1995,"[robinwilliams, jonathanhyde, kirstendunst]","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[board game, disappearance, based on children'...",26,16,"[joejohnston, joejohnston, joejohnston]"
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,False,6.5,92.0,1995,"[waltermatthau, jacklemmon, ann-margret]","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, best friend, duringcreditsstinger, o...",7,4,"[howarddeutch, howarddeutch, howarddeutch]"
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,False,6.1,34.0,1995,"[whitneyhouston, angelabassett, lorettadevine]","[{'credit_id': '52fe44779251416c91011acb', 'de...","[based on novel, interracial relationship, sin...",10,10,"[forestwhitaker, forestwhitaker, forestwhitaker]"
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,False,5.7,173.0,1995,"[stevemartin, dianekeaton, martinshort]","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlife crisis, confidence, aging, daug...",12,7,"[charlesshyer, charlesshyer, charlesshyer]"


#### Keywords

In [81]:
s = smd.apply(lambda x: pd.Series(x['keywords']),axis=1).stack().reset_index(level=1, drop=True)
s.name = 'keyword'

In [82]:
#Calculating frequency of each keyword
s = s.value_counts()
s[:5]

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
Name: keyword, dtype: int64

In [83]:
#Removing keywords that occur only once
s = s[s > 1]

In [84]:
#Stemming keywords
stemmer = SnowballStemmer('english')

In [85]:
def filter_keywords(x):
    words = []
    for i in x:
        if i in s:
            words.append(i)
    return words

In [86]:
smd['keywords'] = smd['keywords'].apply(filter_keywords)
smd['keywords'] = smd['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])
smd['keywords'] = smd['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])

In [87]:
#Concatenating keywords, Cast, Director, and Genres and adding a column in dataframe
smd['soup'] = smd['keywords'] + smd['cast'] + smd['director'] + smd['genres']
smd['soup'] = smd['soup'].apply(lambda x: ' '.join(x))

In [88]:
smd.soup

0        jealousi toy boy friendship friend rivalri boy...
1        boardgam disappear basedonchildren'sbook newho...
2        fish bestfriend duringcreditssting waltermatth...
3        basedonnovel interracialrelationship singlemot...
4        babi midlifecrisi confid age daughter motherda...
                               ...                        
40952    friendship sidneypoitier wendycrewson jayo.san...
41172    bollywood akshaykumar ileanad'cruz eshagupta t...
41225    bollywood hrithikroshan poojahegde kabirbedi a...
41391    monster godzilla giantmonst destruct kaiju hir...
41669    music documentari paulmccartney ringostarr joh...
Name: soup, Length: 9219, dtype: object

In [89]:
count = CountVectorizer(analyzer='word',ngram_range=(1, 2),min_df=0, stop_words='english')
count_matrix = count.fit_transform(smd['soup'])

In [90]:
#Calculating Cosine Similarity
cosine_sim = cosine_similarity(count_matrix, count_matrix)

In [91]:
smd = smd.reset_index()
titles = smd['title']
indices = pd.Series(smd.index, index=smd['title'])

We will reuse the get_recommendations function that we had written earlier. Since our cosine similarity scores have changed, we expect it to give us different (and probably better) results.

In [92]:
get_recommendations('The Dark Knight').head(10)

Index of the movie is:  6981
[(0, 0.0), (1, 0.0), (2, 0.0), (3, 0.020965696734438367), (4, 0.0), (5, 0.08481041912882635), (6, 0.0), (7, 0.04774099160262887), (8, 0.04193139346887673), (9, 0.05017452060042545), (10, 0.020965696734438367), (11, 0.0), (12, 0.0), (13, 0.020965696734438367), (14, 0.020965696734438367), (15, 0.04318335170875306), (16, 0.01891511969643836), (17, 0.018490006540840973), (18, 0.02159167585437653), (19, 0.04193139346887673), (20, 0.047643061260321314), (21, 0.04078236951430929), (22, 0.06153846153846155), (23, 0.035438495596916704), (24, 0.016428801936338142), (25, 0.023870495801314433), (26, 0.020391184757154644), (27, 0.02303267198524394), (28, 0.0), (29, 0.039722906114947866), (30, 0.033449680400283635), (31, 0.018490006540840973), (32, 0.01891511969643836), (33, 0.020965696734438367), (34, 0.014720214798941532), (35, 0.023870495801314433), (36, 0.0), (37, 0.019861453057473933), (38, 0.024806946917841695), (39, 0.02159167585437653), (40, 0.08636670341750612),

8031         The Dark Knight Rises
6218                 Batman Begins
6623                  The Prestige
2085                     Following
7648                     Inception
4145                      Insomnia
3381                       Memento
8613                  Interstellar
7659    Batman: Under the Red Hood
1134                Batman Returns
Name: title, dtype: object

The recommendations seem to have recognized other Christopher Nolan movies (due to the high weightage given to director) and put them as top recommendations.

We can of course experiment on this engine by trying out different weights for our features (directors, actors, genres), limiting the number of keywords that can be used in the soup, weighing genres based on their frequency, only showing movies with the same languages, etc.

In [93]:
get_recommendations('Mean Girls').head(10)

Index of the movie is:  5207
[(0, 0.024419314525275217), (1, 0.0), (2, 0.05751973085430508), (3, 0.027066598098038335), (4, 0.02564102564102564), (5, 0.0), (6, 0.027874733666903028), (7, 0.0), (8, 0.0), (9, 0.0), (10, 0.027066598098038335), (11, 0.027874733666903028), (12, 0.0), (13, 0.0), (14, 0.0), (15, 0.0), (16, 0.0), (17, 0.02387049580131443), (18, 0.027874733666903028), (19, 0.027066598098038335), (20, 0.02050230920261824), (21, 0.0), (22, 0.0), (23, 0.0), (24, 0.0), (25, 0.0), (26, 0.05264981264926564), (27, 0.0), (28, 0.0), (29, 0.0), (30, 0.021591675854376526), (31, 0.0), (32, 0.024419314525275217), (33, 0.0), (34, 0.0), (35, 0.0), (36, 0.02973505167250263), (37, 0.05128205128205128), (38, 0.0), (39, 0.0), (40, 0.0), (41, 0.0), (42, 0.0), (43, 0.02632490632463282), (44, 0.0), (45, 0.0), (46, 0.0), (47, 0.0), (48, 0.0), (49, 0.023357091793352585), (50, 0.0), (51, 0.02564102564102564), (52, 0.0), (53, 0.027066598098038335), (54, 0.027066598098038335), (55, 0.0), (56, 0.0), (57, 

3319               Head Over Heels
4763                 Freaky Friday
1329              The House of Yes
6277              Just Like Heaven
7905         Mr. Popper's Penguins
7332    Ghosts of Girlfriends Past
6959     The Spiderwick Chronicles
8883                      The DUFF
6698         It's a Boy Girl Thing
7377       I Love You, Beth Cooper
Name: title, dtype: object

#### Popularity and Ratings

One thing that we notice about our recommendation system is that it recommends movies regardless of ratings and popularity. It is true that **Batman and Robin** has a lot of similar characters as compared to **The Dark Knight** but it was a terrible movie that shouldn't be recommended to anyone.

Therefore, we will add a mechanism to remove bad movies and return movies which are popular and have had a good critical response.

I will take the top 25 movies based on similarity scores and calculate the vote of the 60th percentile movie. Then, using this as the value of $m$, we will calculate the weighted rating of each movie using IMDB's formula like we did in the Simple Recommender section.

In [94]:
def improved_recommendations(title):
    idx = indices[title]
    sim_scores = list(enumerate(cosine_sim[idx]))
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    sim_scores = sim_scores[1:26]
    movie_indices = [i[0] for i in sim_scores]

    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype('int')
    vote_averages = movies[movies['vote_average'].notnull()]['vote_average'].astype('int')
    C = vote_averages.mean()
    m = vote_counts.quantile(0.60)

    qualified = movies[(movies['vote_count'] >= m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    qualified['vote_count'] = qualified['vote_count'].astype('int')
    qualified['vote_average'] = qualified['vote_average'].astype('int')
    qualified['wr'] = qualified.apply(weighted_rating, axis=1)
    qualified = qualified.sort_values('wr', ascending=False).head(10)
    return qualified

In [95]:
improved_recommendations('The Dark Knight')

Unnamed: 0,title,vote_count,vote_average,year,wr
7648,Inception,14075,8,2010,7.917588
8613,Interstellar,11187,8,2014,7.897107
6623,The Prestige,4510,8,2006,7.758148
3381,Memento,4168,8,2000,7.740175
8031,The Dark Knight Rises,9263,7,2012,6.921448
6218,Batman Begins,7511,7,2005,6.904127
1134,Batman Returns,1706,6,1992,5.846862
132,Batman Forever,1529,5,1995,5.054144
9024,Batman v Superman: Dawn of Justice,7189,5,2016,5.013943
1260,Batman & Robin,1447,4,1997,4.287233


In [96]:
improved_recommendations('Mean Girls')

Unnamed: 0,title,vote_count,vote_average,year,wr
1547,The Breakfast Club,2189,7,1985,6.709602
390,Dazed and Confused,588,7,1993,6.254682
8883,The DUFF,1372,6,2015,5.818541
3712,The Princess Diaries,1063,6,2001,5.781086
4763,Freaky Friday,919,6,2003,5.757786
6277,Just Like Heaven,595,6,2005,5.681521
6959,The Spiderwick Chronicles,593,6,2008,5.680901
7494,American Pie Presents: The Book of Love,454,5,2009,5.11969
7332,Ghosts of Girlfriends Past,716,5,2009,5.092422
7905,Mr. Popper's Penguins,775,5,2011,5.087912


Unfortunately, **Batman and Robin** does not disappear from our recommendation list. This is probably due to the fact that it is rated a 4, which is only slightly below average on TMDB. It certainly doesn't deserve a 4 when amazing movies like **The Dark Knight Rises** has only a 7. However, there is nothing much we can do about this. Therefore, we will conclude our Content Based Recommender section here and come back to it when we build a hybrid engine.

## Collaborative Filtering

Our content based engine suffers from some severe limitations. It is only capable of suggesting movies which are *close* to a certain movie. That is, it is not capable of capturing tastes and providing recommendations across genres.

Also, the engine that we built is not really personal in that it doesn't capture the personal tastes and biases of a user. Anyone querying our engine for recommendations based on a movie will receive the same recommendations for that movie, regardless of who s/he is.

Therefore, in this section, we will use a technique called **Collaborative Filtering** to make recommendations to Movie Watchers. Collaborative Filtering is based on the idea that users similar to a me can be used to predict how much I will like a particular product or service those users have used/experienced but I have not.

I will not be implementing Collaborative Filtering from scratch. Instead, I will use the **Surprise** library that used extremely powerful algorithms like **Singular Value Decomposition (SVD)** to minimise RMSE (Root Mean Square Error) and give great recommendations.

In [97]:
reader = Reader()

In [98]:
ratings = pd.read_csv('ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [99]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

In [100]:
svd = SVD()
cross_validate(svd, data, measures=['RMSE', 'MAE'], cv=5)

{'test_rmse': array([0.90073501, 0.89448937, 0.9049408 , 0.89268918, 0.89220217]),
 'test_mae': array([0.69335656, 0.68755091, 0.69752472, 0.68566193, 0.68692905]),
 'fit_time': (1.513676404953003,
  1.5152676105499268,
  1.512507438659668,
  1.5073163509368896,
  2.195732593536377),
 'test_time': (0.3627948760986328,
  0.1257319450378418,
  0.12222456932067871,
  0.1670088768005371,
  0.46442127227783203)}

Let us pick user 1 and check ratings

In [101]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [102]:
#Predicting estimated rating for userid=1, movieId 302
svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.8416963863910296, details={'was_impossible': False})

For movie with ID 302, we get an estimated prediction of **2.844**. One startling feature of this recommender system is that it doesn't care what the movie is (or what it contains). It works purely on the basis of an assigned movie ID and tries to predict ratings based on how the other users have predicted the movie.

## Hybrid Recommender


In this section, I will try to build a simple hybrid recommender that brings together techniques we have implemented in the content based and collaborative filter based engines. This is how it will work:

* **Input:** User ID and the Title of a Movie
* **Output:** Similar movies sorted on the basis of expected ratings by that particular user.

In [103]:
def convert_int(x):
    try:
        return int(x)
    except:
        return np.nan

In [104]:
id_map = pd.read_csv('links_small.csv')[['movieId', 'tmdbId']]

In [105]:
id_map.head()

Unnamed: 0,movieId,tmdbId
0,1,862.0
1,2,8844.0
2,3,15602.0
3,4,31357.0
4,5,11862.0


In [106]:
id_map['tmdbId'] = id_map['tmdbId'].apply(convert_int)
#Renaming columns
id_map.columns = ['movieId', 'id']
#Fetching title for the given ids from original dataframe
id_map = id_map.merge(smd[['title', 'id']], on='id').set_index('title')

In [107]:
id_map.head()

Unnamed: 0_level_0,movieId,id
title,Unnamed: 1_level_1,Unnamed: 2_level_1
Toy Story,1,862.0
Jumanji,2,8844.0
Grumpier Old Men,3,15602.0
Waiting to Exhale,4,31357.0
Father of the Bride Part II,5,11862.0


In [108]:
indices_map = id_map.set_index('id')
indices_map.head()

Unnamed: 0_level_0,movieId
id,Unnamed: 1_level_1
862.0,1
8844.0,2
15602.0,3
31357.0,4
11862.0,5


In [109]:
#building hybrid recommender
def hybrid(userId, title):
    #fetching index of the title
    idx = indices[title]
    #fetching tmdbID
    tmdbId = id_map.loc[title]['id']
    movie_id = id_map.loc[title]['movieId']

    #Calculating Similarity
    sim_scores = list(enumerate(cosine_sim[int(idx)]))
    #Sorting cosine similarity in reverse order
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    #Slicing top 25 movies
    sim_scores = sim_scores[1:26]
    #fetching movie indices of top 25 movies
    movie_indices = [i[0] for i in sim_scores]

    #Displaying info of top 25 movies
    movies = smd.iloc[movie_indices][['title', 'vote_count', 'vote_average', 'year', 'id']]

    #collaborative filtering
    #predicting estimated ratings using SVD
    movies['est'] = movies['id'].apply(lambda x: svd.predict(userId, indices_map.loc[x]['movieId']).est)
    movies = movies.sort_values('est', ascending=False)
    return movies.head(10)

In [110]:
#Passing userid and title of the movie
hybrid(1, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,3.31258
1011,The Terminator,4208.0,7.4,1984,218,3.238351
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,3.130171
974,Aliens,3282.0,7.7,1986,679,3.114932
2014,Fantastic Planet,140.0,7.6,1973,16306,3.004852
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,2.986482
1621,Darby O'Gill and the Little People,35.0,6.7,1959,18887,2.934674
4347,Piranha Part Two: The Spawning,41.0,3.9,1981,31646,2.926077
1668,Return from Witch Mountain,38.0,5.6,1978,14822,2.912715
922,The Abyss,822.0,7.1,1989,2756,2.863365


In [111]:
#Passing userid and title of the movie
hybrid(500, 'Avatar')

Unnamed: 0,title,vote_count,vote_average,year,id,est
1376,Titanic,7770.0,7.5,1997,597,3.372622
8658,X-Men: Days of Future Past,6155.0,7.5,2014,127585,3.366378
2014,Fantastic Planet,140.0,7.6,1973,16306,3.243116
1621,Darby O'Gill and the Little People,35.0,6.7,1959,18887,3.211043
831,Escape to Witch Mountain,60.0,6.5,1975,14821,3.202397
8401,Star Trek Into Darkness,4479.0,7.4,2013,54138,3.137998
522,Terminator 2: Judgment Day,4274.0,7.7,1991,280,3.077256
1011,The Terminator,4208.0,7.4,1984,218,2.989782
4347,Piranha Part Two: The Spawning,41.0,3.9,1981,31646,2.944287
922,The Abyss,822.0,7.1,1989,2756,2.904818


We see that for our hybrid recommender, we get different recommendations for different users although the movie is the same. Hence, our recommendations are more personalized and tailored towards particular users.