# Content Based Recommender

**The output of this notebook is the table of 10 Movies similar to the input movie provided by the user, with various parameters like the genre, certain keywords, the cast, the director, weighted rating, etc all taken in consideration.**

We will first begin by importing the necessary tools and modules :

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import CountVectorizer
from ast import literal_eval
import warnings; warnings.simplefilter("ignore")

Reading the "movies_metadata" CSV file :

In [2]:
smr = pd.read_csv("movies_metadata.csv")
smr.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


Like we did in the "Simple Movie Recommender" notebook, we will preprocess the "genres" column here as well :

In [3]:
smr['genres'] = smr['genres'].fillna('[]').apply(literal_eval).apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])
smr['genres'].head(5)

0     [Animation, Comedy, Family]
1    [Adventure, Fantasy, Family]
2               [Romance, Comedy]
3        [Comedy, Drama, Romance]
4                        [Comedy]
Name: genres, dtype: object

Reading the "links_small" CSV file :

In [4]:
links_small = pd.read_csv("links_small.csv")
links_small.sample(5)

Unnamed: 0,movieId,imdbId,tmdbId
6672,52328,448134,1272.0
8003,92751,1700258,53174.0
2338,2916,100802,861.0
846,1044,117791,41843.0
1571,2009,70723,12101.0


Removing NaN values from the "tmdbId" column, and converting the remaining ones into int data type :

In [5]:
links_small = links_small[links_small['tmdbId'].notnull()]['tmdbId'].astype(int)
smr = smr.drop([19730, 29503, 35587])  # invalid data points

Creating tables for "credits" and "keywords", each, then converting their id into int data type :

In [6]:
credits = pd.read_csv("credits.csv")
keywords = pd.read_csv("keywords.csv")
keywords['id'] = keywords['id'].astype(int)
credits['id'] = credits['id'].astype(int)
smr['id'] = smr['id'].astype(int)

Merging the "credits" and "keywords" tables with the "smr" table :

In [7]:
smr = smr.merge(credits, on='id')
smr = smr.merge(keywords, on='id')

Creating a new table which consists of only those points that belong in the "links_small" table, by slicing the original "smr" table :

In [8]:
cbr = smr[smr['id'].isin(links_small)]
cbr.head(5)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,spoken_languages,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[Animation, Comedy, Family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0,"[{'cast_id': 14, 'character': 'Woody (voice)',...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 931, 'name': 'jealousy'}, {'id': 4290,..."
1,False,,65000000,"[Adventure, Fantasy, Family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[{'cast_id': 1, 'character': 'Alan Parrish', '...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 10090, 'name': 'board game'}, {'id': 1..."
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[Romance, Comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 1495, 'name': 'fishing'}, {'id': 12392..."
3,False,,16000000,"[Comedy, Drama, Romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[{'cast_id': 1, 'character': ""Savannah 'Vannah...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[{'id': 818, 'name': 'based on novel'}, {'id':..."
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[Comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[{'cast_id': 1, 'character': 'George Banks', '...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 1009, 'name': 'baby'}, {'id': 1599, 'n..."


Applying AST's `literal_eval()` method on columns "cast", "crew" and "keywords" :

In [9]:
cbr['cast'] = cbr['cast'].apply(literal_eval)
cbr['crew'] = cbr['crew'].apply(literal_eval)
cbr['keywords'] = cbr['keywords'].apply(literal_eval)

Replacing the actual cast and crew with their lengths, in the "cast" and "crew" columns :

In [10]:
cbr['cast_size'] = cbr['cast'].apply(lambda x: len(x))
cbr['crew_size'] = cbr['crew'].apply(lambda x: len(x))

Creating a function to get the name of the Director :

In [11]:
def director_name(df):
    for i in df:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

Creating a new column "director", which stores the name of the director of the corresponding movie :

In [12]:
cbr['director'] = cbr['crew'].apply(director_name)
cbr['director']

0             John Lasseter
1              Joe Johnston
2             Howard Deutch
3           Forest Whitaker
4             Charles Shyer
                ...        
40952        Gregg Champion
41172     Tinu Suresh Desai
41225    Ashutosh Gowariker
41391          Hideaki Anno
41669            Ron Howard
Name: director, Length: 9219, dtype: object

Preprocessing the "cast" and "keywords" columns :

In [13]:
cbr['cast'] = cbr['cast'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])  # removing the id
cbr['cast'] = cbr['cast'].apply(lambda x: x[:3] if len(x)>=3 else x)  # restricting the length of cast to 3
cbr['keywords'] = cbr['keywords'].apply(lambda x: [i['name'] for i in x] if isinstance(x, list) else [])  # removing the id
cbr['cast']

0                      [Tom Hanks, Tim Allen, Don Rickles]
1           [Robin Williams, Jonathan Hyde, Kirsten Dunst]
2               [Walter Matthau, Jack Lemmon, Ann-Margret]
3        [Whitney Houston, Angela Bassett, Loretta Devine]
4               [Steve Martin, Diane Keaton, Martin Short]
                               ...                        
40952      [Sidney Poitier, Wendy Crewson, Jay O. Sanders]
41172            [Akshay Kumar, Ileana D'Cruz, Esha Gupta]
41225            [Hrithik Roshan, Pooja Hegde, Kabir Bedi]
41391    [Hiroki Hasegawa, Yutaka Takenouchi, Satomi Is...
41669           [Paul McCartney, Ringo Starr, John Lennon]
Name: cast, Length: 9219, dtype: object

Stripping spaces and converting to lowercase, the contents of the "cast" and "director" column :

In [14]:
cbr['cast'] = cbr['cast'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])
cbr['director'] = cbr['director'].astype('str').apply(lambda x: str.lower(x.replace(" ", "")))

Giving two times the weightage to the director feature in our dataset, for better recommendations :

In [15]:
cbr['director'] = cbr['director'].apply(lambda x: [x, x])

Calculating the frequency count of every keyword that appears in the dataset :

In [16]:
frequency = cbr.apply(lambda x: pd.Series(x['keywords']), axis=1).stack().reset_index(level=1, drop=True)
frequency.name = 'keyword'
frequency = frequency.value_counts()
frequency

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
                       ... 
secret rites              1
plato                     1
shangri la                1
comedy team               1
hospital bed              1
Name: keyword, Length: 12940, dtype: int64

Removing those keywords whose frequency is as low as 1 :

In [17]:
frequency = frequency[frequency > 1]
frequency

independent film        610
woman director          550
murder                  399
duringcreditsstinger    327
based on novel          318
                       ... 
flight attendant          2
sale of soul              2
exhibit                   2
step parents              2
famous                    2
Name: keyword, Length: 6709, dtype: int64

Creating a Snowball Stemmer to stem words :

In [18]:
stemmer = SnowballStemmer('english')

Creating a function that returns the keywords which are a part of the "frequency" table :

In [19]:
def frequent_keywords(col):
    words = []
    for i in col:
        if i in frequency:
            words.append(i)
    return words

Preprocessing the "keywords" column :

In [20]:
cbr['keywords'] = cbr['keywords'].apply(frequent_keywords)  # eliminating words which are not present in "frequncy" table
cbr['keywords'] = cbr['keywords'].apply(lambda x: [stemmer.stem(i) for i in x])  # stemming the remaining words
cbr['keywords'] = cbr['keywords'].apply(lambda x: [str.lower(i.replace(" ", "")) for i in x])  # stripping spaces and converting to lower
cbr['keywords']

0        [jealousi, toy, boy, friendship, friend, rival...
1        [boardgam, disappear, basedonchildren'sbook, n...
2                   [fish, bestfriend, duringcreditssting]
3        [basedonnovel, interracialrelationship, single...
4        [babi, midlifecrisi, confid, age, daughter, mo...
                               ...                        
40952                                         [friendship]
41172                                          [bollywood]
41225                                          [bollywood]
41391     [monster, godzilla, giantmonst, destruct, kaiju]
41669                                 [music, documentari]
Name: keywords, Length: 9219, dtype: object

Creating a new column "mix" that stores keywords, cast, director, and genres as a single string per row, which can further be used to convert into Word Vectors :

In [21]:
cbr['mix'] = cbr['keywords'] + cbr['cast'] + cbr['director'] + cbr['genres']
cbr['mix'] = cbr['mix'].apply(lambda x: ' '.join(x))
cbr['mix']

0        jealousi toy boy friendship friend rivalri boy...
1        boardgam disappear basedonchildren'sbook newho...
2        fish bestfriend duringcreditssting waltermatth...
3        basedonnovel interracialrelationship singlemot...
4        babi midlifecrisi confid age daughter motherda...
                               ...                        
40952    friendship sidneypoitier wendycrewson jayo.san...
41172    bollywood akshaykumar ileanad'cruz eshagupta t...
41225    bollywood hrithikroshan poojahegde kabirbedi a...
41391    monster godzilla giantmonst destruct kaiju hir...
41669    music documentari paulmccartney ringostarr joh...
Name: mix, Length: 9219, dtype: object

We are going to use SkLearn's `cosine_similarity()` feature to recommend similar movies. Cosine Similarity is a measure of similarity between two sequences of numbers, and here those numbers are Word Vectors, which we will generate by applying SkLearn's `CountVectorizer().fit_transform()` on the "mix" column of our "cbr" table :

In [22]:
cv = CountVectorizer(analyzer='word', ngram_range=(1, 2), min_df=0, stop_words='english')
count_matrix = cv.fit_transform(cbr['mix'])
cosim = cosine_similarity(count_matrix, count_matrix)

Resetting the index for the "cbr" table, and mapping the titles of the movies with the new indices :

In [23]:
cbr = cbr.reset_index()
titles = cbr['title']
indices = pd.Series(cbr.index, index=cbr['title'])
indices

title
Toy Story                                                0
Jumanji                                                  1
Grumpier Old Men                                         2
Waiting to Exhale                                        3
Father of the Bride Part II                              4
                                                      ... 
The Last Brickmaker in America                        9214
Rustom                                                9215
Mohenjo Daro                                          9216
Shin Godzilla                                         9217
The Beatles: Eight Days a Week - The Touring Years    9218
Length: 9219, dtype: int64

The IMDB's Weighted Rating Formula is as follows :

Weighted Rating = (v/(v+m)\*R)+(m/(v+m)\*C), where

v = Number of votes<br>
m = Minimum number of votes required by a movie to enter the Top Charts<br>
R = Average rating<br>
C = Mean votes<br>

We will now create a function that implements the IMDB's Weighted Rating Formula on our dataset :

In [24]:
vote_counts = smr[smr['vote_count'].notnull()]['vote_count'].astype(int)
vote_averages = smr[smr['vote_average'].notnull()]['vote_average'].astype(int)

mean_votes = vote_averages.mean()
min_votes_req_for_charts = vote_counts.quantile(0.95)
# 0.95 here means that the movie should have atleast 95% more votes than other movies, then only it can be featured in the Top Charts.

def weighted_rating(x):
    num_of_votes = x['vote_count']
    average_rating = x['vote_average']
    return (num_of_votes/(num_of_votes + min_votes_req_for_charts)*average_rating) + (min_votes_req_for_charts/(min_votes_req_for_charts + num_of_votes)*mean_votes)

Creating a function that recommends movies similar to the provided the input name :

In [25]:
def recommendations(name):
    try:
        idx = indices[name]
    except KeyError:
        print(f"Your query, '{name}' is not in our dataset. Try entering the exact name of the movie, or simply enter a different movie.")
    scores = list(enumerate(cosim[idx]))
    try:
        scores = sorted(scores, key=lambda x: x[1], reverse=True)
    except ValueError:
        scores = sorted(scores, key=lambda x: np.all(x[1]), reverse=True)
    scores = scores[1:26]
    mov_indices = [i[0] for i in scores]
    movies = cbr.iloc[mov_indices][['title', 'vote_count', 'vote_average']]
    vote_counts = movies[movies['vote_count'].notnull()]['vote_count'].astype(int)
    vote_avg = movies[movies['vote_average'].notnull()]['vote_average'].astype(int)
    m = vote_counts.quantile(0.60)
    recommendation = movies[(movies['vote_count']>=m) & (movies['vote_count'].notnull()) & (movies['vote_average'].notnull())]
    recommendation['vote_count'] = recommendation['vote_count'].astype(int)
    recommendation['vote_average'] = recommendation['vote_average'].astype(int)
    recommendation['weighted_rating'] = recommendation.apply(weighted_rating, axis=1)
    recommendation = recommendation.sort_values('weighted_rating', ascending=False).head(10)
    return recommendation

**Testing out the output :**

In [26]:
recommendations("Iron Man")

Unnamed: 0,title,vote_count,vote_average,weighted_rating
7969,The Avengers,12000,7,6.939754
8871,Deadpool,11444,7,6.936932
8712,Guardians of the Galaxy,10014,7,6.928293
8872,Captain America: Civil War,7462,7,6.90509
8868,Avengers: Age of Ultron,6908,7,6.89792
8869,Ant-Man,6029,7,6.884017
8392,Iron Man 3,8951,6,5.965491
7923,Captain America: The First Avenger,7174,6,5.957422
7600,Iron Man 2,6969,6,5.956241
7861,Thor,6678,6,5.954448


In [27]:
recommendations("Interstellar")

Unnamed: 0,title,vote_count,vote_average,weighted_rating
7648,Inception,14075,8,7.919065
8983,The Martian,7442,7,6.904849
8477,About Time,2140,7,6.708166
129,Apollo 13,1637,7,6.636977
1274,Contact,1338,7,6.575409
2043,Planet of the Apes,958,7,6.458746
8384,Oblivion,4862,6,5.938802
8726,The Giver,1859,6,5.858339
8936,Midnight Special,705,6,5.713669
8854,Terminator Genisys,3677,5,5.024731


In [28]:
recommendations("Harry Potter and the Philosopher's Stone")

Unnamed: 0,title,vote_count,vote_average,weighted_rating
7921,Harry Potter and the Deathly Hallows: Part 2,6141,7,6.885995
5452,Harry Potter and the Prisoner of Azkaban,6037,7,6.884161
4366,Harry Potter and the Chamber of Secrets,5966,7,6.882874
6354,Harry Potter and the Goblet of Fire,5758,7,6.878934
7742,Harry Potter and the Deathly Hallows: Part 1,5708,7,6.877947
6801,Harry Potter and the Order of the Phoenix,5633,7,6.876435
7345,Harry Potter and the Half-Blood Prince,5435,7,6.87226
519,Home Alone,2487,7,6.742942
2388,Home Alone 2: Lost in New York,2459,6,5.887811
8996,Pixels,2564,5,5.03394


In [29]:
recommendations("The Matrix")

Unnamed: 0,title,vote_count,vote_average,weighted_rating
1011,The Terminator,4208,7,6.83843
5544,"I, Robot",3889,6,5.924999
4651,The Matrix Reloaded,3500,6,5.917566
4928,The Matrix Revolutions,3155,6,5.909622
7764,TRON: Legacy,2895,6,5.902544
5636,Resident Evil: Apocalypse,1286,6,5.810898
7424,Surrogates,1219,5,5.061707
4739,Terminator 3: Rise of the Machines,2177,5,5.038988
7296,Terminator Salvation,2496,5,5.03473
8854,Terminator Genisys,3677,5,5.024731


In [30]:
recommendations("The Dark Knight")

Unnamed: 0,title,vote_count,vote_average,weighted_rating
6623,The Prestige,4510,8,7.762198
8031,The Dark Knight Rises,9263,7,6.922734
6218,Batman Begins,7511,7,6.905676
7659,Batman: Under the Red Hood,459,7,6.15322
2085,Following,363,7,6.050059
1134,Batman Returns,1706,6,5.848168
7561,Harry Brown,351,6,5.583049
8026,Bullet to the Head,490,5,5.11087
9024,Batman v Superman: Dawn of Justice,7189,5,5.013324
1260,Batman & Robin,1447,4,4.281221
