## Movie recommender system with document similarity ##

Movie recommender systems can be implemented in three ways,

1. Simple-rule based recommenders - Based on some specific global metrics and thresholds like movie popularity,global ratings etc.
2. Content based  recommenders - Based on providing similar entities. Content meta data can be used such as movie title, description, cast, director etc. 
3. Collaborative filtering recommenders - Based on past ratings of different users and specific items. We are going to predict ratings and recommendations and also we don't need meta data.

Here, we are going to use Content based recommenders.

Import the required libraries

In [164]:
import pandas as pd 
import nltk
from nltk.tokenize import word_tokenize
import re
import numpy as np 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from gensim.summarization.bm25 import get_bm25_weights

Load tmdb dataset

In [79]:
df = pd.read_csv('./tmdb_dataset/tmdb_5000_movies.csv')
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Let us explore dataset

In [80]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

We do not want all variables for content based recommenders. We will select some variables from dataset.

In [81]:
df = df[['overview','popularity','tagline','title','genres']]
df.tagline.fillna(' ',inplace=True)
# Create a new variable "Description" by combaining tagline and overview
df['description'] = df['tagline'].map(str)+' '+ df['overview']
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   overview     4800 non-null   object 
 1   popularity   4800 non-null   float64
 2   tagline      4800 non-null   object 
 3   title        4800 non-null   object 
 4   genres       4800 non-null   object 
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


In [82]:
df.head()

Unnamed: 0,overview,popularity,tagline,title,genres,description
0,"In the 22nd century, a paraplegic Marine is di...",150.437577,Enter the World of Pandora.,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Enter the World of Pandora. In the 22nd centur...
1,"Captain Barbossa, long believed to be dead, ha...",139.082615,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","At the end of the world, the adventure begins...."
2,A cryptic message from Bond’s past sends him o...,107.376788,A Plan No One Escapes,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",A Plan No One Escapes A cryptic message from B...
3,Following the death of District Attorney Harve...,112.31295,The Legend Ends,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",The Legend Ends Following the death of Distric...
4,"John Carter is a war-weary, former military ca...",43.926995,"Lost in our world, found in another.",John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","Lost in our world, found in another. John Cart..."


In [83]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   overview     4800 non-null   object 
 1   popularity   4800 non-null   float64
 2   tagline      4800 non-null   object 
 3   title        4800 non-null   object 
 4   genres       4800 non-null   object 
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


Text preprocessing

In [86]:
nltk.download('punkt')
#removing stop words
stop_words = nltk.corpus.stopwords.words('english')

# Normalization of each document
def normalize_doc(doc):
    # lowercase and remove special characters\whitespace
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

norm_corpus = np.vectorize(normalize_doc)

[nltk_data] Downloading package punkt to /home/csuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [87]:
norm_corpus = norm_corpus(list(df['description']))
len(norm_corpus)

4800

Extract td-idf features 

In [91]:
tf = TfidfVectorizer(ngram_range=(1,2),min_df=2)
tfidf_matrix = tf.fit_transform(norm_corpus)
tfidf_matrix.shape

(4800, 20464)

Cosine similarity for Pairwise Document Similarity

In [95]:
cs = cosine_similarity(tfidf_matrix)
cs_df = pd.DataFrame(cs)
cs_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.010754,0.0,0.019123,0.028828,0.025023,0.0,0.026646,0.0,0.007459,...,0.009749,0.0,0.023452,0.033714,0.0,0.0,0.0,0.006926,0.0,0.0
1,0.010754,1.0,0.011891,0.0,0.041623,0.0,0.014755,0.027122,0.034707,0.007617,...,0.009956,0.0,0.004818,0.0,0.0,0.012593,0.0,0.022392,0.013724,0.0
2,0.0,0.011891,1.0,0.0,0.0,0.0,0.0,0.022242,0.015862,0.004893,...,0.042617,0.0,0.0,0.0,0.01654,0.0,0.0,0.011682,0.0,0.004047
3,0.019123,0.0,0.0,1.0,0.008793,0.0,0.016185,0.023172,0.027467,0.07364,...,0.0,0.0,0.009667,0.0,0.0,0.0,0.0,0.028355,0.021785,0.02806
4,0.028828,0.041623,0.0,0.008793,1.0,0.0,0.023211,0.028676,0.0,0.023547,...,0.0148,0.0,0.0,0.0,0.0,0.01076,0.0,0.010514,0.0,0.0


In [97]:
#Movies list
movies_list = df['title'].values
movies_list,movies_list.shape

(array(['Avatar', "Pirates of the Caribbean: At World's End", 'Spectre',
        ..., 'Signed, Sealed, Delivered', 'Shanghai Calling',
        'My Date with Drew'], dtype=object),
 (4800,))

Find top similar movies 

In [124]:
# find movie id
movie_idx = np.where(movies_list=='Apollo 18')[0][0]
movie_idx

3603

In [125]:
# Find similar movies for given movie_idx
movie_simis = cs_df.iloc[movie_idx].values
movie_simis

array([0.14777147, 0.01529667, 0.02089563, ..., 0.00722744, 0.00902633,
       0.01666397])

In [126]:
# Get top five movie similar ids
sim_movie_idx = np.argsort(-movie_simis)[1:6]
sim_movie_idx

array([ 311,  847, 1275,    0, 4246])

In [127]:
# Get top five movies
sim_movies = movies_list[sim_movie_idx]
sim_movies

array(['The Adventures of Pluto Nash', 'Semi-Pro', 'Sunshine', 'Avatar',
       'The Lords of Salem'], dtype=object)

Get list of popular movies


In [129]:
pop_mvs = df.sort_values(by='popularity',ascending=False)
pop_mvs

Unnamed: 0,overview,popularity,tagline,title,genres,description
546,"Minions Stuart, Kevin and Bob are recruited by...",875.581305,"Before Gru, they had a history of bad bosses",Minions,"[{""id"": 10751, ""name"": ""Family""}, {""id"": 16, ""...","Before Gru, they had a history of bad bosses M..."
95,Interstellar chronicles the adventures of a gr...,724.247784,Mankind was born on Earth. It was never meant ...,Interstellar,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 18, ""...",Mankind was born on Earth. It was never meant ...
788,Deadpool tells the origin story of former Spec...,514.569956,Witness the beginning of a happy ending,Deadpool,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Witness the beginning of a happy ending Deadpo...
94,"Light years from Earth, 26 years after being a...",481.098624,All heroes start somewhere.,Guardians of the Galaxy,"[{""id"": 28, ""name"": ""Action""}, {""id"": 878, ""na...",All heroes start somewhere. Light years from E...
127,An apocalyptic story set in the furthest reach...,434.278564,What a Lovely Day.,Mad Max: Fury Road,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",What a Lovely Day. An apocalyptic story set in...
...,...,...,...,...,...,...
4625,A Broadway producer puts on a play with a Devi...,0.001389,The hot spot where Satan's waitin'.,Midnight Cabaret,"[{""id"": 27, ""name"": ""Horror""}]",The hot spot where Satan's waitin'. A Broadway...
4118,"Raju, a waiter, is in love with the famous TV ...",0.001186,,Hum To Mohabbat Karega,[],"Raju, a waiter, is in love with the famous T..."
4727,A hitchhiker named Martel Gordone gets in a fi...,0.001117,"There's only one way out, and 100 fools stand ...",Penitentiary,"[{""id"": 28, ""name"": ""Action""}, {""id"": 18, ""nam...","There's only one way out, and 100 fools stand ..."
3361,A man who is having an affair with a married w...,0.000372,Don't you dare go in there!,Alien Zone,"[{""id"": 27, ""name"": ""Horror""}, {""id"": 28, ""nam...",Don't you dare go in there! A man who is havin...


In [161]:
#Select movies based on popularity scores
pop_mvs = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World',
'Pirates of the Caribbean: The Curse of the Black Pearl',
'Dawn of the Planet of the Apes', 'Terminator Genisys', 'Captain America: Civil War', 'The Dark Knight', 'The Martian','Batman v Superman: Dawn of Justice', 'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring', 'Harry Potter and the Chamber of Secrets', 'The Hobbit: The Battle of the Five Armies', 'Iron Man']

Create function for recommander system


In [159]:
def movie_recommander(movie_title, movies=movies_list, mov_sim=movie_simis):
    # Get movie id
    movie_idx = np.where(movies==movie_title)[0][0]
    # Find similar movies for given movie_idx
    movie_simis = cs_df.iloc[movie_idx].values 
    # Get top five movie similar ids
    sim_movie_idx = np.argsort(-movie_simis)[1:6]
    # Get top five movies
    sim_movies = movies_list[sim_movie_idx]
    return sim_movies


Get popular movies similarities

In [162]:
for movie in pop_mvs:
    print('Movie:', movie)
    print('Top 5 recommanded movies', movie_recommander(movie_title=movie))
    print()


Movie: Minions
Top 5 recommanded movies ['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Superman'
 'Rise of the Guardians']

Movie: Interstellar
Top 5 recommanded movies ['Gattaca' 'Space Cowboys' 'Space Pirate Captain Harlock'
 'Starship Troopers' 'Final Destination 2']

Movie: Deadpool
Top 5 recommanded movies ['Silent Trigger' 'Underworld: Evolution' 'Mars Attacks!' 'Bronson'
 'Shaft']

Movie: Jurassic World
Top 5 recommanded movies ['Jurassic Park' 'The Lost World: Jurassic Park' 'The Nut Job'
 "National Lampoon's Vacation" 'Vacation']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommanded movies ["Pirates of the Caribbean: Dead Man's Chest" 'The Pirate'
 'Pirates of the Caribbean: On Stranger Tides'
 'The Pirates! In an Adventure with Scientists!' 'Joyful Noise']

Movie: Dawn of the Planet of the Apes
Top 5 recommanded movies ['Battle for the Planet of the Apes' 'Groove' 'The Other End of the Line'
 'Chicago Overcoa

Okapi BM25 Ranking for Pairwise Document Similarity

It is quite popular algorithm like page rank used to retrieve information and search engines.


In [166]:
# Tokenize the corpus
 norm_corpus_tokens = np.array([nltk.word_tokenize(doc) for doc in norm_corpus ])
 norm_corpus_tokens[:4]

array([list(['enter', 'world', 'pandora', 'nd', 'century', 'paraplegic', 'marine', 'dispatched', 'moon', 'pandora', 'unique', 'mission', 'becomes', 'torn', 'following', 'orders', 'protecting', 'alien', 'civilization']),
       list(['end', 'world', 'adventure', 'begins', 'captain', 'barbossa', 'long', 'believed', 'dead', 'come', 'back', 'life', 'headed', 'edge', 'earth', 'turner', 'elizabeth', 'swann', 'nothing', 'quite', 'seems']),
       list(['plan', 'one', 'escapes', 'cryptic', 'message', 'bonds', 'past', 'sends', 'trail', 'uncover', 'sinister', 'organization', 'battles', 'political', 'forces', 'keep', 'secret', 'service', 'alive', 'bond', 'peels', 'back', 'layers', 'deceit', 'reveal', 'terrible', 'truth', 'behind', 'spectre']),
       list(['legend', 'ends', 'following', 'death', 'district', 'attorney', 'harvey', 'dent', 'batman', 'assumes', 'responsibility', 'dents', 'crimes', 'protect', 'late', 'attorneys', 'reputation', 'subsequently', 'hunted', 'gotham', 'city', 'police', 'dep

In [168]:
bm25 = get_bm25_weights(norm_corpus_tokens)
# movie similarities
bm25_df = pd.DataFrame(bm25)
bm25_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,133.236216,2.344334,0.0,3.861253,5.541154,4.618482,0.0,4.647223,0.0,1.792351,...,2.382959,0.0,3.635955,5.076981,0.0,0.0,0.0,1.687774,0.0,0.0
1,2.422878,111.138242,2.649992,0.0,7.078071,0.0,2.707911,5.956817,5.037028,1.792351,...,2.382959,0.0,1.110701,0.0,0.0,2.821835,0.0,4.723948,3.026945,0.0
2,0.0,2.993616,149.788109,0.0,0.0,0.0,0.0,4.681828,4.0582,1.4333,...,7.748826,0.0,0.0,0.0,3.550886,0.0,0.0,2.92165,0.0,1.4333
3,5.60769,0.0,0.0,221.106887,3.192287,0.0,5.892955,7.055147,6.3883,18.938899,...,0.0,0.0,2.985986,0.0,0.0,0.0,0.0,8.377811,7.000089,7.446661
4,8.388436,10.002182,0.0,2.787245,181.520289,0.0,5.605415,7.507656,0.0,6.802209,...,4.765918,0.0,0.0,0.0,0.0,3.102177,0.0,3.375549,0.0,0.0


Get popular movies similarities

In [169]:
for movie in pop_mvs:
    print('Movie:', movie)
    print('Top 5 recommanded movies', movie_recommander(movie_title=movie,mov_sim= bm25_df))
    print()

Movie: Minions
Top 5 recommanded movies ['Despicable Me 2' 'Despicable Me'
 'Teenage Mutant Ninja Turtles: Out of the Shadows' 'Superman'
 'Rise of the Guardians']

Movie: Interstellar
Top 5 recommanded movies ['Gattaca' 'Space Cowboys' 'Space Pirate Captain Harlock'
 'Starship Troopers' 'Final Destination 2']

Movie: Deadpool
Top 5 recommanded movies ['Silent Trigger' 'Underworld: Evolution' 'Mars Attacks!' 'Bronson'
 'Shaft']

Movie: Jurassic World
Top 5 recommanded movies ['Jurassic Park' 'The Lost World: Jurassic Park' 'The Nut Job'
 "National Lampoon's Vacation" 'Vacation']

Movie: Pirates of the Caribbean: The Curse of the Black Pearl
Top 5 recommanded movies ["Pirates of the Caribbean: Dead Man's Chest" 'The Pirate'
 'Pirates of the Caribbean: On Stranger Tides'
 'The Pirates! In an Adventure with Scientists!' 'Joyful Noise']

Movie: Dawn of the Planet of the Apes
Top 5 recommanded movies ['Battle for the Planet of the Apes' 'Groove' 'The Other End of the Line'
 'Chicago Overcoa