## Movie recommender system with document similarity ##

Clustering of movies by using Clustering alogirthms,

1. K-Means Clustering
2. Affinity propagation
3. Agglomerative Hierarchical clustering



Import the required libraries

In [51]:
import pandas as pd 
import nltk
from nltk.tokenize import word_tokenize
import re
import numpy as np 
from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.cluster import KMeans
from collections import Counter
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import AffinityPropagation

Load tmdb dataset

In [5]:
df = pd.read_csv('./tmdb_dataset/tmdb_5000_movies.csv')
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124


Let us explore dataset

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

We do not want all variables for content based recommenders. We will select some variables from dataset.

In [7]:
df = df[['overview','popularity','tagline','title','genres']]
df.tagline.fillna(' ',inplace=True)
# Create a new variable "Description" by combaining tagline and overview
df['description'] = df['tagline'].map(str)+' '+ df['overview']
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   overview     4800 non-null   object 
 1   popularity   4800 non-null   float64
 2   tagline      4800 non-null   object 
 3   title        4800 non-null   object 
 4   genres       4800 non-null   object 
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


In [8]:
df.head()

Unnamed: 0,overview,popularity,tagline,title,genres,description
0,"In the 22nd century, a paraplegic Marine is di...",150.437577,Enter the World of Pandora.,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Enter the World of Pandora. In the 22nd centur...
1,"Captain Barbossa, long believed to be dead, ha...",139.082615,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","At the end of the world, the adventure begins...."
2,A cryptic message from Bond’s past sends him o...,107.376788,A Plan No One Escapes,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",A Plan No One Escapes A cryptic message from B...
3,Following the death of District Attorney Harve...,112.31295,The Legend Ends,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",The Legend Ends Following the death of Distric...
4,"John Carter is a war-weary, former military ca...",43.926995,"Lost in our world, found in another.",John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","Lost in our world, found in another. John Cart..."


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   overview     4800 non-null   object 
 1   popularity   4800 non-null   float64
 2   tagline      4800 non-null   object 
 3   title        4800 non-null   object 
 4   genres       4800 non-null   object 
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


Text preprocessing

In [10]:
nltk.download('punkt')
#removing stop words
stop_words = nltk.corpus.stopwords.words('english')

# Normalization of each document
def normalize_doc(doc):
    # lowercase and remove special characters\whitespace
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = nltk.word_tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

norm_corpus = np.vectorize(normalize_doc)

[nltk_data] Downloading package punkt to /home/csuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
norm_corpus = norm_corpus(list(df['description']))
len(norm_corpus)

4800

Extract td-idf features 

In [17]:
cv = CountVectorizer(ngram_range=(1,2), min_df=10, max_df=0.8, stop_words=stop_words)
cv_matrix = cv.fit_transform(norm_corpus)
cv_matrix.shape

(4800, 2986)

K-Means Clustering

In [24]:
km = KMeans(n_clusters=5, max_iter=10000,n_init=60,random_state=34 )
km.fit_transform(cv_matrix)

array([[7.28010989, 3.6519611 , 3.9817426 , 3.98029933, 3.75893985],
       [7.54983444, 4.31573155, 4.59718735, 4.32598568, 4.44311375],
       [7.93725393, 5.21784641, 5.24621038, 5.23742411, 5.06494748],
       ...,
       [8.24621125, 5.59092354, 5.81154542, 5.81162711, 5.68566806],
       [8.71779789, 6.12085998, 5.65894112, 6.12946221, 6.04096558],
       [8.60232527, 5.98247229, 5.78325812, 5.95921267, 5.86495327]])

In [25]:
df['kmeans_cluster'] = km.labels_
df.head(10)

Viewing the distribution of movies across the clusters 

In [27]:
Counter(km.labels_)

Counter({1: 523, 4: 2992, 2: 561, 3: 723, 0: 1})

Let us most popular movies in each cluster,

In [29]:
movie_clusters = (df[['title','kmeans_cluster','popularity']].sort_values(by=['kmeans_cluster','popularity']).groupby('kmeans_cluster').head(10))
movie_clusters = movie_clusters.copy(deep=True)

In [43]:
feature_names = cv.get_feature_names()
top_nfeatures = 13
order_centroids = km.cluster_centers_.argsort()[:,::-1]

# get key features and movies for each cluster
num_clusters = 5
for cluster_num in range(num_clusters):
    key_features = [feature_names[index] for index in order_centroids[cluster_num,:top_nfeatures]]
    movies = movie_clusters[movie_clusters['kmeans_cluster'] == cluster_num]['title'] .values.tolist ()
    print('Cluster '+str(cluster_num+1))
    print('key features',key_features)
    print('Popular Movies',movies)
    print('*'*107)

Cluster 1
key features ['childhood', 'friend', 'journey', 'hot', 'lives', 'friends', 'apart', 'discover', 'fourth', 'three friends', 'attempting', 'call', 'three']
Popular Movies ['Without a Paddle']
***********************************************************************************************************
Cluster 2
key features ['world', 'war', 'story', 'young', 'world war', 'man', 'find', 'new', 'must', 'save', 'time', 'love', 'evil']
Popular Movies ["Anderson's Cross", '8 Days', 'Antarctic Edge: 70° South', 'Proud', 'Sharkskin', 'On The Downlow', 'Rise of the Entrepreneur: The Search for a Better Way', "A Beginner's Guide to Snuff", 'Broken Vessels', 'Heroes of Dirt']
***********************************************************************************************************
Cluster 3
key features ['new', 'york', 'new york', 'city', 'young', 'family', 'love', 'man', 'york city', 'years', 'friends', 'find', 'must']
Popular Movies ['Hav Plenty', 'Four Single Fathers', 'An American in H

Clustering by cosine similarities

In [45]:
cosine_sim_features = cosine_similarity(cv_matrix)
km = KMeans(n_clusters=5, max_iter=10000,n_init=60,random_state=34 )
km.fit_transform(cosine_sim_features)

array([[1.97530605, 3.08274383, 2.40553603, 2.96363438, 2.84530743],
       [2.65833149, 2.48319555, 3.04908495, 3.22593932, 3.18335888],
       [2.61583361, 2.88562468, 2.09989486, 2.7989796 , 2.73614097],
       ...,
       [2.1603031 , 2.86633778, 2.40104916, 2.79209865, 2.72830537],
       [3.33286562, 3.41100297, 3.20298224, 2.10378656, 3.42878436],
       [2.90525791, 2.94061459, 2.58016964, 2.40753059, 2.90487785]])

In [46]:
df['kmeans_cluster'] = km.labels_
df.head()

Unnamed: 0,overview,popularity,tagline,title,genres,description,kmeans_cluster
0,"In the 22nd century, a paraplegic Marine is di...",150.437577,Enter the World of Pandora.,Avatar,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",Enter the World of Pandora. In the 22nd centur...,0
1,"Captain Barbossa, long believed to be dead, ha...",139.082615,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...","At the end of the world, the adventure begins....",1
2,A cryptic message from Bond’s past sends him o...,107.376788,A Plan No One Escapes,Spectre,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",A Plan No One Escapes A cryptic message from B...,2
3,Following the death of District Attorney Harve...,112.31295,The Legend Ends,The Dark Knight Rises,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",The Legend Ends Following the death of Distric...,3
4,"John Carter is a war-weary, former military ca...",43.926995,"Lost in our world, found in another.",John Carter,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...","Lost in our world, found in another. John Cart...",0


Viewing the distribution of movies across the clusters

In [47]:
Counter(km.labels_)

Counter({0: 448, 1: 718, 2: 2633, 3: 525, 4: 476})

Let us most popular movies in each cluster,

In [48]:
movie_clusters = (df[['title','kmeans_cluster','popularity']].sort_values(by=['kmeans_cluster','popularity']).groupby('kmeans_cluster').head(10))
movie_clusters = movie_clusters.copy(deep=True)

In [50]:
# get key features and movies for each cluster
num_clusters = 5
for cluster_num in range(num_clusters):
    movies = movie_clusters[movie_clusters['kmeans_cluster'] == cluster_num]['title'] .values.tolist ()
    print('Cluster '+str(cluster_num+1))
    print('Popular Movies',movies)
    print('*'*107)

Cluster 1
Popular Movies ["Anderson's Cross", '8 Days', 'Antarctic Edge: 70° South', 'Proud', 'Sharkskin', 'On The Downlow', 'Rise of the Entrepreneur: The Search for a Better Way', "A Beginner's Guide to Snuff", 'Broken Vessels', 'Journey from the Fall']
***********************************************************************************************************
Cluster 2
Popular Movies ['Smiling Fish & Goat On Fire', 'Butterfly Girl', 'Quinceañera', 'Theresa Is a Mother', 'Manito', 'Dolphins and Whales: Tribes of the Ocean', 'Dry Spell', 'To Save A Life', 'Bran Nue Dae', 'Ayurveda: Art of Being']
***********************************************************************************************************
Cluster 3
Popular Movies ['Alien Zone', 'Penitentiary', 'Midnight Cabaret', 'Down & Out With The Dolls', 'The Work and The Story', 'Fabled', "The Legend of God's Gun", 'The Young Unknowns', 'Short Cut to Nirvana: Kumbh Mela', 'The Blood of My Brother: A Story of Death in Iraq']
*********

Affinity propagation clustering to aviod specifying the number of clusters problem. 

In [52]:
ap = AffinityPropagation(max_iter=1000)
ap.fit(cosine_sim_features)
res = Counter(ap.labels_)
res.most_common(10)

[(181, 1355),
 (180, 97),
 (156, 87),
 (52, 73),
 (109, 56),
 (77, 56),
 (24, 48),
 (14, 48),
 (22, 45),
 (169, 44)]

In [53]:
df['ap_cluster'] = ap.labels_
filtered_clusters = [item[0] for item in res.most_common(10)]
filtered_df = df[df['ap_cluster'].isin(filtered_clusters)]
movie_clusters = (df[['title','ap_cluster','popularity']].sort_values(by=['ap_cluster','popularity']).groupby('ap_cluster').head(10))
movie_clusters = movie_clusters.copy(deep=True)

In [55]:
# get key features and movies for each cluster
for cluster_num in range(len(filtered_clusters)):
    movies = movie_clusters[movie_clusters['ap_cluster'] == cluster_num]['title'] .values.tolist ()
    print('Cluster '+str(cluster_num+1))
    print('Popular Movies',movies)
    print('*'*107)

Cluster 1
Popular Movies ['My Date with Drew', 'Swimfan', 'She Wore a Yellow Ribbon', 'Friday the 13th Part 2', 'Martha Marcy May Marlene', 'Ride Along', 'Miss Congeniality', 'Tangled', 'Lilo & Stitch', 'Hancock']
***********************************************************************************************************
Cluster 2
Popular Movies ['Blue Car', 'Miracle', 'The Abyss', 'Team America: World Police', 'Enchanted', 'Independence Day: Resurgence', 'Coraline', 'The 5th Wave', 'Pitch Perfect 2', 'The Avengers']
***********************************************************************************************************
Cluster 3
Popular Movies ['Samantha: An American Girl Holiday', 'Fido', 'The Brothers Bloom', "A Turtle's Tale: Sammy's Adventures", 'Ponyo', 'Apocalypto', 'The Last Witch Hunter', 'The Hunger Games', 'Star Trek Into Darkness', 'World War Z']
***********************************************************************************************************
Cluster 4
Popular M