<a href="https://colab.research.google.com/github/jasonpark9001/NLP/blob/main/Movie_Recommender_System_with_FastText_Embeddings.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender System with FastText Embeddings

__We will be building a movie recommendation system here where based on data\metadata pertaining to different movies, we try and recommend similar movies of interest!__


## Steps:
- Load Data
-Veiw some of rows of dataframe 
-Create 'Description' column
-Text Preprocessing(normalization of document)
-Use gensim to train a FastText model on the processed corpus
-Create Movie Recommender


## Load Data


In [14]:
import pandas as pd

df = pd.read_csv('tmdb_5000_movies.csv.gz', compression='gzip')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4803 entries, 0 to 4802
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   budget                4803 non-null   int64  
 1   genres                4803 non-null   object 
 2   homepage              1712 non-null   object 
 3   id                    4803 non-null   int64  
 4   keywords              4803 non-null   object 
 5   original_language     4803 non-null   object 
 6   original_title        4803 non-null   object 
 7   overview              4800 non-null   object 
 8   popularity            4803 non-null   float64
 9   production_companies  4803 non-null   object 
 10  production_countries  4803 non-null   object 
 11  release_date          4802 non-null   object 
 12  revenue               4803 non-null   int64  
 13  runtime               4801 non-null   float64
 14  spoken_languages      4803 non-null   object 
 15  status               

### View top few rows of the dataframe 

In [15]:
df.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


In [16]:
column_names = ['title', 'tagline', 'overview', 'genres', 'popularity']
df = df[column_names]
df.tagline.fillna('',inplace= True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return self._update_inplace(result)


In [17]:
df.head()

Unnamed: 0,title,tagline,overview,genres,popularity
0,Avatar,Enter the World of Pandora.,"In the 22nd century, a paraplegic Marine is di...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",150.437577
1,Pirates of the Caribbean: At World's End,"At the end of the world, the adventure begins.","Captain Barbossa, long believed to be dead, ha...","[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",139.082615
2,Spectre,A Plan No One Escapes,A cryptic message from Bond’s past sends him o...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",107.376788
3,The Dark Knight Rises,The Legend Ends,Following the death of District Attorney Harve...,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",112.31295
4,John Carter,"Lost in our world, found in another.","John Carter is a war-weary, former military ca...","[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",43.926995


### Merge text from tagline column with text from overview column 

So the new description columns will have both information of "tagline" and "overveiw" column.

In [18]:
df['description'] =df['tagline'].map(str)+ ' '+ df['overview'].map(str)

In [21]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4800 entries, 0 to 4802
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   title        4800 non-null   object 
 1   tagline      4800 non-null   object 
 2   overview     4800 non-null   object 
 3   genres       4800 non-null   object 
 4   popularity   4800 non-null   float64
 5   description  4800 non-null   object 
dtypes: float64(1), object(5)
memory usage: 262.5+ KB


## Text Preprocessing

- prepare the text colunmns for analysis


In [22]:
import nltk
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [23]:
import re
import numpy as np
stop_words = nltk.corpus.stopwords.words('english')

In [32]:
#create a function which normalize the documents

def normalize_document(doc):
    # remove special characters\whitespaces, ignore case
    doc = re.sub(r'[^a-zA-Z/s]', ' ', doc,  flags= re.I|re.A)

    # lower case  
    doc = doc.lower()

    # remove whitespaces
    doc = doc.strip()

    # tokenize document
    tokens = nltk.word_tokenize(doc)

    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]

    # re-create/merge sentences from filtered content
    doc = ' '.join(filtered_tokens)
    return doc

In [33]:
#create the normalized corpus
normalize_corpus = np.vectorize(normalize_document)
norm_corpus = normalize_corpus(list(df['description']))

In [34]:
#check the numbe rof corpus
len(norm_corpus)

4800

In [35]:
norm_corpus

array(['enter world pandora nd century paraplegic marine dispatched moon pandora unique mission becomes torn following orders protecting alien civilization',
       'end world adventure begins captain barbossa long believed dead come back life headed edge earth turner elizabeth swann nothing quite seems',
       'plan one escapes cryptic message bond past sends trail uncover sinister organization battles political forces keep secret service alive bond peels back layers deceit reveal terrible truth behind spectre',
       ...,
       'signed sealed delivered introduces dedicated quartet civil servants dead letter office u postal system transform elite team lost mail detectives determination deliver seemingly undeliverable takes post office unpredictable world letters packages past save lives solve crimes reunite old loves change futures arriving late always miraculously time',
       'new yorker shanghai ambitious new york attorney sam sent shanghai assignment immediately stumbles legal

###  Use ``gensim`` to train a FastText model on the normalized corpus 



- the embedding size to be 300
- context to be around 30
- min word count to be 2 
- use a skipgram model
- iterations can be 50 



In [30]:
import logging
logging.basicConfig(format = '%(asctime)s : %(levelname)s : %(message)s', level= logging.INFO)

In [None]:
from gensim.models import FastText

# iterate normalized corpus and split
tokenized_docs = [nltk.word_tokenize(doc) for doc in norm_corpus]

# Set values for various parameters
feature_size = 300   # Set Word embedding dimensionality 
window_context = 30  # Set Context window size                                                                                  
min_word_count = 2   # Set Minimum word count                    
sg = 1              # set skip-gram model flag

# train FastText model
ft_model = FastText(tokenized_docs, size=feature_size, window= window_context, min_count = min_word_count, sg=sg, iter= 50)

In [60]:
ft_model

<gensim.models.fasttext.FastText at 0x7f359d94bad0>

In [64]:
def average_W2V_Vectorizer(corpus, model, num_features):
    vocab = set(model.wv.index2word) #one can access its keyed vectors via the model.wv attributes. 
  
  
  
    def average_W2V_Vector(words, model, vocab, num_features ):
          feature_vector = np.zeros((num_features,),dtype="float64" )
          n_word = 0
          for word in words:
                  if word in vocab:
                            n_word = n_word+1
                            feature_vector = np.add(feature_vector, model[word])
      
                  if n_word:
                            feature_vector = np.divide(feature_vector, n_word)
          return feature_vector

    features= [average_W2V_Vector(tokenized_sentence, model, vocab, num_features) for tokenized_sentence in corpus]
    return np.array(features)

In [65]:
ft_vec_doc = average_W2V_Vectorizer(corpus = tokenized_docs , model = ft_model, num_features = feature_size )
ft_vec_doc.shape

  if sys.path[0] == '':


(4800, 300)

In [72]:
vec_df = pd.DataFrame(ft_vec_doc)

## Get Movie Recommendations

Use a content based recommendation system to find similar movies based on the movie's description
- **Cosign similarity** is used.

In [67]:
from sklearn.metrics.pairwise import cosine_similarity

In [73]:
doc_sim = cosine_similarity(vec_df.values)
doc_sim_df = pd.DataFrame(doc_sim)
doc_sim_df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,4790,4791,4792,4793,4794,4795,4796,4797,4798,4799
0,1.0,0.092244,0.258557,0.076623,0.157003,0.17905,0.135307,0.236105,0.171401,0.085183,...,0.189982,0.116467,0.103358,0.131448,0.222977,0.177339,0.156279,0.161117,0.14501,0.13522
1,0.092244,1.0,0.135482,0.15166,0.099395,0.195379,0.178484,0.097387,0.100567,0.092886,...,0.058422,0.147661,0.138236,0.140639,0.150708,0.126868,0.099938,0.060208,0.249185,0.106623
2,0.258557,0.135482,1.0,0.314557,0.218783,0.171528,0.187098,0.220585,0.254631,0.269043,...,0.202046,0.157564,0.105428,0.127249,0.240188,0.200722,0.17442,0.269616,0.209187,0.153909
3,0.076623,0.15166,0.314557,1.0,0.172601,0.109631,0.190052,0.063325,0.044828,0.20173,...,0.149527,0.062245,0.043369,0.121631,0.251044,0.076216,0.075151,0.146943,0.233765,0.07463
4,0.157003,0.099395,0.218783,0.172601,1.0,0.08245,0.14606,0.113062,0.102454,0.177824,...,0.12239,0.02574,0.122461,0.148873,0.173147,0.130495,0.082466,0.048878,0.194673,0.098041


### Get a list of Movie titles, Movie title and it's index
-Get the ID for the movie **Minions**

In [74]:
#movie ID
movies_list = df['title']
movies_list

0                                         Avatar
1       Pirates of the Caribbean: At World's End
2                                        Spectre
3                          The Dark Knight Rises
4                                    John Carter
                          ...                   
4798                                 El Mariachi
4799                                   Newlyweds
4800                   Signed, Sealed, Delivered
4801                            Shanghai Calling
4802                           My Date with Drew
Name: title, Length: 4800, dtype: object

546

In [84]:
##movie ID
movie_idx_minions = np.where(movies_list == 'Minions')[0][0]
movie_idx_minions

546

## Extract row of the movie 'Minions' in the dataframe


In [100]:
movie_similarities = doc_sim_df.iloc[546].values
movie_similarities

array([0.20479107, 0.09085604, 0.23326263, ..., 0.16425708, 0.0909234 ,
       0.06744803])

##Get top 5 movie names by using top 5 movie IDs

In [109]:
similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
similar_movie_idxs

array([ 813,  109, 1424,  785, 1822])

In [110]:
similar_movie_name = movies_list[similar_movie_idxs]
similar_movie_name

813                                              Superman
109     The Chronicles of Narnia: The Voyage of the Da...
1424                                           Concussion
785                                        Beyond Borders
1822                                    Forbidden Kingdom
Name: title, dtype: object

##Movie Recommender
-Based on the previous steps, we can build our own movie recommender system.

In [123]:
def movie_recommender(movie_title, movies=movies_list, doc_sims=None):
    # find movie id
    movie_idx = np.where(movies_list == movie_title)[0][0]

    # get movie similarities. 
    
    movie_similarities = doc_sims.iloc[movie_idx].values
    
    # get top 5 similar movie IDs
    similar_movie_idxs = np.argsort(-movie_similarities)[1:6]
    
    # get top 5 movies
    similar_movies = movies_list[similar_movie_idxs]
    
    # return the top 5 movies
    return similar_movies

In [135]:
#try the system with movie,'superman'
similar_movie_super = movie_recommender(movie_title = 'Superman', movies = movies_list, doc_sims= doc_sim_df)

In [136]:
similar_movie_super

546                                               Minions
109     The Chronicles of Narnia: The Voyage of the Da...
1424                                           Concussion
391                                             Enchanted
785                                        Beyond Borders
Name: title, dtype: object

In [124]:
popular_movies = ['Minions', 'Interstellar', 'Deadpool', 'Jurassic World', 'Pirates of the Caribbean: The Curse of the Black Pearl',
              'Dawn of the Planet of the Apes', 'The Hunger Games: Mockingjay - Part 1', 'Terminator Genisys', 
              'Captain America: Civil War', 'The Dark Knight', 'The Martian', 'Batman v Superman: Dawn of Justice', 
              'Pulp Fiction', 'The Godfather', 'The Shawshank Redemption', 'The Lord of the Rings: The Fellowship of the Ring',  
              'Harry Potter and the Chamber of Secrets', 'Star Wars', 'The Hobbit: The Battle of the Five Armies',
              'Iron Man']

In [137]:
for movie in popular_movies:
    print('Movie:', movie)
    
    print('Top 5 recommended Movies:', movie_recommender(movie_title=movie, movies=movies_list, doc_sims=doc_sim_df))
    print()

Movie: Minions
Top 5 recommended Movies: 813                                              Superman
109     The Chronicles of Narnia: The Voyage of the Da...
1424                                           Concussion
785                                        Beyond Borders
1822                                    Forbidden Kingdom
Name: title, dtype: object

Movie: Interstellar
Top 5 recommended Movies: 1317         White Squall
4509         Love Letters
1529        Out of Africa
1269    Raise the Titanic
2901        5 Days of War
Name: title, dtype: object

Movie: Deadpool
Top 5 recommended Movies: 4516    Kingdom of the Spiders
3896                  Sinister
4773                    Clerks
3243               Brown Sugar
2549        Where the Heart Is
Name: title, dtype: object

Movie: Jurassic World
Top 5 recommended Movies: 3458                   Duel in the Sun
3021                   Invasion U.S.A.
348     Ice Age: Dawn of the Dinosaurs
1165       Back to the Future Part III
236     