## Plot description-based recommender
Our plot description-based recommender will take in a movie title as an argument and
output a list of movies that are most similar based on their plots. These are the steps we are
going to perform in building this model:
1. Obtain the data required to build the model
2. Create TF-IDF vectors for the plot description (or overview) of every movie
3. Compute the pairwise cosine similarity score of every movie
4. Write the recommender function that takes in a movi

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
import numpy as np
import pandas as pd

### Preparing the data

In [3]:
local_data_path = "../data/"
drive_data_path = "/content/drive/MyDrive/Hands-On Recommendation Systems with Python/data/"

In [4]:
orig_df = pd.read_csv(drive_data_path + "movies_metadata.csv", low_memory = False)
orig_df.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [5]:
df = pd.read_csv(drive_data_path + "metadata_clean.csv")
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995


In [6]:
df['overview'], df['id'] = orig_df["overview"], orig_df["id"]
df.head()

Unnamed: 0,title,genres,runtime,vote_average,vote_count,year,overview,id
0,Toy Story,"['Animation', 'Comedy', 'Family']",81.0,7.7,5415.0,1995,"Led by Woody, Andy's toys live happily in his ...",862
1,Jumanji,"['Adventure', 'Fantasy', 'Family']",104.0,6.9,2413.0,1995,When siblings Judy and Peter discover an encha...,8844
2,Grumpier Old Men,"['Romance', 'Comedy']",101.0,6.5,92.0,1995,A family wedding reignites the ancient feud be...,15602
3,Waiting to Exhale,"['Comedy', 'Drama', 'Romance']",127.0,6.1,34.0,1995,"Cheated on, mistreated and stepped on, the wom...",31357
4,Father of the Bride Part II,['Comedy'],106.0,5.7,173.0,1995,Just when George Banks has recovered from his ...,11862


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   title         45460 non-null  object 
 1   genres        45466 non-null  object 
 2   runtime       45203 non-null  float64
 3   vote_average  45460 non-null  float64
 4   vote_count    45460 non-null  float64
 5   year          45466 non-null  int64  
 6   overview      44512 non-null  object 
 7   id            45466 non-null  object 
dtypes: float64(3), int64(1), object(4)
memory usage: 2.8+ MB


### Text Encoding using TF-IDF

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
no_of_movies_to_consider = 1000

In [21]:
df["overview"][:no_of_movies_to_consider].shape

(1000,)

In [22]:
tfidf = TfidfVectorizer(stop_words="english")
df["overview"] = df["overview"].fillna('')
tfidf_matrix = tfidf.fit_transform(df["overview"][:no_of_movies_to_consider])
type(tfidf_matrix), tfidf_matrix.shape

(scipy.sparse._csr.csr_matrix, (1000, 9397))

### Computing the cosine similarity score

- Since our movie plots are represented as tf-idf vectors, their magnitude is always 1.
- Our work is now reduced to computing the much simpler and computationally cheaper dot product.

In [23]:
# import linear kernel to compute the dot product
from sklearn.metrics.pairwise import linear_kernel

In [24]:
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [41]:
type(cosine_sim), cosine_sim.shape

(numpy.ndarray, (1000, 1000))

In [38]:
import h5py

dataset_name = 'cosine similarity'
with h5py.File('cosine_sim.h5', 'w') as hf:
    hf.create_dataset(dataset_name, data=cosine_sim)

(numpy.ndarray, (1000, 1000))

### Building the recommender function

We will perform the following steps in building the recommender function:
1. Declare the title of the movie as an argument.
2. Obtain the index of the movie from the indices reverse mapping.
3. Get the list of cosine similarity scores for that particular movie with all movies
using cosine_sim. Convert this into a list of tuples where the first element is the
position and the second is the similarity score.
4. Sort this list of tuples on the basis of the cosine similarity scores.
5. Get the top 10 elements of this list. Ignore the first element as it refers to the
similarity score with itself (the movie most similar to a particular movie is
obviously the movie itself).
6. Return the titles corresponding to the indices of the top 10 elements, excluding
the first:

In [47]:
indices = pd.Series(df.index, index = df["title"]).drop_duplicates()
print(indices[:5])
print(type(indices), indices.shape)

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64
<class 'pandas.core.series.Series'> (45466,)


In [None]:
import pickle

In [50]:
# save the pickle file
with open('indices.pkl', 'wb') as f:
    pickle.dump(indices, f)


In [53]:
with open('indices.pkl', 'rb') as f:
    indices = pickle.load(f)

In [54]:
print(indices[:5])
print(type(indices), indices.shape)

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64
<class 'pandas.core.series.Series'> (45466,)


In [56]:
dataset_name = 'cosine similarity'
with h5py.File('cosine_sim.h5', 'r') as hf:
    cosine_sim = hf[dataset_name][:]

type(cosine_sim), cosine_sim.shape

(numpy.ndarray, (1000, 1000))

In [57]:
def content_recommender(title, cosine_sim=cosine_sim, df=df, indices=indices):
  idx = indices[title]
  # sim_scores = [(0, 1.0000000000000002), (1, 0.015706569646092478), (2, 0.0), ...]
  sim_scores = list(enumerate(cosine_sim[idx]))
  sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
  sim_scores = sim_scores[1:11]
  movie_indices = [i[0] for i in sim_scores]
  return df['title'].iloc[movie_indices]

In [58]:
content_recommender('The Lion King')

892              The Wizard of Oz
42                    Restoration
515     Robin Hood: Men in Tights
89     The Journey of August King
643                   DragonHeart
110           Rumble in the Bronx
697                       Flipper
526             The Secret Garden
197            The Tie That Binds
55        Kids of the Round Table
Name: title, dtype: object