### Simple Recommender System in Python

We all love watching movies! There are some movies we like, some we don't. Most people have a preference for movies of a similar genre. For example 2001: A Space Odyssey and Close Encounters of the Third Kind, are movies based on aliens coming to Earth. We could conclude that both of these fall into the same genre of movies based on intuition, but that's no fun in a data science context.

The rapid growth of data collection has led to a new era of information. Data is being used to create more efficient systems and this is where Recommendation Systems come into play. Recommendation Systems are a type of information filtering systems as they improve the quality of search results and provides items that are more relevant to the search item.

In this notebook, we will quantify the similarity of movies based on their plot summaries available on IMDb.

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

### Importing the dataset

In [2]:
# Read in the movie data
movies_df = pd.read_csv('movies_plot.csv',encoding='UTF-8')
# Display the data
display(movies_df.head())

Unnamed: 0,original_title,overview
0,Toy Story,"Led by Woody, Andy's toys live happily in his ..."
1,Jumanji,When siblings Judy and Peter discover an encha...
2,Grumpier Old Men,A family wedding reignites the ancient feud be...
3,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom..."
4,Father of the Bride Part II,Just when George Banks has recovered from his ...


### Cleaning dataset

In [3]:
print(movies_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32269 entries, 0 to 32268
Data columns (total 2 columns):
original_title    32269 non-null object
overview          32200 non-null object
dtypes: object(2)
memory usage: 504.3+ KB
None


Using the info command we can get a brief description of our dataset. This is important in order to enable us understand the dataset we are working with.The dataset we imported currently contains two columns and the overview column has 69 missing values.

In [4]:
# Replacing NaN with an empty string
movies_df.overview.fillna('',inplace=True)

# Changing column names
movies_df.rename(columns={"original_title": "title", "overview": "plot"}, inplace=True)

### Creating TfidfVectorizer

We'll compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each movie. The Term Frequency of a word is the measure of how often it appears in a document, while the Inverse Document Frequency is the parameter which reduces the importance of a word if it frequently appears in several documents.

Scikit-learn gives us a built-in TfIdfVectorizer class that produces the TF-IDF matrix in a couple of lines

In [5]:
# Defining a TF-IDF Vectorizer Object
tfidf = TfidfVectorizer(max_df=0.7,stop_words='english')

In [6]:
# Constructing the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies_df['plot'])

In [7]:
# Output the shape of tfidf_matrix
tfidf_matrix.shape

(32269, 59531)

We see that over 59,531 features were used to describe the 32,269 movies in our dataset.

### Cosine similarity score

Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. We will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

# Computing the cosine similarity matrix
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

### Function to find most similar movies

We are going to define a function that takes in a movie title as an input and outputs a list of the 5 most similar movies. Firstly, for this, we need a reverse mapping of movie titles and DataFrame indices. In other words, we need a mechanism to identify the index of a movie in our metadata DataFrame, given its title.

In [9]:
indices = pd.Series(movies_df.index, index=movies_df['title']).drop_duplicates()
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

In [10]:
# Function that takes in movie title as input and outputs most similar movies
def get_similar(title, cosine_sim=cosine_sim):
    
    # Get the index of the movie that matches the title
    idx = indices[title]
    
    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
    
    # Get the scores of the 5 most similar movies
    sim_scores = sim_scores[1:6]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # Get the similarity score
    similarity = [i[1] for i in sim_scores]
    
    # Return the top 5 most similar movies as a datafram
    df = pd.DataFrame(movies_df['title'].iloc[movie_indices])
    df['similarity'] = similarity
    
    return df.reset_index(drop=True)

In [11]:
get_similar('The Godfather')

Unnamed: 0,title,similarity
0,The Godfather: Part II,0.460882
1,The Godfather Trilogy: 1972-1990,0.348765
2,The Godfather: Part III,0.171986
3,Blood Ties,0.158799
4,Household Saints,0.144341


In [12]:
get_similar('The Dark Knight Rises')

Unnamed: 0,title,similarity
0,The Dark Knight,0.315895
1,Batman Forever,0.310626
2,Batman Returns,0.284586
3,Batman: Under the Red Hood,0.277394
4,Batman,0.266956


In [13]:
get_similar('Thor')

Unnamed: 0,title,similarity
0,Thor: The Dark World,0.363348
1,Team Thor,0.353527
2,I Am Thor,0.279237
3,Thor: Tales of Asgard,0.266488
4,Almighty Thor,0.259372


#### We can now determine the similarity between movies based on their plots! 