In [1]:
# Import Pandas
import pandas as pd
import numpy as np

#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Load Movies Metadata
metadata = pd.read_csv('data/movies_metadata.csv', low_memory=False)


In [2]:
# Load Movies Metadata
metadata = pd.read_csv('data/movies_metadata.csv', low_memory=False)
metadata.head(3)


Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


We are trying to build a clone of IMDB's Top 250, you will use its weighted rating formula as your metric/score. Mathematically, it is represented as follows:

**Weighted Rating (WR) = ((v/(v+m)).R)+((m/(v+m)).C)**

where,

v is the number of votes for the movie;

m is the minimum votes required to be listed in the chart;

R is the average rating of the movie; And

C is the mean vote across the whole report

You already have the values to v (vote_count) and R (vote_average) for each movie in the dataset. It is also possible to directly calculate C from this data.

In [3]:
# Calculate C
C = metadata['vote_average'].mean()
print("The average rating of a movie on IMDB is around %0.3f, on a scale of 10."%C)

The average rating of a movie on IMDB is around 5.618, on a scale of 10.


In [4]:
# Calculate the minimum number of votes required to be in the chart, m
m = metadata['vote_count'].quantile(0.90)
print(m)

160.0


In [5]:
# Filter out all qualified movies into a new DataFrame
q_movies = metadata.copy().loc[metadata['vote_count'] >= m]
q_movies.shape

(4555, 24)

In [6]:
# Function that computes the weighted rating of each movie

def weighted_rating(x, m=m, C=C):
    v = x["vote_count"]
    R = x["vote_average"]
    
    # Calculations Based On the IMDB formula
    return np.add(np.multiply(np.divide(v,v+m),R),np.multiply(np.divide(m,v+m),C))

In [7]:
# Define a new feature 'score' and calculate its value with `weighted_rating()
q_movies["score"] = q_movies.apply(lambda x: weighted_rating(x,m,C), axis=1)

In [8]:
#Sort movies based on score calculated above
q_movies = q_movies.sort_values('score', ascending=False)

q_movies[['title', 'vote_count', 'vote_average', 'score']].head(15)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.445869
834,The Godfather,6024.0,8.5,8.425439
10309,Dilwale Dulhania Le Jayenge,661.0,9.1,8.421453
12481,The Dark Knight,12269.0,8.3,8.265477
2843,Fight Club,9678.0,8.3,8.256385
292,Pulp Fiction,8670.0,8.3,8.251406
522,Schindler's List,4436.0,8.3,8.206639
23673,Whiplash,4376.0,8.3,8.205404
5481,Spirited Away,3968.0,8.3,8.196055
2211,Life Is Beautiful,3643.0,8.3,8.187171


## Content-Based Recommender in Python

**Plot Description Based Recommender**

In this section, you will try to build a system that recommends movies that are similar to a particular movie. More specifically, you will compute pairwise similarity scores for all movies based on their plot descriptions and recommend movies based on that similarity score.

In [9]:
#Print plot overviews of the first 5 movies.
metadata["overview"].head()

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
3    Cheated on, mistreated and stepped on, the wom...
4    Just when George Banks has recovered from his ...
Name: overview, dtype: object

In [10]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words="english")

#Replace NaN with an empty string
metadata["overview"] = metadata["overview"].fillna('')



In [11]:
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata["overview"])

tfidf_matrix.shape

(45466, 75827)

75,827 different words were used to describe the 45,000 movies.

Can now compute a similarity score. There are several candidates for this; such as the euclidean, the Pearson and the cosine similarity scores. Different scores work well in different scenarios and it is often a good idea to experiment with different metrics.

Cosine similarity to calculate a numeric quantity that denotes the similarity between two movies.

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mi>c</mi>
  <mi>o</mi>
  <mi>s</mi>
  <mi>i</mi>
  <mi>n</mi>
  <mi>e</mi>
  <mo stretchy="false">(</mo>
  <mi>x</mi>
  <mo>,</mo>
  <mi>y</mi>
  <mo stretchy="false">)</mo>
  <mo>=</mo>
  <mfrac>
    <mrow>
      <mi>x</mi>
      <mo>.</mo>
      <msup>
        <mi>y</mi>
        <mo>&#x22BA;<!-- ⊺ --></mo>
      </msup>
    </mrow>
    <mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mi>x</mi>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mo>.</mo>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mi>y</mi>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
      <mrow class="MJX-TeXAtom-ORD">
        <mo stretchy="false">|</mo>
      </mrow>
    </mrow>
  </mfrac>
</math>

 **Using sklearn's linear_kernel() instead of cosine_similarities() since it is faster.**

In [12]:
# Import linear kernel

from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

In [13]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata["title"]).drop_duplicates()

In [14]:
indices.head()

title
Toy Story                      0
Jumanji                        1
Grumpier Old Men               2
Waiting to Exhale              3
Father of the Bride Part II    4
dtype: int64

Define your recommendation function. 
These are the following steps you'll follow:

1. Get the index of the movie given its title.

2. Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.

3. Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.

4. Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).

5. Return the titles corresponding to the indices of the top elements.


In [15]:
#  Function that takes in movie title as input and outputs most similar movies
def get_recommendation(title, cosine_sim=cosine_sim):
    
    
    try:
        # Get the index of the movie that matches the title
        
        idx = indices[title]
        
        # Get the pairwsie similarity scores of all movies with that movie
        sim_scores = list(enumerate(cosine_sim[idx]))
        
        # Sort the movies based on the similarity scores
        sim_scores = sorted(sim_scores, key=lambda x:x[1], reverse=True)
        
        # Get the scores of the 10 most similar movies
        sim_scores = sim_scores[1:11]
        
        # Get the movie indices
        movie_indices = [i[0] for i in sim_scores]
        
        # Return the top 10 most similar movies
        print("You should watch these movies also:")
        return metadata["title"].iloc[movie_indices] 
    
    except KeyError:
        print("No recommendations found.")

In [38]:
get_recommendation("Unbroken")

You should watch these movies also:


5211     Triumph of the Spirit
4786                    Midway
36042       Fires on the Plain
11049                  Running
160                  Desperado
44042                Tracktown
19035          The Peach Thief
4225            Uncommon Valor
35557       Cannonball Wedlock
42019                  Rangoon
Name: title, dtype: object