# Part 1: Content-Based Movie Recommender for TMDb 5000 Movie Dataset
Katherine Huerta

## Content Based Filtering
Suggests similar items (movies) based on another item (movie) of interest. Content based recommenders use item metadata (ex: director, actors, genre, plot, etc.) to identify recommended movies. The assumption behind this system is that if an individual enjoyed a particular movie, they will enjoy a movie similar to it.

![](content_based_image.png)

## Import packages and load data
Using the [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata) from Kaggle.
This dataset contains metadata on approximately 5,000 movies from TMDb. This makes it an ideal dataset for a content-based recommendation system. There are two data sources that are loaded in my jupyter notebook environment: 
1. tmdb_5000_credits.csv
2. tmdb_5000_movies.csv

The first dataset will be labeled df1, and it contains the following:
* movie_id - unique identifier for each movie
* cast - name of lead and supporting actors
* crew - name of director, editor, composer, writer, and more.

The second dataset will be named df2, and it contains the following:
* budget - the budget in which the movie was made
* genre - genre of movie (action, comedy, thriller, and more)
* id - movie id in the first dataset
* keywords - keywords or tags related to movie
* original_language - language in which movie was made
* original_title - movie title before translation or adaptation
* overview - brief description of movie
* popularity - numeric quantity specifying movie popularity
* production_companies - the production house of movie
* production_countries - country in which movie was produced
* release_data - date of movie release
* revenue - worldwide revenue generated by movie
* runtime - running time of the movie in minutes
* status - "Released" or "Rumored"
* tagline - movie's tagline
* title - title of movie
* vote_average - average ratings of movie recieved
* vote_count - count of votes recieved

In [1]:
# Import packages
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import linear_kernel
from sklearn.metrics.pairwise import cosine_similarity
from ast import literal_eval

In [2]:
# Load data
df1 = pd.read_csv('tmdb_5000_credits.csv')
df2 = pd.read_csv('tmdb_5000_movies.csv')

In [3]:
# Join both datasets on the 'id' column
df1.columns = ['id','title','cast','crew']
df = df2.merge(df1, on = 'id')

# Look at data
df.head()

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,...,runtime,spoken_languages,status,tagline,title_x,vote_average,vote_count,title_y,cast,crew
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...",...,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Avatar,"[{""cast_id"": 242, ""character"": ""Jake Sully"", ""...","[{""credit_id"": ""52fe48009251416c750aca23"", ""de..."
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...",...,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Pirates of the Caribbean: At World's End,"[{""cast_id"": 4, ""character"": ""Captain Jack Spa...","[{""credit_id"": ""52fe4232c3a36847f800b579"", ""de..."
2,245000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.sonypictures.com/movies/spectre/,206647,"[{""id"": 470, ""name"": ""spy""}, {""id"": 818, ""name...",en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...",...,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Spectre,"[{""cast_id"": 1, ""character"": ""James Bond"", ""cr...","[{""credit_id"": ""54805967c3a36829b5002c41"", ""de..."
3,250000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 80, ""nam...",http://www.thedarkknightrises.com/,49026,"[{""id"": 849, ""name"": ""dc comics""}, {""id"": 853,...",en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...",...,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,The Dark Knight Rises,"[{""cast_id"": 2, ""character"": ""Bruce Wayne / Ba...","[{""credit_id"": ""52fe4781c3a36847f81398c3"", ""de..."
4,260000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://movies.disney.com/john-carter,49529,"[{""id"": 818, ""name"": ""based on novel""}, {""id"":...",en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]",...,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,John Carter,"[{""cast_id"": 5, ""character"": ""John Carter"", ""c...","[{""credit_id"": ""52fe479ac3a36847f813eaa3"", ""de..."


## Content Based Recommender 1
### Plot description based recommender

* Compute pairwise similarity scores for all movies based on their plot descriptions.
* Recommends movies based on cosine similarity score.
* Plot description is provided in the _overview_ feature of the dataset.

In [4]:
df['overview'].head(3)

0    In the 22nd century, a paraplegic Marine is di...
1    Captain Barbossa, long believed to be dead, ha...
2    A cryptic message from Bond’s past sends him o...
Name: overview, dtype: object

The next step is to convert the word vector of each plot description. 

1) Compute Term-Frequency-Inverse Document Frequency (TF-IDF) vectors for each overview.
   * TF-IDF = relative frequency of a word in a document (term instances/total instances) and the relative count of documents containing the term (log(number of documents/documents with term).
   * Overall importance of _each_ word to the document they appeared in = TF * IDF
   
2) Result is a matrix.
   * Each column represents a word in the plot description vocabulary (all words that appear in at least one document)
   * Each column represents a movie
   * This matrix is used to **reduce** the importance of words that occur frequently in plot overviews (ex: 'The'). Thus, their significance in computing the final similarity score is reduced.
    
Use **scikit-learn TfIdVectorizer** class that makes TF-IDF matrix. 

In [5]:
# Define TF-IDF Vectorizer object, removing all English stop words
# ex. 'the', 'a', etc.
tfidf = TfidfVectorizer(stop_words = 'english')

# Replace NaN with empty string
df['overview'] = df['overview'].fillna('')

# Generate the TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(df['overview'])

# Check matrix shape
tfidf_matrix.shape

(4803, 20978)

The shape reveals that there are **20,978 different words** used to **describe 4,803 movies** in the dataset.

The next step is to compute the **similarity score** using the matrix.
Using **cosine similarity scores**, which calculates a numeric quantity that denotes the similarity between two movies(independent of magnitude). This calculation is fast, which is quite useful for larger sets of data.

**summary** Use TF-IDF vectorizer to calculate the dot product, which yields the cosine similarity score.

Use **sklearn linear_kernel()** rather than cosine_similarities as suggested by [Ibtesam Ahmed](https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system) because it is faster. 

In [6]:
# Calculate cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix) # dot product of tfidf matrix against itself.

### Overall goal: Function that takes movie title as input, and gives output as a list of 10 most similar movies

To do this, we need to do the following:

1) Make a reverse map of movie titles and DataFrame indices.
   * Mechanism to identify movie index in metadata DataFrame given a title.

2) Define recommendation function:
   * Acquire: 
       * index of movie given its title
       * list of cosine similarity scores for specific movies with all movies
        * convert tuples where first element is its position and second is similarity score. 
   * Sort list of tuples based on sim scores
   * Get top 10 elements of list - ignore first element because it refers to itself.
   * Return respective titles 
    

In [7]:
# Reverse map of indices and movie_titles
indices = pd.Series(df.index, index = df['title_x']).drop_duplicates()

In [8]:
# Function where movie title is the input and returns most similar movies

def get_recommendations(title, cosine_sim = cosine_sim):
    # get index of movie that matches movie title
    idx = indices[title]
    
    # get pairwise similarity scores of all movies with movie of interest
    sim_scores = list(enumerate(cosine_sim[idx]))
    
    # sort movies based on similarity scores
    sim_scores = sorted(sim_scores, key = lambda x: x[1], reverse = True)
    
    # get scores of top 10 most similar movies
    sim_scores = sim_scores[1:11]
    
    # acquire movie indices
    movie_indices = [i[0] for i in sim_scores]
    
    # return top 10 most similar movies
    return df['title_x'].iloc[movie_indices]

In [10]:
# Try it out!
# Plot-description based recommender
get_recommendations('The Princess Bride')

2912               Star Wars
917           Into the Woods
1874             August Rush
4191        Against the Wild
4046               Show Boat
1668             Miss Potter
762           Mercury Rising
3076      The House of Mirth
391                Enchanted
4489    Escape from Tomorrow
Name: title_x, dtype: object

## Credits, Genres and Keywords Based Recommender
**Increase the quality of the recommender with better metadata**

Building a recommender based on metadata: 
* 3 top actors 
* director
* related genres
* movie plot keywords.

**From cast, crew and keyword features** - need to extract 3 most important actors, the director and keywords associated with that movie. 

Currently, data is present in form of string lists, convert to safe and usable structure.

In [10]:
# Parse string type features into objects using ast literal_eval

features = ['cast','crew','keywords','genres'] # identifying features to convert to python
for feature in features:
    df[feature] = df[feature].apply(literal_eval)

In [11]:
# Extract required information from each feature
# Get director's name from crew feature.
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan    # If the director is not listed, will return NaN

In [12]:
# Make function that returns the list of top three elements of the entire list

def get_list(x):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        # check if more than 3 elements exist.
        if len(names) > 3: # if yes,
            names = names[:3] # return only first 3 elements. If not, will still return entire list.
        return names
    return [] # return empty list in case of missing/malformed data

In [13]:
# Define new director, cast, genres, and keywords features
# that are in an appropriate format.

df['director'] = df['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    df[feature] = df[feature].apply(get_list)
    
# Show new features of first two films
df[['title_x','cast','director','keywords','genres']].head(2)

Unnamed: 0,title_x,cast,director,keywords,genres
0,Avatar,"[Sam Worthington, Zoe Saldana, Sigourney Weaver]",James Cameron,"[culture clash, future, space war]","[Action, Adventure, Fantasy]"
1,Pirates of the Caribbean: At World's End,"[Johnny Depp, Orlando Bloom, Keira Knightley]",Gore Verbinski,"[ocean, drug abuse, exotic island]","[Adventure, Fantasy, Action]"


In [15]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ","")) for i in x]
    else:
        # check if director exists. If nor, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ",""))
        else:
            return ''

# Apply clean_data function to features
features = ['cast','keywords','director','genres']

for feature in features:
    df[feature] = df[feature].apply(clean_data)

Make a **"metadata soup"** - a string that contains all the metadata we want to feed to vectorizer (top actors, director, keywords).

In [16]:
def create_soup(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])
df['soup'] = df.apply(create_soup, axis = 1)

**Next step - same as plot description based recommender**

But using CountVectorizer() instead of TF-IDF because don't want to decrease weight on the existence of an actor/director if they acted or directed in relatively more movies. 

In [17]:
# make count matrix
count = CountVectorizer(stop_words = 'english')
count_matrix = count.fit_transform(df['soup'])

# Calculate cosine sim matrix based on count matrix 
cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

# Reset index of main DataFrame and construct reverse mapping like earlier
df = df.reset_index()
indices = pd.Series(df.index, index = df['title_x'])

Now we can reuse get_recommendations() function by passing new cosine_sim2 matrix as second argument

In [18]:
# Keywords, cast, director, genre based recommender
get_recommendations('The Princess Bride', cosine_sim2)

1322                                        City of Ember
1306                          Dragon Nest: Warriors' Dawn
1764                                         Return to Oz
294                                                  Epic
3670                                      Running Forever
8                  Harry Potter and the Half-Blood Prince
15               The Chronicles of Narnia: Prince Caspian
32                                    Alice in Wonderland
37                             Oz: The Great and Powerful
63      The Chronicles of Narnia: The Lion, the Witch ...
Name: title_x, dtype: object

In [19]:
get_recommendations('Spirited Away', cosine_sim2)

1987                        Howl's Moving Castle
8         Harry Potter and the Half-Blood Prince
37                    Oz: The Great and Powerful
197     Harry Potter and the Philosopher's Stone
276      Harry Potter and the Chamber of Secrets
1984                   The Thief and the Cobbler
2114                        Return to Never Land
2247                           Princess Mononoke
3364                                     Warlock
1277                                       Delgo
Name: title_x, dtype: object

## This concludes the content-based recommender, where you can input a movie you enjoyed, and get the top 10 recommendations based on your movie's similarity to others!

### Sources:
1. [TMDB 5000 Movie Dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata) Metadata on ~5,000 movies from TMDb
2. [Ibtesam Ahmed](https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system) Getting Started with a Movie Recommendation System
3. Image link for introduction: [Emma Grimaldi](https://towardsdatascience.com/how-to-build-from-scratch-a-content-based-movie-recommender-with-natural-language-processing-25ad400eb243)

## Try it out!
### Fill in your favorite movie, and see what you get:
**Plot Description-based Movie Recommender**

get_recommendations('Insert Movie Here')

**Credits, Genres and Keywords Based Recommender** 

get_recommendations('Insert Movie Here', cosine_sim2)

In [None]:
#Plot Description Recommender
get_recommendations('')

In [None]:
# Credits, Genres, and Keywords Recommender
get_recommendations('', cosine_sim2)