# Content-Based Recommender System Using The Movies Dataset

This project attempts to showcase a simple content-based recommender system, using the publicly available datasets from [MovieLens](http://movielens.org). For simplicity's sake, we will be using the MovieLens Datasets from Kaggle which can be found [here](https://www.kaggle.com/rounakbanik/the-movies-dataset).

In this project, we will focus on building a recommender system based on the *ratings*, *genres*, *credits* and *keywords* of the movies.

In [42]:
# Import required libraries
import numpy as np
import pandas as pd
from ast import literal_eval
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Read required dataset
df_movies = pd.read_csv("./Data/movies_metadata.csv", low_memory=False)
df_credits = pd.read_csv("./Data/credits.csv", low_memory=False)
df_keywords = pd.read_csv("./Data/keywords.csv", low_memory=False)

### Preprocess And Clean Dataset

Perform basic formatting and merging of datasets

In [3]:
# Remove entries in movies dataset with mixed columns
df_movies = df_movies[(df_movies['id'].str.isnumeric() == True)]

# Remove entries with missing values
df_movies = df_movies[(df_movies['title'].isna() == False)]

In [4]:
# Format Datasets 'ID' 
df_movies['id'] = df_movies['id'].astype('int')
df_credits['id'] = df_credits['id'].astype('int')
df_keywords['id'] = df_keywords['id'].astype('int')

In [5]:
# Merge Movies, Credits and Keywords Dataset
df_main = pd.merge(df_movies, df_credits, how = 'inner', on = "id")
df_main = pd.merge(df_main, df_keywords, how = 'inner', on = 'id')
df_main.shape

(46624, 27)

In [6]:
df_main.columns


Index(['adult', 'belongs_to_collection', 'budget', 'genres', 'homepage', 'id',
       'imdb_id', 'original_language', 'original_title', 'overview',
       'popularity', 'poster_path', 'production_companies',
       'production_countries', 'release_date', 'revenue', 'runtime',
       'spoken_languages', 'status', 'tagline', 'title', 'video',
       'vote_average', 'vote_count', 'cast', 'crew', 'keywords'],
      dtype='object')

Define helper functions for feature extractions

In [7]:
# Extract 'Director' from 'Crew'
def get_director(x):
    for i in x:
        if i['job'] == 'Director':
            return i['name']
        else:
            return np.nan

In [8]:
# Extract Top Elements from features
def get_toplist(x, y):
    if isinstance(x, list):
        names = [i['name'] for i in x]
        
        if len(names) > y:
            names = names[:y]
        return names
    else:
        return []

Extract features from embedded lists in columns

In [9]:
# Select column names to extract features
features = ['cast', 'crew', 'keywords', 'genres']

for feature in features:
    df_main[feature] = df_main[feature].apply(literal_eval)

In [10]:
# Extract Director from 'crew' column
df_main['director'] = df_main['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']

for feature in features:
    df_main[feature] = df_main[feature].apply(get_toplist, y =5)

For each individual elements in each feature, we will strip all white space and convert to lowercase. Each elements will then be dumped into a "pool", which will be used as the basis of calculating the similarities/ dissimilarities between movies.

In [12]:
# Define helper function to clean elements
def clean_elements(x):
    
    # for feature columns with embedded lists (cast, genres, keywords)
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # for feature columns with str (director)
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [13]:
# Clean all elements in feature columns

features = ['director', 'cast', 'keywords', 'genres']

for feature in features:
    df_main[feature] = df_main[feature].apply(clean_elements)

In [14]:
df_main.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,status,tagline,title,video,vote_average,vote_count,cast,crew,keywords,director
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[animation, comedy, family]",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,Released,,Toy Story,False,7.7,5415.0,"[tomhanks, timallen, donrickles, jimvarney, wa...","[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[jealousy, toy, boy, friendship, friends]",johnlasseter
1,False,,65000000,"[adventure, fantasy, family]",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,"[robinwilliams, jonathanhyde, kirstendunst, br...","[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[boardgame, disappearance, basedonchildren'sbo...",
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[romance, comedy]",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,"[waltermatthau, jacklemmon, ann-margret, sophi...","[{'credit_id': '52fe466a9251416c75077a89', 'de...","[fishing, bestfriend, duringcreditsstinger, ol...",howarddeutch
3,False,,16000000,"[comedy, drama, romance]",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,"[whitneyhouston, angelabassett, lorettadevine,...","[{'credit_id': '52fe44779251416c91011acb', 'de...","[basedonnovel, interracialrelationship, single...",forestwhitaker
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,[comedy],,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,"[stevemartin, dianekeaton, martinshort, kimber...","[{'credit_id': '52fe44959251416c75039ed7', 'de...","[baby, midlifecrisis, confidence, aging, daugh...",


The final step in preprocessing is to create a "pool" consisting of all elements of all features for each movie.

In [15]:
# Define Helper Function To "Pool" the elements together
def create_pool(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [16]:
# Pool features 
df_main['pool'] = df_main.apply(create_pool, axis = 1)

In [18]:
# Select only required columns
df_main = df_main[['id', 'title', 'vote_average', 'vote_count','pool']]
df_main.head()

Unnamed: 0,id,title,vote_average,vote_count,pool
0,862,Toy Story,7.7,5415.0,jealousy toy boy friendship friends tomhanks t...
1,8844,Jumanji,6.9,2413.0,boardgame disappearance basedonchildren'sbook ...
2,15602,Grumpier Old Men,6.5,92.0,fishing bestfriend duringcreditsstinger oldmen...
3,31357,Waiting to Exhale,6.1,34.0,basedonnovel interracialrelationship singlemot...
4,11862,Father of the Bride Part II,5.7,173.0,baby midlifecrisis confidence aging daughter s...


### Compute Weighted Ratings

We will use the IMDB Weighted Ratings Formula to calculate the rating scores for the movies. The formula used is as below:

$${WR} = \frac{n}{n+m}\cdot R  +  \frac{m}{n+m}\cdot C$$

Where:
- **n** is the number of votes for the movie;
- **m** is a hyperparameter indicating the minimum votes required to be considered;
- **R** is the average rating for the movie; and
- **C** is the mean vote across all movies.

In [35]:
# Define weighted ratings formula
def weighted_ratings(x, m=m, C=C):
    n = x['vote_count']
    R = x['vote_average']
    WR = n/(n+m)*R + m/(n+m)*C
    return WR

In [36]:
m = df_main['vote_count'].quantile(0.70)
C = df_main['vote_average'].mean()
df_main['WR'] = df_main.apply(weighted_ratings, axis = 1)

Filter movies with the minimum vote count **m**.

In [45]:
df_main = df_main[df_main['vote_count']>= m]

### Create Similarity Matrix

In [46]:
# Create Count Matrix 
count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(df_main['pool'])
count_matrix.shape

(14046, 37657)

In [47]:
# Create Similarity Matrix
similarity_matrix = cosine_similarity(count_matrix, count_matrix)

In [63]:
# Trim Dataset for Implementator Function
df_final = df_main[['title', 'WR']].reset_index()

In [80]:
df_main.sort_values(by = ['WR'], ascending = False)

Unnamed: 0,id,title,vote_average,vote_count,pool,WR
10397,19404,Dilwale Dulhania Le Jayenge,9.1,661.0,musical shahrukhkhan kajol amrishpuri anupamkh...,8.972876
314,278,The Shawshank Redemption,8.5,8358.0,prison corruption policebrutality prisoncell d...,8.491387
841,238,The Godfather,8.5,6024.0,italy loveatfirstsight lossoffather patriarch ...,8.488063
41414,372058,Your Name.,8.5,1030.0,supernatural romance school starcrossedlovers ...,8.431558
40249,192040,Planet Earth,8.8,176.0,miniseries greatcinematpgraphy davidattenborou...,8.403449
...,...,...,...,...,...,...
8690,22293,Manos: The Hands of Fate,2.0,56.0,fire gun drive sacrifice flashlight haroldp.wa...,3.114731
13710,14164,Dragonball Evolution,2.9,475.0,karate superhero revenge dragon duringcreditss...,3.035586
17898,40016,Birdemic: Shock and Terror,2.1,69.0,birdattack naturerunamok alanbagh whitneymoore...,3.033970
44554,271404,Beyond Skyline,0.0,30.0,invasion sequel alien frankgrillo bojananovako...,2.550785


The last step is to define the implementator function **get_recommendation**

In [105]:
# Define Implementor Function get_recommendations
def get_recommendations(title, data = df_final, sim_matrix = similarity_matrix, top_n = 10):

    # Get the index of the movie that matches the title
    idx = data[data['title'] == title].index[0]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(sim_matrix[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 30 most similar movies
    sim_scores = sim_scores[1:21]
    
    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies with highest ratings
    return data[['title', 'WR']].iloc[movie_indices].sort_values(by = ['WR'], ascending = False).head(top_n)

### Implement The Recommender System

Now feel free to use the implementor function to get recommendations for any movies.

In [116]:
get_recommendations('The Phantom of the Opera')

Unnamed: 0,title,WR
4902,The Phantom Carriage,7.150455
7858,The Cremator,7.067081
4241,The Hunchback of Notre Dame,6.757512
4913,The Man Who Laughs,6.61567
3979,Code Unknown,6.516586
10864,The Big Shave,6.452683
10925,Hansel & Gretel,6.428665
12356,Marguerite,6.242346
4935,Camille,6.239494
10476,When Animals Dream,5.92309
