# Recommender Model Intro

Recommender systems are very popular applications today and they are used in many sectors to personalize the service provided to customers. The basic concept of what this models do is to predict the "rating" or "preference" that a user would give to an item.

Almost every major tech company has applied recommender systems in some form. Amazon suggests products to customers, YouTube selects which video to play next on autoplay and Facebook recommend pages to like and people to follow. For companies like Netflix and Spotify it's the core of the business model and its success revolves around the power of their recommendations.

<br><br>
Types of recommenders:

1. **Simple recommenders** offer generalized recommendations based on popularity and/or genre. The basic idea behind this system is that items that are more popular and acclaimed will have higher probability of being liked by the average audience. IMDB Top 250 is an example of this system.
2. **Content recommenders** suggest similar items based on a particular item. This system uses item metadata, for example in films they will use genre, director, description, actors, synopsis, etc. in music, the singer, writer, genre, etc, to make these recommendations. The general idea behind these recommender systems is that if someone liked a particular item the same person will also like an item that is similar to it.
3. **Collaborative filtering recommenders** try to predict the rating or preference that a user would give an item-based on past ratings and preferences of other users. Collaborative filters do not require item metadata like its content-based counterparts.

<br><br>
Common steps across the recommenders include:
1. Decide on the **metric or score** on which to rate the items.
2. **Calculate scores** for the existing items.
3. **Sort the items** based on the score and output the top results.

A nice dataset to work is: https://www.kaggle.com/rounakbanik/the-movies-dataset


In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext watermark
%watermark -v -m -p numpy,pandas,skmultilearn -g

import os
import sys
import re
from tqdm import tqdm
import yaml
import watermark
from math import floor
from pprint import pprint as pp
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import pandas_profiling
from pandas.plotting import register_matplotlib_converters    # for pandas_profiling

register_matplotlib_converters()                              # for pandas_profiling
sys.path.append(os.pardir)

CPython 3.7.3
IPython 7.5.0

numpy 1.16.4
pandas 0.24.2
skmultilearn unknown

compiler   : GCC 7.3.0
system     : Linux
release    : 5.0.0-27-generic
machine    : x86_64
processor  : x86_64
CPU cores  : 8
interpreter: 64bit
Git hash   : c987530a06d5dc92c5ad22fef7532c74dadb8a1c


<br>
As always first we load the data and inspect which type of variables do we have available in the dataset.

In [2]:
metadata = pd.read_csv('./data/raw/movies_metadata.csv', low_memory=False)

In [3]:
metadata.head(3)

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0


In [4]:
metadata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45466 entries, 0 to 45465
Data columns (total 24 columns):
adult                    45466 non-null object
belongs_to_collection    4494 non-null object
budget                   45466 non-null object
genres                   45466 non-null object
homepage                 7782 non-null object
id                       45466 non-null object
imdb_id                  45449 non-null object
original_language        45455 non-null object
original_title           45466 non-null object
overview                 44512 non-null object
popularity               45461 non-null object
poster_path              45080 non-null object
production_companies     45463 non-null object
production_countries     45463 non-null object
release_date             45379 non-null object
revenue                  45460 non-null float64
runtime                  45203 non-null float64
spoken_languages         45460 non-null object
status                   45379 non-null objec

In [5]:
metadata.describe()

Unnamed: 0,revenue,runtime,vote_average,vote_count
count,45460.0,45203.0,45460.0,45460.0
mean,11209350.0,94.128199,5.618207,109.897338
std,64332250.0,38.40781,1.924216,491.310374
min,0.0,0.0,0.0,0.0
25%,0.0,85.0,5.0,3.0
50%,0.0,95.0,6.0,10.0
75%,0.0,107.0,6.8,34.0
max,2787965000.0,1256.0,10.0,14075.0


In [6]:
metadata.profile_report()



In [7]:
ratings_sm = pd.read_csv("./data/raw/ratings_small.csv", low_memory=False)

In [8]:
ratings_sm.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182


In [9]:
ratings_sm.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100004 entries, 0 to 100003
Data columns (total 4 columns):
userId       100004 non-null int64
movieId      100004 non-null int64
rating       100004 non-null float64
timestamp    100004 non-null int64
dtypes: float64(1), int64(3)
memory usage: 3.1 MB


In [10]:
ratings_sm.describe()

Unnamed: 0,userId,movieId,rating,timestamp
count,100004.0,100004.0,100004.0,100004.0
mean,347.01131,12548.664363,3.543608,1129639000.0
std,195.163838,26369.198969,1.058064,191685800.0
min,1.0,1.0,0.5,789652000.0
25%,182.0,1028.0,3.0,965847800.0
50%,367.0,2406.5,4.0,1110422000.0
75%,520.0,5418.0,4.0,1296192000.0
max,671.0,163949.0,5.0,1476641000.0


## Most Popular

For this calculation the most evident parameters in which to base the scoring are the movie ratings or averaging the user ratings. In both cases there are a few caveats. 

For example, it does not take into consideration the number of contributions to scoring a particular item so a movie with a rating of 5 (user rating is from 0 to 5 here) from 20 voters will be considered better than a movie with a rating of 8.5 from 1000 voters.

Basically it tends to favor items with smaller number of voters and extremely high ratings. As the number of voters increase the rating regularizes and the final score becomes more representative of the true quality or value of the item.

To compensate for that we can create a weighted rating that penalizes the rating of items with very few votes. Following the IMDB formula the resulting expression is:

\begin{equation*}
Weighted​​Rating​​(WR)​​​=​​​\left( ​{\frac{v}{v + m}} ​ AR \right)​+​\left( ​{\frac{m}{v + m}} ​ MV \right)
\end{equation*}

Parameters:
- v is the number of votes for the movie
- m is the minimum votes required to be listed in the chart
- AR is the average rating of the movie
- MV is the mean vote across the whole report

<br><br>
And now we can begin to calculate these parameters from the data available.

First for the Mean Vote across the full data we can just apply the mean in the dataframe:

In [11]:
MV = metadata['vote_average'].mean()
print(f"Mean Vote: {MV:.3f}")

Mean Vote: 5.618


To calculate the min amount of votes we can consider items within a specific percentile. For example 90th or 95th and this way we will only consider items that have equal or more votes than 90% (or 95%) of the items.

In [12]:
m = metadata['vote_count'].quantile(0.90)
print(f"90th percentile cut off: {int(m)}")

m = metadata['vote_count'].quantile(0.95)
print(f"95th percentile cut off: {int(m)}")

90th percentile cut off: 160
95th percentile cut off: 434


As expected the 95th percentile will be more restrictive requiring items with at least 434 votes.

Now we filter the movies that comply with the requirement of the 95th percentile:

In [13]:
selected_movies = metadata.copy().loc[metadata['vote_count'] >= m]
selected_movies.shape

(2274, 24)

From these 2274 movies we calculate the metric for each one by using the Weighted Rating that we have previously defined.

In [14]:
def weighted_rating(x, m=m, mean_vote=MV):
    n_votes = x['vote_count']
    vote_av = x['vote_average']
    # Calculation based on the IMDB formula
    return (n_votes/(n_votes + m) * vote_av) + (m/(m + n_votes) * mean_vote)

In [15]:
# Apply Weighted Rating
selected_movies['score'] = selected_movies.apply(weighted_rating, axis=1)

# Sort movies based on the score
selected_movies = selected_movies.sort_values('score', ascending=False)

# Print the top 10 movies
TOPN = 10
selected_movies[['title', 'vote_count', 'vote_average', 'score']].head(TOPN)

Unnamed: 0,title,vote_count,vote_average,score
314,The Shawshank Redemption,8358.0,8.5,8.357746
834,The Godfather,6024.0,8.5,8.306334
12481,The Dark Knight,12269.0,8.3,8.208376
2843,Fight Club,9678.0,8.3,8.184899
292,Pulp Fiction,8670.0,8.3,8.172155
351,Forrest Gump,8147.0,8.2,8.069421
522,Schindler's List,4436.0,8.3,8.061007
23673,Whiplash,4376.0,8.3,8.058025
5481,Spirited Away,3968.0,8.3,8.035598
1154,The Empire Strikes Back,5998.0,8.2,8.025793


## Content Recommender

This model builds recommendations based on similarity across items. More specifically we have to compute pairwise similarity scores for all items based on specific parameters like plot descriptions. In this dataset we have available the plot description.

This data includes a lot of words that are not usefull for the comparison, like stopwords, so we need to clean it first and then translate it to word vectors which is what we will use for the similarity calculations.

**Term Frequency-Inverse Document Frequency (TF-IDF)** calculated for each document will provide as with a matrix were each document is a row and each column is a different word (that must appears in at least one document). The TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score.

Fortunately, scikit-learn gives you a built-in TfIdfVectorizer class that produces the TF-IDF matrix in a couple of lines.

In [16]:
# Example of the data...
metadata['overview'].head(3)

0    Led by Woody, Andy's toys live happily in his ...
1    When siblings Judy and Peter discover an encha...
2    A family wedding reignites the ancient feud be...
Name: overview, dtype: object

In [17]:
# Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

# Define a TF-IDF Vectorizer Object and remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

# Replace NaN with an empty string
metadata['overview'] = metadata['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(metadata['overview'])
tfidf_matrix.shape

(45466, 75827)

As a result over 75K different words were used to describe the 45K movies in the dataset.

Now we proceed to compute the similarity score. Typical similarity metrics include:
- **euclidean**
- **Pearson**
- **cosine**

Different scores work well in different scenarios and it is often a good idea to experiment with different metrics.

For example using the cosine similarity score which it is independent of magnitude and is relatively easy and fast to calculate we apply the following expression:


\begin{equation*}
cosine \left( x, y \right)​​=​​{\frac{x​\cdot​y^T}{||x||​\cdot​||y||}}
\end{equation*}

Since you have used the TF-IDF vectorizer calculating the dot product will directly give you the cosine similarity score. Therefore you will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [19]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

MemoryError: 

The recommender will receive an item (movie title) as input and it will return a list of the 10 most similar items. For this we need a reverse mapping of movie titles and DataFrame indices.

The steps to follow in the recommender will be the following:

1. Get the index of the movie given its title.
2. Get the list of cosine similarity scores for that particular movie against all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.
3. Sort the aforementioned list of tuples based on the similarity scores.
4. Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
5. Return the titles corresponding to the indices of the top elements.

In [None]:
# Reverse map of indices and movie titles
indices = pd.Series(metadata.index, index=metadata['title']).drop_duplicates()

In [None]:
def get_recommendations(title, cosine_sim=cosine_sim, topn=10):
    """Takes in movie title as input and outputs most similar movies"""
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:(topn+1)]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the topn most similar movies
    return metadata['title'].iloc[movie_indices]

In [None]:
get_recommendations('The Shawshank Redemption')

As always, the more we increase the quality of the data used the better will be the results we obtain.

In the previous calculation of the similarity we are missing very important information like the actors, the director, genres and the movie plot keywords (yes, part of this data is not available in the current dataset so we will load extra data).

In [None]:
# Load keywords and credits
credits = pd.read_csv('./data/raw/credits.csv')
keywords = pd.read_csv('./data/raw/keywords.csv')

# Remove rows with bad IDs.
bad_ids =  [idx for idx, k in zip(metadata.index, metadata['id'].values) if not k.isdigit()]
metadata = metadata.drop(bad_ids)

# Convert IDs to int. Required for merging
keywords['id'] = keywords['id'].astype('int')
credits['id'] = credits['id'].astype('int')
metadata['id'] = metadata['id'].astype('int')

# Merge keywords and credits into your main metadata dataframe
metadata = metadata.merge(credits, on='id')
metadata = metadata.merge(keywords, on='id')

From the new features (cast, crew and keywords) extract the most important actors, the director and the keywords. Right now the data is in the form of "stringified" lists so first of all we transform than into usable lists.

In [None]:
# Example of data...
metadata['cast'][0][:99]

In [None]:
# Parse the stringified features into their corresponding python objects
from ast import literal_eval

features = ['cast', 'crew', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(literal_eval)

Now we extract the required information from each feature.

In [None]:
def get_director(x):
    """Get the director's name from the crew feature. If director is not listed, return NaN"""
    for i in x:
        if i['job'] == 'Director':
            return i['name']
    return np.nan

In [None]:
def get_list(x, topn=3):
    """Returns the list top n elements or entire list, whichever is more."""
    if isinstance(x, list):
        names = [i['name'] for i in x]
        #Check if more than n elements exist. If yes, return only first ones. If no, return entire list.
        if len(names) > topn:
            names = names[:topn]
        return names

    #Return empty list in case of missing/malformed data
    return []

In [None]:
# Define new director, cast, genres and keywords features that are in a suitable form.
metadata['director'] = metadata['crew'].apply(get_director)

features = ['cast', 'keywords', 'genres']
for feature in features:
    metadata[feature] = metadata[feature].apply(get_list)

We don't want to count as different things in uppercase and lowercase or confusion because of the spaces. This is done so that vectorizer doesn't count the Marlon of "Marlon Brando" and "Marlon Jackson" as the same. After this step the previous actors will be distinct to the vectorizer.

In [None]:
# Function to convert all strings to lower case and strip names of spaces
def clean_data(x):
    if isinstance(x, list):
        return [str.lower(i.replace(" ", "")) for i in x]
    else:
        # Check if director exists. If not, return empty string
        if isinstance(x, str):
            return str.lower(x.replace(" ", ""))
        else:
            return ''

In [None]:
# Apply clean_data function to your features.
features = ['cast', 'keywords', 'director', 'genres']

for feature in features:
    metadata[feature] = metadata[feature].apply(clean_data)

In [None]:
def create_combination(x):
    return ' '.join(x['keywords']) + ' ' + ' '.join(x['cast']) + \
           ' ' + x['director'] + ' ' + ' '.join(x['genres'])

In [None]:
metadata['combination'] = metadata.apply(create_combination, axis=1)

The next steps are the same as the previous recommender with one important difference. Now we use the CountVectorizer() instead of TF-IDF. This is because we don't want to down-weight the presence of an actor/director if he or she has acted or directed in relatively more movies. It doesn't make much intuitive sense.

In [None]:
# Import CountVectorizer and create the count matrix
from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer(stop_words='english')
count_matrix = count.fit_transform(metadata['combination'])

In [88]:
# Compute the Cosine Similarity matrix based on the count_matrix
from sklearn.metrics.pairwise import cosine_similarity

cosine_sim2 = cosine_similarity(count_matrix, count_matrix)

MemoryError: 

In [None]:
# Reset index of your main DataFrame and construct reverse mapping as before
metadata = metadata.reset_index()
indices = pd.Series(metadata.index, index=metadata['title'])

In [None]:
get_recommendations('The Shawshank Redemption', cosine_sim2)

The recommender has been successful in capturing more information due to more metadata and has given better recommendations althoug there are many ways of playing with the available info and improve the recommendations.

Some suggestions:

- Popularity filter: this recommender would take the list of the 30 most similar movies, calculate the weighted ratings (using the IMDB formula from above), sort movies based on this rating and return the top 10 movies.
- Other crew members: other crew member names, such as screenwriters and producers, could also be included.
- Increasing weight of the director: to give more weight to the director, he or she could be mentioned multiple times in the soup to increase the similarity scores of movies with the same director.

## Collaborative Filtering Recommenders