# Content Based Recommender Example

## Helpful Links:
 * [This blog post](http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html) describes where / when to use Content-Based models over other ones, and provides a really simple example of how to build one ([code](https://github.com/groveco/content-engine/blob/master/engines.py)).
 * [Another tutorial](https://www.datacamp.com/community/tutorials/recommender-systems-python) for building content-based recommender systems. Goes into more detail on other features you can use besides TF/IDF description similarity ([code](https://github.com/rounakbanik/movies/blob/master/movies_recommender.ipynb)).
 * [The MovieLens dataset](http://files.grouplens.org/datasets/movielens/ml-latest-small.zip) used in this example can be downloaded here. An explanation of the dataset can be found [here](http://files.grouplens.org/datasets/movielens/ml-latest-small-README.html).
 * [The TMDB dataset](https://www.kaggle.com/tmdb/tmdb-movie-metadata/downloads/tmdb_5000_movies.csv/2) used in this example can be donloaded here.

## Notes:
 * Content-Based recommendation algorithms aren't very useful in most cases: typically, customers perform content-based filtering already (for example, if a customer reads "Chamber of Secrets", recommending "Prisoner of Askaban" doesn't make too much sense, because they almost certainly know about it already).
   - Thus, Collaborative Filtering, which predicts new content a customer might enjoy, is better at getting incremental sales (sales a customer wouldn't have chosen without the recommendation algorithm).
 * However, Collaboriative Filtering (and most other models) requries pre-existing data on the user in question in order to make accurate predictions.
   - Content-based recommendation only relies on the products that are available, and thus has unchanged predictive accuracy on brand-new customers.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import json
import os

from IPython.display import display
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel

In [2]:
# GETTING STARTED:
# Download both of the datasets in the links above,
# and put the TMDB dataset CSV inside the folder
# containing the MovieLens dataset CSVs

# Set this to where your dataset folder is located
# For example, mine is at ./data/ml-latest-small/
DATA_DIR = os.path.join('..', 'data', 'ml-latest-small')

RATINGS_CSV = os.path.join(DATA_DIR, "ratings.csv")
MOVIES_CSV = os.path.join(DATA_DIR, "movies.csv")
LINKS_CSV = os.path.join(DATA_DIR, "links.csv")
TMDB_CSV = os.path.join(DATA_DIR, "tmdb.csv")

ratings_raw_df = pd.read_csv(RATINGS_CSV)
movies_raw_df = pd.read_csv(MOVIES_CSV)
links_df = pd.read_csv(LINKS_CSV)
tmdb_df = pd.read_csv(TMDB_CSV)

print("Ratings DF Sample (5 of %d):" % len(ratings_raw_df))
display(ratings_raw_df.head())

Ratings DF Sample (5 of 100004):


Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [3]:
# Merge the movies and TMDB dataframes
links_df['tmdbId'] = links_df['tmdbId'].fillna(0).astype('int')
movies_df = movies_raw_df.merge(links_df, on='movieId')
movies_df = movies_df.merge(tmdb_df, left_on='tmdbId', right_on='id')

# Fix up some of the columns
movies_df['description'] = movies_df['overview'].fillna('') + \
        movies_df['tagline'].fillna('')
movies_df['keywords'] =  movies_df['keywords'].apply(
        lambda x: [y['name'] for y in json.loads(x)])
movies_df['genres'] = movies_df['genres_x'].apply(
        lambda x: x.lower().split('|'))
movies_df['title'] = movies_df['title_y']

# Filter out rows with movies we don't have metadata for in the ratings DF
movie_ids = list(movies_df['movieId'])
ratings_df = ratings_raw_df[ratings_raw_df['movieId'].isin(movie_ids)]
ratings_df = ratings_df.reset_index(drop=True)

# Drop the columns we don't want
movies_df = movies_df.drop(columns=['title_x', 'title_y', 'genres_x',
        'genres_y', 'id', 'imdbId', 'budget', 'homepage',
        'original_language', 'original_title', 'overview', 'popularity',
        'production_companies', 'production_countries', 'release_date',
        'revenue', 'spoken_languages', 'status', 'tagline', 'tmdbId',
        'vote_average', 'vote_count'])
ratings_df = ratings_df.drop(columns=['timestamp'])

print("Filtered Ratings Sample (5 of %d):" % len(ratings_df))
display(ratings_df.head())

print("\nMerged Movies DF Sample (5 of %d):" % len(movies_df))
display(movies_df.head())

Filtered Ratings Sample (5 of 66947):


Unnamed: 0,userId,movieId,rating
0,1,1061,3.0
1,1,1129,2.0
2,1,1263,2.0
3,1,1293,2.0
4,1,1339,3.5



Merged Movies DF Sample (5 of 3404):


Unnamed: 0,movieId,keywords,runtime,description,genres,title
0,1,"[jealousy, toy, boy, friendship, friends, riva...",81.0,"Led by Woody, Andy's toys live happily in his ...","[adventure, animation, children, comedy, fantasy]",Toy Story
1,10,"[cuba, falsely accused, secret identity, compu...",130.0,James Bond must unmask the mysterious head of ...,"[action, adventure, thriller]",GoldenEye
2,11,"[white house, usa president, new love, widower...",106.0,"Widowed U.S. president Andrew Shepherd, one of...","[comedy, drama, romance]",The American President
3,14,"[usa president, presidential election, waterga...",192.0,An all-star cast powers this epic look at Amer...,[drama],Nixon
4,15,"[exotic island, treasure, map, ship, scalp, pi...",119.0,"Morgan Adams and her slave, William Shaw, are ...","[action, adventure, romance]",Cutthroat Island


In [4]:
# Create a TF-IDF matrix 1, 2, and 3 gram word phrases in
# the descriptions. Ignore English stop words (e.g. 'the')
TF_IDF = TfidfVectorizer(analyzer='word', ngram_range=(1, 3),
        min_df=0, stop_words='english')
tfidf_matrix = TF_IDF.fit_transform(movies_df['description'])

# Calculate the cosine similarity between the movies
# using the TF-IDF matrix. We use linear_kernel here
# since it is faster and equivalent.
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

titles = movies_df['title']
indices = pd.Series(movies_df.index, index=movies_df['title'])

def get_recommendations(title, n):
    """
    Returns the top n most similar movies using the
    computed similarity matrix(ces) as recommendations.
    """
    idx = indices[title]
    similarity_scores = list(enumerate(cosine_sim[idx]))
    similarity_scores = sorted(similarity_scores,
            key=lambda x: x[1], reverse=True)[1:n+1]
    # movie_indices = [x[0] for x in similarity_scores]
    results = [(titles[x[0]], x[1]) for x in similarity_scores]
    return results

def pprint_recs(recs):
    print("\n".join(["(%f) %s" % (x[1], x[0]) for x in recs]))

# Sanity Check: Get some recs for popular movies
print("The Godfather Recommendations:")
pprint_recs(get_recommendations('The Godfather', 10))
print("\nThe Dark Knight Recommendations:")
pprint_recs(get_recommendations('The Dark Knight', 10))

The Godfather Recommendations:
(0.147459) The Godfather: Part II
(0.046813) Made
(0.032958) Summer of Sam
(0.031096) Thinner
(0.030020) Run All Night
(0.029734) The Godfather: Part III
(0.028901) The Color Purple
(0.028754) 8 Women
(0.027666) Machete
(0.027361) Jaws: The Revenge

The Dark Knight Recommendations:
(0.132618) The Dark Knight Rises
(0.093304) Batman Forever
(0.066215) Batman Returns
(0.054294) Batman: The Dark Knight Returns, Part 2
(0.052912) Batman
(0.041014) JFK
(0.034657) Batman Begins
(0.033545) Sherlock Holmes: A Game of Shadows
(0.033214) Law Abiding Citizen
(0.027241) Batman v Superman: Dawn of Justice
