# Introduction

Recommender systems are among the most popular applications of data science today. They are used to predict the *rating* or *preference* that an user would give to an item. Amazon uses it to suggest products to customers. YouTube uses recommender systems to decide which video to play next on autoplay.

There are also popular recommder systems for domains like restaurants and movies. Recommender systems have also been developed to explore research articles and experts, collaborators and financial services. YouTube uses the recommendation system at a large scale to suggest videos based on your history.

Recommender systems can be classified primarily into 3 types:

- <u>Simple recommenders</u>: Offer generalized recommendations to every user, based on movie popularity and/or genre.  The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience. For example, IMDB Top 250.

- <u> Content-based recommenders</u>: These recommenders suggest similar items based on a particular item. This system uses item metadata, such as genre, director, actors etc, for movies, to make these recommendations. The general idea behind these systems is that if a person likes a particular item, he will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. For example, YouTube, where based on your history, the system suggests new videos that you can potentially watch.

- <u> Collaborative filtering</u>: These systems are widely used, and they try to predict the rating or preference that an user would give an item-based on past ratings and preferences of other  users. Collaborative filtering based recommendation systems do not require item metadata like content-based ones.

## Dataset

The dataset files contain metadata for 9742 movies listed in the [`MovieLens Dataset`](https://grouplens.org/datasets/movielens/). The dataset consists of movies released on or before September 2018. The dataset captures feature points like cast, crew, TMDB vote counts and vote averages.

This dataset consists of the following files:

* *movies.csv*: Each line of this file after the header row represents one movie, and has the following format:

*****
    movieId,title,genres
*****

Genres are a a pipe-separated list. Some common genres are: Action, Adventure, Animation, Comedy, Crime etc.

* *links.csv*: This file contains the TMDB and IMDB IDs of all the movies featured in the `MovieLens Dataset`.

* *ratings.csv*: This file contains 100836 ratings across 9742 movies from 610 users. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

In [1]:
# Import libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style="darkgrid", palette="icefire")

import warnings
warnings.filterwarnings("ignore")

In [2]:
import os
os.chdir('D:\Teaching\Python-Tutorial\data\ml-latest-small')
os.getcwd()

'D:\\Teaching\\Python-Tutorial\\data\\ml-latest-small'

In [3]:
# Load data
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [4]:
ratings.head(3)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224


The `ratings` DataFrame contains the IDs of the movies but not their titles. We'll need movie names for the movies we're recommending. We can merge the above two DataFrames, based on the column `movieId`.

In [5]:
metadata = pd.merge(movies, ratings, on="movieId")
metadata.head(3)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,964982703
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,847434962
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,1106635946


Lets add a column to the above DataFrame, which represents the average rating of each movie. To do so, we can group the dataset by the title of the movie and then calculate the mean of the rating for each movie.

In [6]:
vote_average = pd.DataFrame(metadata.groupby('title')['rating'].mean())
vote_count = pd.DataFrame(metadata.groupby('title')['rating'].count())

vote_average.head()

Unnamed: 0_level_0,rating
title,Unnamed: 1_level_1
'71 (2014),4.0
'Hellboy': The Seeds of Creation (2004),4.0
'Round Midnight (1986),3.5
'Salem's Lot (2004),5.0
'Til There Was You (1997),4.0


In [7]:
# Add ratings and count
d_movies = pd.merge(movies, vote_average, on='title', how='left')
d_movies = pd.merge(d_movies, vote_count, on='title', how='left')

# Rename columns to vote_average and vote_count
d_movies = d_movies.rename(columns={'rating_x' : 'vote_average', 'rating_y': 'vote_count'})

d_movies.head()

Unnamed: 0,movieId,title,genres,vote_average,vote_count
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,215.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,110.0
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,52.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,7.0
4,5,Father of the Bride Part II (1995),Comedy,3.071429,49.0


Let us join `tags` column as well. Each tag is typically a single word or short phrase. The meaning, value and purpose of a particular tag is determined by the user. Note that some movies are also present in our DataFrame with no tags.

In [8]:
tags = pd.read_csv('tags.csv')

tags_df = pd.DataFrame(tags.groupby('movieId')['tag'].apply(lambda x: '{}'.format('|'.join(x))))

d_movies = pd.merge(d_movies, tags_df, on='movieId', how='left')
d_movies.head()

Unnamed: 0,movieId,title,genres,vote_average,vote_count,tag
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,215.0,pixar|pixar|fun
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,110.0,fantasy|magic board game|Robin Williams|game
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,52.0,moldy|old
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,7.0,
4,5,Father of the Bride Part II (1995),Comedy,3.071429,49.0,pregnancy|remake


Finally, let us also join the TMDB and IMDB IDs so as to generate the complete matrix.

In [9]:
links = pd.read_csv('links.csv')

d_movies = pd.merge(d_movies, links, on='movieId', how='left')
d_movies.head()

Unnamed: 0,movieId,title,genres,vote_average,vote_count,tag,imdbId,tmdbId
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,3.92093,215.0,pixar|pixar|fun,114709,862.0
1,2,Jumanji (1995),Adventure|Children|Fantasy,3.431818,110.0,fantasy|magic board game|Robin Williams|game,113497,8844.0
2,3,Grumpier Old Men (1995),Comedy|Romance,3.259615,52.0,moldy|old,113228,15602.0
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,2.357143,7.0,,114885,31357.0
4,5,Father of the Bride Part II (1995),Comedy,3.071429,49.0,pregnancy|remake,113041,11862.0


## Simple Recommender System

To compute *fairly* the popularity of a movie, we should calculate its weighted rating score. This score takes into account the average rating and the number of votes a movie has accumulated. Such a score would make sure that a movie with a 9 rating from 100k voters gets a higher score than a movie with the same rating but from 100 voters.

Mathematically, the weighted rating score is formulated as:

$$
\mathbf{S} = \left( \frac{v}{v + m} \cdot \mathbf{R} \right) + \left( \frac{m}{v + m} \cdot \mathbf{C} \right) 
$$

where,
* $v$: number of votes for a movie(column: `vote_count`),
* $m$: minimum no of votes required to be listed in a chart,
* $\mathbf{R}$: average rating of the movie(column: `vote_average`),
* $\mathbf{C}$: mean vote across all movies.

The value of $m$ simply removes the movies which have number of votes less than a certain threshold. For our case, let us select this threshold to be $90^{th}$ percentile. In other words, for a movie to be featured in the charts, it must have more votes than at least 90% of the movies on the list.

In [10]:
# Calculate mean of vote_average column, C
C = d_movies['vote_average'].mean()
C

3.262512882635387

In [11]:
# Min number of votes required to be in the chart, m
m = d_movies['vote_count'].quantile(0.90)
m

27.0

Refine the `d_movies` DataFrame based on these metrics.

In [12]:
t_movies = d_movies.copy().loc[d_movies['vote_count'] >= m]

print(d_movies.shape)
print(t_movies.shape)

(9742, 8)
(978, 8)


In [13]:
978./9742

0.10039006364196264

From the above output, it is clear that there are around 10% movies with vote count more than 27 and qualify to be on this list.

Next, let us calculate the weighted rating for each qualified movie.

In [14]:
def weighted_score(x, m=m, C=C):
    try:
        v = x['vote_count']
        R = x['vote_average']
        
        return (v/(v+m) * R) + (m/(v+m) * C)
    except Exception as e:
            print(e)

In [15]:
t_movies['score'] = t_movies.apply(weighted_score, axis=1)

t_movies.sort_values('score',ascending=False).head(n=10)

Unnamed: 0,movieId,title,genres,vote_average,vote_count,tag,imdbId,tmdbId,score
277,318,"Shawshank Redemption, The (1994)",Crime|Drama,4.429022,317.0,prison|Stephen King|wrongful imprisonment|Morg...,111161,278.0,4.337465
659,858,"Godfather, The (1972)",Crime|Drama,4.289062,192.0,Mafia,68646,238.0,4.162502
2226,2959,Fight Club (1999),Action|Crime|Drama|Thriller,4.272936,218.0,dark comedy|psychology|thought-provoking|twist...,137523,550.0,4.161583
224,260,Star Wars: Episode IV - A New Hope (1977),Action|Adventure|Sci-Fi,4.231076,251.0,classic|space action|action|sci-fi|EPIC|great ...,76759,11.0,4.137007
46,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,4.237745,204.0,mindfuck|suspense|thriller|tricky|twist ending...,114814,629.0,4.123757
257,296,Pulp Fiction (1994),Comedy|Crime|Drama|Thriller,4.197068,307.0,good dialogue|great soundtrack|non-linear|cult...,110912,680.0,4.121521
461,527,Schindler's List (1993),Drama|War,4.225,220.0,moving|thought-provoking|Holocaust|based on a ...,108052,424.0,4.119789
1939,2571,"Matrix, The (1999)",Action|Sci-Fi|Thriller,4.192446,278.0,martial arts|sci-fi|alternate universe|philoso...,133093,603.0,4.110124
898,1196,Star Wars: Episode V - The Empire Strikes Back...,Action|Adventure|Sci-Fi,4.21564,211.0,I am your father|space|space opera|classic|Geo...,80684,1891.0,4.107512
314,356,Forrest Gump (1994),Comedy|Drama|Romance|War,4.164134,329.0,shrimp|Vietnam|bubba gump shrimp|lieutenant da...,109830,13.0,4.095752


## Collaborative Filtering

There are two types of Collaborative Filtering,

1. User-based filtering
2. Item-based filtering


**User-based filtering**


This approach is often harder to scale because of the user count increase rapidly and recommendation for the new user is bit harder.

**Item-based filtering**

This approach is mostly preferred since the movie don't change much. We can rerun this model once a week unlike User based where we have to frequently run the model.

In this notebook, we will look at the item-based filtering method.

In [33]:
d_movies = ratings.pivot(index="movieId", columns="userId", values="rating")

# Fill missing rating with 0s
d_movies.fillna(0,inplace=True)

d_movies.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0


As before, we must squeeze the matrix by adding some filters and qualify the movies for this dataset.

- To qualify a movie, minimum of 10 users should be voted the movie.
- To qualify a user, minimum 30 movies should be voted by the user.

In [34]:
no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')

t_movies = d_movies.loc[no_user_voted[no_user_voted > 10].index, no_movies_voted[no_movies_voted > 30].index]
print(t_movies.shape)
t_movies.values

(2121, 498)


array([[4. , 0. , 0. , ..., 2.5, 3. , 5. ],
       [0. , 0. , 0. , ..., 2. , 0. , 0. ],
       [4. , 0. , 0. , ..., 2. , 0. , 0. ],
       ...,
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ],
       [0. , 0. , 0. , ..., 0. , 0. , 0. ]])

In [35]:
t_movies

userId,1,3,4,5,6,7,8,9,10,11,...,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,0.0,0.0,4.0,0.0,4.5,0.0,0.0,0.0,0.0,...,4.0,0.0,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,0.0,...,0.0,4.0,0.0,5.0,3.5,0.0,0.0,2.0,0.0,0.0
3,4.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0
5,0.0,0.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0
6,4.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,5.0,...,0.0,3.0,4.0,3.0,0.0,0.0,0.0,0.0,0.0,5.0
7,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.5,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.0,0.0,0.0,3.0,0.0,2.0,0.0,0.0,3.0,...,0.0,3.0,0.0,0.0,0.0,0.0,0.0,4.0,4.0,0.0
11,0.0,0.0,0.0,0.0,4.0,0.0,4.0,0.0,0.0,0.0,...,0.0,3.0,0.0,0.0,0.0,2.5,3.0,0.0,0.0,0.0
12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Let us compute the sparsity of this matrix.

In [36]:
sparsity = 1.0 - ( np.count_nonzero(t_movies.values) / float(t_movies.size) )
sparsity

0.9269430385379329

So our matrix is 90% sparse. This is a common scenario for recommendation systems, where not all products are voted by a user. To work more efficiently with sparse matrices, we shall use the `csr_matrix` sub-module from `scipy`.

In [37]:
%%time

from scipy.sparse import csr_matrix

t_csr = csr_matrix(t_movies.values)
t_movies.reset_index(inplace=True)

Wall time: 57.6 ms


To compute movie recommendations, we must compute *cosine similarity* for a movie from its neighbors. For this, we would use `NearestNeighbors` class.

In [38]:
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=5, n_jobs=1)

knn.fit(t_csr)

NearestNeighbors(algorithm='brute', leaf_size=30, metric='cosine',
         metric_params=None, n_jobs=1, n_neighbors=5, p=2, radius=1.0)

In [39]:
def get_movie_recommendation(movie_name):
    n_recommendations = 10
    movie_list = movies[movies['title'].str.contains(movie_name)]
#     print(movie_list)
    if len(movie_list) > 0:
        movie_idx = movie_list.iloc[0]['movieId']
#         print(t_movies[t_movies['movieId'] == movie_idx].index[0])
        movie_idx = t_movies[t_movies['movieId'] == movie_idx].index[0]
#         print(movie_idx)
#         print(t_movies[t_movies["movieId"] == movie_idx])
        
        distances , indices = knn.kneighbors(t_csr[movie_idx],n_neighbors=n_recommendations+1)    
        
#         print(len(indices.squeeze().tolist()))
        
        rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),\
                               key=lambda x: x[1])[:0:-1]
#         print(rec_movie_indices)
        
        recommend_frame = []
        
        for val in rec_movie_indices:
            movie_idx = t_movies.iloc[val[0]]['movieId']
            idx = movies[movies['movieId'] == movie_idx].index
            recommend_frame.append({'Title':movies.iloc[idx]['title'].values[0],'Distance':val[1]})
            
        df = pd.DataFrame(recommend_frame,index=range(1,n_recommendations+1))
        return df
    else:
        print('No movie found with this name {}'.format(movie_name))

In [40]:
get_movie_recommendation('Star Wars')

      movieId                                              title  \
224       260          Star Wars: Episode IV - A New Hope (1977)   
898      1196  Star Wars: Episode V - The Empire Strikes Back...   
911      1210  Star Wars: Episode VI - Return of the Jedi (1983)   
1979     2628   Star Wars: Episode I - The Phantom Menace (1999)   
3832     5378  Star Wars: Episode II - Attack of the Clones (...   
5896    33493  Star Wars: Episode III - Revenge of the Sith (...   
6823    61160                   Star Wars: The Clone Wars (2008)   
7367    79006  Empire of Dreams: The Story of the 'Star Wars'...   
8683   122886  Star Wars: Episode VII - The Force Awakens (2015)   
8908   135216               The Star Wars Holiday Special (1978)   
9433   166528                Rogue One: A Star Wars Story (2016)   
9645   179819                    Star Wars: The Last Jedi (2017)   
9710   187595                     Solo: A Star Wars Story (2018)   

                                    genres  
22

Unnamed: 0,Distance,Title
1,0.386201,"Godfather, The (1972)"
2,0.381139,"Terminator, The (1984)"
3,0.380529,Terminator 2: Judgment Day (1991)
4,0.357655,Star Wars: Episode I - The Phantom Menace (1999)
5,0.347852,Back to the Future (1985)
6,0.346389,Indiana Jones and the Last Crusade (1989)
7,0.3215,"Matrix, The (1999)"
8,0.276924,Raiders of the Lost Ark (Indiana Jones and the...
9,0.185999,Star Wars: Episode VI - Return of the Jedi (1983)
10,0.148017,Star Wars: Episode V - The Empire Strikes Back...
