# Introduction

Recommender systems are among the most popular applications of data science today. They are used to predict the *rating* or *preference* that an user would give to an item. Amazon uses it to suggest products to customers. YouTube uses recommender systems to decide which video to play next on autoplay.

There are also popular recommder systems for domains like restaurants and movies. Recommender systems have also been developed to explore research articles and experts, collaborators and financial services. YouTube uses the recommendation system at a large scale to suggest videos based on your history.

Recommender systems can be classified primarily into 3 types:

- <u>Simple recommenders</u>: Offer generalized recommendations to every user, based on movie popularity and/or genre.  The basic idea behind this system is that movies that are more popular and critically acclaimed will have a higher probability of being liked by the average audience. For example, IMDB Top 250.

- <u> Content-based recommenders</u>: These recommenders suggest similar items based on a particular item. This system uses item metadata, such as genre, director, actors etc, for movies, to make these recommendations. The general idea behind these systems is that if a person likes a particular item, he will also like an item that is similar to it. And to recommend that, it will make use of the user's past item metadata. For example, YouTube, where based on your history, the system suggests new videos that you can potentially watch.

- <u> Collaborative filtering</u>: These systems are widely used, and they try to predict the rating or preference that an user would give an item-based on past ratings and preferences of other  users. Collaborative filtering based recommendation systems do not require item metadata like content-based ones.

## Dataset

The dataset files contain metadata for 9742 movies listed in the [`MovieLens Dataset`](https://grouplens.org/datasets/movielens/). The dataset consists of movies released on or before September 2018. The dataset captures feature points like cast, crew, TMDB vote counts and vote averages.

This dataset consists of the following files:

* *movies.csv*: Each line of this file after the header row represents one movie, and has the following format:

*****
    movieId,title,genres
*****

Genres are a a pipe-separated list. Some common genres are: Action, Adventure, Animation, Comedy, Crime etc.

* *links.csv*: This file contains the TMDB and IMDB IDs of all the movies featured in the `MovieLens Dataset`.

* *ratings.csv*: This file contains 100836 ratings across 9742 movies from 610 users. Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).

In [None]:
# Import libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style="darkgrid", palette="icefire")

import warnings
warnings.filterwarnings("ignore")

In [None]:
import os
os.chdir('D:\Teaching\Python-Tutorial\data\ml-latest-small')
os.getcwd()

In [None]:
# Load data
movies = pd.read_csv('movies.csv')
ratings = pd.read_csv('ratings.csv')

movies.head(3)

In [None]:
ratings.head(3)

The `ratings` DataFrame contains the IDs of the movies but not their titles. We'll need movie names for the movies we're recommending. We can merge the above two DataFrames, based on the column `movieId`.

In [None]:
metadata = pd.merge(movies, ratings, on="movieId")
metadata.head(3)

Lets add a column to the above DataFrame, which represents the average rating of each movie. To do so, we can group the dataset by the title of the movie and then calculate the mean of the rating for each movie.

In [None]:
vote_average = pd.DataFrame(metadata.groupby('title')['rating'].mean())
vote_count = pd.DataFrame(metadata.groupby('title')['rating'].count())

vote_average.head()

In [None]:
# Add ratings and count
d_movies = pd.merge(movies, vote_average, on='title', how='left')
d_movies = pd.merge(d_movies, vote_count, on='title', how='left')

# Rename columns to vote_average and vote_count
d_movies = d_movies.rename(columns={'rating_x' : 'vote_average', 'rating_y': 'vote_count'})

d_movies.head()

Let us join `tags` column as well. Each tag is typically a single word or short phrase. The meaning, value and purpose of a particular tag is determined by the user. Note that some movies are also present in our DataFrame with no tags.

In [None]:
tags = pd.read_csv('tags.csv')

tags_df = pd.DataFrame(tags.groupby('movieId')['tag'].apply(lambda x: '{}'.format('|'.join(x))))

d_movies = pd.merge(d_movies, tags_df, on='movieId', how='left')
d_movies.head()

Finally, let us also join the TMDB and IMDB IDs so as to generate the complete matrix.

In [None]:
links = pd.read_csv('links.csv')

d_movies = pd.merge(d_movies, links, on='movieId', how='left')
d_movies.head()

## Simple Recommender System

To compute *fairly* the popularity of a movie, we should calculate its weighted rating score. This score takes into account the average rating and the number of votes a movie has accumulated. Such a score would make sure that a movie with a 9 rating from 100k voters gets a higher score than a movie with the same rating but from 100 voters.

Mathematically, the weighted rating score is formulated as:

$$
\mathbf{S} = \left( \frac{v}{v + m} \cdot \mathbf{R} \right) + \left( \frac{m}{v + m} \cdot \mathbf{C} \right) 
$$

where,
* $v$: number of votes for a movie(column: `vote_count`),
* $m$: minimum no of votes required to be listed in a chart,
* $\mathbf{R}$: average rating of the movie(column: `vote_average`),
* $\mathbf{C}$: mean vote across all movies.

The value of $m$ simply removes the movies which have number of votes less than a certain threshold. For our case, let us select this threshold to be $90^{th}$ percentile. In other words, for a movie to be featured in the charts, it must have more votes than at least 90% of the movies on the list.

In [None]:
# Calculate mean of vote_average column, C
C = d_movies['vote_average'].mean()
C

In [None]:
# Min number of votes required to be in the chart, m
m = d_movies['vote_count'].quantile(0.90)
m

Refine the `d_movies` DataFrame based on these metrics.

In [None]:
t_movies = d_movies.copy().loc[d_movies['vote_count'] >= m]

print(d_movies.shape)
print(t_movies.shape)

In [None]:
978./9742

From the above output, it is clear that there are around 10% movies with vote count more than 27 and qualify to be on this list.

Next, let us calculate the weighted rating for each qualified movie.

In [None]:
def weighted_score(x, m=m, C=C):
    try:
        v = x['vote_count']
        R = x['vote_average']
        
        return (v/(v+m) * R) + (m/(v+m) * C)
    except Exception as e:
            print(e)

In [None]:
t_movies['score'] = t_movies.apply(weighted_score, axis=1)

t_movies.sort_values('score',ascending=False).head(n=10)

## Collaborative Filtering

There are two types of Collaborative Filtering,

1. User-based filtering
2. Item-based filtering


**User-based filtering**


This approach is often harder to scale because of the user count increase rapidly and recommendation for the new user is bit harder.

**Item-based filtering**

This approach is mostly preferred since the movie don't change much. We can rerun this model once a week unlike User based where we have to frequently run the model.

In this notebook, we will look at the item-based filtering method.

In [None]:
d_movies = ratings.pivot(index="movieId", columns="userId", values="rating")

# Fill missing rating with 0s
d_movies.fillna(0,inplace=True)

d_movies.head()

As before, we must squeeze the matrix by adding some filters and qualify the movies for this dataset.

- To qualify a movie, minimum of 10 users should be voted the movie.
- To qualify a user, minimum 50 movies should be voted by the user.

In [None]:
no_user_voted = ratings.groupby('movieId')['rating'].agg('count')
no_movies_voted = ratings.groupby('userId')['rating'].agg('count')

t_movies = d_movies.loc[no_user_voted[no_user_voted > 10].index, no_movies_voted[no_movies_voted > 50].index]
print(t_movies.shape)
t_movies.values

In [None]:
d_movies

Let us compute the sparsity of this matrix.

In [None]:
sparsity = 1.0 - ( np.count_nonzero(t_movies.values) / float(t_movies.size) )
sparsity

So our matrix is 90% sparse. This is a common scenario for recommendation systems, where not all products are voted by a user. To work more efficiently with sparse matrices, we shall use the `csr_matrix` sub-module from `scipy`.

In [None]:
%%time

from scipy.sparse import csr_matrix

t_csr = csr_matrix(t_movies.values)
t_movies.reset_index(inplace=True)

To compute movie recommendations, we must compute *cosine similarity* for a movie from its neighbors. For this, we would use `NearestNeighbors` class.

In [None]:
from sklearn.neighbors import NearestNeighbors

knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20, n_jobs=-1)

knn.fit(t_csr)

In [None]:
def get_movie_recommendation(movie_name):
    n_recommendations = 1
    movie_list = movies[movies['title'].str.contains(movie_name)]
    if len(movie_list) > 0:
        movie_idx = movie_list.iloc[0]['movieId']
        movie_idx = t_movies[t_movies['movieId'] == movie_idx].index[0]
        print(movie_idx)
        
        distances , indices = knn.kneighbors(t_csr[movie_idx],n_neighbors=n_recommendations+1)    
        rec_movie_indices = sorted(list(zip(indices.squeeze().tolist(),distances.squeeze().tolist())),\
                               key=lambda x: x[1])[:0:-1]
        print(rec_movie_indices)
        
        recommend_frame = []
        
        for val in rec_movie_indices:
            movie_idx = t_movies.iloc[val[0]]['movieId']
            idx = movies[movies['movieId'] == movie_idx].index
            recommend_frame.append({'Title':movies.iloc[idx]['title'].values[0],'Distance':val[1]})
            
        df = pd.DataFrame(recommend_frame,index=range(1,n_recommendations+1))
        return df
    else:
        print('No movie found with this name {}'.format(movie_name))

In [None]:
get_movie_recommendation('Godfather')