# AI Clinique #15 : Recommender Systems

- __Date__: 09-12-2021
- __Presentator__: Nicolas Clavel
- __Datasets__: For this hands-on, we will be using the following open source datasets
    - Movie Lens Dataset accessible here: https://grouplens.org/datasets/movielens/latest/
    - The movie database: https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv
- __Packages__: pip install -r requirements.txt
- __Documentation__:
    - Interesting Github: https://github.com/rposhala/Recommender-System-on-MovieLens-dataset#content-based-recommender-system
    - Scikit-surprise: http://surprise.readthedocs.io/en/stable/getting_started.html
    - Matrix Factorization from scratch: https://towardsdatascience.com/recommendation-system-matrix-factorization-d61978660b4b
    - Content-based filtering Kaggle: https://www.kaggle.com/ibtesama/getting-started-with-a-movie-recommendation-system/notebook
- __Citation__:  
Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems 

## Recommender Systems
The objective of a Recommender System is to __recommend relevant items for users__, based on their preference, history consumption...  
We see the use of recommendation systems all around us. These systems are personalizing our web experience, telling us what to buy (Amazon), which movies to watch (Netflix), whom to be friends with (Facebook), which songs to listen (Spotify) etc.  
Recommender systems typically produce a list of recommendations and there are few ways in which it can be done.  
Two of the most popular ways are – through __collaborative filtering__ or through __content-based filtering__

### Table of contents
- __1. Presentation of the Movie Lens dataset__
- __2. Collaborative filtering__
- __3. Content based filtering__
- __4. Simple recommender system__

#### Imports

In [None]:
import numpy as np
import pandas as pd
from surprise import SVD, NMF, KNNBasic, Reader, Dataset, accuracy
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split
import matplotlib
import seaborn as sns
from sklearn.neighbors import NearestNeighbors
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from IPython.display import Image
import matplotlib.pyplot as plt
from sklearn.metrics.pairwise import linear_kernel

## 1. Presentation of the Movie Lens dataset

#### Movies file

In [None]:
# Load movies
movies = pd.read_csv('../input_data/ml-latest-small/movies.csv', low_memory=False)

# Figures
print(f'Nb of rows in the movies file: {len(movies)}')
print(f'Columns of the movies file: {movies.columns.values}')

# Print the first three rows
movies.head(3)

#### Ratings

In [None]:
# Load ratings
ratings = pd.read_csv('../input_data/ml-latest-small/ratings.csv', low_memory=False)

# Figures
print(f'Nb of rows in the ratings file: {len(ratings)}')
print(f'Columns of the ratings file: {ratings.columns.values}')
print(f'Min ratings: {min(ratings["rating"])} Max ratings: {max(ratings["rating"])}')
print(f'Nb of movies: {len(ratings["movieId"].unique())}')
print(f'Nb of users: {len(ratings["userId"].unique())}')

# Print the first three rows
ratings.head(3)

#### Check if nan values

In [None]:
print(f'Nb nans values in userId: {pd.isnull(ratings["userId"]).any()}')
print(f'Nb nans values in movieId: {pd.isnull(ratings["movieId"]).any()}')
print(f'Nb nans values in rating: {pd.isnull(ratings["rating"]).any()}')

#### Tags

In [None]:
# Load tags
tags = pd.read_csv('../input_data/ml-latest-small/tags.csv', low_memory=False)

# Figures
print(f'Nb of rows in the tags file: {len(tags)}')
print(f'Columns of the tags file: {tags.columns.values}')

# Print the first three rows
tags.head(3)

#### Datavizualisation

In [None]:
sns.histplot(data=ratings, x="rating", binwidth=0.5)

#### Number of ratings by movies

In [None]:
df = ratings[['movieId','userId']].groupby(['movieId']).agg(['count']).sort_values(('userId','count'),ascending=False)
 
plt.figure(figsize=(10,4))
sns.set_style("darkgrid")
sns.lineplot(data=df[('userId', 'count')].values)
plt.title("Number of ratings by movie movies")
plt.xlabel("Movie id")
plt.ylabel("Number of ratings");

#### Number of ratings by user

In [None]:
df = ratings[['movieId','userId']].groupby(['userId']).agg(['count']).sort_values(('movieId','count'),ascending=False)
 
plt.figure(figsize=(10,4))
sns.set_style("darkgrid")
sns.lineplot(data=df[('movieId', 'count')].values)
plt.title("Rating frequency of users")
plt.xlabel("User id")
plt.ylabel("Number of ratings");

#### User-Item interaction matrix
For top 15th rated users and items for vizualisation

In [None]:
top = 15
g = ratings.groupby('userId')['rating'].count()
topg = g.sort_values(ascending = False)[:top]

i = ratings.groupby('movieId')['rating'].count()
topi = i.sort_values(ascending = False)[:top]

# gettings ratings of top users and top items
join_top_users = ratings.join(topg, on='userId', how = 'inner', rsuffix='_r')
join_top_movies_and_users = join_top_users.join(topi, on='movieId', how = 'inner', rsuffix = '_r')

pd.crosstab(join_top_movies_and_users.userId, join_top_movies_and_users.movieId,
            join_top_movies_and_users.rating, aggfunc=np.mean)

#### Measure of sparsity (%)

In [None]:
unique_movies = len(ratings["movieId"].unique())
unique_users = len(ratings["userId"].unique())
total_ratings = unique_users * unique_movies
rating_present = ratings.shape[0]

ratings_not_provided = total_ratings - rating_present 

print("sparsity of user-item matrix is :")
print(ratings_not_provided / total_ratings)

#### Users-items top 500 users, top 1000 movies
This is done to limit matrix sparsity (for collaborative filtering)

In [None]:
top_users = 500
g = ratings.groupby('userId')['rating'].count()
topg = g.sort_values(ascending = False)[:top_users]

top_movies = 1000
i = ratings.groupby('movieId')['rating'].count()
topi = i.sort_values(ascending = False)[:top_movies]

# gettings ratings of top users and top items
join_top_users = ratings.join(topg, on='userId', how = 'inner', rsuffix='_r')
join_top_movies_and_users = join_top_users.join(topi, on='movieId', how = 'inner', rsuffix = '_r')

user_movie_matrix = pd.crosstab(join_top_movies_and_users.userId, join_top_movies_and_users.movieId,
                                join_top_movies_and_users.rating, aggfunc=np.mean)

In [None]:
user_movie_matrix.iloc[0:5]

In [None]:
print('Nb of users:')
print(len(user_movie_matrix))

print('Nb of movies:')
print(len(user_movie_matrix.columns))

In [None]:
print('Sparsity:')
print(user_movie_matrix.isna().sum().sum() / float(len(user_movie_matrix) * len(user_movie_matrix.columns)))

## 2. Collaborative filtering
__Collaborative filtering__ is based on the assumption that people who agreed in the past will agree in the future, and that they will like similar kinds of items as they liked in the past.  
It uses __similarities between users behaviours__ to provide recommendations, there is no need of knowledge/features required.  
There are two types of collaborative filtering:
- __Memory based__
- __Model based__  

The key difference is that we __are not learning any parameter__ using gradient descent (or any other optimization algorithm) in the memory-based.

### 2.1. Matrix Factorization (Model based)
__Matrix Factorization__ is denoted as methods that decompose a rating matrix for collaborative filtering.  
The __user-item interaction matrice__ lists __users and items in rows and columns__, respectively.  
The __ratings of user i on movie j__ is located in __cell(i, j)__ (the cell is empty if no ratings exist yet).  
Documentation: https://developers.google.com/machine-learning/recommendation/collaborative/matrix  
Matrix factorization from scratch: https://towardsdatascience.com/recommendation-system-matrix-factorization-d61978660b4b  
Scikit-surprise doc: https://surprise.readthedocs.io/en/stable/matrix_factorization.html

In [None]:
Image(filename='../input_data/matrix_facto_illustration.png')

#### Datapreparation

In [None]:
df_ratings = ratings[['movieId', 'userId', 'rating']]

# The Reader class is used to parse a file containing ratings.
reader = Reader(rating_scale=(0.5, 5.0))

# The columns must correspond to userId, itemId and ratings (in that order).
dataset_ratings = Dataset.load_from_df(df_ratings[['userId', 'movieId', 'rating']], reader)

# Split dataset between train and test set
train, test = train_test_split(dataset_ratings, test_size=.20, random_state=2)
# As if we remove some cells of the user-item matrix to put them in the set

#### NMF: Non-negative Matrix Factorization
Documentation: https://en.wikipedia.org/wiki/Non-negative_matrix_factorization

In [None]:
# Number of latents factors
n_factors=13

# NMF model
nmf = NMF(n_factors=n_factors)

# Train the algorithm on the train set, and predict ratings for the test set
nmf.fit(train)
preds = nmf.test(test)

# Then compute RMSE
accuracy.rmse(preds)

# To dataframe
df_preds = pd.DataFrame(preds)

In [None]:
df_preds.iloc[:10]

#### Make a prediction on a user and movie

In [None]:
uid = 1  # raw user id (as in the ratings file)
iid = 2  # raw item id (as in the ratings file)

# get a prediction for specific users and items.
pred = nmf.predict(uid, iid, verbose=True)  # we can also pass the real value if it is filled

In [None]:
pred = nmf.predict(uid, iid, r_ui=0.5, verbose=True) 

#### First conclusion:
- The mean error (RMSE) seems pretty correct
- But how to choose the number of factors ? => Using a grid-search on cross-validation

#### Hyperparameter tuning

In [None]:
# Use movielens-100K
nmf = NMF()
param_grid = {'n_factors': [12, 13, 14]}
gs_nmf = GridSearchCV(NMF, param_grid, measures=['rmse'], cv=3)

gs_nmf.fit(dataset_ratings)

# best RMSE score
print(gs_nmf.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs_nmf.best_params['rmse'])

#### SVD: Singular Value Decomposition

In [None]:
# SVD model
svd = SVD()

# Train the algorithm on the train set, and predict ratings for the test set
svd.fit(train)
preds_svd = svd.test(test)

# Then compute RMSE
accuracy.rmse(preds_svd)

# To dataframe
df_preds_svd = pd.DataFrame(preds_svd)

### 2.2. k Nearest Neighbour
__K-nearest neighbor__ finds the k most similar items to a particular instance based on a given distance metric.  
It can be used for classification (voting of the k-nearest neighbors) or regression (average values of the k-nearest neighbors).  
In this my model, I used to __cosine similarity__ as metric.

In [None]:
Image(filename='../input_data/knn.png')

In [None]:
n_neighbors = 20
metric = 'cosine'
model_knn = NearestNeighbors(metric=metric, n_neighbors=n_neighbors, n_jobs=-1)
index_user_to_predict_k_movies = 0 # we are going to predict for the first user
top_k_movies = 5

In [None]:
# Fill nan values in empty scores (it does not impact cosine)
user_movie_matrix_filled = user_movie_matrix.fillna(0)

# train knn
model_knn.fit(user_movie_matrix_filled)

# Get similar users distances and indexes
user_to_predict_k_movies = user_movie_matrix_filled.iloc[index_user_to_predict_k_movies,:].values.reshape(1,-1)
distances, indices_similar_users = model_knn.kneighbors(user_to_predict_k_movies)
distances = distances.flatten()
indices_similar_users = indices_similar_users.flatten()

In [None]:
# Taking average scores for these users (excluding zero because no score)
similar_users = user_movie_matrix[user_movie_matrix.index.isin(indices_similar_users)]

In [None]:
movies_scores_similar_users = np.nanmean(similar_users, axis=0) # compute mean score without taking into account nan
movies_scores_similar_users = np.nan_to_num(movies_scores_similar_users) # than if only nan => put 0

In [None]:
movies_scores_similar_users.shape

In [None]:
top_movies = []

while len(top_movies) <= top_k_movies:
    # Get index of the movie with the max score
    max_value_index = movies_scores_similar_users.argmax()
    
    # Adding the movie with the highest score to the top list
    top_movies.append(max_value_index)
    
    # Removing this index in the movies
    movies_scores_similar_users = np.delete(movies_scores_similar_users, max_value_index)

In [None]:
top_movies
# there is still the need to remove movies that the user 0 have already watched, we can integrate it in the whil loop upper.

#### Pros
- __No domain knowledge necessary__: It does not need any information regarding the movies (genres, author...) , and any "understanding" of the movie itself 
- __Serendipity__ : The user can __discover new interests__ (because not featured based)

#### Cons
- __Cold start__: For a new user or item, there isn't enough data to make accurate recommendations. 
- __Scalability__: There are millions of users and products in many of the environments in which these systems make recommendations. Thus, a large amount of computation power is often necessary to calculate recommendations.
- __Sparsity__: The number of items sold on major e-commerce sites is extremely large. The most active users will only have rated a small subset of the overall database. Thus, even the most popular items have very few ratings.

## 3. Content-based Filtering
__Content-Based Filtering__ is used to produce items recommendation based on items’ and/or users characteristics.  
In these types of systems, the __descriptive attributes of items/users are used__ to make recommendations. The term “content” refers to these descriptions.

In [None]:
movies.iloc[20:40]

#### TF-IdF : Term Frequency-Inverse Document Frequency

We need to __convert the word vector__ into a __numerical representation__ We will use __Term Frequency-Inverse Document Frequency (TF-IDF)__ vectors for each overview.

It is the __relative frequency of a word in a document__ (so here, a cell) and is given as (term instances/total instances). Inverse Document Frequency is the relative count of documents containing the term, given as log( 1 / number of documents/documents with term) The overall importance of each word to the documents in which they appear is equal to TF * IDF.  
  
For term Drama (for instance):
- Cell with Crime|Drama => TF = 0.5
- Log ratio of cells containing Drama => IDF = Log(1 / (200 cells/1000 total cells))
TDF x IDF = 0.5 x log(5) = 0.31

In [None]:
tfidf = TfidfVectorizer(stop_words='english')
movies['genres'] = movies['genres'].apply(lambda x: x.replace('|', ' ').replace('-', ''))

# tfidf matrix
tfidf_matrix = tfidf.fit_transform(movies['genres'])

In [None]:
tfidf_matrix.shape

(9742, 21)  means that here are 21 different words are used to describe a 9742 movies.

In [None]:
tfidf.get_feature_names_out()

In [None]:
len(tfidf.get_feature_names_out())

In [None]:
# Compute cosine similarity
cosin_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

# Put it in a Pandas Series
index_of_movies = pd.Series(movies.index, index=movies['title']).drop_duplicates()

In [None]:
def get_recommendations(title, cosin_sim=cosin_sim, top_k=10):
    
    idx_of_title = index_of_movies[title]
    
    similarity_scores = list(enumerate(cosin_sim[idx_of_title]))
    
    # sorting of movies idx based on similarity score
    similarity_scores = sorted(similarity_scores, key = lambda x:x[1], reverse = True)
    
    # get top k
    similarity_scores = similarity_scores[0:top_k]
    movies_idx = [i[0] for i in similarity_scores]
    
    return movies.iloc[movies_idx]

#### Make recommendation

In [None]:
get_recommendations(title='Dangerous Minds (1995)', cosin_sim=cosin_sim)

This is not very efficient as all the movies with the same genre would have the same similarity score...  
Let's try with another dataset with more information.

#### Content-based filtering based on movie overview description
https://www.kaggle.com/tmdb/tmdb-movie-metadata?select=tmdb_5000_movies.csv

In [None]:
movies_lmdb = pd.read_csv('../input_data/tmdb_5000_movies.csv')
print('Nb of rows of movies lmdb:')
print(len(movies_lmdb))

In [None]:
movies_lmdb.iloc[0:3]

Let's perform a content-based filtering on the overview informations (brief description of the movie)

In [None]:
movies_lmdb['overview'].head(5)

In [None]:
movies_lmdb['overview'].iloc[1]

In [None]:
tfidf = TfidfVectorizer(stop_words='english') # Principal Component Analysis PCA 20978 => 10 dimensions + 30 

# Replace NaN with an empty string
movies_lmdb['overview'] = movies_lmdb['overview'].fillna('')

# Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(movies_lmdb['overview'])

# Output the shape of tfidf_matrix
tfidf_matrix.shape

print(f'Nb of movies: {tfidf_matrix.shape[0]}  Nb of text features: {tfidf_matrix.shape[1]}')

In [None]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix) #=> 40 dimensions

In [None]:
cosine_sim.shape

In [None]:
# Pandas series
indices = pd.Series(movies_lmdb.index, index=movies_lmdb['title']).drop_duplicates()

In [None]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim, top_k_movies=10):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the top k similar movies
    sim_scores = sim_scores[1:(top_k_movies+1)]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return movies_lmdb['title'].iloc[movie_indices]

In [None]:
get_recommendations('The Dark Knight Rises')

In [None]:
get_recommendations('Avatar')