<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Recommendations with surprise

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Use-the-built-in-movielens-100k-dataset" data-toc-modified-id="Use-the-built-in-movielens-100k-dataset-1">Use the built-in movielens-100k dataset</a></span></li><li><span><a href="#Load-the-movielens-100k-dataset-from-disk" data-toc-modified-id="Load-the-movielens-100k-dataset-from-disk-2">Load the movielens-100k dataset from disk</a></span><ul class="toc-item"><li><span><a href="#Instantiate-the-algorithm" data-toc-modified-id="Instantiate-the-algorithm-2.1">Instantiate the algorithm</a></span></li><li><span><a href="#Extract-the-model-parameters" data-toc-modified-id="Extract-the-model-parameters-2.2">Extract the model parameters</a></span></li><li><span><a href="#Evaluate-the-model:" data-toc-modified-id="Evaluate-the-model:-2.3">Evaluate the model:</a></span></li><li><span><a href="#Put-the-predictions-in-a-dataframe" data-toc-modified-id="Put-the-predictions-in-a-dataframe-2.4">Put the predictions in a dataframe</a></span></li><li><span><a href="#Correlations-between-predicted-and-true-ratings" data-toc-modified-id="Correlations-between-predicted-and-true-ratings-2.5">Correlations between predicted and true ratings</a></span></li></ul></li><li><span><a href="#Cross-validation,-train-test-split-and-grid-search" data-toc-modified-id="Cross-validation,-train-test-split-and-grid-search-3">Cross validation, train-test split and grid search</a></span></li><li><span><a href="#Slope-One" data-toc-modified-id="Slope-One-4">Slope One</a></span></li><li><span><a href="#KNN-with-Means" data-toc-modified-id="KNN-with-Means-5">KNN with Means</a></span></li><li><span><a href="#Precision@k-and-Recall@k" data-toc-modified-id="Precision@k-and-Recall@k-6">Precision@k and Recall@k</a></span></li><li><span><a href="#Top-n-predictions" data-toc-modified-id="Top-n-predictions-7">Top-n predictions</a></span><ul class="toc-item"><li><span><a href="#Coverage" data-toc-modified-id="Coverage-7.1">Coverage</a></span></li><li><span><a href="#Novelty" data-toc-modified-id="Novelty-7.2">Novelty</a></span></li><li><span><a href="#Evaluate-the-similarity-of-the-top-k-predictions-between-all-pairs-of-users" data-toc-modified-id="Evaluate-the-similarity-of-the-top-k-predictions-between-all-pairs-of-users-7.3">Evaluate the similarity of the top-k predictions between all pairs of users</a></span></li><li><span><a href="#Content-data" data-toc-modified-id="Content-data-7.4">Content data</a></span></li></ul></li></ul></div>

In this lab we will make use of the [surprise package](https://surprise.readthedocs.io/en/stable/index.html), a package dedicated to recommendation systems.

`conda install -c conda-forge scikit-surprise`

First we will need some data. Load the built-in dataset. It will have to be downloaded first.
It is a very famous dataset about movie ratings.

In [1]:
#from surprise import Dataset
# Load the movielens-100k dataset (download it if needed),
#data = Dataset.load_builtin('ml-100k')

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [3]:
# load surprise
import surprise as sur

## Use the built-in movielens-100k dataset

In [None]:
# Load the movielens-100k dataset (download it if needed),
data = sur.Dataset.load_builtin('ml-100k')

Dataset ml-100k could not be found. Do you want to download it? [Y/n] 

In [None]:
# We'll use the famous SVD algorithm.
algo = sur.SVD()

In [None]:
# Run 5-fold cross-validation and print results
sur.model_selection.cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5)

## Load the movielens-100k dataset from disk

With the above command we could load the data in a simplified and already prepared way. As reading and preparing other files is not that straight-forward, we will rather load the file from disk.

In [None]:
df_data = pd.read_csv(
    '~/.surprise_data/ml-100k/ml-100k/u.data', sep='\t', header=None)
df_data.columns = ['user_id', 'item_id', 'rating', 'timestamp']
df_data.head()

In [None]:
df_data.rating.describe()

The reader function serves to specify which columns are referring to user, items and ratings as well as the rating scale.

In [None]:
reader = sur.Reader(rating_scale=(1, 5))

In [None]:
# The columns must correspond to user id, item id and ratings (in that order).
data_1 = sur.Dataset.load_from_df(
    df_data[['user_id', 'item_id', 'rating']], reader)

### Instantiate the algorithm

In [None]:
algo = sur.SVD(random_state=1,
               biased=True,  # isolate biases
               reg_all=0.2,  # use regularisation (the same for all)
               n_epochs=20,  # number of epochs for stochastic gradient descent search
               n_factors=100  # number of factors to retain in SVD
               )

# we have to build a training set from the data
trainset_full = data_1.build_full_trainset()
# fit the model
algo.fit(trainset_full)

# we prepare a test set from the training set
trainsetfull_build = trainset_full.build_testset()
# obtain the predictions
predictions_full = algo.test(trainsetfull_build)
# evaluate the predictions
print(sur.accuracy.rmse(predictions_full, verbose=False))

### Extract the model parameters

In [None]:
mu = algo.default_prediction()
bu = algo.bu
bi = algo.bi
pu = algo.pu
qi = algo.qi
puqi = pu.dot(qi.T)

> Note that internally surprise uses other (inner) indices for users and items than in the original data.
> The original ones are the raw indices. There are functions to translate between the two.

In [None]:
# check that we can reconstruct the predictions using the parameters
i = 10
print(predictions_full[i])
print()
uid = predictions_full[i].uid
iid = predictions_full[i].iid
u_inner = trainset_full.to_inner_uid(uid)
i_inner = trainset_full.to_inner_iid(iid)

pred_calc = mu + bu[u_inner] + bi[i_inner] + puqi[u_inner, i_inner]
print('Results agree:', predictions_full[i].est - pred_calc)

### Evaluate the model:

In [None]:
sur.accuracy.rmse(predictions_full)
sur.accuracy.mae(predictions_full);

### Put the predictions in a dataframe

In [None]:
df_pred = pd.DataFrame([(x.r_ui, x.est) for x in predictions_full],
                       columns=['Rating', 'Predicted'])

In [None]:
# reconstruct RMSE
np.sqrt(df_pred.apply(lambda x: (x[0]-x[1])**2, axis=1).mean())

In [None]:
# reconstruct MAE
df_pred.apply(lambda x: abs(x[0]-x[1]), axis=1).mean()

### Correlations between predicted and true ratings

In [None]:
df_pred.corr(method='pearson')

In [None]:
df_pred.corr(method='spearman')

In [None]:
df_pred.corr(method='kendall')

## Cross validation, train-test split and grid search

Example from https://surprise.readthedocs.io/en/stable/FAQ.html?highlight=raw_ratings

In [None]:
import random

raw_ratings = data_1.raw_ratings
np.random.seed(1)
# shuffle ratings if you want
random.shuffle(raw_ratings)

# A = 90% of the data, B = 10% of the data
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

print(len(A_raw_ratings))
print(len(B_raw_ratings))

data_1.raw_ratings = A_raw_ratings  # data is now the set A

In [None]:
len(data_1.raw_ratings)

In [None]:
algo = sur.SVD(random_state=1)

In [None]:
cv_results = sur.model_selection.cross_validate(
    algo, data_1, measures=['RMSE', 'MAE'], cv=5)
pd.DataFrame(cv_results)

In [None]:
# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = sur.model_selection.GridSearchCV(sur.SVD,
                                               param_grid,
                                               measures=['rmse'],
                                               cv=3,
                                               refit=True)
grid_search.fit(data_1)

algo = grid_search.best_estimator['rmse']

# retrain on the whole set A
trainset = data_1.build_full_trainset()
algo.fit(trainset)

# Compute score on training set
trainset_build = trainset.build_testset()
predictions_train = algo.test(trainset_build)
print('Training score ', end='   ')
sur.accuracy.rmse(predictions_train)

# Compute score on rated test set
testset = data_1.construct_testset(B_raw_ratings)  # testset is now the set B
predictions_test = algo.test(testset)
print('Test score (rated items) ', end=' ')
sur.accuracy.rmse(predictions_test)

# Compute score on unrated data
# The anti-test set is the part where we did not have any ratings
no_ratings = trainset.build_anti_testset()
predictions_no_ratings = algo.test(no_ratings)
print('Test score (unrated items) ', end='   ')
sur.accuracy.rmse(predictions_no_ratings, verbose=False)

In [None]:
print(len(trainset_build), len(testset), len(no_ratings))

In [None]:
print(predictions_train[0])
print(predictions_test[0])
print(predictions_no_ratings[0])

In [None]:
# extract model parameters
mu = algo.default_prediction()
print(f'Training set mean: {mu:.6}')
bu = algo.bu
bi = algo.bi
pu = algo.pu
qi = algo.qi
puqi = pu.dot(qi.T)

In [None]:
# reconstruct predictions
i = 10
print(predictions_train[i])
print()
uid = predictions_train[i].uid
iid = predictions_train[i].iid
u_inner = trainset.to_inner_uid(uid)
i_inner = trainset.to_inner_iid(iid)

pred_calc = mu + bu[u_inner] + bi[i_inner] + puqi[u_inner, i_inner]
print('Results agree:', predictions_train[i].est - pred_calc)

## Slope One

Repeat the same steps with the slope one model.

In [None]:
algo = sur.SlopeOne()

## KNN with Means

Repeat the same steps with the kNN with means model.

In [None]:
algo = sur.KNNWithMeans()

## Precision@k and Recall@k

Obtain  precision@k and recall@k following the [example](https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-compute-precision-k-and-recall-k).

## Top-n predictions

Obtain the n top-ranked predictions for each user following the [example](https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-top-n-recommendations-for-each-user).

### Coverage

Calculate the coverage of the top-ranked recommendations

### Novelty

Calculate the novelty of the top-ranked recommendations.

### Evaluate the similarity of the top-k predictions between all pairs of users

Form a user-item matrix with ones indicating the top movies recommended to each user.
Use scipy's `pdist` function to calculate the similarities of all pairs of rows.



In [None]:
from scipy import sparse
from scipy.spatial.distance import pdist

### Content data

Now work with the further data files containing content information. They can be found in 

`.surprise_data/ml-100k/ml-100k/u.item`

`.surprise_data/ml-100k/ml-100k/u.user`

Take the movie data into account to evaluate the similarity of the recommended films regarding genre. 


Translate the recommended movie ids into movie titles.

In [None]:
df_users = pd.read_csv(
    '/Users/crahmede/.surprise_data/ml-100k/ml-100k/u.user', sep='|', header=None)
df_users.columns = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
df_users.head()

In [None]:
df_items = pd.read_csv('/Users/crahmede/.surprise_data/ml-100k/ml-100k/u.item',
                       sep='|', header=None, encoding='latin')
df_items.columns = ['movie_id', 'movie_title', 'release_date', 'video_release_date',
                    'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation',
                    'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
                    'FilmNoir', 'Horror', 'Musical', 'Mystery', 'Romance', 'SciFi',
                    'Thriller', 'War', 'Western']
df_items.head()