<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# Recommendations with surprise

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Use-the-built-in-movielens-100k-dataset" data-toc-modified-id="Use-the-built-in-movielens-100k-dataset-1">Use the built-in movielens-100k dataset</a></span></li><li><span><a href="#Load-the-movielens-100k-dataset-from-disk" data-toc-modified-id="Load-the-movielens-100k-dataset-from-disk-2">Load the movielens-100k dataset from disk</a></span><ul class="toc-item"><li><span><a href="#Instantiate-the-algorithm" data-toc-modified-id="Instantiate-the-algorithm-2.1">Instantiate the algorithm</a></span></li><li><span><a href="#Extract-the-model-parameters" data-toc-modified-id="Extract-the-model-parameters-2.2">Extract the model parameters</a></span></li><li><span><a href="#Evaluate-the-model:" data-toc-modified-id="Evaluate-the-model:-2.3">Evaluate the model:</a></span></li><li><span><a href="#Put-the-predictions-in-a-dataframe" data-toc-modified-id="Put-the-predictions-in-a-dataframe-2.4">Put the predictions in a dataframe</a></span></li><li><span><a href="#Correlations-between-predicted-and-true-ratings" data-toc-modified-id="Correlations-between-predicted-and-true-ratings-2.5">Correlations between predicted and true ratings</a></span></li></ul></li><li><span><a href="#Cross-validation,-train-test-split-and-grid-search" data-toc-modified-id="Cross-validation,-train-test-split-and-grid-search-3">Cross validation, train-test split and grid search</a></span></li><li><span><a href="#Slope-One" data-toc-modified-id="Slope-One-4">Slope One</a></span></li><li><span><a href="#KNN-with-Means" data-toc-modified-id="KNN-with-Means-5">KNN with Means</a></span></li><li><span><a href="#Precision@k-and-Recall@k" data-toc-modified-id="Precision@k-and-Recall@k-6">Precision@k and Recall@k</a></span></li><li><span><a href="#Top-n-predictions" data-toc-modified-id="Top-n-predictions-7">Top-n predictions</a></span><ul class="toc-item"><li><span><a href="#Coverage" data-toc-modified-id="Coverage-7.1">Coverage</a></span></li><li><span><a href="#Novelty" data-toc-modified-id="Novelty-7.2">Novelty</a></span></li><li><span><a href="#Evaluate-the-similarity-of-the-top-k-predictions-between-all-pairs-of-users" data-toc-modified-id="Evaluate-the-similarity-of-the-top-k-predictions-between-all-pairs-of-users-7.3">Evaluate the similarity of the top-k predictions between all pairs of users</a></span></li><li><span><a href="#Content-data" data-toc-modified-id="Content-data-7.4">Content data</a></span></li></ul></li></ul></div>

In this lab we will make use of the [surprise package](https://surprise.readthedocs.io/en/stable/index.html), a package dedicated to recommendation systems.

`conda install -c conda-forge scikit-surprise`

First we will need some data. Load the built-in dataset. It will have to be downloaded first.
It is a very famous dataset about movie ratings.

In [1]:
#from surprise import Dataset
# Load the movielens-100k dataset (download it if needed),
#data = Dataset.load_builtin('ml-100k')

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use('ggplot')
sns.set(font_scale=1.5)
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

In [3]:
# load surprise
import surprise as sur

## Use the built-in movielens-100k dataset

In [4]:
# Load the movielens-100k dataset (download it if needed),
#data = sur.Dataset.load_builtin('ml-100k')

In [5]:
# We'll use the famous SVD algorithm.
algo = sur.SVD()

In [6]:
# Run 5-fold cross-validation and print results
#sur.model_selection.cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5)

## Load the movielens-100k dataset from disk

With the above command we could load the data in a simplified and already prepared way. As reading and preparing other files is not that straight-forward, we will rather load the file from disk.

In [7]:
pwd

'/Users/paxton615/GA/DSI9-lessons/week10/day5_recommendation_systems/surprise-lab'

In [8]:
df_data = pd.read_csv(
    'ml-100k/u.data', sep='\t', header=None)
df_data.columns = ['user_id', 'item_id', 'rating', 'timestamp']
df_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [9]:
df_data.rating.describe()

count    100000.000000
mean          3.529860
std           1.125674
min           1.000000
25%           3.000000
50%           4.000000
75%           4.000000
max           5.000000
Name: rating, dtype: float64

The reader function serves to specify which columns are referring to user, items and ratings as well as the rating scale.

In [10]:
reader = sur.Reader(rating_scale=(1, 5))

In [11]:
# The columns must correspond to user id, item id and ratings (in that order).
data_1 = sur.Dataset.load_from_df(
    df_data[['user_id', 'item_id', 'rating']], reader)

### Instantiate the algorithm

In [12]:
algo = sur.SVD(random_state=1,
               biased=True,  # isolate biases
               reg_all=0.2,  # use regularisation (the same for all)
               n_epochs=20,  # number of epochs for stochastic gradient descent search
               n_factors=100  # number of factors to retain in SVD  将矩阵百分百还原，但与原矩阵不同
               )

# we have to build a training set from the data
trainset_full = data_1.build_full_trainset()
# fit the model
algo.fit(trainset_full)

# we prepare a test set from the training set  在这个例子中，test set and train set are the SAME!
trainsetfull_build = trainset_full.build_testset()
# obtain the predictions
predictions_full = algo.test(trainsetfull_build)
# evaluate the predictions
print(sur.accuracy.rmse(predictions_full, verbose=False))

0.9167802882204997


### Extract the model parameters

In [13]:
mu = algo.default_prediction()
bu = algo.bu
bi = algo.bi
# 以上是baseline
# 一下是correction part
# 合在一起
pu = algo.pu
qi = algo.qi
puqi = pu.dot(qi.T)

> Note that internally surprise uses other (inner) indices for users and items than in the original data.
> The original ones are the raw indices. There are functions to translate between the two.

In [14]:
# check that we can reconstruct the predictions using the parameters
# 从变换后的矩阵里找回原先的user id和item id是难点。
i = 10
print(predictions_full[i])
print()
uid = predictions_full[i].uid
iid = predictions_full[i].iid
u_inner = trainset_full.to_inner_uid(uid)
i_inner = trainset_full.to_inner_iid(iid)

pred_calc = mu + bu[u_inner] + bi[i_inner] + puqi[u_inner, i_inner]
print('Results agree:', predictions_full[i].est - pred_calc)
# 结果中 r_ui 是真实值， est 是预测值， 最后的推荐结果也会根据est的值来排序，并以此来做推荐

user: 196        item: 580        r_ui = 2.00   est = 3.34   {'was_impossible': False}

Results agree: 0.0


### Evaluate the model:

In [15]:
sur.accuracy.rmse(predictions_full)
sur.accuracy.mae(predictions_full);

RMSE: 0.9168
MAE:  0.7305


### Put the predictions in a dataframe

In [16]:
df_pred = pd.DataFrame([(x.r_ui, x.est) for x in predictions_full],
                       columns=['Rating', 'Predicted'])

In [17]:
# reconstruct RMSE
np.sqrt(df_pred.apply(lambda x: (x[0]-x[1])**2, axis=1).mean())

0.9167802882205027

In [18]:
# reconstruct MAE
df_pred.apply(lambda x: abs(x[0]-x[1]), axis=1).mean()

0.7305196298517427

### Correlations between predicted and true ratings

In [19]:
df_pred.corr(method='pearson') #如果真实ranking和预测ranking的corr是1，那么预测是非常好的

Unnamed: 0,Rating,Predicted
Rating,1.0,0.591973
Predicted,0.591973,1.0


In [20]:
df_pred.corr(method='spearman')

Unnamed: 0,Rating,Predicted
Rating,1.0,0.576486
Predicted,0.576486,1.0


In [21]:
df_pred.corr(method='kendall')

Unnamed: 0,Rating,Predicted
Rating,1.0,0.451749
Predicted,0.451749,1.0


## Cross validation, train-test split and grid search

Example from https://surprise.readthedocs.io/en/stable/FAQ.html?highlight=raw_ratings

In [22]:
import random  #  这里的spliting过程略有不同；train test data在这里只用已知的known data

raw_ratings = data_1.raw_ratings
np.random.seed(1)
# shuffle ratings if you want
random.shuffle(raw_ratings)

# A = 90% of the data, B = 10% of the data
threshold = int(.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

print(len(A_raw_ratings))
print(len(B_raw_ratings))

data_1.raw_ratings = A_raw_ratings  # data is now the set A ，最后90000train和10000的test

90000
10000


In [23]:
len(data_1.raw_ratings)

90000

In [24]:
algo = sur.SVD(random_state=1)

In [25]:
cv_results = sur.model_selection.cross_validate(
    algo, data_1, measures=['RMSE', 'MAE'], cv=5) # 给出rmse，mae的结果
pd.DataFrame(cv_results)

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
0,0.947086,0.747943,3.321255,0.163562
1,0.942171,0.743496,3.53573,0.090619
2,0.938736,0.741271,3.519796,0.084338
3,0.936509,0.735815,3.471327,0.15605
4,0.940757,0.741228,3.47417,0.082865


In [26]:
# Select your best algo with grid search.
print('Grid Search...')
param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005]}
grid_search = sur.model_selection.GridSearchCV(sur.SVD,
                                               param_grid,
                                               measures=['rmse'],
                                               cv=3,
                                               refit=True)
grid_search.fit(data_1)

algo = grid_search.best_estimator['rmse']  #要求和rmse相关的结果出来

#现在，我们开始预测最后的结果

# retrain on the whole set A
trainset = data_1.build_full_trainset()
algo.fit(trainset)

# Compute score on training set
trainset_build = trainset.build_testset()
predictions_train = algo.test(trainset_build)
print('Training score ', end='   ')
sur.accuracy.rmse(predictions_train)

# Compute score on rated test set
testset = data_1.construct_testset(B_raw_ratings)  # testset is now the set B
predictions_test = algo.test(testset)
print('Test score (rated items) ', end=' ')
sur.accuracy.rmse(predictions_test)

# Compute score on unrated data
# The anti-test set is the part where we did not have any ratings
no_ratings = trainset.build_anti_testset()
predictions_no_ratings = algo.test(no_ratings)
print('Test score (unrated items) ', end='   ')
sur.accuracy.rmse(predictions_no_ratings, verbose=False)

Grid Search...
Training score    RMSE: 0.8383
Test score (rated items)  RMSE: 0.9475
Test score (unrated items)    

0.5204609111000074

In [27]:
print(len(trainset_build), len(testset), len(no_ratings))

90000 10000 1478209


In [28]:
print(predictions_train[0])
print(predictions_test[0])
print(predictions_no_ratings[0])

user: 673        item: 326        r_ui = 4.00   est = 3.59   {'was_impossible': False}
user: 877        item: 86         r_ui = 4.00   est = 4.20   {'was_impossible': False}
user: 673        item: 531        r_ui = 3.53   est = 3.93   {'was_impossible': False}


In [29]:
# extract model parameters
mu = algo.default_prediction()
print(f'Training set mean: {mu:.6}')
bu = algo.bu
bi = algo.bi
pu = algo.pu
qi = algo.qi
puqi = pu.dot(qi.T)

Training set mean: 3.5311


In [30]:
# reconstruct predictions
i = 10
print(predictions_train[i])
print()
uid = predictions_train[i].uid
iid = predictions_train[i].iid
u_inner = trainset.to_inner_uid(uid)
i_inner = trainset.to_inner_iid(iid)

pred_calc = mu + bu[u_inner] + bi[i_inner] + puqi[u_inner, i_inner]
print('Results agree:', predictions_train[i].est - pred_calc)

user: 673        item: 302        r_ui = 3.00   est = 4.26   {'was_impossible': False}

Results agree: -8.881784197001252e-16


## Slope One

Repeat the same steps with the slope one model.

In [31]:
algo = sur.SlopeOne()

In [51]:
df_data.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,196,242,3,881250949
1,186,302,3,891717742
2,22,377,1,878887116
3,244,51,2,880606923
4,166,346,1,886397596


In [33]:
algo.fit(trainset_full)

<surprise.prediction_algorithms.slope_one.SlopeOne at 0x1a2d25e400>

In [34]:
trainsetfull_build = trainset_full.build_testset()

In [35]:
predictions_full = algo.test(trainsetfull_build)

In [36]:
print(sur.accuracy.rmse(predictions_full, verbose=False))

0.8580831572042164


In [37]:
# 刚才那个sru.SVD的score是0.916

In [41]:
mu = algo.default_prediction()
bu = algo.bu
bi = algo.bi

#pu = algo.pu
#qi = algo.qi
puqi = pu.dot(qi.T)

In [43]:
i = 10
print(predictions_full[i])

user: 196        item: 580        r_ui = 2.00   est = 3.13   {'was_impossible': False}


In [44]:
print()
uid = predictions_full[i].uid
iid = predictions_full[i].iid
u_inner = trainset_full.to_inner_uid(uid)
i_inner = trainset_full.to_inner_iid(iid)




In [45]:
pred_calc = mu + bu[u_inner] + bi[i_inner] +puqi[u_inner, i_inner]
print(" result agree:", predictions_full[i].est - pred_calc)

TypeError: 'NoneType' object is not subscriptable

In [46]:
sur.accuracy.rmse(predictions_full)

RMSE: 0.8581


0.8580831572042164

In [47]:
sur.accuracy.mae(predictions_full)

MAE:  0.6741


0.6741447038245749

In [48]:
df_pred = pd.DataFrame([(x.r_ui, x.est) for x in predictions_full], columns=["rating",'predicted'])

In [50]:
df_pred.head()

Unnamed: 0,rating,predicted
0,3.0,3.893535
1,4.0,3.411323
2,4.0,3.63626
3,3.0,4.186504
4,5.0,3.929603


In [53]:
np.sqrt(df_pred.apply(lambda x: (x[0]-x[1])**2, axis=1).mean())

0.8580831572042114

In [56]:
np.sqrt(df_pred.apply(lambda x: x[0]-x[1], axis=1).mean())

0.021209624651506737

In [57]:
df_pred.corr(method="pearson")

Unnamed: 0,rating,predicted
rating,1.0,0.649139
predicted,0.649139,1.0


In [58]:
df_pred.corr(method='spearman')

Unnamed: 0,rating,predicted
rating,1.0,0.62644
predicted,0.62644,1.0


In [59]:
df_pred.corr(method='kendall')

Unnamed: 0,rating,predicted
rating,1.0,0.49604
predicted,0.49604,1.0


In [60]:
import random

In [62]:
raw_ratings = data_1.raw_ratings
np.random.seed(1)
random.shuffle(raw_ratings)

In [70]:
threshold = int(.9*len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]

In [68]:
len(A_raw_ratings)

81000

In [69]:
len(raw_ratings)

90000

In [71]:
data_1.raw_ratings = A_raw_ratings

In [73]:
algo = sur.SlopeOne()

In [75]:
cv_results = sur.model_selection.cross_validate(algo, data_1, measures=['RMSE','MAE'], cv=5)
pd.DataFrame(cv_results)

Unnamed: 0,test_rmse,test_mae,fit_time,test_time
0,0.946581,0.744435,0.501297,1.738791
1,0.951776,0.747722,0.477933,1.640819
2,0.953754,0.7497,0.47266,1.665711
3,0.954782,0.748588,0.501011,1.666841
4,0.950317,0.747693,0.478571,1.612125


## KNN with Means

Repeat the same steps with the kNN with means model.

In [None]:
#algo = sur.KNNWithMeans()

## Precision@k and Recall@k

Obtain  precision@k and recall@k following the [example](https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-compute-precision-k-and-recall-k).

## Top-n predictions

Obtain the n top-ranked predictions for each user following the [example](https://surprise.readthedocs.io/en/stable/FAQ.html#how-to-get-the-top-n-recommendations-for-each-user).

### Coverage

Calculate the coverage of the top-ranked recommendations

### Novelty

Calculate the novelty of the top-ranked recommendations.

### Evaluate the similarity of the top-k predictions between all pairs of users

Form a user-item matrix with ones indicating the top movies recommended to each user.
Use scipy's `pdist` function to calculate the similarities of all pairs of rows.



In [None]:
from scipy import sparse
from scipy.spatial.distance import pdist

### Content data

Now work with the further data files containing content information. They can be found in 

`.surprise_data/ml-100k/ml-100k/u.item`

`.surprise_data/ml-100k/ml-100k/u.user`

Take the movie data into account to evaluate the similarity of the recommended films regarding genre. 


Translate the recommended movie ids into movie titles.

In [None]:
df_users = pd.read_csv(
    '/Users/crahmede/.surprise_data/ml-100k/ml-100k/u.user', sep='|', header=None)
df_users.columns = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
df_users.head()

In [None]:
df_items = pd.read_csv('/Users/crahmede/.surprise_data/ml-100k/ml-100k/u.item',
                       sep='|', header=None, encoding='latin')
df_items.columns = ['movie_id', 'movie_title', 'release_date', 'video_release_date',
                    'IMDb_URL', 'unknown', 'Action', 'Adventure', 'Animation',
                    'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
                    'FilmNoir', 'Horror', 'Musical', 'Mystery', 'Romance', 'SciFi',
                    'Thriller', 'War', 'Western']
df_items.head()