# Building and Testing Recommender Systems With Surprise

In [14]:
import pandas as pd
import matplotlib as plt
import seaborn as sns

In [15]:
user = pd.read_csv(r'C:\Users\delchain_default\Documents\GitHub\Python-Notes\Machine Learning\Recommender System (Advanced)\BX-Users.csv', sep=';', error_bad_lines=False, encoding="latin-1")
user.columns = ['userID', 'Location', 'Age']

rating = pd.read_csv(r'C:\Users\delchain_default\Documents\GitHub\Python-Notes\Machine Learning\Recommender System (Advanced)\BX-Book-Ratings.csv', sep=';', error_bad_lines=False, encoding="latin-1")
rating.columns = ['userID', 'ISBN', 'bookRating']

df = pd.merge(user, rating, on='userID', how='inner')
df.drop(['Location', 'Age'], axis=1, inplace=True)
df.head()

Unnamed: 0,userID,ISBN,bookRating
0,2,195153448,0
1,7,34542252,0
2,8,2005018,5
3,8,60973129,0
4,8,374157065,0


In [19]:

df.groupby('userID')['bookRating'].count().reset_index().sort_values('bookRating', ascending=False)[:10]

Unnamed: 0,userID,bookRating
4213,11676,13602
74815,198711,7550
58113,153662,6109
37356,98391,5891
13576,35859,5850
80185,212898,4785
105111,278418,4533
28884,76352,3367
42037,110973,3100
88584,235105,3067


Most of the users gave less than 5 ratings, and very few users gave many ratings, although the most productive user have given 13,602 ratings.

I'm sure you have noticed that the above two charts share the same distribution. The number of ratings per movie and the bnumber of ratings per user decay exponentially.

To reduce the dimensionality of the dataset, we will filter out rarely rated movies and rarely rating users.

In [20]:

min_book_ratings = 50
filter_books = df['ISBN'].value_counts() > min_book_ratings
filter_books = filter_books[filter_books].index.tolist()

min_user_ratings = 50
filter_users = df['userID'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

df_new = df[(df['ISBN'].isin(filter_books)) & (df['userID'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(df.shape))
print('The new data frame shape:\t{}'.format(df_new.shape))

The original data frame shape:	(1149780, 3)
The new data frame shape:	(140516, 3)


## EDA

## Surprise

To load a data set from the above pandas data frame, we will use the load_from_df() method, we will also need a Reader object, and the rating_scale parameter must be specified. The data frame must have three columns, corresponding to the user ids, the item ids, and the ratings in this order. Each row thus corresponds to a given rating.

With the Surprise library, we will benchmark the following algorithms

**Basic algorithms**

<ins>NormalPredictor:</ins> 

* NormalPredictor algorithm predicts a random rating based on the distribution of the training set, which is assumed to be normal. This is one of the most basic algorithms that do not do much work. 

<ins>BaselineOnly</ins> 

* BasiclineOnly algorithm predicts the baseline estimate for given user and item. 

<ins>k-NN algorithms</ins> 

* KNNBasic : is a basic collaborative filtering algorithm. 

* KNNWithMeans : is basic collaborative filtering algorithm, taking into account the mean ratings of each user.

*  KNNWithZScore : is a basic collaborative filtering algorithm, taking into account the z-score normalization of each user. 

* KNNBaseline :  is a basic collaborative filtering algorithm taking into account a baseline rating.  

**Matrix Factorization-based algorithms**

* SVD:  SVD algorithm is equivalent to Probabilistic Matrix Factorization (http://papers.nips.cc/paper/3208-probabilistic-matrix-factorization.pdf) 


* SVDpp :  The SVDpp algorithm is an extension of SVD that takes into account implicit ratings. 

* NMF : NMF is a collaborative filtering algorithm based on Non-negative Matrix Factorization. It is very similar with SVD. 

* Slope One:  Slope One is a straightforward implementation of the SlopeOne algorithm. (https://arxiv.org/abs/cs/0702144) 

* Co-clustering :  Co-clustering is a collaborative filtering algorithm based on co-clustering  (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.113.6458&rep=rep1&type=pdf) 

**We use rmse as our accuracy metric for the predictions.**

In [27]:
from surprise import Reader
from surprise import Dataset

from surprise.model_selection import cross_validate
from surprise import NormalPredictor
from surprise import KNNBasic
from surprise import KNNWithMeans
from surprise import KNNWithZScore
from surprise import KNNBaseline
from surprise import SVD
from surprise import BaselineOnly
from surprise import SVDpp
from surprise import NMF
from surprise import SlopeOne
from surprise import CoClustering

from surprise.accuracy import rmse
from surprise import accuracy
from surprise.model_selection import train_test_split

In [28]:
reader = Reader(rating_scale=(1, 10))
data = Dataset.load_from_df(df_new[['userID', 'ISBN', 'bookRating']], reader)

In [51]:
benchmark = []

# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(),SlopeOne(), NormalPredictor(), KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]:   # NMF()
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
       

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


In [52]:

surprise_results = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')
surprise_results
    

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BaselineOnly,3.387319,0.208773,0.226054
SlopeOne,3.466256,0.529251,2.782891
CoClustering,3.470497,1.868986,0.198137
KNNWithMeans,3.482147,0.546209,3.650566
KNNBaseline,3.491368,0.705521,4.447764
KNNWithZScore,3.507124,0.639282,4.075442
SVD,3.532282,3.863326,0.292883
KNNBasic,3.683223,0.52061,3.317125
SVDpp,3.763535,89.937813,3.317793
NormalPredictor,4.596632,0.106714,0.262963


In [33]:

print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


{'test_rmse': array([3.38497216, 3.37720166, 3.38152191]),
 'fit_time': (0.11073589324951172, 0.121673583984375, 0.14358234405517578),
 'test_time': (0.21841096878051758, 0.14162850379943848, 0.21645379066467285)}

## Train and Predict

BaselineOnly algorithm gave us the best rmse, therefore, we will train and predict with BaselineOnly and use Alternating Least Squares (ALS).

In [54]:
print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
cross_validate(algo, data, measures=['RMSE'], cv=3, verbose=False)

Using ALS
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...


{'test_rmse': array([3.3843159 , 3.38601587, 3.3808941 ]),
 'fit_time': (0.13463568687438965, 0.13064908981323242, 0.12367057800292969),
 'test_time': (0.24634432792663574, 0.20940780639648438, 0.13959670066833496)}

We use the train_test_split() to sample a trainset and a testset with given sizes, and use the accuracy metric of rmse. We’ll then use the fit() method which will train the algorithm on the trainset, and the test() method which will return the predictions made from the testset.

In [55]:
trainset, testset = train_test_split(data, test_size=0.25)
algo = BaselineOnly(bsl_options=bsl_options)
predictions = algo.fit(trainset).test(testset)
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 3.3785


3.3784989384983968

To inspect our predictions in details, we are going to build a pandas data frame with all the predictions. The following code were largely taken from this notebook.

In [56]:
trainset = algo.trainset
print(algo.__class__.__name__)

BaselineOnly


To inspect our predictions in details, we are going to build a pandas data frame with all the predictions.


def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)

In [58]:

df.head()

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
0,182993,312278586,9.0,3.173244,{'was_impossible': False},51,177,5.826756
1,123981,671034057,0.0,1.132647,{'was_impossible': False},228,32,1.132647
2,171445,316776963,9.0,3.641491,{'was_impossible': False},21,137,5.358509
3,31008,345350499,0.0,3.38004,{'was_impossible': False},5,84,3.38004
4,157273,440224624,0.0,1.190933,{'was_impossible': False},135,76,1.190933


In [59]:

best_predictions = df.sort_values(by='err')[:10]
worst_predictions = df.sort_values(by='err')[-10:]

In [60]:
best_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
7366,40943,0312288115,1.0,1.0,{'was_impossible': False},140,31,0.0
26326,7158,074343627X,1.0,1.0,{'was_impossible': False},101,73,0.0
25184,114988,0971880107,1.0,1.0,{'was_impossible': False},89,644,0.0
20276,132492,0971880107,1.0,1.0,{'was_impossible': False},0,644,0.0
3095,4131,0971880107,1.0,1.0,{'was_impossible': False},2,644,0.0
12705,233911,0345427637,1.0,1.0,{'was_impossible': False},36,89,0.0
23311,233911,0345298349,1.0,1.0,{'was_impossible': False},36,41,0.0
7257,41667,0971880107,1.0,1.0,{'was_impossible': False},11,644,0.0
20477,123883,0345370775,5.0,5.000181,{'was_impossible': False},68,206,0.000181
12156,23902,0156005891,7.0,7.000385,{'was_impossible': False},56,38,0.000385


In [61]:
worst_predictions

Unnamed: 0,uid,iid,rui,est,details,Iu,Ui,err
7283,245963,0425130711,10.0,1.0,{'was_impossible': False},136,43,9.0
19171,184299,0441627404,10.0,1.0,{'was_impossible': False},167,34,9.0
28174,160819,0451204948,10.0,1.0,{'was_impossible': False},93,29,9.0
7114,200674,0425141233,10.0,1.0,{'was_impossible': False},139,33,9.0
34792,217375,0553569031,10.0,1.0,{'was_impossible': False},241,51,9.0
19296,36606,0743224574,10.0,1.0,{'was_impossible': False},242,63,9.0
4093,108285,044022165X,10.0,1.0,{'was_impossible': False},56,163,9.0
26526,109901,0618002227,0.0,9.011367,{'was_impossible': False},32,35,9.011367
30824,31826,0439139597,0.0,10.0,{'was_impossible': False},83,86,10.0
22922,31826,0439064864,0.0,10.0,{'was_impossible': False},83,80,10.0


The above are the best predictions, and they are not lucky guesses. Because Ui is anywhere between 26 to 146, they are not really small, meaning that significant number of users have rated the target book.

The worst predictions look pretty surprise. Let's look in more details of the last one ISBN "055358264X", the book was rated by 47 users, user "26544" rated 10, our BaselineOnly algorithm predicts 0.


In [62]:
df_new.loc[df_new['ISBN'] == '055358264X']['bookRating'].describe()

count    60.000000
mean      1.283333
std       2.969287
min       0.000000
25%       0.000000
50%       0.000000
75%       0.000000
max      10.000000
Name: bookRating, dtype: float64

In [63]:

import matplotlib.pyplot as plt
%matplotlib notebook

df_new.loc[df_new['ISBN'] == '055358264X']['bookRating'].hist()
plt.xlabel('rating')
plt.ylabel('Number of ratings')
plt.title('Number of ratings book ISBN 055358264X has received')
plt.show();

<IPython.core.display.Javascript object>

It turns out, most of the ratings this book received was 0, in another word, most of the users in the data rated this book 0, only very few users rated 10. Same with the other predictions in “worst predictions” list. It seems that for each prediction, the users are some kind of outsiders.