Recommendation systems are a collection of algorithms used to recommend items to users based on information taken from the user. These systems have become ubiquitous can be commonly seen in online stores, movies databases and job finders. In this notebook, we will explore recommendation systems based on Collaborative Filtering and implement it using <b>Surprise<b> package.

Collaborative filtering, also referred to as social filtering, filters information by using the recommendations of other people. Users and items matrix is built. Normally this matrix is sparse, i.e. most of the cells will be empty. The goal of any recommendation system is to find similarities among the users and items and recommend items which have high probability of being liked by a user given the similarities between users and items.
 
Similarities between users and items can be assessed using several similarity measures such as Correlation, Cosine Similarities, Jaccard Index, Hamming Distance. The most commonly used similarity measures are Cosine Similarity and Jaccard Index in a recommendation engine. 

There are 2 types in Collaborative filtering:
1. User-based collaborative filtering - based on users' neighborhood
2. Item based collaborative filtering - based on items' similarity

Collaborative filtering can be done based on two methods:
1. Neighborhood method
2. Matrix factorization

In this tutorial, we will look at implementation of user-based collaborative filtering using both Neighborhood method.

In [4]:
!pip install scikit-surprise



In [0]:
from surprise import Dataset
from surprise import evaluate, print_perf
from surprise import Reader
from surprise import KNNBasic

In [2]:
from google.colab import files

uploaded = files.upload()

Saving ratings.dat to ratings.dat


In [3]:
data_file_path = "ratings.dat"
data_file_path

'ratings.dat'

The Reader class is used to parse a file containing ratings.

Such a file is assumed to specify only one rating per line, and each line needs to respect the following structure:

user ; item ; rating ; [timestamp]

Here we dont have time stamp and it is optional

In [0]:
reader_object = Reader(line_format='user item rating', sep='|', rating_scale=(1, 5), skip_lines=0)
data = Dataset.load_from_file(data_file_path, reader=reader_object)

In [6]:
data

<surprise.dataset.DatasetAutoFolds at 0x7fbff248e080>

In [7]:
type(data)

surprise.dataset.DatasetAutoFolds

In [21]:
#Visualize few rows in the ratings.dat file
data.raw_ratings[0:10]

[('521', '1', 5.0, None),
 ('365', '1', 5.0, None),
 ('309', '1', 5.0, None),
 ('1176', '7', 5.0, None),
 ('97', '1', 5.0, None),
 ('1587', '33', 5.0, None),
 ('882', '4', 5.0, None),
 ('263', '1', 5.0, None),
 ('601', '2', 5.0, None),
 ('781', '3', 5.0, None)]

In [0]:
# We need to split the Data into five folds to perform cross validation
data.split(n_folds=5)

**KNNBasic (Neighborhood Method)**

**KNNBasic**
A basic collaborative filtering algorithm depending on the 'user_based' field of the 'sim_options' parameter.

Args:

    k(int): The (max) number of neighbors to take into account for
        aggregation. **Default is 40.**
    
    min_k(int): The minimum number of neighbors to take into account for
        aggregation. If there are not enough neighbors, the prediction is
        set the global mean of all ratings. **Default is 1.**
    
    sim_options(dict): A dictionary of options for the similarity
        measure.
    
    verbose(bool): Whether to print trace messages of bias estimation,
        similarity, etc.  Default is True.

In [0]:
# We'll use the famous User Based Collaborative Filtering algorithm.there are a list of other algorithms that can be used in the Surprise package

similarity_params = {'name': 'cosine',
                     'user_based': True  # compute  similarities between users
                    }
algo = KNNBasic(sim_options=similarity_params)

In [11]:
# Evaluate performances of our algorithm on the dataset.

perf = evaluate(algo, data, measures=['RMSE', 'MAE'])

print_perf(perf)



Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9682
MAE:  0.6732
------------
Fold 2
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9611
MAE:  0.6645
------------
Fold 3
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8866
MAE:  0.6117
------------
Fold 4
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9352
MAE:  0.6417
------------
Fold 5
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0443
MAE:  0.6970
------------
------------
Mean RMSE: 0.9591
Mean MAE : 0.6576
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9682  0.9611  0.8866  0.9352  1.0443  0.9591  
MAE     0.6732  0.6645  0.6117  0.6417  0.6970  0.6576  


In [12]:
from surprise import GridSearch
param_grid = {'min_k': [1,2,3,4,5], 'k': [35,36,37,38,39,40]}
grid_search = GridSearch(KNNBasic, param_grid, measures=['RMSE', 'MAE'])
grid_search.evaluate(data)



Running grid search for the following parameter combinations:
{'min_k': 1, 'k': 35}
{'min_k': 1, 'k': 36}
{'min_k': 1, 'k': 37}
{'min_k': 1, 'k': 38}
{'min_k': 1, 'k': 39}
{'min_k': 1, 'k': 40}
{'min_k': 2, 'k': 35}
{'min_k': 2, 'k': 36}
{'min_k': 2, 'k': 37}
{'min_k': 2, 'k': 38}
{'min_k': 2, 'k': 39}
{'min_k': 2, 'k': 40}
{'min_k': 3, 'k': 35}
{'min_k': 3, 'k': 36}
{'min_k': 3, 'k': 37}
{'min_k': 3, 'k': 38}
{'min_k': 3, 'k': 39}
{'min_k': 3, 'k': 40}
{'min_k': 4, 'k': 35}
{'min_k': 4, 'k': 36}
{'min_k': 4, 'k': 37}
{'min_k': 4, 'k': 38}
{'min_k': 4, 'k': 39}
{'min_k': 4, 'k': 40}
{'min_k': 5, 'k': 35}
{'min_k': 5, 'k': 36}
{'min_k': 5, 'k': 37}
{'min_k': 5, 'k': 38}
{'min_k': 5, 'k': 39}
{'min_k': 5, 'k': 40}
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing simil

In [13]:
# To know results of grid search

# best RMSE score in the model
print("Best RMSE Score is:{!r}".format(grid_search.best_score['RMSE']))


# combination of parameters that gave the best RMSE score
print("Parameter to achieve Best RMSE Score is:{!r}".format(grid_search.best_params['RMSE']))


# best MAE score
print("Best MAE Score is :{!r}".format(grid_search.best_score['MAE']))


# combination of parameters that gave the best MAE score
print("Parameter to achieve Best MAE Score is:{!r}".format(grid_search.best_params['MAE']))


Best RMSE Score is:0.9459235245253034
Parameter to achieve Best RMSE Score is:{'min_k': 3, 'k': 35}
Best MAE Score is :0.6500376414597726
Parameter to achieve Best MAE Score is:{'min_k': 3, 'k': 35}


In [15]:
#Re-run the KNNBasic model with the best parameters found in Grid Search

similarity_params = {'name': 'cosine',
                     'user_based': True  # compute  similarities between users
                    }
algo = KNNBasic(sim_options=similarity_params, min_k =3, k = 35)
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)



Evaluating RMSE, MAE of algorithm KNNBasic.

------------
Fold 1
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9518
MAE:  0.6656
------------
Fold 2
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9276
MAE:  0.6520
------------
Fold 3
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.8886
MAE:  0.6153
------------
Fold 4
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 0.9180
MAE:  0.6382
------------
Fold 5
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.0403
MAE:  0.6877
------------
------------
Mean RMSE: 0.9453
Mean MAE : 0.6517
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.9518  0.9276  0.8886  0.9180  1.0403  0.9453  
MAE     0.6656  0.6520  0.6153  0.6382  0.6877  0.6517  


In [16]:
# Make prediction for a user and movie id
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 196        item: 302        r_ui = 4.00   est = 4.51   {'was_impossible': True, 'reason': 'User and/or item is unkown.'}


In [18]:
# Make prediction for a user and movie id
uid = str(699)  # raw user id (as in the ratings file). They are **strings**!
iid = str(208)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)

user: 699        item: 208        r_ui = None   est = 1.45   {'actual_k': 3, 'was_impossible': False}


**SVD (Matrix Factorization method) **

In [0]:
from surprise import SVD #Import SVD
from surprise.model_selection import cross_validate #Import cross_validate

In [25]:
#Initialize SVD algorithm with default params
algo = SVD()
algo

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7fbff2429550>

In [26]:
# Evaluate performances of our algorithm on the dataset.
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)



Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8168
MAE:  0.4952
------------
Fold 2
RMSE: 0.7917
MAE:  0.4870
------------
Fold 3
RMSE: 0.7533
MAE:  0.4616
------------
Fold 4
RMSE: 0.7820
MAE:  0.4744
------------
Fold 5
RMSE: 0.9212
MAE:  0.5265
------------
------------
Mean RMSE: 0.8130
Mean MAE : 0.4890
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.8168  0.7917  0.7533  0.7820  0.9212  0.8130  
MAE     0.4952  0.4870  0.4616  0.4744  0.5265  0.4890  


In [32]:
param_grid = {'n_factors': [50,100,150], 'n_epochs': [10,20,30,40,50], 'lr_all': [0.001, 0.005, 0.01, 0.05], 'reg_all': [0.01,0.02,0.03,0.04,0.05]}
grid_search = GridSearch(SVD, param_grid, measures=['RMSE', 'MAE'])
grid_search.evaluate(data)



Running grid search for the following parameter combinations:
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.01}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.02}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.03}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.04}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.001, 'reg_all': 0.05}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.01}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.02}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.03}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.04}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.05}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.01, 'reg_all': 0.01}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.01, 'reg_all': 0.02}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.01, 'reg_all': 0.03}
{'n_factors': 50, 'n_epochs': 10, 'lr_all': 0.01, 'reg_al

In [33]:
# To know results of grid search

# best RMSE score in the model
print("Best RMSE Score is:{!r}".format(grid_search.best_score['RMSE']))


# combination of parameters that gave the best RMSE score
print("Parameter to achieve Best RMSE Score is:{!r}".format(grid_search.best_params['RMSE']))


# best MAE score
print("Best MAE Score is :{!r}".format(grid_search.best_score['MAE']))


# combination of parameters that gave the best MAE score
print("Parameter to achieve Best MAE Score is:{!r}".format(grid_search.best_params['MAE']))


Best RMSE Score is:0.8036188912814561
Parameter to achieve Best RMSE Score is:{'n_factors': 50, 'n_epochs': 20, 'lr_all': 0.01, 'reg_all': 0.05}
Best MAE Score is :0.4824774952732297
Parameter to achieve Best MAE Score is:{'n_factors': 100, 'n_epochs': 30, 'lr_all': 0.01, 'reg_all': 0.01}


In [34]:
algo = SVD(n_factors = 100, n_epochs = 30, lr_all = 0.01, reg_all = 0.01)
perf = evaluate(algo, data, measures=['RMSE', 'MAE'])
print_perf(perf)



Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.8249
MAE:  0.4976
------------
Fold 2
RMSE: 0.7901
MAE:  0.4863
------------
Fold 3
RMSE: 0.7497
MAE:  0.4589
------------
Fold 4
RMSE: 0.7742
MAE:  0.4638
------------
Fold 5
RMSE: 0.9096
MAE:  0.5231
------------
------------
Mean RMSE: 0.8097
Mean MAE : 0.4859
------------
------------
        Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    
RMSE    0.8249  0.7901  0.7497  0.7742  0.9096  0.8097  
MAE     0.4976  0.4863  0.4589  0.4638  0.5231  0.4859  


In [35]:
# Make prediction for a user and movie id
uid = str(699)  # raw user id (as in the ratings file). They are **strings**!
iid = str(208)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, verbose=True)

user: 699        item: 208        r_ui = None   est = 1.58   {'was_impossible': False}


**Other algorithms**

1.   random_pred.NormalPredictor	--> Algorithm predicting a random rating based on the distribution of the training set, which is assumed to be normal.
2.   baseline_only.BaselineOnly	--> Algorithm predicting the baseline estimate for given user and item.
3.  knns.KNNBasic -->	A basic collaborative filtering algorithm.
4.  knns.KNNWithMeans -->	A basic collaborative filtering algorithm, taking into account the mean ratings of each user.
5.  knns.KNNWithZScore  -->	A basic collaborative filtering algorithm, taking into account the z-score normalization of each user.
6.  knns.KNNBaseline  -->	A basic collaborative filtering algorithm taking into account a baseline rating.
7.  matrix_factorization.SVD	--> The famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize.
8.  matrix_factorization.SVDpp -->	The SVD++ algorithm, an extension of SVD taking into account implicit ratings.
9.  matrix_factorization.NMF	--> A collaborative filtering algorithm based on Non-negative Matrix Factorization.
10.  slope_one.SlopeOne	--> A simple yet accurate collaborative filtering algorithm.
11.  co_clustering.CoClustering	-->  A collaborative filtering algorithm based on co-clustering.List item

