<div style="width: 100%; clear: both;">
<div style="float: left; width: 50%;">
<img src="http://www.uoc.edu/portal/_resources/common/imatges/marca_UOC/UOC_Masterbrand.jpg", align="left">
</div>
<div style="float: right; width: 50%;">
<p style="margin: 0; padding-top: 22px; text-align:right;">22.418 · Aprenentatge automàtic</p>
<p style="margin: 0; text-align:right;">Grau en Ciència de Dades Aplicada</p>
<p style="margin: 0; text-align:right; padding-button: 100px;">Estudis de Informàtica, Multimèdia i Telecomunicació</p>
</div>
</div>
<div style="width:100%;">&nbsp;</div>

# User Based Collaborative filtering

In this notebook, we will run user and item collaborative filtering algorithm on the movielens dataset. The original notebook from the surprise dataset can be found in the following link: <br>
https://github.com/NicolasHug/Surprise/blob/master/examples/notebooks/KNNBasic_analysis.ipynb

## Imports

In [1]:
!pip install scikit-surprise

Collecting scikit-surprise
[?25l  Downloading https://files.pythonhosted.org/packages/97/37/5d334adaf5ddd65da99fc65f6507e0e4599d092ba048f4302fe8775619e8/scikit-surprise-1.1.1.tar.gz (11.8MB)
[K     |████████████████████████████████| 11.8MB 6.8MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp36-cp36m-linux_x86_64.whl size=1618277 sha256=2032e3621e6b5552b0a2778154fc9eb8090bedadce9ae83a93070e908eb3d6e1
  Stored in directory: /root/.cache/pip/wheels/78/9c/3d/41b419c9d2aff5b6e2b4c0fc8d25c538202834058f9ed110d0
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [2]:
from __future__ import (absolute_import, division, print_function,             
                        unicode_literals)                                      
import pickle
import os

import pandas as pd

from surprise import KNNWithMeans
from surprise import Dataset                                                     
from surprise import Reader                                                      
from surprise.model_selection import train_test_split
from surprise import dump
from surprise.accuracy import rmse
from collections import defaultdict

from surprise import accuracy

## Load the dataset


Let's load the MovieLens-100K dataset and split it into train and test sets as we learnt in the surprise_introduction/2_train_test_split notebook:

In [3]:
data = Dataset.load_builtin('ml-100k')
trainset, testset = train_test_split(data, test_size=.25)

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


## Train


In https://surprise.readthedocs.io/en/stable/prediction_algorithms.html#similarity-measures-configuration are defined the options for the similarity measures:


Many algorithms use a similarity measure to estimate a rating. The way they can be configured is done in a similar fashion as for baseline ratings: you just need to pass a sim_options argument at the creation of an algorithm. This argument is a dictionary with the following (all optional) keys:

* 'name': The name of the similarity to use, as defined in the similarities module. Default is 'MSD'.
* 'user_based': Whether similarities will be computed between users or between items. This has a huge impact on the performance of a prediction algorithm. Default is True.
* 'min_support': The minimum number of common items (when 'user_based' is 'True') or minimum number of common users (when 'user_based' is 'False') for the similarity not to be zero. Simply put, if |𝐼𝑢𝑣|<min_support then sim(𝑢,𝑣)=0  . The same goes for items.
* 'shrinkage': Shrinkage parameter to apply (only relevant for pearson_baseline similarity). Default is 100.


We are interested in user_based and use the cosine measure:

In [13]:
sim_options = {'name': 'pearson', 
               'user_based': True  # compute  similarities between users
               }
algo = KNNWithMeans(sim_options=sim_options)

Lets train the algorithm:

In [14]:
algo.fit(trainset)                     


Computing the pearson similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f8c2b677a90>

## Save model
Lets save the model as we done in the 6_save_load_models.ipynb in the surprise_introduction:


In [15]:
file_name = os.path.expanduser('~/dump_file')
dump.dump(file_name, algo=algo)

## Load the model

In [16]:
_, loaded_algo = dump.load(file_name)

# Predictions

In [17]:
predictions = loaded_algo.test(testset)


Lets define the function that will receive the list of predictions for the user and return the highest ranked ones <br>
we defined it in the 4_get_top_n_recommendations.ipynb from surprise_introduction


In [19]:
def get_top_n(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions.

    Args:
        predictions(list of Prediction objects): The list of predictions, as
            returned by the test method of an algorithm.
        n(int): The number of recommendation to output for each user. Default
            is 10.

    Returns:
    A dict where keys are user (raw) ids and values are lists of tuples:
        [(raw item id, rating estimation), ...] of size n.
    """

    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n


Get the top n predictions: With the get_top_n we get the top n predictions for all the users at once <br>
Once we have all the predictions computed with algo.test, we can use the function we defined to get the n best predictions for each user:


In [10]:
top_n = get_top_n(predictions, n=10)

We can get the predictions for the user 196 with the following instruction:

In [11]:
top_n["196"]

[('251', 4.407892723166968),
 ('655', 4.291990333327616),
 ('173', 4.176166826197281),
 ('70', 4.075916901496879),
 ('692', 3.906719551648593),
 ('381', 3.7695731135023838),
 ('257', 3.743816336866086),
 ('1118', 3.643927397120928),
 ('762', 3.636030195896745),
 ('1022', 3.4843963678264736)]

## Accuracy measures

In [12]:
print("accuracy measures:")
accuracy.rmse(predictions)
accuracy.mse(predictions)
accuracy.mae(predictions)
accuracy.fcp(predictions)



accuracy measures:
RMSE: 0.9545
MSE: 0.9110
MAE:  0.7469
FCP:  0.7109


0.710885562155365

# Test with different parameters

Once we have a working example we can explore new parameters. Try to run the experiments again changing:
* Similarity measure, try different parameters for the 'name' parameter: 'pearson', 'cosine', 'msd'
* user_based: try with user based (user_based=True) and with item based (user_based=False)
* min_support: does the performance change if we set a minimum number of items/users to ensure better prediction models?

Which configuration obtains better results? DO the accuracy measures vary drastically?