<a href="https://colab.research.google.com/github/hkbu-kennycheng/comp7240/blob/main/lab1_collaborative_filtering_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab1: Collaborative Filtering (CF) methods

There are various libraries for doing CF in Python. In this lab, we would like to go through several CF techniques including **user-based method**, **item-based method**, **centered k-NN** and **co-clustering**.

Let's install it first by runing `pip install surprise` or `conda install surprise` in the terminal. It will be ready when you see `Successfully installed scikit-surprise-1.1.1 surprise-0.1`.


In [None]:
!pip install surprise

Collecting surprise
  Downloading surprise-0.1-py2.py3-none-any.whl (1.8 kB)
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 11.1 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1619397 sha256=63bd38c8efcd65e9dd226b3ecb28a78afc787ebcb10b108087ffaad4e94208e4
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise, surprise
Successfully installed scikit-surprise-1.1.1 surprise-0.1


## Dataset: MovieLens 100K

Before doing experiment, we need a dataset.

![](https://url2img-web.herokuapp.com/aHR0cHM6Ly9maWxlcy5ncm91cGxlbnMub3JnL2RhdGFzZXRzL21vdmllbGVucy9tbC0xMDBrLVJFQURNRS50eHQ=)

We could simply load the dataset with `surprise.Dataset.load_builtin`. After that, we could split it into training set and testing set by `surprise.model_selection.train_test_split`.

In [None]:
from surprise import Dataset
from surprise.model_selection import train_test_split

# Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k', prompt=False)

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k


Let's take a look to first 10 records in testing set, in order to get a better understanding of the data.

`('11', '100', 4.0)` means user id `11` giving `4.0` rating to moive id `100`.


In [None]:
testset[:10]

[('11', '100', 4.0),
 ('255', '443', 1.0),
 ('343', '297', 5.0),
 ('846', '1210', 2.0),
 ('293', '162', 3.0),
 ('82', '411', 3.0),
 ('828', '269', 4.0),
 ('182', '471', 4.0),
 ('439', '100', 3.0),
 ('533', '412', 1.0)]

We could get

In [None]:
trainset.ur[0][:10]

## User-based vs item-based


**User-based** collaborative filtering is a technique to find similar users based on the ratings they give using a rating matrix.

On the other hand **item-based** collaborative filtering

### Rating matrix

|Rating|item 1|item 2|item 3|item 4|item 5| ... |
|------|------|------|------|------|------|-----|
|user 1| 5    |      | 4    | 1    |      |     |
|user 2|      | 3    |      | 3    |      |     |
|user 3|      | 4    | 4    | 1    |      |     |
|user 4| 4    | 4    | 5    |      |      |     |
|user 5| 2    | 4    |      | 5    | 2    |     |
|...   |      |      |      |      |      |     |

# User based method using Centered K-Nearest Neighbours (KNN)

In [None]:
from surprise import KNNWithMeans

# To use item-based cosine similarity
algo = KNNWithMeans(sim_options={
  "name": "cosine",
  "user_based": True,  # Compute  similarities between items
})


### Evaluate with 5-fold cross-validation

It's easy to evaluate the algorithm using cross validation in `surprise.model_selection`. It support follow accuracy metrics:

- Root Mean Squared Error (RMSE)
- Mean Squared Error (MSE)
- Mean Absolute Error (MAE)
- Fraction of Concordant Pairs (FCP)

In [None]:
from surprise.model_selection import cross_validate

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9490  0.9351  0.9492  0.9442  0.9364  0.9428  0.0060  
MAE (testset)     0.7444  0.7355  0.7446  0.7427  0.7390  0.7412  0.0035  
Fit time          1.84    1.75    1.90    1.78    1.73    1.80    0.06    
Test time         4.66    4.68    4.73    4.58    4.50    4.63    0.08    


{'fit_time': (1.8381211757659912,
  1.7521154880523682,
  1.901658535003662,
  1.782780647277832,
  1.7349183559417725),
 'test_mae': array([0.74440984, 0.73547189, 0.74463689, 0.74268236, 0.73902804]),
 'test_rmse': array([0.94901234, 0.93507251, 0.94921895, 0.94420391, 0.93641868]),
 'test_time': (4.662696361541748,
  4.677043437957764,
  4.7342612743377686,
  4.578236103057861,
  4.501827239990234)}

# Item-based method using KNN

In [None]:
# To use item-based cosine similarity
algo = KNNWithMeans(sim_options={
    "name": "cosine",
    "user_based": False,  # Compute similarities between users
})

## Evaluate with 5-fold cross-validation

In [None]:
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# CoClustering


In [None]:
from surprise import CoClustering

algo = CoClustering()

## Train the algorithm on the trainset

In [None]:
algo.fit(trainset)

<surprise.prediction_algorithms.co_clustering.CoClustering at 0x7f79819efe90>

## Make prediction on testset

In [None]:
predictions = algo.test(testset)

## Compute RMSE for predictions

In [None]:
from surprise import accuracy

accuracy.rmse(predictions)

RMSE: 0.9658


0.9658082516421524

## Mkae preduction for specific user and items

In [None]:
uid = str(196)  # raw user id (as in the ratings file). They are **strings**!
iid = str(302)  # raw item id (as in the ratings file). They are **strings**!

# get a prediction for specific users and items.
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

user: 196        item: 302        r_ui = 4.00   est = 4.38   {'was_impossible': False}
