<a href="https://colab.research.google.com/github/hkbu-kennycheng/comp7240/blob/main/lab2_content_based_methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 2: content-based methods

We are going to cover another two CF techniques in this lab, baseline method and factorization machine (FM). With another dataset for this lab, it also demonstrate how to import data from csv file in `surprise`.

Let's install `surprise` with `pip` command.

In [2]:
!pip install scikit-surprise scipy



The following code cell is for rendering math equation in Google Colab.

In [3]:
if 'google.colab' in str(get_ipython()):
    
    from sympy import init_printing
    from sympy.printing import latex

    def colab_LaTeX_printer(exp, **options):  
        from google.colab.output._publish import javascript 

        url_ = "https://colab.research.google.com/static/mathjax/MathJax.js?"
        cfg_ = "config=TeX-MML-AM_HTMLorMML" # "config=default"

        javascript(url=url_+cfg_)

        return latex(exp, **options)

    init_printing(use_latex="mathjax", latex_printer=colab_LaTeX_printer) 

# Dataset: CiaoDVD

Let's consider CiaoDVD dataset.

![](https://url2img-web.herokuapp.com/aHR0cHM6Ly9ndW9ndWliaW5nLmdpdGh1Yi5pby9saWJyZWMvZGF0YXNldHMuaHRtbCNjaWFvZHZk)

In [4]:
!curl https://guoguibing.github.io/librec/datasets/CiaoDVD.zip > CiaoDVD.zip
!unzip CiaoDVD.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5678k  100 5678k    0     0  7280k      0 --:--:-- --:--:-- --:--:-- 7280k
Archive:  CiaoDVD.zip
replace movie-ratings.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: All
  inflating: movie-ratings.txt       
  inflating: readme.txt              
  inflating: review-ratings.txt      
  inflating: trusts.txt              


In [5]:
!head -n 20 readme.txt

Data Collection Duration: 2013-11 --- 2013-12
Author: Guibing Guo

Dataset Name: CiaoDVDs

1. Ciao movie ratings format:  
    1) File: movie-ratings.txt (size: 72,665 --> 72.7K)
    2) Columns: userID, movieID, genreID, reviewID, movieRating, date

2. Ciao review ratings format: 
    1) File: review-ratings.txt (size: 1,625,480 --> 1.6M)
    2) Columns: userID, reviewID, reviewRating
    3) Note: There are users who do not provide movie ratings but provide review ratings.
    
3. Ciao user trusts fromat:
    1) File: trusts.txt (size: 40,133 --> 40K)
    2) Columns: trustorID, trusteeID, trustValue
    3) Note: There are users who may not provide neither movie rating nor review ratings. 



In [6]:
!head review-ratings.txt

4064,21,3
931,27,4
1869,41,3
44,17,3
9370,26,4
2355,33,3
17616,40,4
17198,29,2
7802,35,4
6567,19,3


## Import data from csv file

Let's work on the data in `review-ratings.txt`.

In [10]:
from surprise import Dataset
from surprise import Reader

# path to dataset file
file_path = 'review-ratings.txt'

# seperate user item rating using comma
reader = Reader(sep=',')

data = Dataset.load_from_file(file_path, reader=reader)

## Split training set and testing set

In [11]:
from surprise.model_selection import train_test_split

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

# Baseline method

In [9]:
from surprise import BaselineOnly
from surprise import accuracy

algo = BaselineOnly()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

Estimating biases using als...
RMSE: 0.4387


0.43868088206653477

## Evaluation

In [10]:
from surprise.model_selection import cross_validate

# We can now use this dataset as we please, e.g. calling cross_validate
cross_validate(algo, data, verbose=True)

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.4321  0.4336  0.4363  0.4340  0.4346  0.4341  0.0014  
MAE (testset)     0.2849  0.2847  0.2863  0.2856  0.2854  0.2854  0.0006  
Fit time          10.48   11.86   11.74   11.97   11.76   11.56   0.55    
Test time         3.98    3.05    3.05    3.02    3.98    3.42    0.46    


{'fit_time': (10.476726293563843,
  11.859905004501343,
  11.738009214401245,
  11.969250202178955,
  11.758033514022827),
 'test_mae': array([0.28490151, 0.28466501, 0.28626753, 0.28556507, 0.28541539]),
 'test_rmse': array([0.43211072, 0.43364079, 0.43626314, 0.43403554, 0.43461696]),
 'test_time': (3.9818809032440186,
  3.048750400543213,
  3.0481350421905518,
  3.0211844444274902,
  3.975337505340576)}

# Factorization Machines (FM) with FastFM

Factorization Machine is a combination of linear regression and matrix factorization.

\begin{align}
\hat{y}(\textbf{x}) = w_{0} + \sum_{i=1}^{n} w_{i} x_{i} + \frac{1}{2} \sum_{f=1}^{k} \left( \left( \sum_{i}^{n} v_{i,f}x_{i} \right)^2  - \sum_{i=1}^{n} v_{i,f}^2 x_{i}^2 \right)
\end{align}

## Building fastFM from github source

fastFM currently do not support Python 3.8+. Building from source is required for newer verions of Python. Let's clone and build it!

In [11]:
%%time

!apt-get install python-dev libopenblas-dev
!git clone --recursive http://github.com/ibayer/fastFM.git
!pip install -r fastFM/requirements.txt
!cd fastFM && make
#!cd fastFM && make TARGET=NEHALEM
!cd fastFM && pip install .

# For python 3.7 or older, you could simply run the following command to install it.
!pip install fastFM

Reading package lists... Done
Building dependency tree       
Reading state information... Done
python-dev is already the newest version (2.7.15~rc1-1).
libopenblas-dev is already the newest version (0.2.20+ds-4).
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.
Cloning into 'fastFM'...
remote: Enumerating objects: 1894, done.[K
remote: Counting objects: 100% (26/26), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 1894 (delta 12), reused 20 (delta 9), pack-reused 1868[K
Receiving objects: 100% (1894/1894), 4.77 MiB | 26.98 MiB/s, done.
Resolving deltas: 100% (1202/1202), done.
Submodule 'fastFM-core' (https://github.com/ibayer/fastFM-core.git) registered for path 'fastFM-core'
Cloning into '/content/fastFM/fastFM-core'...
remote: Enumerating objects: 520, done.        
remote: Total 520 (delta 0), reused 0 (delta 0), pack-reused 520        
Receiving objects: 100% (520/520), 136.42 KiB | 6.50 MiB/s, done.
Resolving deltas: 100% (330/330), do

## Wrapping FastFM in suprise

`fm_dataset` method is for transforming the dataset into following format.

|               | u<sub>1</sub> | u<sub>2</sub> | u<sub>3</sub> | i<sub>1</sub> | i<sub>2</sub> | i<sub>3</sub> | y |
|---------------|---|---|---|---|---|---|---|
| x<sub>1</sub> | 0 | 1 | 0 | 0 | 1 | 0 | 2 |
| x<sub>2</sub> | 1 | 0 | 0 | 0 | 0 | 0 | 4 |
| x<sub>3</sub> | 0 | 0 | 0 | 1 | 0 | 0 | 3 |
| x<sub>4</sub> | 0 | 0 | 0 | 0 | 0 | 1 | 4 |

where *u* represents each user, and *i* represents each item.

In [29]:
from surprise import AlgoBase
from surprise.prediction_algorithms.predictions import PredictionImpossible

from scipy.sparse import lil_matrix
from fastFM import als
import numpy as np

class FactorizationMachine(AlgoBase):

    def __init__(self, task = 'regression', n_factors = 10, n_epochs = 10,
                 lr = 0.1, reg_coef = 0.01,
                 reg_factors = 0.01, random_state = 1234, verbose = False):
      
      self.n_factors = n_factors
      self.n_epochs = n_epochs
      self.lr = lr
      self.reg_coef = reg_coef
      self.reg_factors = reg_factors
      self.random_state = random_state
      self.verbose = verbose

      if task == 'regression':
        self.model = als.FMRegression(n_iter=self.n_epochs, init_stdev=0.1, rank=self.n_factors, l2_reg_w=0.1, l2_reg_V=0.5)

      AlgoBase.__init__(self)

    def fm_dataset(self, trainset):
        self.n_features = trainset.n_users + trainset.n_items
        x = lil_matrix((trainset.n_ratings, self.n_features))
        y = np.zeros(trainset.n_ratings)
        for (index, (u, i, r)) in enumerate(trainset.all_ratings()):
          x[index, u] = x[index, -1-i] = 1
          y[index] = r
        # fastFM is happy with csc_matrix
        return x.asformat('csc'), y

    def fit(self, trainset):
        AlgoBase.fit(self, trainset)
        
        X_train, y_train = self.fm_dataset(trainset)
        self.model.fit(X_train, y_train)

        return self

    def estimate(self, u, i):
        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unknown.')

        x = lil_matrix((1, self.n_features))
        x[0, u] = x[0, -1-i] = 1

        return round(self.model.predict(x.asformat('csc'))[0])

## Train the model

In [30]:
%%time
algo = FactorizationMachine()
algo.fit(trainset)

CPU times: user 25.3 s, sys: 385 ms, total: 25.7 s
Wall time: 25.7 s


## Let's evaluate the FastFM wrapper

In [31]:
%%time
from surprise import accuracy

predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.4452
CPU times: user 4min 25s, sys: 3.39 s, total: 4min 28s
Wall time: 4min 29s


As you could notice that, we got little improvement over the baseline method.

### With 5-folds cross-validation

In [None]:
%%time
from surprise.model_selection import cross_validate
cross_validate(algo, data, verbose=True)