<a href="https://colab.research.google.com/github/hkbu-kennycheng/comp7240/blob/main/lab3_matrix_factorization_based_methods_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3: matrix factorization based methods I

In this lab, we would do experiment on Singular Value Decomposition (SVD) and Probabilistic Matrix Factorization (PMF) algorthium. Let's start with installing `surprise` library.

In [76]:
!pip install surprise



# Dataset: [Amazon Review Data](https://nijianmo.github.io/amazon/index.html)

Let's take a look to today's dataset.

![](https://url2img-web.herokuapp.com/aHR0cHM6Ly9uaWppYW5tby5naXRodWIuaW8vYW1hem9uL2luZGV4Lmh0bWwjbWFpbg==)

Although the whole dataset is quite big. There is a sample dataset with only Home and Kitchen product reviews in json format. Let's download it by `curl` command and unzip it with `zcat`.



In [77]:
!curl http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Digital_Music_5.json.gz | zcat > data.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18.5M  100 18.5M    0     0  16.0M      0  0:00:01  0:00:01 --:--:-- 16.0M


We could use `json` library to load the json file into a list by taking out only `reviewerID`, `asin` and `overall`.
- `reviewerID` is the user ID in string format.
- `asin` is an unique identifier of a particular product.
- `overall` is the rating for the product given by the user. 

In [78]:
import json

# reviews = pd.DataFrame(columns=['reviewerID', 'asin', 'overall'])
reviews = []
with open('data.json', 'r') as f:
    for l in f:
        r = json.loads(l)
        reviews.append([r['reviewerID'], r['asin'], r['overall']])

In [79]:
reviews[0]

['A2TYZ821XXK2YZ', '3426958910', 5.0]

Since `reviewerID` and `asin` is in string format, changing to numeric value is needed before passing in to algorithm.

In [80]:
import pandas as pd

df = pd.DataFrame(reviews)

# build a dictionary of reviewerID to numeric ID by index
reviewerIDs = df[0].unique()
reviewerIDdict = dict(zip(reviewerIDs, range(len(reviewerIDs))))

# build a dictionary of asin to numeric ID by index
asins = df[1].unique()
asinDict = dict(zip(asins, range(len(asins))))

# replace reviewerID and asin to numeric value by the dictionaries
df[0] = df[0].replace(reviewerIDdict)
df[1] = df[1].replace(asinDict)

df

Unnamed: 0,0,1,2
0,0,0,5.0
1,1,0,5.0
2,2,0,5.0
3,3,0,4.0
4,4,1,5.0
...,...,...,...
169776,2598,9682,5.0
169777,8493,9682,5.0
169778,15667,9682,5.0
169779,16411,9682,5.0


After converting to numerice value, we could wrap it as `Dataset` using `load_from_df`.

In [81]:
from surprise import Dataset
from surprise import Reader

# build the reader object by specifying rating scale
reader = Reader(rating_scale=(1, 5))

# load the data with from panda data frame
data = Dataset.load_from_df(df, reader=reader)

Finally, split it in training set and testing set.

In [82]:
from surprise.model_selection import train_test_split

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

## Singular Value Decomposition (SVD)


In [83]:
from surprise import SVD
from surprise import accuracy

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE: 0.5676


0.567603021219035

## Evaluation

In [84]:
from surprise.model_selection import cross_validate

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE', 'FCP'], cv=5, verbose=True)

Evaluating RMSE, MAE, FCP of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.5521  0.5653  0.5631  0.5711  0.5610  0.5625  0.0062  
MAE (testset)     0.3325  0.3361  0.3359  0.3372  0.3358  0.3355  0.0016  
FCP (testset)     0.6367  0.6140  0.6208  0.6204  0.6083  0.6200  0.0095  
Fit time          9.31    9.16    9.26    9.06    9.05    9.17    0.10    
Test time         0.38    0.38    0.37    0.59    0.35    0.41    0.09    


{'fit_time': (9.310689210891724,
  9.160857677459717,
  9.262882471084595,
  9.061443567276001,
  9.052412033081055),
 'test_fcp': array([0.63674277, 0.61402927, 0.62084638, 0.62035843, 0.60826246]),
 'test_mae': array([0.33247687, 0.33605355, 0.33591178, 0.337153  , 0.3358271 ]),
 'test_rmse': array([0.5520808 , 0.56529935, 0.56311369, 0.57109399, 0.5610484 ]),
 'test_time': (0.3773348331451416,
  0.3759145736694336,
  0.3693201541900635,
  0.5929265022277832,
  0.34999585151672363)}

# Probabilistic Matrix Factorization (PMF)

In [None]:
# By setting biased to False, it is equivalent to PMF.
algo = SVD(biased=False)

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

## Evaluation

In [86]:
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE', 'FCP'], cv=5, verbose=True)

Evaluating RMSE, MAE, FCP of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    2.9344  2.8770  2.9019  2.8985  2.8868  2.8997  0.0195  
MAE (testset)     2.5828  2.5208  2.5463  2.5431  2.5308  2.5448  0.0211  
FCP (testset)     0.6380  0.5903  0.5982  0.5911  0.5888  0.6013  0.0186  
Fit time          9.02    9.11    9.15    9.00    9.14    9.08    0.06    
Test time         0.30    0.31    0.30    0.53    0.32    0.35    0.09    


{'fit_time': (9.024967193603516,
  9.106263637542725,
  9.14866018295288,
  8.995039224624634,
  9.138448476791382),
 'test_fcp': array([0.63798342, 0.59031328, 0.5981757 , 0.59114249, 0.5888278 ]),
 'test_mae': array([2.58278798, 2.52084052, 2.54625864, 2.54309248, 2.53080891]),
 'test_rmse': array([2.93440061, 2.87699287, 2.90190824, 2.89851383, 2.88678293]),
 'test_time': (0.3035256862640381,
  0.3144717216491699,
  0.3042325973510742,
  0.5273528099060059,
  0.31767892837524414)}