<a href="https://colab.research.google.com/github/hkbu-kennycheng/comp7240/blob/main/lab3_matrix_factorization_based_methods_I.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 3: matrix factorization based methods I

In this lab, we would do experiment on Singular Value Decomposition (SVD) and Probabilistic Matrix Factorization (PMF) algorthium. Let's start with installing `surprise` library.

In [54]:
!pip install surprise



# Dataset: [Amazon Review Data](https://nijianmo.github.io/amazon/index.html)

Let's take a look to today's dataset.

![](https://url2img-web.herokuapp.com/aHR0cHM6Ly9uaWppYW5tby5naXRodWIuaW8vYW1hem9uL2luZGV4Lmh0bWwjbWFpbg==)

Although the whole dataset is quite big. There is a sample dataset with only Home and Kitchen product reviews in json format. Let's download it by `curl` command and unzip it with `zcat`.



In [55]:
!curl http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Digital_Music_5.json.gz | zcat > data.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 18.5M  100 18.5M    0     0  15.8M      0  0:00:01  0:00:01 --:--:-- 15.8M


We could use `json` library to load the json file into a list by taking out only `reviewerID`, `asin` and `overall`.
- `reviewerID` is the user ID in string format.
- `asin` is an unique identifier of a particular product.
- `overall` is the rating for the product given by the user. 

In [56]:
import json

# reviews = pd.DataFrame(columns=['reviewerID', 'asin', 'overall'])
reviews = []
with open('data.json', 'r') as f:
    for l in f:
        r = json.loads(l)
        reviews.append([r['reviewerID'], r['asin'], r['overall']])

In [57]:
reviews[0]

['A2TYZ821XXK2YZ', '3426958910', 5.0]

Since `reviewerID` and `asin` is in string format, changing to numeric value is needed before passing in to algorithm.

In [None]:
import pandas as pd

df = pd.DataFrame(reviews)

# build a dictionary of reviewerID to numeric ID by index
reviewerIDs = df[0].unique()
reviewerIDdict = dict(zip(reviewerIDs, range(len(reviewerIDs))))

# build a dictionary of asin to numeric ID by index
asins = df[1].unique()
asinDict = dict(zip(asins, range(len(asins))))

# replace reviewerID and asin to numeric value by the dictionaries
df[0] = df[0].replace(reviewerIDdict)
df[1] = df[1].replace(asinDict)

df

After converting to numerice value, we could wrap it as `Dataset` using `load_from_df`.

In [None]:
from surprise import Dataset
from surprise import Reader

# build the reader object by specifying rating scale
reader = Reader(rating_scale=(1, 5))

# load the data with from panda data frame
data = Dataset.load_from_df(df, reader=reader)

Finally, split it in training set and testing set.

In [None]:
from surprise.model_selection import train_test_split

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

## Singular Value Decomposition (SVD)


In [None]:
from surprise import SVD
from surprise import accuracy

# We'll use the famous SVD algorithm.
algo = SVD()

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

## Evaluation

In [None]:
from surprise.model_selection import cross_validate

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE', 'FCP'], cv=5, verbose=True)

# Probabilistic Matrix Factorization (PMF)

In [None]:
# By setting biased to False, it is equivalent to PMF.
algo = SVD(biased=False)

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

## Evaluation

In [None]:
# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE', 'MAE', 'FCP'], cv=5, verbose=True)