# Music rating prediction

In [1]:
import json
from tqdm.notebook import tqdm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline

DATA_JSON = "Digital_Music.json"

with open(DATA_JSON, 'r') as infile:
    entries = [json.loads(entry) for entry in tqdm(infile.read().strip().split('\n'))]

raw = pd.DataFrame(entries)

raw_df = raw.sort_values('unixReviewTime').drop_duplicates(['reviewerID', 'asin'], keep='last')
raw_df = raw_df[['reviewerID', 'asin', 'overall', 'unixReviewTime']]

  0%|          | 0/1584082 [00:00<?, ?it/s]

### Data preprocessing

1. Remove "vague" elements to improve collaborative filtering. A "vague" element is an element that took part in too few interactions, i.e. has too few corresponding non-zero elements in the utility matrix.
2. Split the data onto train and test "by user" as suggested in [the book](https://www.manning.com/books/practical-recommender-systems), section 9.8.1:
> The last option we’ll look at doesn’t divide the users between test and training sets.
Instead, you’ll divide each user’s ratings between a training set and a test set. The ratings will be divided by taking the first n ratings in the training set and the rest in the
testing set.

*Notes*
* We will sacrifice a significant amount of items and users with too little information about them.
* The splitting involves timestamps. In train split each user has the first $n$ reviews. In test split users have all subsequent items, i.e. $n+1^{st}$, $n+2^{nd}, \ldots$. Users with $< n$ items will be present just in train split.

In [2]:
from preprocessing import drop_vague_elements, split_by_user

# we could use less restrictive value of min_ratings, e.g. 2 or 3,
# but it would be too much data for KNN methods 
# which try to build a |U|x|U| matrix
df = drop_vague_elements(raw_df, min_ratings=5)

train_df, val_test_df = split_by_user(df, train_ratings_num=7)

mask = np.random.rand(len(val_test_df)) < 0.5
val_df = val_test_df[mask]
test_df = val_test_df[~mask]

print()
total = len(df)
train_share = len(train_df) / total
val_share = len(val_df) / total
test_share = len(test_df) / total

print(f"train/val/test ratio is {train_share:.0%}/{val_share:.0%}/{test_share:.0%}")

iteration 0
# of vague users: 797597
# of vague items: 192724
iteration 1
# of vague users: 21543
# of vague items: 5038
iteration 2
# of vague users: 2244
# of vague items: 900
iteration 3
# of vague users: 522
# of vague items: 221
iteration 4
# of vague users: 127
# of vague items: 41
iteration 5
# of vague users: 32
# of vague items: 14
iteration 6
# of vague users: 10
# of vague items: 6
iteration 7
# of vague users: 2
# of vague items: 1
iteration 8
# of vague users: 0
# of vague items: 0
what's left:
- 8.1% of ratings
- 2.2% of unique items
- 1.5% of unique users

train/val/test ratio is 63%/19%/19%


In [3]:
from surprise import Dataset
from surprise import Reader

surprise_format = ['reviewerID', 'asin', 'overall']

reader = Reader(rating_scale=(1, 5))
train = Dataset.load_from_df(train_df[surprise_format], reader)
val = Dataset.load_from_df(val_df[surprise_format], reader)
test = Dataset.load_from_df(test_df[surprise_format], reader)

trainset = train.build_full_trainset()
valset = val.build_full_trainset().build_testset()
testset = test.build_full_trainset().build_testset()

## Model selection

In [4]:
from surprise import accuracy
from surprise.model_selection import GridSearchCV
from surprise import SVD, NormalPredictor

### Normal Predict

#### Idea
1. Model the true distribution of ratings with a gaussian $\mathcal{N}$.
1. Predict ratings by randomly sampling them from $\mathcal{N}$.

In [5]:
algo = NormalPredictor()
algo.fit(trainset)

predictions = algo.test(valset)
_ = accuracy.rmse(predictions)

RMSE: 0.9081


### SVD

In [6]:
%%time

param_grid = {
    'n_factors': [10],
    'n_epochs': [40],
    'lr_all': [0.01],
    'reg_all': [0.1],
}
gs = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=4, n_jobs=-1)
gs.fit(train)

print(gs.best_params['rmse'])

{'n_factors': 10, 'n_epochs': 40, 'lr_all': 0.01, 'reg_all': 0.1}
CPU times: user 1.14 s, sys: 525 ms, total: 1.67 s
Wall time: 6.21 s


In [7]:
algo = SVD(n_factors=10, n_epochs=40, lr_all=0.01, reg_all=0.1)
algo.fit(trainset)

predictions = algo.test(valset)
_ = accuracy.rmse(predictions)

RMSE: 0.6039


### KNN algorithms

All KNN algorithms boil down to taking a weighted average of ratings. To dive into the details, let's first introduce some notation.

$N_u^k(i)$ - $k$ neighbours of item $i$ which were also rated by user $u$

$N_i^k(u)$ - $k$ neighbours of user $u$ who also rated item $i$

Now, we can write down how to find a rating with KNN-approach in the most simple form (`KNNBasic`):
$$r_{ui} = \frac{\sum_{v \in N_i^k(u)} \textrm{sim}(u, v) \cdot r_{vi}}{\sum_{v \in N_i^k(u)} \textrm{sim}(u, v)}$$

$$\textrm{or alternatively}$$

$$r_{ui} = \frac{\sum_{j \in N_u^k(i)} \textrm{sim}(i, j) \cdot r_{uj}}{\sum_{j \in N_u^k(i)} \textrm{sim}(i, j)}$$

Here $sim(u, v)$ measures how similar utility matrix rows of users $u$ and $v$ are. Same thing for $sim(i, j)$.

Further modifications of `KNNBasic` - `KNNWithMeans` and `KNNWithZScore` - address the problem of "inherently kind" and "inherently angry" reviewers by subtracting means from the ratings being weighted. In addition, `KNNWithZScore` takes into account the variance of a user's (or an item's) ratings.

Finally, instead of subtracting means for centering one may subtract some fancy baselines described in [a paper by Yehuda Koren](https://courses.ischool.berkeley.edu/i290-dm/s11/SECURE/a1-koren.pdf). This is implemented in `KNNBaseline`. **NB**: it is advised to use pearson-baseline similarity with `KNNBaseline`.

#### Test with default parameters

In [8]:
from surprise import KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore
import time

knn_algos = [KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore]

results = []
for KNNAlgo in tqdm(knn_algos):
    algo = KNNAlgo(verbose=False)
    
    start = time.time()
    algo.fit(trainset)
    elapsed_time = time.time() - start

    predictions = algo.test(valset)
    rmse = accuracy.rmse(predictions, verbose=False)
    
    results.append({
        'algo': str(KNNAlgo).strip("<>' ").split('.')[-1],
        'rmse': round(rmse, 4),
        'fit_time_sec': round(elapsed_time, 1),
    })

pd.DataFrame(results).set_index('algo').sort_values(['rmse' , 'fit_time_sec'])

  0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0_level_0,rmse,fit_time_sec
algo,Unnamed: 1_level_1,Unnamed: 2_level_1
KNNWithMeans,0.6238,4.5
KNNWithZScore,0.6265,8.6
KNNBaseline,0.6895,4.0
KNNBasic,0.7787,4.2


Now let's properly tune `KNNBaseline`.

In [9]:
%%time

param_grid = {
    'bsl_options': {
        'method': ['sgd'], 
        'learning_rate': [0.01], 
        'reg': [0.01], 
        'n_epochs': [20],
    },
    'sim_options': {
        'name': ['pearson'],
        'user_based': [False],
    },
    'verbose': [False]
}

gs = GridSearchCV(KNNBaseline, param_grid, measures=['rmse'])
gs.fit(train)

print(gs.best_params['rmse'])

{'bsl_options': {'method': 'sgd', 'learning_rate': 0.01, 'reg': 0.01, 'n_epochs': 20}, 'sim_options': {'name': 'pearson', 'user_based': False}, 'verbose': False}
CPU times: user 15 s, sys: 7.29 s, total: 22.3 s
Wall time: 22.4 s


In [10]:
algo = KNNBaseline(
    bsl_options={'method': 'sgd', 'learning_rate': 0.01, 'reg': 0.01, 'n_epochs': 20}, 
    sim_options={'name': 'pearson', 'user_based': False}, 
    verbose=False
)
algo.fit(trainset)

predictions = algo.test(valset)
_ = accuracy.rmse(predictions)

RMSE: 0.6055


### Test the best recommender

According to my experiments, SVD and KNNBaseline demonstrate comparable performance, but SVD seems to be slightly better.

#### Test split

In [11]:
algo = SVD(n_factors=10, n_epochs=40, lr_all=0.01, reg_all=0.1)
algo.fit(trainset)

predictions = algo.test(testset)
_ = accuracy.rmse(predictions)

RMSE: 0.5885
