This file contains test results on three Baseline, MF-ALS, MF-SGD, NNMF and KNN: user-based and item based

# 1. Baseline

In [1]:
import matplotlib.pyplot as plt
import pickle
import time

from helpers import load_data
from utils import split_data

from baselines import *
from matrix_factorization import matrix_factorization_sgd, write_sgd_prediction, matrix_factorization_als, write_als_prediction

%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
#process data and traform to sparse matrix
path_dataset = "./data/data_train.csv"
ratings = load_data(path_dataset)

In [67]:
#load data and store in pickle 
_, train, test = split_data(ratings, 10, verbose=True)
with open('./data/pickle/train.pickle', 'wb') as file:
    pickle.dump(train, file)
with open('./data/pickle/test.pickle', 'wb') as file:
    pickle.dump(test, file)

Shape of original ratings : (10000, 1000)
Shape of valid ratings (and of train and test data) : (9990, 999)
Total number of nonzero elements in original data : 1176952
Total number of nonzero elements in train data : 1068523
Total number of nonzero elements in test data : 108350


In [53]:
# This function change a sparse matrix to a n*3 array
# It makes our blending part easy to implement  
def toarray(matrix):
    nnz_row, nnz_col = matrix.nonzero()
    rat = matrix[matrix.nonzero()].toarray().reshape((len(nnz_col),1))
    nnz_row += 1
    nnz_col += 1
    return np.concatenate((nnz_row.reshape(len(nnz_col),1), nnz_col.reshape(len(nnz_col),1), rat), axis=1)

test_matrix = toarray(test)
test_matrix

array([[1.00e+00, 8.40e+01, 4.00e+00],
       [1.00e+00, 4.39e+02, 4.00e+00],
       [2.00e+00, 4.80e+01, 4.00e+00],
       ...,
       [9.99e+03, 7.36e+02, 3.00e+00],
       [9.99e+03, 9.06e+02, 4.00e+00],
       [9.99e+03, 9.85e+02, 1.00e+00]])

In [65]:
pd.DataFrameramepandas as pd
def savepred(a, name):
    a = pd.DataFrame(a)
    a.to_csv("./data/name")

In this part, we are going to use global mean, user mean and item mean to test the error of baseline model. It is reasonable that these model do not have good performace

## 1.1 Global Mean

In [33]:
#to test the result of global mean
start_time = time.time()
global_mean_rmse = global_mean_test(ratings, min_num_ratings=10)
print("--- %s seconds ---" % (time.time() - start_time))
print('Global mean RMSE : {}'.format(global_mean_rmse))

--- 25.113199949264526 seconds ---
Global mean RMSE : 1.1183506557779523


In [75]:
#computer a test prediction for blending
blend_GlbMean = np.ones((test_matrix.shape[0], 1)) * global_mean(test)

## 1.2 User Mean

In [34]:
start_time = time.time()
user_mean_rmse = user_mean_test(ratings, min_num_ratings=10)
print("--- %s seconds ---" % (time.time() - start_time))
print('User mean RMSE : {}'.format(user_mean_rmse))

--- 166.68256092071533 seconds ---
User mean RMSE : 1.0289888944873853


In [74]:
#computer a test prediction for blending
blend_UserMean = compute_user_means(test)
# blend_UserMean

## 1.3 Item Mean

In [35]:
start_time = time.time()
item_mean_rmse = item_mean_test(ratings, min_num_ratings=10)
print("--- %s seconds ---" % (time.time() - start_time))
print('Item mean RMSE : {}'.format(item_mean_rmse))

--- 31.795614004135132 seconds ---
Item mean RMSE : 1.0938352842783858


In [73]:
#computer a test prediction for blending
blend_ItemMean = compute_item_means(test)
# blend_ItemMean

# 2. Matrix Facrization

## SGD

In order to tune the hyper-parameters for matrix factorisation with SGD, a k-fold cross validation is used, with $k$ set to be 5. By using grid search, the item penalisation coefficient $\lambda_{it} = 0.25$, the user penalisation coefficient $\lambda_{us} = 0.01$, the latent variable $k=20$. This process of SGD is iterated by 50 times, when the change between iterations are small enough to be ignored. The running time of this method is 2067s.

In [4]:
start_time = time.time()
train_rmse, test_rmse, _, _ = matrix_factorization_sgd(train, test, gamma=0.012, verbose=True, 
                                                       lambda_user=0.01, lambda_item=0.25,
                                                       num_epochs=50, num_features=20)
print("--- %s seconds ---" % (time.time() - start_time))
train_rmse, test_rmse

Learning the matrix factorization using SGD...
Final RMSE on train data: 0.9912955138845942
Final RMSE on test data: 1.0001463113122229.
--- 2067.5607390403748 seconds ---


(0.9912955138845942, 1.0001463113122229)

## ALS

With the same idea as SGD, matrix factorisation with ALS is tuned by using grid search as well. After setting $k=20$, which is found at the initial cursory grid search, it is observed that most results of ALS are better than SGD. Hence, its parameter optimisation is investigated more precisely with a finer grid in grid search. below shows the grid search plot, where the brightest area indicates the most precise prediction. The best-tuned model found turns out to have the item penalisation coefficient $\lambda_{it} = 0.575$, the user penalisation coefficient $\lambda_{us} = 0.014$. The model is trained such that the change of improvement between each iteration is neglectable ($10^{-6}$). The running time of this method is 1847s.

In [80]:
start_time = time.time()
train_rmse, test_rmse, _, _ = matrix_factorization_als(train, test, verbose=True, stop_criterion=0.00001,
                                                       lambda_user=0.14, lambda_item=0.575, num_features=20)
print("--- %s seconds ---" % (time.time() - start_time))
print(train_rmse, test_rmse)

Learning the matrix factorization using ALS...
Final RMSE on train data: 0.9082974241119364
Final RMSE on test data: 0.983983639
--- 1847.3403561115265 seconds ---
0.9082974241119364 0.983983639


# 3. NNMF section

`Neural network matrix factorization` or NNMF, for short—dominates standard low-rank techniques on a suite of benchmark but is dominated by some recent proposals that take advantage of the graph features. Given the vast range of architectures, activation functions, regularizers, and optimizationtechniques that could be used within the NNMF framework, it seems likely the true potential of the approach has yet to be reached.

This method is presented by Gintare Karolina Dziugaite.The neural network contains three layers with 50 units. 
After tuning hyper-parameter, we set lamda=1.4841, D=40, D_prim=60

In [60]:
import nnmf.nnmf 
import nnmf.predict
import nnmf.split_data  
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [61]:
#split train and test set by our defaut setting
nnmf.split_data.split_nnmf()

Reading in data
number of items: 10000, number of users: 1000
Data subsets:
Train: 953330
Validation: 105926
Test: 117696


In [52]:
#training nnmf (test_ratio=0.1)
#if you want to see the process, set verbose=True
nnmf.nnmf.do_nnmf(mode='train', verbose=False)

Building network & initializing variables
Reading in data
[start] Train error: 95769.312500, Train RMSE: 1.823475; Valid RMSE: 1.820590
Early stopping (0.9900772571563721 vs. 0.9904701709747314)...
Loading best checkpointed model
./model/nnmf.ckpt
INFO:tensorflow:Restoring parameters from ./model/nnmf.ckpt
Final train RMSE: 0.958625078201294
Final test RMSE: 0.9892654418945312


# KNN section

KNN is a prediction algorithm that computes a prediction for the rating exploiting a weighted sum of the other users/items ratings with the help of a similarity metric, in our case Pearson Baseline. This algorithm implemented in the Python Surprise library. min_k (int) – The minimum number of neighbors to take into account for aggregation. If there are not enough neighbors, the prediction is set the the global mean of all ratings. Default is 1.

In [23]:
from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate
from surprise import BaselineOnly
from surprise import Dataset
from surprise import Reader
from surprise import KNNBasic
import os
from surprise.model_selection import train_test_split
from surprise import accuracy

In [25]:
#read and set train and test in surprise library
file_path_train = os.path.expanduser('./data/mov_kaggle.all')
reader = Reader(line_format='user item rating', sep='\t')
data = Dataset.load_from_file(file_path_train, reader=reader)
trainset, testset = train_test_split(data, test_size=.1)   #test ratio

## user_based

In [26]:
sim_options = {'name': 'msd',
               'user_based': True  # compute  similarities between items
               }
algo = KNNBasic(k = 80, sim_options=sim_options)
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0197


1.019656231639403

## item_based

In [27]:
sim_options = {'name': 'msd',
               'user_based': False  # compute  similarities between users
               }
algo = KNNBasic(k = 20, sim_options=sim_options)
algo.fit(trainset)
predictions = algo.test(testset)
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 1.0235


1.0234965094325155