# Modelling Exploring

In this notebook I explore some modelling with the surprise library.  I aim to optimise the RMSE score.  

In [1]:
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from surprise import Reader, Dataset, accuracy
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD, KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV, train_test_split

parent_dir = '../../../'

In [2]:
ratings = pd.read_csv(parent_dir + 'data/mod_ratings_lc', index_col = 0)

In [3]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [14]:
ratings.shape

(100836, 3)

In [11]:
len(ratings.userId.unique())

610

In [12]:
len(ratings.movieId.unique())

9724

In [37]:
# create surprise reader:
reader = Reader(line_format = 'user item rating timestamp', 
                sep = ',', skip_lines = 1, rating_scale = (.5,5))
data = Dataset.load_from_file('../../../data/ratings.csv', reader)

In [38]:
# train test split
train_set, test_set = train_test_split(data, test_size = 0.25, random_state = 15)

In [39]:
train_set.n_ratings

75627

In [40]:
len(test_set)

25209

Sanity check:  75,627 + 25,209 = 100,836

In [41]:
train_set.n_items

8829

In [42]:
train_set.n_users

610

In [43]:
train_set

<surprise.trainset.Trainset at 0x1a1d20a940>

## Model 1:  SVD

First, I'll do a cross validation on the whole data set using an SVD model and then I'll do a regular fit on the train data:

### Crossval:

In [44]:
svd = SVD(random_state = 15)

In [46]:
cross_validate(svd, data, cv = 5, verbose = True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8727  0.8760  0.8712  0.8769  0.8753  0.8744  0.0021  
MAE (testset)     0.6709  0.6738  0.6696  0.6725  0.6734  0.6721  0.0016  
Fit time          5.15    5.14    5.18    5.11    5.08    5.13    0.03    
Test time         0.15    0.14    0.13    0.32    0.14    0.18    0.07    


{'test_rmse': array([0.87268478, 0.87602912, 0.8712265 , 0.87685051, 0.87531311]),
 'test_mae': array([0.67086101, 0.6738413 , 0.6696423 , 0.67252738, 0.67339982]),
 'fit_time': (5.146215200424194,
  5.138158798217773,
  5.176810026168823,
  5.106670141220093,
  5.076992988586426),
 'test_time': (0.14764189720153809,
  0.14020895957946777,
  0.13470101356506348,
  0.3244760036468506,
  0.13528990745544434)}

### Train test split:

In [47]:
svd.fit(train_set)
preds = svd.test(test_set)

In [48]:
accuracy.rmse(preds)

RMSE: 0.8750


0.8750037791289396

## Model 2:  SVD with gridsearch

Don't know how to do this just on the train set

In [49]:
param_grid = {'n_factors':[50,100,150],
              'n_epochs':[20,30],
              'lr_all':[0.005,0.01],
              'reg_all':[0.02,0.1]}
gs_svd = GridSearchCV(SVD, param_grid = param_grid, 
                      measures = ['rmse'], n_jobs=-1)
gs_svd.fit(train_set)

AttributeError: 'Trainset' object has no attribute 'raw_ratings'

In [None]:
best_params = gs_svd.best_params

In [None]:
best_params

In [None]:
gs_svd.best_score

In [35]:
factors = best_params['rmse']['n_factors']
epochs = best_params['rmse']['n_epochs']
lr_all = best_params['rmse']['lr_all']
reg_all = best_params['rmse']['reg_all']

## Model 3:  KNN Basic:

In [52]:
knn_basic = KNNBasic(sim_options = {'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

Still can't seem to get train set to work in these models..... 🤬