# Modelling Exploring

In this notebook I explore some modelling with the surprise library.  I aim to optimise the RMSE score.  

In [9]:
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from surprise import Reader, Dataset, accuracy
from surprise.model_selection import cross_validate
from surprise.prediction_algorithms import SVD, KNNWithMeans, KNNBasic, KNNBaseline
from surprise.model_selection import GridSearchCV, train_test_split

parent_dir = '../../../'

In [10]:
ratings = pd.read_csv(parent_dir + 'data/mod_ratings_lc', index_col = 0)

In [11]:
ratings.head()

Unnamed: 0,userId,movieId,rating
0,1,1,4.0
1,1,3,4.0
2,1,6,4.0
3,1,47,5.0
4,1,50,5.0


In [12]:
ratings.shape

(100836, 3)

In [13]:
len(ratings.userId.unique())

610

In [14]:
len(ratings.movieId.unique())

9724

In [19]:
# create surprise reader:
reader = Reader(line_format = 'user item rating timestamp', 
                sep = ',', skip_lines = 1, rating_scale = (.5,5))
data = Dataset.load_from_file(parent_dir + 'data/mod_ratings_lc', reader)

In [20]:
# train test split
train_set, test_set = train_test_split(data, test_size = 0.25, random_state = 15)

In [21]:
train_set.n_ratings

75627

In [22]:
len(test_set)

25209

Sanity check:  75,627 + 25,209 = 100,836

In [23]:
train_set.n_items

610

In [24]:
train_set.n_users

75627

In [25]:
train_set

<surprise.trainset.Trainset at 0x10df454a8>

## Model 1:  SVD

First, I'll do a cross validation on the whole data set using an SVD model and then I'll do a regular fit on the train data:

### Crossval:

In [26]:
svd = SVD(random_state = 15)

In [None]:
cross_validate(svd, data, cv = 5, verbose = True)

### Train test split:

In [27]:
svd.fit(train_set)
preds = svd.test(test_set)

In [None]:
accuracy.rmse(preds)

In [None]:
preds[0][3] #this is the estimated rating for the first user in our test set...

In [None]:
test_set[:3]

In [None]:
preds[:3]

## Model 2:  SVD with gridsearch

Don't know how to do this just on the train set

In [None]:
param_grid = {'n_factors':[50,100,150],
              'n_epochs':[20,30],
              'lr_all':[0.005,0.01],
              'reg_all':[0.02,0.1]}
gs_svd = GridSearchCV(SVD, param_grid = param_grid, 
                      measures = ['rmse'], n_jobs=-1)
gs_svd.fit(data)

In [None]:
best_params = gs_svd.best_params

In [None]:
best_params

In [None]:
gs_svd.best_score

In [None]:
factors = best_params['rmse']['n_factors']
epochs = best_params['rmse']['n_epochs']
lr_all = best_params['rmse']['lr_all']
reg_all = best_params['rmse']['reg_all']

## Model 3:  KNN Basic:

In [None]:
knn_basic = KNNBasic(sim_options = {'name':'pearson', 'user_based':True})
cv_knn_basic = cross_validate(knn_basic, data, n_jobs=-1)

Still can't seem to get train set to work in these models..... 🤬

## Testing deployment

I know these models are no where near ready, but I'm just going to pickle one to test with deployment to the web app:

In [28]:
import pickle

##dump the model into a file
with open(parent_dir + "model_files/svd_model.bin", 'wb') as f_out:
    pickle.dump(svd, f_out) 
    f_out.close() 

A test input to our model (for a new user) will be in this form:

In [65]:
test_set[:3]

[('563', '114762', 3.5), ('448', '4487', 1.0), ('525', '104272', 4.0)]

So a list of tuples with the ordered values: (`userId`, `movieId`, `rating`)

In [75]:
# create some random test data:
new_user = [('1000', '1', 2), ('1000', '3', 5), 
            ('1000', '6', 2.5), ('1000', '47', 4.5), 
            ('1000', '50', 3)]

In [73]:
1005 in ratings.userId.unique()

False

In [74]:
ratings.movieId.unique()[:5]

array([ 1,  3,  6, 47, 50])

In [76]:
new_user_preds = svd.test(new_user)

In [77]:
new_user_preds

[Prediction(uid='1000', iid='1', r_ui=2, est=3.933169282742589, details={'was_impossible': False}),
 Prediction(uid='1000', iid='3', r_ui=5, est=3.2270686333226184, details={'was_impossible': False}),
 Prediction(uid='1000', iid='6', r_ui=2.5, est=4.0074483776005625, details={'was_impossible': False}),
 Prediction(uid='1000', iid='47', r_ui=4.5, est=3.9736428436987516, details={'was_impossible': False}),
 Prediction(uid='1000', iid='50', r_ui=3, est=4.191024997695057, details={'was_impossible': False})]