# Model Iterations

In this notebook I continue to iterate upon models and try out new collaborative filtered models.

I aim to create models that will be able to make movie recommendations based on the highest predicted ratings for a user.  

In [5]:
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

parent_dir = '../../../'

In [82]:
from surprise import Dataset, Reader
from surprise import accuracy
from surprise.model_selection import train_test_split, cross_validate, GridSearchCV
from surprise.prediction_algorithms import SVD, SVDpp, SlopeOne, NMF, NormalPredictor, KNNBaseline, KNNBasic, KNNWithMeans, KNNWithZScore, BaselineOnly, CoClustering

In [2]:
# plot parameters
plt.rcParams['axes.labelsize'] = 20
plt.rcParams['axes.titlesize'] = 25
plt.rcParams['xtick.labelsize'] = 15
plt.rcParams['ytick.labelsize'] = 15
plt.rcParams['axes.edgecolor'] = 'black'
plt.rcParams['axes.facecolor'] = 'white' # white or EAEAF2

In [7]:
# load joined dataframe:
df = pd.read_csv(parent_dir + 'data/joined_dfs_lc', index_col = 0)
df.head()

Unnamed: 0_level_0,userId,rating,title,genres,imdbId,tmdbId
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,5,4.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,7,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,15,2.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0
1,17,4.5,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,114709,862.0


Let's keep movieId as a column

In [10]:
df.reset_index(inplace = True)

In [12]:
df.shape

(100836, 7)

## Num. Ratings per Movie

Let's look again at the number of ratings per movie to get an idea of the long tail problem and where our potential threshold should be:

In [15]:
num_ratings = df.groupby('movieId').count().drop(['userId', 'title', 'genres', 'imdbId', 'tmdbId'], axis = 1)

In [16]:
num_ratings.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
1,215
2,110
3,52
4,7
5,49


In [18]:
sorted_num_ratings = num_ratings.sort_values(by = 'rating', axis = 0, ascending = False)

In [19]:
sorted_num_ratings.head()

Unnamed: 0_level_0,rating
movieId,Unnamed: 1_level_1
356,329
318,317
296,307
593,279
2571,278


Let's first see how many movies only have 1 rating:

In [42]:
len(sorted_num_ratings[sorted_num_ratings['rating'] == 1])

3446

Let's see how many movies have less than 50 ratings:

In [34]:
len(sorted_num_ratings[sorted_num_ratings['rating'] < 50])

9274

In [23]:
len(sorted_num_ratings)

9724

In [35]:
len(sorted_num_ratings[sorted_num_ratings['rating'] >= 50])

450

So there are only 450 movies that have 50 or more ratings and 9724 movies have less than 50 ratings.  For this reason, I think 50 is too harsh to cut as we would be cutting a lot of movies.  I think it is perhaps reasonable to say that people are likely to rate 5 movies and so this might be a more reasonable number to investigate **(actually this is wrong, this isn't what's happening here - this is saying that a movie only recieved 5 ratings, which seems a little unreasonable as that might suggest it's unpopular, so maybe we should up this threshold...)**.  There might not be a lot of people who go around rating lots of movies.... 

Let's see how many movies have less than 5 movies:

In [45]:
len(sorted_num_ratings[sorted_num_ratings['rating'] < 5])

6074

In [46]:
len(sorted_num_ratings[sorted_num_ratings['rating'] >= 5])

3650

There are 6074 movies with less than 5 ratings.  For this reason, we will remove these movies and this will help to reduce the dimensionality of the dataset and avoid running into any memory error issues that might occur. 

After second thoughts, I think that a movie should have at least 10 ratings...

In [68]:
len(sorted_num_ratings[sorted_num_ratings['rating'] < 10])

7455

In [71]:
len(sorted_num_ratings[sorted_num_ratings['rating'] >= 10])

2269

I don't really want to reduce the number of movies much less than this, so I'm going to use 10 as the cut off threshold for how many ratings a movie should have. 

In [72]:
min_ratings = 10
filter_movies = df['movieId'].value_counts() >= min_ratings
filter_movies = filter_movies[filter_movies].index.tolist()

In [73]:
len(filter_movies)

2269

## Num. Ratings per User

Let's now look at how many ratings each user gives:

In [60]:
user_ratings = df.groupby('userId').count().drop(['movieId', 'title', 'genres', 'imdbId', 'tmdbId'], axis = 1)

In [61]:
user_ratings.head()

Unnamed: 0_level_0,rating
userId,Unnamed: 1_level_1
1,232
2,29
3,39
4,216
5,44


In [62]:
sorted_user_ratings = user_ratings.sort_values(by = 'rating', axis = 0, ascending = False)

In [66]:
sorted_user_ratings.head()

Unnamed: 0_level_0,rating
userId,Unnamed: 1_level_1
414,2698
599,2478
474,2108
448,1864
274,1346


Let's first see how many users only gave 1 rating:

In [64]:
len(sorted_user_ratings[sorted_user_ratings['rating'] == 1])

0

In [67]:
sorted_user_ratings.min()

rating    20
dtype: int64

So the minimum number of ratings a user gave is 20.  This seems like a reasonable number so we won't filter this down. 

## Update df with filtered down movies:

In [74]:
new_df = df[df['movieId'].isin(filter_movies)]
new_df.shape

(81116, 7)

So this reduced our dataset by about 20,000 rows. 

## Modelling

In [75]:
new_df.rating.unique()

array([4. , 4.5, 2.5, 3.5, 3. , 5. , 0.5, 2. , 1.5, 1. ])

In [77]:
reader = Reader(rating_scale=(0.5, 5))
data = Dataset.load_from_df(new_df[['userId', 'movieId', 'rating']], reader)

### Try a few models:

In [80]:
benchmark = []

# Iterate over all algorithms
for model in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), 
              KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), 
              BaselineOnly(), CoClustering()]:
    
    # Perform cross validation
    results = cross_validate(model, data, measures = ['RMSE'], 
                             cv = 5, verbose = True)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(model).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')  

Evaluating RMSE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8483  0.8506  0.8458  0.8466  0.8523  0.8487  0.0024  
Fit time          4.04    4.01    4.03    4.01    3.97    4.01    0.02    
Test time         0.12    0.12    0.36    0.12    0.12    0.17    0.09    
Evaluating RMSE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8413  0.8429  0.8224  0.8442  0.8292  0.8360  0.0086  
Fit time          261.87  263.45  380.39  263.19  260.68  285.91  47.25   
Test time         5.56    5.36    5.42    5.38    5.32    5.41    0.08    
Evaluating RMSE of algorithm SlopeOne on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8545  0.8455  0.8535  0.8591  0.8643  0.8554  0.0063  
Fit time          1.51    1.56    1.52    1.56    1.52    1.53    0.02    
Test time         4.7

Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.835975,285.914514,5.408826
KNNBaseline,0.841942,0.21781,2.240907
SVD,0.848743,4.014225,0.166481
BaselineOnly,0.850838,0.163921,0.084332
KNNWithZScore,0.853852,0.17819,1.84416
KNNWithMeans,0.85458,0.120352,1.681041
SlopeOne,0.855404,1.534131,308.995272
NMF,0.879023,14.465833,0.323211
CoClustering,0.898842,1.387536,0.156611
KNNBasic,0.905502,0.102885,1.56907


In [92]:
param_grid = {'n_factors':[50, 100, 150],
              'n_epochs':[5, 20, 30],
              'lr_all':[0.005, 0.01],
              'reg_all':[0.02, 0.1]}

gs_svdpp = GridSearchCV(SVD(), param_grid = param_grid, cv = 3,
                        measures = ['rmse'], n_jobs=-1)

In [90]:
svdpp = SVDpp(n_factors= 100, n_epochs = 20, lr_all = 0.005, reg_all = 0.02)

In [91]:
svdpp.fit(data)

AttributeError: 'DatasetAutoFolds' object has no attribute 'global_mean'

Can't figure out what's going on here