# Time to get down to some modeling

In [6]:
# imports 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

from surprise import Dataset, Reader
from surprise import accuracy

from surprise.model_selection import train_test_split, cross_validate

from surprise.prediction_algorithms import SVD
from surprise.prediction_algorithms import SVDpp
from surprise.prediction_algorithms import SlopeOne
from surprise.prediction_algorithms import NMF
from surprise.prediction_algorithms import NormalPredictor
from surprise.prediction_algorithms import KNNBaseline
from surprise.prediction_algorithms import KNNBasic
from surprise.prediction_algorithms import KNNWithMeans
from surprise.prediction_algorithms import KNNWithZScore
from surprise.prediction_algorithms import BaselineOnly
from surprise.prediction_algorithms import CoClustering

Read in the joined dataframe

In [7]:
df= pd.read_csv('../../../data/joined_dfs_lc')

### Start FSM

In [8]:
# instantiate the Reader and the rating scale
reader = Reader(rating_scale=(0, 5))

# load the dataset 
data = Dataset.load_from_df(df[['userId', 'movieId', 'rating']], reader)

# sample random trainset and testset
trainset, testset = train_test_split(data, test_size=.25, random_state=15)

#### Find the best algorithm to use

Research lead me to an article by Susan Li, who provided a method to test a variety of algorithms at once to determine the best option.

I'm going to iterate over all the algorithms to see which one returns the best RMSE value.
This one will take a while...

In [9]:
# thank you to Susan Li for this helpful code
benchmark = []
# Iterate over all algorithms
for algorithm in [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), 
                  KNNBaseline(), KNNBasic(), KNNWithMeans(), KNNWithZScore(), 
                  BaselineOnly(), CoClustering()]:
    # Perform cross validation
    results = cross_validate(algorithm, data, measures=['RMSE'], cv=3, verbose=False)
    
    # Get results & append algorithm name
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    tmp = tmp.append(pd.Series([str(algorithm).split(' ')[0].split('.')[-1]], index=['Algorithm']))
    benchmark.append(tmp)
    
pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse')    

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...


Unnamed: 0_level_0,test_rmse,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
SVDpp,0.868762,559.601321,16.622181
BaselineOnly,0.876303,0.289106,0.393521
SVD,0.879365,6.514188,0.398633
KNNBaseline,0.882199,0.373788,4.272242
KNNWithMeans,0.905891,0.164547,3.516022
KNNWithZScore,0.908567,0.228852,4.10917
SlopeOne,0.910002,5.571923,14.180134
NMF,0.934876,7.253887,0.266966
CoClustering,0.956119,2.915526,0.34749
KNNBasic,0.960331,0.151439,3.221791


RESULT: SVDpp has the lowest RMSE... and the longest test time (hooray).
I'm going to start with that as my model.

    The SVDpp algorithm is an extension of SVD that takes into account implicit ratings.

In [11]:
# Let's pick the algorithm and run the first model on its own
algo = SVDpp(random_state=15)

# Train the algorithm on the trainset, and predict ratings for the testset
algo.fit(trainset)
predictions = algo.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

# Run 5-fold cross-validation and print results
cross_validate(algo, data, measures=['RMSE'], cv=5, verbose=True)


RMSE: 0.8621


KeyboardInterrupt: 

It definitely takes a while to run, especially with the cross-validation. I will probably leave that out of every iteration for the sake of time.

Before I go on, just a quick test to see that it is working as we want it to.
Let's get a rating prediction for a user.

In [None]:
algo.predict(2,60756)

OK. Time to iterate.

Note: I've duplicated this notebook and named the copy '04_dcm_iteration'.
I don't know how much time it may buy me, but I am going to run iterations in both notebookswith staggered start times. Even numbers in notebook 04, odd numbers in this notebook.

### Iteration 3
increased n_factors to 50 and regularization to ~~0.005~~ 0.05

_NOTE: a typo and a poor choice to copy and paste lead to decreasing the regularization when I intended to increase it. The model continued to improve nonethless_

In [None]:
# Let's tune
algo3 = SVDpp(n_factors=50, reg_all=0.05, verbose=True, random_state=15)

# Train the algorithm on the trainset, and predict ratings for the testset
algo3.fit(trainset)
predictions = algo3.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

In [None]:
accuracy.mae(predictions)

Both the RMSE and MAE scores are getting smaller bit by bit...

### Iteration 5
adding an adjusted learning rate of 0.01

In [None]:
# Let's tune
algo5 = SVDpp(n_factors=50, reg_all=0.05, lr_all=0.01, verbose=False, random_state=15)

# Train the algorithm on the trainset, and predict ratings for the testset
algo5.fit(trainset)
predictions = algo5.test(testset)

# Then compute RMSE
accuracy.rmse(predictions)

RMSE of .8551... we were hoping to get to .86 so this is a little bonus.

We need to test the prototype app, so I am going to pickle this and we will try it with the interface.

Do want to make sure and cross-validate

In [None]:
# Run 5-fold cross-validation and print results
cross_validate(algo5, data, measures=['RMSE'], cv=5, verbose=False)

now i'm going to pickle this so we can test the prototype app.

In [None]:
import pickle

with open("../../../model_files/SVDpp.bin", 'wb') as f_out:
    pickle.dump(algo5, f_out) 
    f_out.close()

Taking a closer look at this model's performance.

In [None]:
# thank you again to Susan Li for this helpful code

def get_Iu(uid):
    """ return the number of items rated by given user
    args: 
      uid: the id of the user
    returns: 
      the number of items rated by the user
    """
    try:
        return len(trainset.ur[trainset.to_inner_uid(uid)])
    except ValueError: # user was not part of the trainset
        return 0
    
def get_Ui(iid):
    """ return number of users that have rated given item
    args:
      iid: the raw id of the item
    returns:
      the number of users that have rated the item.
    """
    try: 
        return len(trainset.ir[trainset.to_inner_iid(iid)])
    except ValueError:
        return 0
    
df = pd.DataFrame(predictions, columns=['uid', 'iid', 'rui', 'est', 'details'])
df['Iu'] = df.uid.apply(get_Iu)
df['Ui'] = df.iid.apply(get_Ui)
df['err'] = abs(df.est - df.rui)
best_predictions = df.sort_values(by='err')[:100]
worst_predictions = df.sort_values(by='err')[-100:]

pd.set_option('display.max_rows', 100)


In [None]:
best_predictions

In [None]:
worst_predictions