# Model Selection #
The scikit-surprise library has a ton of similarity algorithms to choose from, so it will be a process to evaluate them all and choose the best option.

In [23]:
import pandas as pd
import os
from pathlib import Path
import numpy as np
from surprise import Reader, Dataset
from surprise import SVD, SVDpp, KNNBasic, KNNBaseline, KNNWithMeans, KNNWithZScore
from surprise import SlopeOne, NMF, NormalPredictor, BaselineOnly, CoClustering
from surprise.model_selection import cross_validate
from tqdm import tqdm

In [4]:
X_path = '..\data\processed\X.csv'

In [2]:
reader = Reader(line_format = u'user item rating', sep = ',', rating_scale = (0,5), skip_lines = 1)

In [6]:
data = Dataset.load_from_file(X_path, reader = reader)

In [28]:
benchmark = []

algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(),
              KNNBasic(), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]

for algorithm in algorithms:

    results = cross_validate(algorithm, data, measures = ['RMSE', 'MAE'], cv = 5, verbose = False)

    # Convert results to a DataFrame and calculate the mean
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    
    # Add the name of the algorithm as a new column
    tmp['Algorithm'] = str(algorithm).split(' ')[0].split('.')[-1]

    # Append this DataFrame to the benchmark list
    benchmark.append(tmp)
    
benchmark_df = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse');

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Don

In [29]:
benchmark_df

Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVD,1.09154,0.795432,0.303526,0.030927
SVDpp,1.093559,0.791556,2.428785,0.358788
BaselineOnly,1.093857,0.80979,0.042815,0.043734
KNNBaseline,1.136521,0.825407,0.066027,0.229763
KNNWithZScore,1.212853,0.881861,0.052409,0.213598
KNNWithMeans,1.213262,0.890666,0.031016,0.18745
SlopeOne,1.237489,0.915619,0.807065,0.217745
CoClustering,1.254014,0.923053,1.246289,0.027436
NMF,1.288964,0.969873,0.765779,0.026858
KNNBasic,1.319909,0.967012,0.023344,0.168466


There are three types of model with a test RMSE of about 1.09: SVD, SVDpp, and Baseline Only