# Model Selection #
The scikit-surprise library has a ton of similarity algorithms to choose from, so it will be a process to evaluate them all and choose the best option.

In [1]:
import pandas as pd
import os
import sys
from pathlib import Path
import numpy as np
import sqlite3
import surprise
from surprise import Reader, Dataset, accuracy
from surprise import SVD, SVDpp, KNNBasic, KNNBaseline, KNNWithMeans, KNNWithZScore
from surprise import SlopeOne, NMF, NormalPredictor, BaselineOnly, CoClustering
from surprise.model_selection import cross_validate, GridSearchCV
from tqdm import tqdm
import random
print("System version: {}".format(sys.version))
print("Surprise version: {}".format(surprise.__version__))

System version: 3.9.18 (main, Sep 11 2023, 14:09:26) [MSC v.1916 64 bit (AMD64)]
Surprise version: 1.1.3


## Benchmarking Algorithms ##
The first step is to evaluate the baseline performance of each algorithm. I wonder if KNNWithZScore will perform well due to the variety in how users approach the rating system.

In [2]:
X_path = '..\data\processed\X.csv'

In [3]:
# Tell the Reader what format the data takes
reader = Reader(line_format = u'user item rating', sep = ',', rating_scale = (0,5), skip_lines = 1)

In [4]:
# Read in the data with the Reader
data = Dataset.load_from_file(X_path, reader = reader)

In [5]:
benchmark = []

algorithms = [SVD(), SVDpp(), SlopeOne(), NMF(), NormalPredictor(), KNNBaseline(),
              KNNBasic(sim_options={'user_based': True}), KNNWithMeans(), KNNWithZScore(), BaselineOnly(), CoClustering()]

for algorithm in algorithms:
    
    # Evaluate each algorithm
    results = cross_validate(algorithm, data, measures = ['RMSE', 'MAE'], cv = 5, verbose = False)

    # Convert results to a DataFrame and calculate the mean
    tmp = pd.DataFrame.from_dict(results).mean(axis=0)
    
    # Add the name of the algorithm as a new column
    tmp['Algorithm'] = str(algorithm).split(' ')[0].split('.')[-1]

    # Append this DataFrame to the benchmark list
    benchmark.append(tmp)
    
benchmark_df = pd.DataFrame(benchmark).set_index('Algorithm').sort_values('test_rmse');

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Don

In [6]:
benchmark_df

Unnamed: 0_level_0,test_rmse,test_mae,fit_time,test_time
Algorithm,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
SVD,1.033372,0.686941,0.280643,0.023931
SVDpp,1.035129,0.683557,1.986468,0.126673
BaselineOnly,1.036723,0.702795,0.044282,0.017364
KNNBaseline,1.073824,0.716127,0.046459,0.05047
KNNWithZScore,1.25272,0.856389,0.02613,0.042762
KNNWithMeans,1.256886,0.862333,0.013171,0.037687
SlopeOne,1.275415,0.884449,2.830489,0.081975
KNNBasic,1.289185,0.882756,0.006184,0.046493
NMF,1.29931,0.920541,0.893476,0.02194
CoClustering,1.299851,0.900942,1.331904,0.017159


### Result ###
There are three types of model with a test RMSE of about 1.03: SVD, SVDpp, and Baseline Only. SVDpp has a slightly lower MAE than SVD alone, but the fit time tradeoff is not preferable. I'll move forward with the SVD algorithm.
I am surprised that KNNWithZScore did so poorly compared with SVD.

## Hyperparameter Tuning for SVD ##
It's time to see if adjusting any of the SVD parameters can improve performance further.

In [7]:
seed = 14
random.seed(seed)
np.random.seed(seed)

# Shuffle the data for fun
raw_ratings = data.raw_ratings
random.shuffle(raw_ratings)

# Use 90% for training
threshold = int(0.9 * len(raw_ratings))

train_raw_ratings = raw_ratings[:threshold]
test_raw_ratings = raw_ratings[threshold:]

data.raw_ratings = train_raw_ratings

In [8]:
param_grid = {
    'n_factors' : [10, 100, 500],
    'n_epochs': [5, 20, 50],
    'lr_all': [0.001, 0.005, 0.02],
    'reg_all': [0.005, 0.02, 0.1],
    'random_state': [14]
    }

In [9]:
gs = GridSearchCV(
    algo_class = SVD,
    param_grid = param_grid,
    n_jobs = -1,
    joblib_verbose = 5)

In [10]:
gs.fit(data)

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    1.2s
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed:    3.4s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   10.4s
[Parallel(n_jobs=-1)]: Done 280 tasks      | elapsed:   28.9s
[Parallel(n_jobs=-1)]: Done 405 out of 405 | elapsed:  1.3min finished


In [11]:
print(gs.best_score['rmse'])

# combination of parameters that gave the best RMSE score
print(gs.best_params['rmse'])

1.0340966565269754
{'n_factors': 10, 'n_epochs': 50, 'lr_all': 0.001, 'reg_all': 0.005, 'random_state': 14}


The GridSearch found some minor improvements over the default performance of the algorithm by increasing the number of epochs and reducing the number of factors, learning rate, and regularization.

In [13]:
# Best algorithm
best_SVD = SVD(n_factors = 10, n_epochs = 50,  lr_all = 0.001, reg_all = 0.005, random_state = 14)

In [14]:
# Retrain on whole trainset
trainset = data.build_full_trainset()
best_SVD.fit(trainset);

In [15]:
# Evaluate training set accuracy
predictions = best_SVD.test(trainset.build_testset())
print('Biased accuracy on trainset:', end='   ')
accuracy.rmse(predictions);

Biased accuracy on trainset:   RMSE: 0.9601


In [16]:
# Evaluate testing set accuracy
testset = data.construct_testset(test_raw_ratings)
predictions = best_SVD.test(testset)
print('Unbiased accuracy on testset:', end=' ')
accuracy.rmse(predictions);

Unbiased accuracy on testset: RMSE: 1.0230


### Results ###
I've picked the algorithm I'll use and its parameters. Now I'll make a function that creates this model from the user rating data so I can add the new user and return predictions to them.

## Examining Predictions for a User ##
Out of curiosity, I want to see what books are predicted for user ID 1

In [17]:
# Collect all predictions for user ID 1
user_predictions = [pred for pred in predictions if str(pred.uid) == str(1)]

In [18]:
# Sort the predictions in descending order by estimated rating and return the top 5
user_predictions.sort(key=lambda x: x.est, reverse=True)
top_recommendations = user_predictions[:5]

# Display the recommendations
for pred in top_recommendations:
    print(f'Book ID: {pred.iid}, Estimated Rating: {pred.est:.2f}')

Book ID: 7, Estimated Rating: 4.42
Book ID: 89, Estimated Rating: 4.33
Book ID: 41, Estimated Rating: 4.31
Book ID: 75, Estimated Rating: 4.31
Book ID: 65, Estimated Rating: 4.31


In [19]:
# Viewing the recommendation format
top_recommendations

[Prediction(uid='1', iid='7', r_ui=5.0, est=4.4243673318579315, details={'was_impossible': False}),
 Prediction(uid='1', iid='89', r_ui=5.0, est=4.3304354828217395, details={'was_impossible': False}),
 Prediction(uid='1', iid='41', r_ui=4.0, est=4.310893500701217, details={'was_impossible': False}),
 Prediction(uid='1', iid='75', r_ui=3.0, est=4.310893500701217, details={'was_impossible': False}),
 Prediction(uid='1', iid='65', r_ui=5.0, est=4.310893500701217, details={'was_impossible': False})]

In [20]:
# Returning a list of only book_ids for database lookup
[pred.iid for pred in user_predictions[:5]]

['7', '89', '41', '75', '65']