In [38]:
import pandas as pd
import os

from surprise import SVD, KNNBasic, Reader
from surprise.model_selection import cross_validate, GridSearchCV
from pathlib import Path

In [14]:
dir = Path(os.getcwd())
merged_data = pd.read_csv(dir/"merged_preferences.csv")

In [16]:
merged_data['pref'] = merged_data.pref.replace(False, 0)
merged_data['pref'] = merged_data.pref.replace(True, 1)

Setting the data up to create a user-user recommendation algorithm, where recommendations are made for a user based on their preference similarities to other users. The surprise scikit requires data to be loaded into it's DataSet class to be formatted for model building and prediction. 

In [41]:
user_collab_data = merged_data.iloc[:,0:3]
reader = Reader(rating_scale=(0,1))
ds = Dataset.load_from_df(merged_data[['user_id', 'item_id', 'pref']], reader)

First, testing the Singular Value Decomposition system on the data using grid search cross-validation. K-fold cross validation methods split the data into K equal sized datasets, then run through K iterations of training a model on K-1 combined sets before testing on the remaining set. This reduces bias and variance in the model, helping to prevent a model from overfitting or underfitting to the data. On top of this cross-validation, the Grid Search takes discrete parameter values to be applied to a model and tests each combination of parameters to find the set with the best predictive power. 

The merged preferences have around 3,000 samples, which is a fairly large data set. A 5-fold cross validation will be used on the data, which will leave 20% of the data to be tested on for each fold. The parameters that will be tuned through the Grid Search CV are the number of iterations to reduce the overall error of the model (n_epochs), the learning rate which determines the speed of error minimization (lr_all), and rate of regularization which can prevent model overfitting (lr_all).

In [60]:
grid_search_params = {'n_epochs':[10, 20, 30], 'lr_all': [0.005, 0.010], 'reg_all': [0.01, .02, .03]}
svd_grid = GridSearchCV(SVD, grid_search_params, measures=['rmse', 'mae'], cv=5)
svd_grid.fit(ds)

In [57]:
print(svd_grid.best_score['rmse'])
print(svd_grid.best_params['rmse'])

0.45328891403238397
{'n_epochs': 30, 'lr_all': 0.005, 'reg_all': 0.03}


The SCD system provided some fairly low estimations overall, with the best values at 45%. We could try to continue refining this model by refining grid search parameters basic on the best score model, but let's see if the K-nearest neighbor algorithm has better results first. This is based on users, but can also be based on comparing artwork.

In [86]:
grid_search_KNN_params = {'k': [10, 20, 30, 40, 50, 60], 'min_k': [2, 3, 4, 5]}
knn_grid = GridSearchCV(KNNBasic, grid_search_KNN_params, measures=['rmse', 'mae'], cv=5)
knn_grid.fit(ds)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


In [88]:
print(knn_grid.best_score['rmse'])
print(knn_grid.best_params['rmse'])


0.46552028160868586
{'k': 30, 'min_k': 5}


Trying out a KNN model that focuses on the artwork instead of the users, but the results are very similar.

In [97]:
algo = KNNBasic(k=30, min_k=5, sim_options= {'name': 'msd', 'user_based': 'False'})
cross_validate(algo, ds, measures=['rmse'], cv=5)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


{'test_rmse': array([0.45469037, 0.45723294, 0.48516967, 0.46913452, 0.46141158]),
 'fit_time': (0.04447507858276367,
  0.026464223861694336,
  0.01691579818725586,
  0.018942832946777344,
  0.024947166442871094),
 'test_time': (0.04649662971496582,
  0.05006575584411621,
  0.0490872859954834,
  0.04748249053955078,
  0.046279191970825195)}