# Gridsearch 3

In this notebook we'll be gridsearching to find the optimal model.

We'll be using the rated_listens4 file that we prepared in Create_ratings3.

rated_listens4 has had the amount of songs present reduced by a large factor, containing only 1M songs.

In [None]:
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
# imports
import requests
import pandas as pd
import numpy as np

from surprise import Reader, Dataset
from surprise.prediction_algorithms import SVD, KNNWithMeans, KNNBasic, KNNBaseline, CoClustering
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split
from surprise.similarities import cosine, msd, pearson

from google.colab import drive 

In [None]:
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [None]:
# load in df
rated_listens=pd.read_csv('/drive/My Drive/Colab Notebooks/rated_listens_10k.csv')
rated_listens.drop(columns=['Unnamed: 0'], inplace=True)
rated_listens.head()

Unnamed: 0,user_name,song_no,rating
0,Svarthjelm,11,10
1,Svarthjelm,34,10
2,Svarthjelm,35,10
3,metabrew,45,2
4,metabrew,47,7


In [None]:
len(rated_listens)

1279949

We'll be using the df with only 10k songs to optimize our model.

In [None]:
# read in values as Surprise dataset 
reader = Reader(rating_scale=(1,10))
data = Dataset.load_from_df(rated_listens, reader)

In [None]:
# examine data
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  2458 

Number of items:  10000


In [None]:
type(data)

surprise.dataset.DatasetAutoFolds

In [None]:
# Split into train and test set
trainset, testset = train_test_split(data, test_size=0.2)

In [None]:
import multiprocessing
n_cpus = multiprocessing.cpu_count()
n_cpus

4

In [None]:
# Perform a gridsearch with SVD
# 1
params = {'n_factors': [20, 50],
         'reg_all': [0.02, 0.05],
          'verbose': [True]
          }
gs_svd = GridSearchCV(SVD, param_grid=params, n_jobs=n_cpus,  joblib_verbose=5)

In [None]:
gs_svd.fit(data)
# 10min

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  4.3min
[Parallel(n_jobs=4)]: Done  18 out of  20 | elapsed:  8.5min remaining:   56.4s
[Parallel(n_jobs=4)]: Done  20 out of  20 | elapsed:  9.2min finished


In [None]:
# print out optimal parameters for SVD after GridSearch

gs_svd.best_params

# {'rmse': {'n_factors': 50, 'reg_all': 0.05, 'verbose': True},
#  'mae': {'n_factors': 50, 'reg_all': 0.05, 'verbose': True}}

{'rmse': {'n_factors': 50, 'reg_all': 0.05, 'verbose': True},
 'mae': {'n_factors': 50, 'reg_all': 0.05, 'verbose': True}}

In [None]:
# print out best score

gs_svd.best_score

# {'rmse': 2.55926115381978, 'mae': 1.8421055481499127}

{'rmse': 2.55926115381978, 'mae': 1.8421055481499127}

In [None]:
# Perform a gridsearch with SVD
# 2
params = {'n_factors': [50, 75, 100],
         'reg_all': [0.05, 0.08, 0.13],
          'verbose': [True]
          }
gs_svd = GridSearchCV(SVD, param_grid=params, n_jobs=n_cpus,  joblib_verbose=5)

In [None]:
gs_svd.fit(data)
# 30min

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  5.7min
[Parallel(n_jobs=4)]: Done  45 out of  45 | elapsed: 27.1min finished


In [None]:
# print out optimal parameters for SVD after GridSearch
gs_svd.best_params

# {'rmse': {'n_factors': 100, 'reg_all': 0.13, 'verbose': True},
#  'mae': {'n_factors': 100, 'reg_all': 0.08, 'verbose': True}}

{'rmse': {'n_factors': 100, 'reg_all': 0.13, 'verbose': True},
 'mae': {'n_factors': 100, 'reg_all': 0.08, 'verbose': True}}

In [None]:
# print out best score
gs_svd.best_score

# {'rmse': 2.4204849915579407, 'mae': 1.7961771001043474}

{'rmse': 2.4204849915579407, 'mae': 1.7961771001043474}

In [None]:
# Perform a gridsearch with SVD
# 3
params = {'n_factors': [100, 125, 150],
         'reg_all': [0.09, 0.1, 0.15],
          'verbose': [True]
          }
gs_svd = GridSearchCV(SVD, param_grid=params, n_jobs=n_cpus,  joblib_verbose=5)

In [None]:
gs_svd.fit(data)
# 40min

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  8.4min
[Parallel(n_jobs=4)]: Done  45 out of  45 | elapsed: 35.8min finished


In [None]:
# print out optimal parameters for SVD after GridSearch
gs_svd.best_params

# {'rmse': {'n_factors': 150, 'reg_all': 0.1, 'verbose': True},
#  'mae': {'n_factors': 150, 'reg_all': 0.09, 'verbose': True}}

{'rmse': {'n_factors': 150, 'reg_all': 0.1, 'verbose': True},
 'mae': {'n_factors': 150, 'reg_all': 0.09, 'verbose': True}}

In [None]:
# print out best score
gs_svd.best_score

# {'rmse': 2.409023054350329, 'mae': 1.783678972924296}

{'rmse': 2.409023054350329, 'mae': 1.783678972924296}

We'll also test with the augmented 10k song dataframe to see if the augmented rating allows for a more accurate model.

Of note is that this uses a different rating scale so we'll need to look at the rmse and mae accordingly.

In [36]:
# load in df
rated_listens=pd.read_csv('/drive/My Drive/Colab Notebooks/rated_listens_10k_aug.csv')
rated_listens.drop(columns=['Unnamed: 0'], inplace=True)
# rated_listens.head()

# read in values as Surprise dataset 
reader = Reader(rating_scale=(1,5))
data = Dataset.load_from_df(rated_listens, reader)

# examine data
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

# Split into train and test set
trainset, testset = train_test_split(data, test_size=0.2)

Number of users:  2458 

Number of items:  10000


In [37]:
rated_listens.head()

Unnamed: 0,user_name,song_no,rating2
0,Svarthjelm,11,5
1,Svarthjelm,34,5
2,Svarthjelm,35,5
3,metabrew,45,2
4,metabrew,47,4


In [38]:
# Perform a gridsearch with SVD
# 4(aug)
params = {'n_factors': [100, 125, 150],
         'reg_all': [0.09, 0.1, 0.15],
          'verbose': [True]
          }
gs_svd = GridSearchCV(SVD, param_grid=params, n_jobs=n_cpus,  joblib_verbose=5)

gs_svd.fit(data)
# 40min

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 tasks      | elapsed:  8.2min
[Parallel(n_jobs=4)]: Done  45 out of  45 | elapsed: 35.5min finished


In [39]:
# print out optimal parameters for SVD after GridSearch
gs_svd.best_params

{'rmse': {'n_factors': 150, 'reg_all': 0.09, 'verbose': True},
 'mae': {'n_factors': 150, 'reg_all': 0.09, 'verbose': True}}

In [40]:
# print out best score
gs_svd.best_score

{'rmse': 1.1994807366448408, 'mae': 0.9563704500132048}

With the condensed list of 100k top songs we were able to gridsearch SVD to find a better model than with location based models.