# Gridsearch

In this notebook we'll be gridsearching to find the optimal model.

We'll be using the rated_listens3 file that we prepared in Create_ratings2.

In [None]:
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 4.4 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633981 sha256=dde4c1798c2280399bf1917c01d8b11ccdc35c5002e30ef375b4d03fd413b128
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [None]:
# imports
import requests
import pandas as pd
import numpy as np

from surprise import Reader, Dataset
from surprise.prediction_algorithms import SVD, KNNWithMeans, KNNBasic, KNNBaseline, CoClustering
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split
from surprise.similarities import cosine, msd, pearson

from google.colab import drive 

In [None]:
drive.mount('/drive')

Mounted at /drive


In [None]:
# load in df
rated_listens=pd.read_csv('/drive/My Drive/Colab Notebooks/rated_listens3.csv')
rated_listens.drop(columns=['Unnamed: 0', 'listen_no'], inplace=True)
rated_listens.head()

Unnamed: 0,user_name,song_no,rating
0,Skeebadoo,0,3
1,Skeebadoo,2,2
2,Svarthjelm,4,10
3,Svarthjelm,6,2
4,Svarthjelm,7,9


In [None]:
# read in values as Surprise dataset 
reader = Reader(rating_scale=(2,10))
data = Dataset.load_from_df(rated_listens, reader)

In [None]:
# examine data
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

Number of users:  2472 

Number of items:  2205253


In [None]:
type(data)

surprise.dataset.DatasetAutoFolds

In [None]:
# Split into train and test set
trainset, testset = train_test_split(data, test_size=0.2)

In [None]:
# determining the optimal algorithm parameters with GridSearchCV
# KNNWithMeans
# Item based

# sim_options = {
#     "name": ["cosine", "pearson"],
#     "min_support": [4, 5],
#     "user_based": [False],
# }

# param_grid = {"sim_options": sim_options}

# gs_KNNm = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse"], cv=5, joblib_verbose=2)

In [None]:
# gs_KNNm.fit(data)

Trying to go with item based recommendations isn't working, as there are just too many items. I've tried many variations but at the end of the day we go over memory every time.(25+ gb)

We'll need to go with user based for now as there are far less users.

In [None]:
# determining the optimal algorithm parameters with GridSearchCV
# Memory based
# KNNWithMeans
# User based

# sim_options = {
#     "name": ["msd", "cosine", "pearson"],
#     "min_support": [3, 4, 5],
#     "user_based": [True],
# }

# param_grid = {"sim_options": sim_options}

# gs_KNNm = GridSearchCV(KNNWithMeans, param_grid, cv=5, joblib_verbose=5)

In [None]:
# gs_KNNm.fit(data)

# gridsearch took 42.6 min

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   55.8s remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  1.9min remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.8min remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.7min remaining:    0.0s


Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Don

[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed: 42.6min finished


In [None]:
# print out optimal parameters for SVD after GridSearch

# gs_KNNm.best_params

# {'rmse': {'sim_options': {'name': 'pearson',
#    'min_support': 3,
#    'user_based': True}},
#  'mae': {'sim_options': {'name': 'pearson',
#    'min_support': 3,
#    'user_based': True}}}

{'rmse': {'sim_options': {'name': 'pearson',
   'min_support': 3,
   'user_based': True}},
 'mae': {'sim_options': {'name': 'pearson',
   'min_support': 3,
   'user_based': True}}}

In [None]:
# gs_KNNm.best_score

# {'rmse': 2.6819474792754847, 'mae': 2.1008990680893596}

{'rmse': 2.6819474792754847, 'mae': 2.1008990680893596}

In [None]:
# determining the optimal algorithm parameters with GridSearchCV
# Memory based
# KNNWithMeans #2
# User based

sim_options = {
    "name": ["pearson"],
    "min_support": [1, 2, 3],
    "user_based": [True],
}

param_grid = {"sim_options": sim_options}

gs_KNNm = GridSearchCV(KNNWithMeans, param_grid, cv=5, joblib_verbose=5)

In [None]:
gs_KNNm.fit(data)

# gridsearch took 14min

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   55.4s remaining:    0.0s


Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  2.0min remaining:    0.0s


Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  2.9min remaining:    0.0s


Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  3.8min remaining:    0.0s


Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed: 13.8min finished


In [None]:
# print out optimal parameters for KNNm after GridSearch

gs_KNNm.best_params

# {'rmse': {'sim_options': {'name': 'pearson',
#    'min_support': 1,
#    'user_based': True}},
#  'mae': {'sim_options': {'name': 'pearson',
#    'min_support': 1,
#    'user_based': True}}}

{'rmse': {'sim_options': {'name': 'pearson',
   'min_support': 1,
   'user_based': True}},
 'mae': {'sim_options': {'name': 'pearson',
   'min_support': 1,
   'user_based': True}}}

In [None]:
gs_KNNm.best_score

# {'rmse': 2.6820072769006944, 'mae': 2.1003798289171227}

{'rmse': 2.6820072769006944, 'mae': 2.1003798289171227}

We're able to achieve the following metrics with KNNwithmeans:
Optimal params: 


In [None]:
# determining the optimal algorithm parameters with GridSearchCV
# Memory based
# KNNBaseline
# User based

sim_options = {
    "name": ["pearson"],
    "k": [20, 40, 60],
    "min_k": [1, 2, 3],
    "user_based": [True],
}

param_grid = {"sim_options": sim_options}

gs_KNNb = GridSearchCV(KNNBaseline, param_grid, cv=5, joblib_verbose=5)

In [None]:
gs_KNNb.fit(data)

# 71 min

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.7min remaining:    0.0s


Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.2min remaining:    0.0s


Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  4.8min remaining:    0.0s


Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  6.7min remaining:    0.0s


Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als...
Computing the pearson similarity matrix...
Done computing similarity matrix.
Estimating biases using als.

[Parallel(n_jobs=1)]: Done  45 out of  45 | elapsed: 71.6min finished


In [None]:
gs_KNNb.best_params

# {'rmse': {'sim_options': {'name': 'pearson',
#    'k': 20,
#    'min_k': 1,
#    'user_based': True}},
#  'mae': {'sim_options': {'name': 'pearson',
#    'k': 20,
#    'min_k': 1,
#    'user_based': True}}}

{'rmse': {'sim_options': {'name': 'pearson',
   'k': 20,
   'min_k': 1,
   'user_based': True}},
 'mae': {'sim_options': {'name': 'pearson',
   'k': 20,
   'min_k': 1,
   'user_based': True}}}

In [None]:
# gs_KNNb.best_score

# {'rmse': 2.611438296260585, 'mae': 1.999188842719161}

In [None]:
# determining the optimal algorithm parameters with GridSearchCV
# Co-clustering

# param_grid = {
#     "n_cltr_u": [2, 3, 4],
#     "n_cltr_i": [2, 3, 4],
#     "n_epochs": [15, 20, 25],
#     "verbose": [True],
# }

# param_grid = {"sim_options": sim_options}

# gs_CC = GridSearchCV(CoClustering, param_grid, cv=5, joblib_verbose=5)

In [None]:
# gs_CC.fit(data)

# still running after 3 hours with original param grid
# need to condense params for results

In [None]:
# determining the optimal algorithm parameters with GridSearchCV
# Co-clustering 2

param_grid = {
    "n_cltr_u": [3],
    "n_cltr_i": [3, 4],
    "n_epochs": [20],
    "verbose": [True],
}

# param_grid = {"sim_options": sim_options}

gs_CC = GridSearchCV(CoClustering, param_grid, cv=5, joblib_verbose=5)

In [None]:
gs_CC.fit(data)

# 47 min

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.6min remaining:    0.0s


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  9.3min remaining:    0.0s


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed: 13.8min remaining:    0.0s


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed: 18.4min remaining:    0.0s


Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing

[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed: 46.2min finished


In [None]:
gs_CC.best_params

# {'rmse': {'n_cltr_u': 3, 'n_cltr_i': 3, 'n_epochs': 20, 'verbose': True},
#  'mae': {'n_cltr_u': 3, 'n_cltr_i': 3, 'n_epochs': 20, 'verbose': True}}

{'rmse': {'n_cltr_u': 3, 'n_cltr_i': 3, 'n_epochs': 20, 'verbose': True},
 'mae': {'n_cltr_u': 3, 'n_cltr_i': 3, 'n_epochs': 20, 'verbose': True}}

In [None]:
gs_CC.best_score

# {'rmse': 2.7669258949064854, 'mae': 2.159563660866292}

{'rmse': 2.7669258949064854, 'mae': 2.159563660866292}

Next we'll try to use matrix based methods, although these may go over memory again.


In [None]:
# Perform a gridsearch with SVD
# params = {'n_factors': [20, 50],
#          'reg_all': [0.02, 0.05],
#           'verbose': [True]
#           }
# gs_svd = GridSearchCV(SVD, param_grid=params, n_jobs=3, joblib_verbose=5)

In [None]:
# gs_svd.fit(data)

[Parallel(n_jobs=3)]: Using backend LokyBackend with 3 concurrent workers.


Matrix based methods went over memory again.

We'll need to limit our songs to only the top x amount of songs in order to do matrix based methods, as there are just too many items for the amount of ram that we have available(25GB)