# Final Model

Using the metrics from our gridsearching, we'll go ahead with creating our best model here. We can also get the metrics of our final model here.

In [1]:
!pip install scikit-surprise

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 5.1 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1633983 sha256=d521e729605e04220f9b55e34d9e277fe9cafa87a17194584c62042d00c78618
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


In [35]:
# imports
import requests
import pandas as pd
import numpy as np
import pickle

from surprise import Reader, Dataset, accuracy, dump
from surprise.prediction_algorithms import SVD, KNNBaseline
from surprise.model_selection import train_test_split

import os

from google.colab import drive
drive.mount('/drive')

Drive already mounted at /drive; to attempt to forcibly remount, call drive.mount("/drive", force_remount=True).


In [3]:
# Creating final model from top 1M songs:
# load in df
rated_listens=pd.read_csv('/drive/My Drive/Colab Notebooks/rated_listens4.csv')
rated_listens.drop(columns=['Unnamed: 0'], inplace=True)

# read in values as Surprise dataset 
reader = Reader(rating_scale=(1,10))
data = Dataset.load_from_df(rated_listens, reader)

# examine data
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

# Number of users:  2598 

# Number of items:  1000000

# Split into train and test set
trainset, testset = train_test_split(data, test_size=0.2)

Number of users:  2598 

Number of items:  1000000


In [4]:
len(rated_listens)

# 8288481

8288481

In [5]:
# create final model
svd_model = SVD(n_factors= 150, reg_all=0.1)
svd_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f19824e3810>

In [6]:
# get final model metrics
predictions = svd_model.test(testset)
print(accuracy.rmse(predictions))

# RMSE: 2.4248
# 2.424778318161281

RMSE: 2.4213
2.4213068360998915


In [17]:
dump.dump(file_name="/drive/My Drive/Colab Notebooks/large_svd.pkl",
          algo=svd_model,
          verbose=1)

The dump has been saved as file /drive/My Drive/Colab Notebooks/large_svd.pkl


In [31]:
import joblib
/drive/MyDrive/Colab Notebooks/large_model_test.pkl
# save
joblib.dump(svd_model, "/drive/My Drive/Colab Notebooks/large_model_test.pkl")

['/drive/My Drive/Colab Notebooks/large_model_test.pkl']

Nice we have a model trained on the top 1M songs! We've also gone ahead and saved it for use later.

As a last try, we'll go ahead and try to run the model on all the data as well to see if it'll work.

In [None]:
# load in full df to see if we can model or if we're limited to 1M songs
rated_listens=pd.read_csv('/drive/My Drive/Colab Notebooks/rated_listens_all.csv')
rated_listens.drop(columns=['Unnamed: 0'], inplace=True)

# read in values as Surprise dataset 
reader = Reader(rating_scale=(1,10))
data = Dataset.load_from_df(rated_listens, reader)

# examine data
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

# Split into train and test set
trainset, testset = train_test_split(data, test_size=0.2)

Number of users:  2643 

Number of items:  4660353


In [None]:
len(rated_listens)

12819261

In [None]:
# create final model(2)
# svd_model = SVD(n_factors= 150, reg_all=0.1)
# svd_model.fit(trainset)

Well it still won't go, but I'm happy with using the top 1M songs.

Although the SVD model with has the best metrics, it's unfortunately too large for a light-weight deployment.

We'll need to go back to our model with 100k songs to see if that model is a more appropriate size.

In [24]:
# Creating small final model from top 10k songs:
# load in df
rated_listens=pd.read_csv('/drive/My Drive/Colab Notebooks/rated_listens_10k.csv')
rated_listens.drop(columns=['Unnamed: 0'], inplace=True)

# read in values as Surprise dataset 
reader = Reader(rating_scale=(1,10))
data = Dataset.load_from_df(rated_listens, reader)

# examine data
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

# Number of users:  2458

# Number of items:  10000

# Split into train and test set
trainset, testset = train_test_split(data, test_size=0.2)

Number of users:  2458 

Number of items:  10000


In [25]:
len(rated_listens)

# 1279949

1279949

In [26]:
# create small SVD final model
svd_small_model = SVD(n_factors= 150, reg_all=0.1)
svd_small_model.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f1858d55a10>

In [27]:
# get final model metrics
predictions = svd_small_model.test(testset)
print(accuracy.rmse(predictions))

# RMSE: 2.4248
# 2.424778318161281

RMSE: 2.4106
2.410617976943369


In [38]:
dump.dump(file_name="/drive/My Drive/Colab Notebooks/small_svd.pkl",
          algo=svd_small_model,
          verbose=1)

The dump has been saved as file /drive/My Drive/Colab Notebooks/small_svd.pkl


This model is the same size unfortunately. After much digging I wasn't able to find a solution to reduce the size of the model. I'll use a more lightweight model for now just so I have something to deploy.

I'll use KNNBaseline as this was the 2nd best model vs SVD.

In [33]:
# Creating final model from top 1M songs:
# load in df
rated_listens=pd.read_csv('/drive/My Drive/Colab Notebooks/rated_listens4.csv')
rated_listens.drop(columns=['Unnamed: 0'], inplace=True)

# read in values as Surprise dataset 
reader = Reader(rating_scale=(1,10))
data = Dataset.load_from_df(rated_listens, reader)

# examine data
dataset = data.build_full_trainset()
print('Number of users: ', dataset.n_users, '\n')
print('Number of items: ', dataset.n_items)

# Number of users:  2598 

# Number of items:  1000000

# Split into train and test set
trainset, testset = train_test_split(data, test_size=0.2)

Number of users:  2598 

Number of items:  1000000


In [36]:
# create lightweight model
KNNb_model = KNNBaseline(name='pearson', k= 20, min_k= 1)
KNNb_model.fit(trainset)

Estimating biases using als...
Computing the msd similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNBaseline at 0x7f197d56d750>

In [37]:
# get final model metrics
predictions = KNNb_model.test(testset)
print(accuracy.rmse(predictions))

# RMSE: 2.4248
# 2.424778318161281

RMSE: 2.5762
2.576234903219563


In [39]:
dump.dump(file_name="/drive/My Drive/Colab Notebooks/KNNb_model.pkl",
          algo=KNNb_model,
          verbose=1)

The dump has been saved as file /drive/My Drive/Colab Notebooks/KNNb_model.pkl
