In this notebook, we will attempt to use another matrix factorisation algorithm, NMF, available on scikit surprise to compare the metrics against SVD.

## Contents:
- [Loading of Libraries](#Loading-of-Libraries) 
- [Loading of Datasets & Preprocessing](#Loading-of-Datasets-&-Preprocessing)
- [Alternative Model - NMF Algorithm](#Alternative-Model---NMF-Algorithm)
- [Model Tuning using GridSearch](#Model-Tuning-using-GridSearch)

## Loading of Libraries

In [40]:
import numpy as np
import pandas as pd

# imports form surprise
from surprise import accuracy, Dataset, Reader, NMF
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from collections import defaultdict

from sklearn import preprocessing # import label encoder

import difflib # helpers for computing deltas
import random

## Loading of Datasets & Preprocessing

In [None]:
# Loading of dataset
recsys_df = pd.read_csv('./instagram-dataset/recsys_df_name.csv')
print(recsys_df.shape)
recsys_df.head()

(177607, 7)


Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd
0,4519805.0,1178180.0,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB
1,259484700.0,1178180.0,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB
2,6364797000.0,1178180.0,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB
3,221389400.0,857670431.0,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB
4,624306600.0,857670431.0,2019-05-30 07:56:50,3,green park,"London, United Kingdom",GB


In [None]:
# label_encoder object knows how to understand word labels
label_encoder = preprocessing.LabelEncoder()

In [None]:
# Encode labels in column 'name' and 'profile_id'
recsys_df['new_location_id']= label_encoder.fit_transform(recsys_df['name'])
recsys_df['new_profile_id']= label_encoder.fit_transform(recsys_df['profile_id'])

In [None]:
# Check new location and profile labels
recsys_df.head()

Unnamed: 0,profile_id,location_id,cts,sentiment_pred,name,city,cd,new_location_id,new_profile_id
0,4519805.0,1178180.0,2016-06-09 22:13:32,3,la famiglia,"London, United Kingdom",GB,3433,3168
1,259484700.0,1178180.0,2019-05-30 23:17:22,3,la famiglia,"London, United Kingdom",GB,3433,36386
2,6364797000.0,1178180.0,2019-05-26 15:27:27,3,la famiglia,"London, United Kingdom",GB,3433,99075
3,221389400.0,857670431.0,2019-05-30 21:41:15,2,green park,"London, United Kingdom",GB,2646,31024
4,624306600.0,857670431.0,2019-05-30 07:56:50,3,green park,"London, United Kingdom",GB,2646,55307


## Alternative Model - NMF Algorithm

In [None]:
# Instantiate reader
reader = Reader(rating_scale=(1, 3))

In [None]:
# Instantiate dataset 
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(recsys_df[["new_profile_id", "new_location_id", "sentiment_pred"]], reader)

In [None]:
# Cross-validate an SVD model using three-fold cross-validation
nmf = NMF(verbose=True)
cross_validate(nmf, data, measures=['RMSE','MSE', 'MAE'], cv=3, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Processing epoch 30
Processing epoch 31
Processing epoch 32
Processing epoch 33
Processing epoch 34
Processing epoch 35
Processing epoch 36
Processing epoch 37
Processing epoch 38
Processing epoch 39
Processing epoch 40
Processing epoch 41
Processing epoch 42
Processing epoch 43
Processing epoch 44
Processing epoch 45
Processing epoch 46
Processing epoch 47
Processing epoch 48
Processing epoch 49
Processing

{'test_rmse': array([0.65179366, 0.64862407, 0.65062449]),
 'test_mse': array([0.42483497, 0.42071319, 0.42331223]),
 'test_mae': array([0.52579703, 0.52361752, 0.5239038 ]),
 'fit_time': (10.282396793365479, 10.743699073791504, 10.930076837539673),
 'test_time': (0.7529940605163574, 0.8212952613830566, 0.7095677852630615)}

In [None]:
# Finding training metrics rmse
trainset = data.build_full_trainset()
nmf.fit(trainset)

testset = trainset.build_testset()
predictions = nmf.test(testset)
accuracy.rmse(predictions, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Processing epoch 30
Processing epoch 31
Processing epoch 32
Processing epoch 33
Processing epoch 34
Processing epoch 35
Processing epoch 36
Processing epoch 37
Processing epoch 38
Processing epoch 39
Processing epoch 40
Processing epoch 41
Processing epoch 42
Processing epoch 43
Processing epoch 44
Processing epoch 45
Processing epoch 46
Processing epoch 47
Processing epoch 48
Processing epoch 49
RMSE: 0.25

0.25631863894160173

In [None]:
# Finding training metrics mse
trainset = data.build_full_trainset()
nmf.fit(trainset)

testset = trainset.build_testset()
predictions = nmf.test(testset)
accuracy.mse(predictions, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Processing epoch 30
Processing epoch 31
Processing epoch 32
Processing epoch 33
Processing epoch 34
Processing epoch 35
Processing epoch 36
Processing epoch 37
Processing epoch 38
Processing epoch 39
Processing epoch 40
Processing epoch 41
Processing epoch 42
Processing epoch 43
Processing epoch 44
Processing epoch 45
Processing epoch 46
Processing epoch 47
Processing epoch 48
Processing epoch 49
MSE: 0.065

0.06573681194153039

In [None]:
# Finding training metrics mae
trainset = data.build_full_trainset()
nmf.fit(trainset)

testset = trainset.build_testset()
predictions = nmf.test(testset)
accuracy.mae(predictions, verbose=True)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Processing epoch 30
Processing epoch 31
Processing epoch 32
Processing epoch 33
Processing epoch 34
Processing epoch 35
Processing epoch 36
Processing epoch 37
Processing epoch 38
Processing epoch 39
Processing epoch 40
Processing epoch 41
Processing epoch 42
Processing epoch 43
Processing epoch 44
Processing epoch 45
Processing epoch 46
Processing epoch 47
Processing epoch 48
Processing epoch 49
MAE:  0.10

0.10408279533085263

- The performance of the NMF model is as follows:

|Metric|train rmse|test rmse|train mse|test mse|train mae|test mae|
|---|---|---|---|---|---|---|
|NMF|0.2563|0.6503|0.0657|0.4230|0.1041|0.5244|

- Comparing the train and test metrics, the train metrics are very much lower than the test metrics significantly so the model is overfitting.
- We will attempt to GridSearch for the best parameters.

## Model Tuning using GridSearch

#### GridSearch1

In [52]:
# Search over the following values of hyperparameters:
# Number of factors: 10, 15
# Number of epochs: 5, 10
# Regularization term for users: 0.08, 0.1, 0.12
# Regularization term for items: 0.08, 0.1, 0.12

param_grid = {"n_factors": [10, 15],
              "n_epochs": [5, 10], 
              "reg_pu": [0.08, 0.1, 0.12],
              "reg_qi": [0.08, 0.1, 0.12]}

In [42]:
# Instantiate GridSearchCV using cv=3
gs = GridSearchCV(NMF, param_grid, measures=['RMSE','MSE', 'MAE'], cv=3)

In [43]:
%%time
# Fit GridSearch to training data
gs.fit(data)

CPU times: user 3min 43s, sys: 2.8 s, total: 3min 46s
Wall time: 3min 51s


In [44]:
# Print metric score and combination of parameters that gave the best metric score
for metric in ['rmse','mse', 'mae']:
    print(f'Test {metric}: {gs.best_score[metric]}')
    print(f'Test best params: {gs.best_params[metric]}')

Test rmse: 0.6447029546428399
Test best params: {'n_factors': 15, 'n_epochs': 10, 'reg_pu': 0.12, 'reg_qi': 0.12}
Test mse: 0.4156436002251542
Test best params: {'n_factors': 15, 'n_epochs': 10, 'reg_pu': 0.12, 'reg_qi': 0.12}
Test mae: 0.5050483799970888
Test best params: {'n_factors': 15, 'n_epochs': 10, 'reg_pu': 0.08, 'reg_qi': 0.08}


In [45]:
# Finding training metrics rmse
algo1 = gs.best_estimator['rmse']
trainset = data.build_full_trainset()
algo1.fit(trainset)

testset = trainset.build_testset()
predictions = algo1.test(testset)

accuracy.rmse(predictions, verbose=True)

RMSE: 0.3138


0.31377034733328735

In [46]:
# Finding training metrics mse
algo2 = gs.best_estimator['mse']
trainset = data.build_full_trainset()
algo2.fit(trainset)

testset = trainset.build_testset()
predictions = algo2.test(testset)

accuracy.mse(predictions, verbose=True)

MSE: 0.0997


0.09972388959384498

In [47]:
# Finding training metrics mae
algo3 = gs.best_estimator['mae']
trainset = data.build_full_trainset()
algo3.fit(trainset)

testset = trainset.build_testset()
predictions = algo3.test(testset)

accuracy.mae(predictions, verbose=True)

MAE:  0.2117


0.21171283922039708

- The results after GridSearch1 is as follows:

|Metric|train rmse|test rmse|train mse|test mse|train mae|test mae|
|---|---|---|---|---|---|---|
|NMF|0.2563|0.6503|0.0657|0.4230|0.1041|0.5244|
|GridSearch1|0.3138|0.6447|0.0997|0.4156|0.2117|0.5050|

- The metrics across rmse, mse and mae have improved compared to the original NMF model.
- However, the model is still overfitting, though lesser, since the train metrics are significantly lower than the test metrics.
- Compared to the SVD algorithm, the model with NMF is more overfitting.
- Thus, we will revert to the use of SVD algorithm for our recommder system.
- We will next use MLFlow to track the performance of the SVD model as per the previous notebook as well. The codes are contained in the [`mlflow`](../mlflow/) folder.
- In addition, we will also attempt to fit a matrix factorization neural network in the next notebook.