# Recommender Systems, Surprise (library), SVD, SVD++, NMF

Take the `movielens` dataset and build a matrix factorization model. In the `surprise` library, it is called SVD. Select the best parameters using cross-validation, experiment with other algorithms (SVD++, NMF), and choose the one that is optimal.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from surprise import Dataset, Reader, SVD, SVDpp, NMF, accuracy
from surprise.model_selection import cross_validate, GridSearchCV, train_test_split

In [2]:
# Load the "movielens" dataset.
data = Dataset.load_builtin('ml-100k')

In [3]:
# Let's see what data the dataset contains.
df = pd.DataFrame(data.raw_ratings, columns=['user_id', 'movie_id', 'rating', 'timestamp'])
print(df.head(10))

  user_id movie_id  rating  timestamp
0     196      242     3.0  881250949
1     186      302     3.0  891717742
2      22      377     1.0  878887116
3     244       51     2.0  880606923
4     166      346     1.0  886397596
5     298      474     4.0  884182806
6     115      265     2.0  881171488
7     253      465     5.0  891628467
8     305      451     3.0  886324817
9       6       86     3.0  883603013


Let's build a model using three algorithms: `SVD`, `SVD++`, and `NMF`, and use cross-validation to find the `RMSE` and `MAE` parameters.

In [4]:
# SVD
algo = SVD()
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9390  0.9363  0.9356  0.9378  0.9336  0.9365  0.0019  
MAE (testset)     0.7425  0.7390  0.7363  0.7364  0.7353  0.7379  0.0026  
Fit time          0.89    0.89    0.91    0.89    0.90    0.90    0.01    
Test time         0.11    0.16    0.10    0.11    0.10    0.12    0.02    


{'test_rmse': array([0.93896676, 0.93633029, 0.93556503, 0.93784516, 0.93357806]),
 'test_mae': array([0.74254483, 0.73901385, 0.73628676, 0.73636909, 0.73529554]),
 'fit_time': (0.887467622756958,
  0.8944242000579834,
  0.9050407409667969,
  0.8865628242492676,
  0.9034526348114014),
 'test_time': (0.11262035369873047,
  0.16237854957580566,
  0.10328793525695801,
  0.1114358901977539,
  0.1048281192779541)}

In [5]:
# SVD++
algo_svdpp = SVDpp()
cross_validate(algo_svdpp, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm SVDpp on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9227  0.9123  0.9244  0.9248  0.9190  0.9206  0.0046  
MAE (testset)     0.7227  0.7169  0.7258  0.7232  0.7236  0.7224  0.0030  
Fit time          19.95   19.99   20.07   19.92   19.99   19.98   0.05    
Test time         3.14    3.16    3.17    3.13    3.12    3.14    0.02    


{'test_rmse': array([0.92271674, 0.91230469, 0.92437204, 0.92476651, 0.91897532]),
 'test_mae': array([0.72268506, 0.71691472, 0.72584466, 0.72318087, 0.72362248]),
 'fit_time': (19.95340394973755,
  19.99152636528015,
  20.067801237106323,
  19.9155113697052,
  19.99117946624756),
 'test_time': (3.1406874656677246,
  3.1610875129699707,
  3.1685688495635986,
  3.1338250637054443,
  3.1179184913635254)}

In [6]:
# NMF
algo_nmf = NMF()
cross_validate(algo_nmf, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Evaluating RMSE, MAE of algorithm NMF on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9654  0.9670  0.9681  0.9520  0.9634  0.9632  0.0058  
MAE (testset)     0.7590  0.7590  0.7616  0.7481  0.7566  0.7569  0.0046  
Fit time          1.67    1.86    1.85    1.61    1.62    1.72    0.11    
Test time         0.08    0.18    0.09    0.08    0.08    0.10    0.04    


{'test_rmse': array([0.96542282, 0.96704694, 0.9681035 , 0.95198336, 0.96343812]),
 'test_mae': array([0.75899955, 0.75904029, 0.76155225, 0.7481255 , 0.75660774]),
 'fit_time': (1.6699450016021729,
  1.859358310699463,
  1.85313081741333,
  1.6117832660675049,
  1.6186320781707764),
 'test_time': (0.08220338821411133,
  0.17684221267700195,
  0.09241461753845215,
  0.08104872703552246,
  0.08226251602172852)}

We obtained the RMSE and MAE evaluation results for all three algorithms. Based on the analyzed data, the following conclusions can be made:

- **SVD Algorithm**: It has an average RMSE of approximately 0.9365 and an MAE of around 0.7379. This algorithm demonstrated decent performance with moderate training and testing times.

- **SVD++ Algorithm**: It showed better results compared to SVD, with an average RMSE of 0.9206 and an average MAE of 0.7224, indicating higher prediction accuracy. However, the training and testing time is significantly longer, which is an important consideration for large datasets or limited computational resources.

- **NMF Algorithm**: It has the highest error values among the three algorithms, with an average RMSE of 0.9632 and an average MAE of 0.7569, indicating lower prediction accuracy compared to SVD and SVD++. Its training and testing time is similar to that of SVD.

*In conclusion, the SVD++ algorithm seems to be the best choice here, as the dataset is relatively small. However, if we were working with a larger dataset, such as one with 1 million entries, it is highly likely that the SVD algorithm would be more suitable due to its lower computational demands.*

***

**Let's apply the GridSearchCV method to find the optimal hyperparameters.**

In [15]:
param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005], "reg_all": [0.4, 0.6]}
gs = GridSearchCV(SVDpp, param_grid, measures=["rmse", "mae"], cv=3)

gs.fit(data)

print(f"RMSE for SVD++: {gs.best_score['rmse']}")
print(gs.best_params['rmse'])
print(f"MAE for SVD++: {gs.best_score['mae']}")
print(gs.best_params['mae'])

algo_rmse_svd_pp = SVDpp(**gs.best_params['rmse'])
algo_mae_svd_pp = SVDpp(**gs.best_params['mae'])

RMSE for SVD++: 0.9636194713584136
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
MAE for SVD++: 0.772301661207014
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


Now that we have a model optimized for both metrics and the hyperparameters are tuned, we can evaluate it on the test set to check its accuracy. To do this, we will split our dataset into training and testing sets in an 80:20 ratio.

In [16]:
trainset, testset = train_test_split(data, test_size=0.2)

In [17]:
# Training the model on the training set.
algo_rmse_svd_pp.fit(trainset)
algo_mae_svd_pp.fit(trainset)

# Making predictions on the test set.
predictions_rmse = algo_rmse_svd_pp.test(testset)
predictions_mae = algo_mae_svd_pp.test(testset)

# Evaluation of the SVD++ model.
accuracy_rmse_svd_pp = accuracy.rmse(predictions_rmse)
accuracy_mae_svd_pp = accuracy.mae(predictions_mae)

RMSE: 0.9562
MAE:  0.7664


In [10]:
# Finding the optimal hyperparameters for the SVD algorithm.
param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005], "reg_all": [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=["rmse", "mae"], cv=3)

gs.fit(data)

print(f"RMSE for SVD: {gs.best_score['rmse']}")
print(gs.best_params['rmse'])
print(f"MAE for SVD: {gs.best_score['mae']}")
print(gs.best_params['mae'])

algo_rmse_svd = SVD(**gs.best_params['rmse'])
algo_mae_svd = SVD(**gs.best_params['mae'])

RMSE for SVD: 0.9643222690994252
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}
MAE for SVD: 0.7729356554441926
{'n_epochs': 10, 'lr_all': 0.005, 'reg_all': 0.4}


In [11]:
trainset, testset = train_test_split(data, test_size=0.2)

algo_rmse_svd.fit(trainset)
algo_mae_svd.fit(trainset)

predictions_rmse = algo_rmse_svd.test(testset)
predictions_mae = algo_mae_svd.test(testset)

accuracy_rmse_svd = accuracy.rmse(predictions_rmse)
accuracy_mae_svd = accuracy.mae(predictions_mae)

RMSE: 0.9584
MAE:  0.7686


In [12]:
# Finding the optimal hyperparameters for the NMF algorithm.
param_grid = {"n_epochs": [5, 10], "n_factors": [15, 50, 100], "reg_pu": [0.05, 0.1, 0.15], "reg_qi": [0.05, 0.1, 0.15]}
gs = GridSearchCV(NMF, param_grid, measures=["rmse", "mae"], cv=3)

gs.fit(data)

print(f"RMSE for NMF: {gs.best_score['rmse']}")
print(gs.best_params['rmse'])
print(f"MAE for NMF: {gs.best_score['mae']}")
print(gs.best_params['mae'])

algo_rmse_nmf = NMF(**gs.best_params['rmse'])
algo_mae_nmf = NMF(**gs.best_params['mae'])

RMSE for NMF: 0.9671994564588698
{'n_epochs': 10, 'n_factors': 15, 'reg_pu': 0.15, 'reg_qi': 0.15}
MAE for NMF: 0.760068430826902
{'n_epochs': 10, 'n_factors': 15, 'reg_pu': 0.1, 'reg_qi': 0.15}


In [13]:
trainset, testset = train_test_split(data, test_size=0.2)

algo_rmse_nmf.fit(trainset)
algo_mae_nmf.fit(trainset)

predictions_rmse = algo_rmse_nmf.test(testset)
predictions_mae = algo_mae_nmf.test(testset)

accuracy_rmse_nmf = accuracy.rmse(predictions_rmse)
accuracy_mae_nmf = accuracy.mae(predictions_mae)

RMSE: 0.9613
MAE:  0.7547


In [18]:
# Comparison:
print(f"SVD: RMSE - {accuracy_rmse_svd}, MAE - {accuracy_mae_svd}")
print(f"SVD++: RMSE - {accuracy_rmse_svd_pp}, MAE - {accuracy_mae_svd_pp}")
print(f"NMF: RMSE - {accuracy_rmse_nmf}, MAE - {accuracy_mae_nmf}")

SVD: RMSE - 0.9583904462140694, MAE - 0.7685710921496915
SVD++: RMSE - 0.9561868993977636, MAE - 0.7663619030161828
NMF: RMSE - 0.9613348970239839, MAE - 0.7546872291049601


The SVD++ algorithm showed the best result for the RMSE metric (0.9561) and almost the best for MAE (0.7663), indicating that the model performs well in predicting ratings overall, especially in terms of squared errors. However, this could be related to the proper selection of parameters in the param_grid.

Using the cross-validation method, SVD++ was also the best.