# Book Recommendations

The goal of the project is to build a recommender system that suggests relevant books based on the person's interests.

## Collaborative filtering using `surprise` library: 

## Model training and evaluation

Evaluate a few models that performed best during cross-validation experiments.

In [47]:
# For reproducible experiments

import random
import numpy as np

my_seed = 42
random.seed(my_seed)
np.random.seed(my_seed)

In [48]:
# Import algorithms from Surprise

from surprise import NormalPredictor, SVDpp, SlopeOne, KNNBaseline

## 1. Load  and prepare the data
Create the Surprise Datasets for training and test data

In [24]:
path_train = '../data/ratings_train.csv'
path_test = '../data/ratings_test.csv'

min_rating = 1
max_rating = 5

In [43]:
from modules.data_prep import load_data_surprise

train_data = load_data_surprise(path_train, min_rating, max_rating)
test_data= load_data_surprise(path_test, min_rating, max_rating)

Build a Surprise training set:

In [44]:
from modules.data_prep import build_trainset_surprise

trainset = build_trainset_surprise(train_data)

Build a Surprise test set:

In [45]:
from modules.data_prep import build_testset_surprise

testset = build_testset_surprise(test_data)

In [7]:
# Number of test examples

len(testset)

896472

## 2. Train the shortlisted models and evaluate on the test set

### 2.1 Normal Predictor (random predictions) as a baseline model

Fit the algorithm and compute training error:

In [9]:
from modules.modeling import fit_surprise_algorithm

normal_predictor = fit_surprise_algorithm(NormalPredictor(), trainset)

MAE:  1.0499
RMSE: 1.3230


Evaluate on the test set:

In [10]:
from modules.modeling import evaluate_performance

np_predictions_test = evaluate_performance(normal_predictor, testset)

MAE:  1.0505
RMSE: 1.3238


Serialize the model:

In [11]:
from surprise import dump

dump.dump('../models/normal_predictor', algo=normal_predictor, verbose=1)

The dump has been saved as file ../models/normal_predictor


In [12]:
# Save the algorithm along with test predictions

dump.dump('../models/normal_predictor_pred_test', algo=normal_predictor, 
          predictions=np_predictions_test, verbose=1)

The dump has been saved as file ../models/normal_predictor_pred_test


### 2.2 Slope One

Fit the algorithm and compute training error:

In [13]:
slope_one = fit_surprise_algorithm(SlopeOne(), trainset)

MAE:  0.5847
RMSE: 0.7582


Evaluate on the test set:

In [14]:
so_predictions_test = evaluate_performance(slope_one, testset)

MAE:  0.6583
RMSE: 0.8460


Save the model:

In [15]:
dump.dump('../models/slope_one', algo=slope_one, verbose=1)

The dump has been saved as file ../models/slope_one


In [16]:
# Save with test predictions

dump.dump('../models/slope_one_pred_test', algo=slope_one, 
          predictions=so_predictions_test, verbose=1)

The dump has been saved as file ../models/slope_one_pred_test


### 2.3 SVD++

Hyperparameters: 
- n_factors=8
- n_epochs=15
- lr_all=0.01
- reg_all=0.02

In [17]:
svdpp = fit_surprise_algorithm(SVDpp(n_factors=8, n_epochs=15, lr_all=0.01, reg_all=0.02), 
                               trainset)

MAE:  0.5853
RMSE: 0.7573


Evaluate on the test set:

In [18]:
svdpp_predictions_test = evaluate_performance(svdpp, testset)

MAE:  0.6293
RMSE: 0.8161


Save the model:

In [19]:
dump.dump('../models/svdpp', algo=svdpp, verbose=1)

The dump has been saved as file ../models/svdpp


In [20]:
# Save together with test predictions

dump.dump('../models/svdpp_pred_test', algo=svdpp, 
          predictions=svdpp_predictions_test, verbose=1)

The dump has been saved as file ../models/svdpp_pred_test


In [25]:
# Mean rating for test data = 3.92

mean_rating_test = np.mean([row[2] for row in testset])
mean_rating_test

3.9198658742269696

### 2.4 k-NN Baseline: item-based

Fit the algorithm and compute training error:

In [49]:
from modules.modeling import fit_surprise_algorithm

# Similarity measure configuration
sim_options = {'name': 'pearson_baseline',  # recommended in the surprise docs
               'user_based': False,  # compute similarities between items
               'min_support': 10}  # min number of common users for the similarity not to be zero

# Fit the algorithm
knn_baseline_items = fit_surprise_algorithm(KNNBaseline(sim_options=sim_options), 
                                            trainset)

Estimating biases using als...
Computing the pearson_baseline similarity matrix...
Done computing similarity matrix.
MAE:  0.3311
RMSE: 0.4450


Evaluate on the test set:

In [50]:
from modules.modeling import evaluate_performance

knn_predictions_test = evaluate_performance(knn_baseline_items, testset)

MAE:  0.6041
RMSE: 0.7974


Save the model:

In [51]:
from surprise import dump

dump.dump('../models/knn_baseline_items', algo=knn_baseline_items, verbose=1)

The dump has been saved as file ../models/knn_baseline_items


In [52]:
# Save with test predictions

dump.dump('../models/knn_baseline_items_pred_test', algo=knn_baseline_items, 
          predictions=knn_predictions_test, verbose=1)

The dump has been saved as file ../models/knn_baseline_items_pred_test


### Conclusions:

- There are quite close results on the test set for the Slope One (MAE=0.659), SVD++ (MAE=0.629) and k-NN Baseline item-based (MAE=0.604)
- The k-NN Baseline model showed the best performance. The mean absolute error on the test set has been reduced by 42% compared to the Normal Predictor (MAE=1.050). 
- On average, rating predictions by k-NN Baseline are off by 15% (mean_rating=3.92 for test data).


### Next steps: 
- Create an app with k-NN Baseline or SVD++ model for demo (to get book recommendations for a given user id)