## Ensemble model

### Basic Ensemble Techniques
The idea of ensemble model is to stack a set of base recommendation models and use the weighted sum of the predictions as the new prediction which will  remove the flaws in individual implementation.
- Averaging
- Weighted Average (Train weights with linear regression and ElasticNet regression)

#### Steps:
1. Tune the hyperparameter of base models including:
    - Neighborhood based model
    - SVD
    - baselineonly
    - co-colustering
2. Train base models using the tuned hyperparameters
3. Run regression to learn the weights or simply user average
4. Stack models together with weights (average or trained weights)

#### Results:

##### Accuracy matrix: RMSE

Base Models: 1.13 - 1.33

Ensemble with average weights: 1.32

Ensemble with weights learned from linear regression: 1.32

Ensemble with weights learned from ElasticNet: 1.39

##### Accuracy matrix: MAE

Base Models: 0.9 - 1

Ensemble with average weights: 1.05

Ensemble with weights learned from linear regression: 1.03

Ensemble with weights learned from ElasticNet: 1.11


### Advanced Ensemble Techniques
- Stacking



In [1]:
%load_ext autoreload
%autoreload 2

In [1]:
import pandas as pd
from source.utils import train_test_split_feature
from surprise import Reader, Dataset
from surprise import BaselineOnly, SVD, KNNBasic
from surprise.prediction_algorithms.co_clustering import CoClustering
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import accuracy
from surprise.model_selection import split

In [2]:
feature = pd.read_csv("data/feature.csv")
cols = [
        "user_id",
        "business_id",
        "review_stars",
        "review_date"
        ]
selected_feature = feature[cols]
train_set, test_set = train_test_split_feature(selected_feature.copy())
full_set = selected_feature[["user_id",'business_id',"review_stars"]]
reader = Reader(rating_scale=(1, 5))
trainset = Dataset.load_from_df(train_set[['user_id', 'business_id', 'review_stars']], reader)
testset = Dataset.load_from_df(test_set[['user_id', 'business_id', 'review_stars']], reader)
fullset = Dataset.load_from_df(full_set[['user_id', 'business_id', 'review_stars']], reader)

In [4]:
# 1. find the best hyperparameters for BaselineOnly
param_grid = {'bsl_options': {'method': ['als','sgd'],
                              'n_epochs': [10, 25], 
                              'reg_u': [3, 5],
                              'reg_i': [3, 5]}
             }
grid_search = GridSearchCV(BaselineOnly, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)
print(grid_search.best_score['rmse'])
print(grid_search.best_params['rmse'])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [13]:
# 2. find the best hyperparameters for SVD
param_grid = {'n_epochs': [25, 40], 'lr_all': [0.01, 0.02],
              'reg_all': [0.2]}
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)
print(grid_search.best_score['rmse'])
print(grid_search.best_params['rmse'])

1.2783104771201215
{'n_epochs': 25, 'lr_all': 0.01, 'reg_all': 0.2}


In [12]:
# 3. find the best hyperparameters for co-clustering
param_grid = {'n_epochs': [3, 5], 'n_cltr_u': [3, 5],
              'n_cltr_i': [3, 5]}
grid_search = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)
print(grid_search.best_score['rmse'])
print(grid_search.best_params['rmse'])

1.4121145162238307
{'n_epochs': 3, 'n_cltr_u': 3, 'n_cltr_i': 3}


In [3]:
from source.stacking_recommender import StackingModel

In [4]:
# Stacking Model with weights which trained by Linear regression w.o. regularization
stacking_model_LR = StackingModel(fullset)
stacking_model_LR.fit(trainset, retrain=True, retrain_split_num=2)
stacking_model_LR_pred_train = stacking_model_LR.test(trainset.build_full_trainset().build_testset())
print(f'The RMSE for train is {accuracy.rmse(stacking_model_LR_pred_train)}')
print(f'The MAE for train is {accuracy.mae(stacking_model_LR_pred_train)}')
stacking_model_LR_pred = stacking_model_LR.test(testset.build_full_trainset().build_testset())
print(f'The RMSE for train is {accuracy.rmse(stacking_model_LR_pred)}')
print(f'The MAE for train is {accuracy.mae(stacking_model_LR_pred)}')

**************** Start retraining models ******************
*************** Retraining: baselineonly *****************
Estimating biases using als...
RMSE: 1.1351
MAE:  0.8981
Estimating biases using als...
RMSE: 1.1366
MAE:  0.8989
*************** Retraining: svd *****************
RMSE: 1.1400
MAE:  0.9022
RMSE: 1.1395
MAE:  0.9019
*************** Retraining: coClustering *****************
RMSE: 1.2573
MAE:  0.9406
RMSE: 1.2625
MAE:  0.9440
*************** Retraining: knn *****************
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.3274
MAE:  0.9958
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.3270
MAE:  0.9959
******************** Models retrained *********************
*** Starting tuning hyperparameter for stacking weights ***
RMSE: 0.9818
The RMSE for train is 0.9818120839642133
MAE:  0.7275
The MAE for train is 0.7274999461539339
RMSE: 1.3159
The RMSE for train is 1.3158572499547387
MAE:  1.0278
The MA

In [4]:
# Stacking Model with weights which trained by ElasticNet
stacking_model_EN = StackingModel(fullset)
stacking_model_EN.fit(trainset, ensemble_method="LR:ElasticNet", retrain=True, retrain_split_num=2)
stacking_model_EN_pred_train = stacking_model_EN.test(trainset.build_full_trainset().build_testset())
print(f'The RMSE for train is {accuracy.rmse(stacking_model_EN_pred_train)}')
print(f'The MAE for train is {accuracy.mae(stacking_model_EN_pred_train)}')
stacking_model_EN_pred = stacking_model_EN.test(testset.build_full_trainset().build_testset())
print(f'The RMSE for train is {accuracy.rmse(stacking_model_EN_pred)}')
print(f'The MAE for train is {accuracy.mae(stacking_model_EN_pred)}')

**************** Start retraining models ******************
*************** Retraining: baselineonly *****************
Estimating biases using als...
RMSE: 1.1362
MAE:  0.8981
Estimating biases using als...
RMSE: 1.1359
MAE:  0.8990
*************** Retraining: svd *****************
RMSE: 1.1395
MAE:  0.9019
RMSE: 1.1401
MAE:  0.9020
*************** Retraining: coClustering *****************
RMSE: 1.2601
MAE:  0.9421
RMSE: 1.2556
MAE:  0.9388
*************** Retraining: knn *****************
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.3263
MAE:  0.9948
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.3272
MAE:  0.9951
******************** Models retrained *********************
*** Starting tuning hyperparameter for stacking weights ***
RMSE: 1.2209
The RMSE for train is 1.2208682531140647
MAE:  0.9279
The MAE for train is 0.9279317916043214
RMSE: 1.3932
The RMSE for train is 1.3932336243705696
MAE:  1.1132
The MA

In [5]:
# Stacking Model with average weights
stacking_model_AVG = StackingModel(fullset)
stacking_model_AVG.fit(trainset, ensemble_method="Average", retrain=True, retrain_split_num=2)
stacking_model_AVG_pred_train = stacking_model_AVG.test(trainset.build_full_trainset().build_testset())
print(f'The RMSE for train is {accuracy.rmse(stacking_model_AVG_pred_train)}')
print(f'The MAE for train is {accuracy.mae(stacking_model_AVG_pred_train)}')
stacking_model_AVG_pred = stacking_model_AVG.test(testset.build_full_trainset().build_testset())
print(f'The RMSE for train is {accuracy.rmse(stacking_model_AVG_pred)}')
print(f'The MAE for train is {accuracy.mae(stacking_model_AVG_pred)}')

**************** Start retraining models ******************
*************** Retraining: baselineonly *****************
Estimating biases using als...
RMSE: 1.1356
MAE:  0.8985
Estimating biases using als...
RMSE: 1.1368
MAE:  0.8989
*************** Retraining: svd *****************
RMSE: 1.1396
MAE:  0.9018
RMSE: 1.1397
MAE:  0.9020
*************** Retraining: coClustering *****************
RMSE: 1.2547
MAE:  0.9387
RMSE: 1.2595
MAE:  0.9417
*************** Retraining: knn *****************
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.3265
MAE:  0.9943
Computing the cosine similarity matrix...
Done computing similarity matrix.
RMSE: 1.3256
MAE:  0.9947
******************** Models retrained *********************
RMSE: 1.0233
The RMSE for train is 1.0232740640563724
MAE:  0.7885
The MAE for train is 0.7885227426778852
RMSE: 1.3172
The RMSE for train is 1.3171515246557777
MAE:  1.0517
The MAE for train is 1.0516577517405332
