## Ensemble model


The idea of ensemble model is to stack a set of base recommendation models and use the weighted sum of the predictions as the new prediction which will  remove the flaws in individual implementation.

### Steps:
1. Tune the hyperparameter of base models including:
    - Neighborhood based model
    - SVD
    - baselineonly
    - co-colustering
2. Train base models using the tuned hyperparameters
3. Stack models together with weights
4. Run linear regression to learn the weights

### Results:
Accuracy matrix: RMSE
Stacking Model (0.69) vs Base Models (1.29 - 1.51)

In [8]:
# load the data
import pandas as pd
import json
from tqdm import tqdm
import os
from surprise import Reader, Dataset
from surprise.model_selection import train_test_split
from surprise import NormalPredictor
from surprise import BaselineOnly, SVD, KNNBasic
from surprise.prediction_algorithms.co_clustering import CoClustering
from surprise.model_selection import cross_validate
from surprise.model_selection import GridSearchCV
from surprise import accuracy
from surprise.model_selection import split


DATA_FOLDER = 'yelp_dataset'
review_datafile = os.path.join(DATA_FOLDER,"review.json")
line_count = len(open(review_datafile).readlines())
user_ids, business_ids, stars, dates = [], [], [], []
with open(review_datafile) as f:
    for line in tqdm(f, total=line_count):
        blob = json.loads(line)
        user_ids += [blob["user_id"]]
        business_ids += [blob["business_id"]]
        stars += [blob["stars"]]
        dates += [blob["date"]]
ratings = pd.DataFrame(
   {"user_id": user_ids, "business_id": business_ids, "rating": stars, "date": dates}
)
user_counts = ratings["user_id"].value_counts()
active_users = user_counts.loc[user_counts >= 5].index.tolist()

100%|██████████| 6685900/6685900 [00:55<00:00, 121218.92it/s]


In [9]:
# create a sample of 1M datapoints
ratings_sample = ratings[:1000000]

In [10]:
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(ratings_sample[['user_id', 'business_id', 'rating']], reader)

In [4]:
# 1. find the best hyperparameters for BaselineOnly
param_grid = {'bsl_options': {'method': ['als','sgd'],
                              'n_epochs': [10, 25], 
                              'reg_u': [3, 5],
                              'reg_i': [3, 5]}
             }
grid_search = GridSearchCV(BaselineOnly, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)
print(grid_search.best_score['rmse'])
print(grid_search.best_params['rmse'])

Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimating biases using sgd...
Estimati

In [13]:
# 2. find the best hyperparameters for SVD
param_grid = {'n_epochs': [25, 40], 'lr_all': [0.01, 0.02],
              'reg_all': [0.2]}
grid_search = GridSearchCV(SVD, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)
print(grid_search.best_score['rmse'])
print(grid_search.best_params['rmse'])

1.2783104771201215
{'n_epochs': 25, 'lr_all': 0.01, 'reg_all': 0.2}


In [12]:
# 3. find the best hyperparameters for co-clustering
param_grid = {'n_epochs': [3, 5], 'n_cltr_u': [3, 5],
              'n_cltr_i': [3, 5]}
grid_search = GridSearchCV(CoClustering, param_grid, measures=['rmse'], cv=3)
grid_search.fit(data)
print(grid_search.best_score['rmse'])
print(grid_search.best_params['rmse'])

1.4121145162238307
{'n_epochs': 3, 'n_cltr_u': 3, 'n_cltr_i': 3}


In [11]:
from source.stacking_recommender import StackingModel

In [12]:
stacking_model = StackingModel(data)

In [13]:
stacking_model.fit(data, retrain=True, retrain_split_num=10)

**************** Start retraining models ******************
*************** Retraining: baselineonly *****************
Estimating biases using als...
RMSE: 1.2862
Estimating biases using als...
RMSE: 1.2883
Estimating biases using als...
RMSE: 1.2930
Estimating biases using als...
RMSE: 1.2895
Estimating biases using als...
RMSE: 1.2891
Estimating biases using als...
RMSE: 1.2854
Estimating biases using als...
RMSE: 1.2909
Estimating biases using als...
RMSE: 1.2906
Estimating biases using als...
RMSE: 1.2869
Estimating biases using als...
RMSE: 1.2934
*************** Retraining: svd *****************
RMSE: 1.2951
RMSE: 1.2942
RMSE: 1.2880
RMSE: 1.2912
RMSE: 1.3016
RMSE: 1.2900
RMSE: 1.2938
RMSE: 1.2967
RMSE: 1.2912
RMSE: 1.2953
*************** Retraining: coClustering *****************
RMSE: 1.4794
RMSE: 1.4815
RMSE: 1.4791
RMSE: 1.4834
RMSE: 1.4718
RMSE: 1.4802
RMSE: 1.4792
RMSE: 1.4741
RMSE: 1.4804
RMSE: 1.4769
*************** Retraining: knn *****************
Computing the cosine s

In [14]:
stacking_model_pred = stacking_model.test(data.build_full_trainset().build_testset())
accuracy.rmse(stacking_model_pred)

RMSE: 0.6910


0.6910324929352037