### Ensembled Modelling
**(STAGE 2 Ensemble)**: Average outputs of weak learners and use it as new features to train the strong learners. <br>
**(STAGE 3 Ensemble)**: Use the output from Stage 2 stong learners to train a final 'meta-learner' model. 

In [1]:
import pandas as pd
import numpy as np
import pickle
import seaborn as sns
import time

from sklearn import metrics
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.svm import SVC
from xgboost.sklearn import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from scipy.stats import randint, uniform

In [4]:
features = pd.read_csv('../data/preprocessed-train.csv', index_col = 'bookingID')

In [5]:
y = pd.read_csv('../data/ori_labels.csv', index_col = 'bookingID')

In [6]:
label = pd.merge(features.reset_index(), y, on='bookingID')['label']

In [7]:
# load weak learners
with open("../model_weights/mlp.dat", "rb") as f:  
    mlp  = pickle.load(f)

with open("../model_weights/naivebayes.dat", "rb") as f:  
    naivebayes  = pickle.load(f)

with open("../model_weights/rf.dat", "rb") as f:  
    rf  = pickle.load(f)
    
with open('../model_weights/model-config.pkl', 'rb') as f:  
    model_config = pickle.load(f)

#### Average predictions from weak learners

In [10]:
temp1 = mlp.predict_proba(features[model_config['mlp']['col_names']])[:, 1]
temp2 = naivebayes.predict_proba(features[model_config['naivebayes']['col_names']])[:, 1]
avg = (temp1 + temp2) / 2

***

### Stage 2: Train Strong Learner

In [12]:
features['avg'] = avg

In [13]:
# distribution of parameters for XGBoost
params_logistic = {
    'solver' : ['liblinear', 'saga'],
    'C' : [1e-3, 1e-2, 0.1, 1, 10, 100]
}

params_lda = {
    'solver': ['svd', 'lsqr'],
    'tol': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
}

params_xgboost = {
    'max_depth': randint(1,6),
    'min_child_weight': randint(0,6),
    'subsample': uniform(loc=0.6, scale=0.4),
    'colsample_bytree': uniform(loc=0.6, scale=0.4),
    'gamma': [i/10.0 for i in range(0,5)],
    'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100]
}

In [14]:
rand_search_logistic = RandomizedSearchCV(estimator = LogisticRegression(), 
                                 param_distributions=params_logistic, 
                                 scoring='roc_auc', 
                                 n_iter=30, 
                                 iid=False, 
                                 cv=5, 
                                 n_jobs=-1,
                                 verbose=True)

%time rand_search_logistic.fit(features[model_config['logistic']['col_names'] + ['avg']], label)
rand_search_logistic.best_params_, rand_search_logistic.best_score_

Fitting 5 folds for each of 12 candidates, totalling 60 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   52.0s
[Parallel(n_jobs=-1)]: Done  60 out of  60 | elapsed:  1.2min finished


CPU times: user 4.42 s, sys: 96 ms, total: 4.52 s
Wall time: 1min 13s




({'C': 100, 'solver': 'liblinear'}, 0.7912019246889356)

In [15]:
rand_search_lda = RandomizedSearchCV(estimator = LinearDiscriminantAnalysis(), 
                                 param_distributions=params_lda, 
                                 scoring='roc_auc', 
                                 n_iter=30, 
                                 iid=False, 
                                 cv=5, 
                                 n_jobs=-1,
                                 verbose=True)

%time rand_search_lda.fit(features[model_config['lda']['col_names'] + ['avg']], label)
rand_search_lda.best_params_, rand_search_lda.best_score_

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:    5.4s


CPU times: user 532 ms, sys: 52 ms, total: 584 ms
Wall time: 6.05 s


[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:    5.8s finished


({'solver': 'svd', 'tol': 1e-06}, 0.8122928250684665)

In [16]:
rand_search_xgboost = RandomizedSearchCV(estimator = XGBClassifier(), 
                                 param_distributions=params_xgboost, 
                                 scoring='roc_auc', 
                                 n_iter=30, 
                                 iid=False, 
                                 cv=5, 
                                 n_jobs=-1,
                                 verbose=True)

%time rand_search_xgboost.fit(features[model_config['xgboost']['col_names'] + ['avg']], label)
rand_search_xgboost.best_params_, rand_search_xgboost.best_score_

Fitting 5 folds for each of 30 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:  4.2min finished


CPU times: user 5.92 s, sys: 132 ms, total: 6.06 s
Wall time: 4min 16s


({'colsample_bytree': 0.8930552389188486,
  'gamma': 0.2,
  'max_depth': 5,
  'min_child_weight': 3,
  'reg_alpha': 1e-05,
  'subsample': 0.7471291925293697},
 0.9110989009814426)

##### Retrain full dataset using BEST model and hyperparameters

In [17]:
logistic = LogisticRegression(**rand_search_logistic.best_params_, max_iter=500)
logistic.fit(features[model_config['logistic']['col_names'] + ['avg']], label)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

In [18]:
lda = LinearDiscriminantAnalysis(**rand_search_lda.best_params_)
lda.fit(features[model_config['lda']['col_names'] + ['avg']], label)

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=1e-06)

In [19]:
xgboost = XGBClassifier(
    n_estimators = 100, #rand_search.best_params_['n_estimators'],
    learning_rate = 0.1, #rand_search.best_params_['learning_rate'],
    **rand_search_xgboost.best_params_,
    n_jobs = -1)

xgboost.fit(features[model_config['xgboost']['col_names'] + ['avg']], label)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8930552389188486,
              gamma=0.2, learning_rate=0.1, max_delta_step=0, max_depth=5,
              min_child_weight=3, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=1e-05, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.7471291925293697, verbosity=1)

#### Save Strong Learner Model

In [20]:
with open('../model_weights/lda_rf.dat', 'wb') as f:  
    pickle.dump(lda, f)
    
with open('../model_weights/xgboost_rf.dat', 'wb') as f:  
    pickle.dump(xgboost, f)
    
with open('../model_weights/logistic_rf.dat', 'wb') as f:  
    pickle.dump(logistic, f)

***

### Stage 3: Train Meta Learner Using Output From Strong Learners

#### Output From Strong Learners

In [21]:
temp3 = logistic.predict_proba(features[model_config['logistic']['col_names'] + ['avg']] )[:, 1]
temp4 = lda.predict_proba(features[model_config['lda']['col_names'] + ['avg']])[:, 1]
temp5 = xgboost.predict_proba(features[model_config['xgboost']['col_names'] + ['avg']])[:, 1]

In [None]:
# check if overfitted
print(metrics.roc_auc_score(label, temp3))
print(metrics.roc_auc_score(label, temp4))
print(metrics.roc_auc_score(label, temp5))

In [23]:
df = pd.DataFrame([])
df['logistic'] = temp3
df['lda'] = temp4
df['xgboost'] = temp5

In [24]:
rand_search_xgboost2 = RandomizedSearchCV(estimator = XGBClassifier(), 
                                 param_distributions=params_xgboost, 
                                 scoring='roc_auc', 
                                 n_iter=30, 
                                 iid=False, 
                                 cv=5, 
                                 n_jobs=-1,
                                 verbose=True)

%time rand_search_xgboost2.fit(df, label)
rand_search_xgboost2.best_params_, rand_search_xgboost2.best_score_

Fitting 5 folds for each of 30 candidates, totalling 150 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  46 tasks      | elapsed:   17.4s
[Parallel(n_jobs=-1)]: Done 150 out of 150 | elapsed:  1.0min finished


CPU times: user 1.87 s, sys: 72 ms, total: 1.94 s
Wall time: 1min 1s


({'colsample_bytree': 0.6928004622825488,
  'gamma': 0.0,
  'max_depth': 4,
  'min_child_weight': 0,
  'reg_alpha': 0.1,
  'subsample': 0.7219384230404164},
 0.9606306735501423)

In [25]:
xgboost2 = XGBClassifier(
    n_estimators = 100, #rand_search.best_params_['n_estimators'],
    learning_rate = 0.01, #rand_search.best_params_['learning_rate'],
    **rand_search_xgboost.best_params_,
    n_jobs = -1)

xgboost2.fit(df, label)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8930552389188486,
              gamma=0.2, learning_rate=0.01, max_delta_step=0, max_depth=5,
              min_child_weight=3, missing=None, n_estimators=100, n_jobs=-1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=1e-05, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=0.7471291925293697, verbosity=1)

In [26]:
with open('../model_weights/meta_rf.dat', 'wb') as f:  
    pickle.dump(xgboost2, f)