### Ensembled Modelling
**(Initial Stage)**: Identify strong and weak learners using cross validation.  <br>
**(STAGE 1 Ensemble)**: Check correlation between predictions made by initial stage models. Choose only models with low correlation as candidate models. <br>
**(STAGE 2 Ensemble)**: Train the weak learners. Average their outputs and use it as new features to train the strong learners. <br>
**(STAGE 3 Ensemble)**: Use the output from Stage 2 stong learners to train a final 'meta-learner' model. 

***

In [1]:
import pandas as pd
import numpy as np
import pickle
import time
import build_feature
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from xgboost.sklearn import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from scipy.stats import randint, uniform

In [2]:
# preprocessed-train.csv is output from build_feature() function
features = pd.read_csv('data/preprocessed-train.csv', index_col = 'bookingID')

In [3]:
y = pd.read_csv('data/ori_labels.csv', index_col = 'bookingID')

### (Initial Stage): Identify Strong and Weak Learners Using Cross-validation

In [4]:
# models that we want to test
model = {
    'logistic': LogisticRegression(max_iter=500),
    'lda': LinearDiscriminantAnalysis(), 
    'svc': SVC(kernel='rbf'),
    'naivebayes': GaussianNB(),
    'rf': RandomForestClassifier(n_estimators=100),
    'xgboost': XGBClassifier(),
    'mlp': MLPClassifier(max_iter=500)
}

In [5]:
# hyperparameters grid for models that we want to test
model_params = {
    'logistic': {
        'solver' : ['liblinear', 'saga'],
        'C' : [1e-3, 1e-2, 0.1, 1, 10, 100]
    },
    'lda': {
        'solver': ['svd', 'lsqr'],
         'tol': [1e-6, 1e-5, 1e-4, 1e-3, 1e-2]
    },
    'svc': {
        'gamma': [0.1, 1, 10, 100],
        'C': [0.1, 1, 10, 100, 1000]
    },
    'naivebayes': {
        'var_smoothing': [1e-11, 1e-10, 1e-09, 1e-08, 1e-7]
    },
    'rf': {
        'max_depth': randint(10, 100),
        'max_features': ['auto', 'sqrt'],
        'min_samples_leaf': randint(1, 4),
        'min_samples_split': randint(2, 10),
        'bootstrap': [True, False]
    },
    'xgboost': {
        'max_depth': randint(1,6),
        'min_child_weight': randint(0,6),
        'subsample': uniform(loc=0.6, scale=0.4),
        'colsample_bytree': uniform(loc=0.6, scale=0.4),
        'gamma': [i/10.0 for i in range(0,5)],
        'reg_alpha': [1e-5, 1e-2, 0.1, 1, 100]
    }, 
    'mlp': {
        'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
        'activation': ['tanh', 'relu'],
        'solver': ['sgd', 'adam'],
        'alpha': [0.0001, 0.05],
        'learning_rate': ['constant','adaptive']
    }
}

#### Determine Strong / Weak Learners

In [6]:
model_config = {}

col_sample = 0.7
n_init = 5

start_time = time.time()

# for every models
for key in model:
    print("Currently computing for", key)
    
    best_match = {}
    
    # sample (col_sample %) of the columns randomly (n_init) times
    for i in range(n_init):
        print('Iteration', i + 1)
        # sample 70% of the columns randomly
        sample_feat = features.sample(frac=col_sample, axis=1)
        sample_label = pd.merge(sample_feat.reset_index(), y, on='bookingID')['label']
        
        # randomized search through the hyperparameters grid
        rand_search = RandomizedSearchCV(estimator = model[key], 
                                         param_distributions=model_params[key], 
                                         scoring='roc_auc', 
                                         n_iter=5, 
                                         iid=False, 
                                         cv=5, 
                                         n_jobs=-1)
        rand_search.fit(sample_feat, sample_label)
        
        # update the best model's hyperparameters and columns used
        if i == 0:
            best_match['model'] = key
            best_match['col_names'] = sample_feat.columns
            best_match['hyperparams'] = rand_search.best_params_
            best_match['roc'] = rand_search.best_score_
        elif rand_search.best_score_ > best_match['roc']:
            best_match['model'] = key
            best_match['col_names'] = sample_feat.columns
            best_match['hyperparams'] = rand_search.best_params_
            best_match['roc'] = rand_search.best_score_
    
    model_config[key] = best_match
    
print('Training Done! Time Used:', time.time() - start_time)

Currently computing for lda
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Currently computing for logistic
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Currently computing for mlp
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Currently computing for xgboost
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Currently computing for rf
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Currently computing for svc
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Currently computing for naivebayes
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
Training Done! Time Used: 5241.812428474426


Training Time: 87 minutes

***