## Boosting Algorithms

1. Adaboost
<br>
The core principle of AdaBoost is to fit a sequence of weak learners (i.e. models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions are then combined through a weighted majority vote (or sum) to product the final prediction. Used for classification and regression problems.
<br><br>
2. Gradient Tree Boosting
<br>
Generalizaiton of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems.

## Random Forests

1. Random Forests
<br>
Each tree in the ensemble is built from a sample drawn with replacement (i.e. a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split is chosen is no longer the best split among all the features. Instead the split that is picked is the best split among a random subset of the features.
<br><br>
2. Extremely Random Trees
<br>
Randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.

In [9]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score

In [10]:
# Input data before feature selection
input_data_before_fs = pd.read_csv('processed_train.csv', index_col=0)

# Input data after feature selection
input_data_after_fs = pd.read_csv('processed_train_after_feature.csv', index_col=0)

# Upsampling without feature selection

# Upsampling with feature selection

# Downsampling without feature selection

# Upsampling with feature selection


# List of all the input data
input_all = {
    "normal_before_fs" : input_data_before_fs,
#     "normal_after_fs" : input_data_after_fs
}

In [19]:
# Functions
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.preprocessing import LabelBinarizer

def multiclass_roc_auc_score(y_test, y_pred, average="macro"):
    lb = LabelBinarizer()
    lb.fit(y_test)
    y_test = lb.transform(y_test)
    y_pred = lb.transform(y_pred)
    return roc_auc_score(y_test, y_pred, average=average)

def metric_consolidation(input_all, classifier, method = "cross_validation"):
    metrics = {'accuracy': 'accuracy',
               'f1_weighted': 'f1_weighted',
               'roc_auc': make_scorer(multiclass_roc_auc_score, average='weighted')
              }
    
    for input_name, input_data in input_all.items():
        # split the data
        x_train, x_test, y_train, y_test = preprocessing(input_data)

        # fit the classifier to the training data
        classifier.fit(x_train, y_train)

        # apply all metrics to the classifier for cross_validation
        if method == "cross_validation":
            scores = tenfold(classifier, x_train, y_train, metric = metrics)
            print ("Metrics for %s: \n" %input_name)
            for metric in metrics:
                test_score_name = "test_" + metric
                test_score = scores[test_score_name]
                print ("%s Test Score: %f +/- %f" %(metric, test_score.mean(),
                                               test_score.std()))   
            print ("\n")
            
def preprocessing(data):
    #Split data into variables types - boolean, categorical, continuous, ID
    bool_var = list(data.select_dtypes(['bool']))
    cont_var = list(data.select_dtypes(['float64']))
    cat_var = list(data.select_dtypes(['int64']))

    #Input Data can be from all except id details
    final_input_data = data[cat_var + cont_var + bool_var]
    
    x = final_input_data.loc[:, final_input_data.columns != 'Target'].values
    y = final_input_data['Target'].values
    y=y.astype('int')
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, 
                                                    random_state = 100 , stratify = y)
    
    return x_train, x_test, y_train, y_test

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.model_selection import cross_validate

def tenfold(model, x, y, metric='accuracy'):
    kfold = StratifiedKFold(n_splits=10, random_state=100, shuffle=True)
    scores = cross_validate(model, x, y, cv=kfold, scoring=metric, 
                            return_train_score=True)
    return scores

# accuracy_mean = scores['test_score'].mean()
# accuracy_std = scores['train_score'].std()

In [11]:
# Adaboost

from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=0)

In [12]:
# Gradient Tree Boosting

from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                 max_depth=1, random_state=0)

In [13]:
# Random Forests

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
                             min_samples_split=2, random_state=0)

In [14]:
# Extreme Random Trees

from sklearn.ensemble import ExtraTreesClassifier

clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
                           min_samples_split=2, random_state=0)

In [22]:
# Boosting Parameters
import warnings
warnings.filterwarnings('ignore')

learning_rate_values = [0.1, 0.5, 1]
n_estimators_values = [50, 100, 150]
method_values = ['ada', 'gtb']

for method in method_values:
    for learning_rate in learning_rate_values:
        for n_estimators in n_estimators_values:
            
            if method =='ada':
                clf = AdaBoostClassifier(n_estimators=n_estimators, learning_rate=learning_rate, 
                                         random_state=0)
            elif method == 'gtb':
                clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=learning_rate, 
                                         random_state=0)
            
            print ("For Boosting with: \n method: %s \n learning rate: %s \n n_estimators: %s \n"
                  %(method, learning_rate, n_estimators))
            metric_consolidation(input_all, clf)

For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 50 

Metrics for normal_before_fs: 

accuracy Test Score: 0.673432 +/- 0.010978
f1_weighted Test Score: 0.567380 +/- 0.015882
roc_auc Test Score: 0.553819 +/- 0.016133




For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 100 

Metrics for normal_before_fs: 

accuracy Test Score: 0.682820 +/- 0.014255
f1_weighted Test Score: 0.593405 +/- 0.016860
roc_auc Test Score: 0.592154 +/- 0.017451




For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 150 

Metrics for normal_before_fs: 

accuracy Test Score: 0.681036 +/- 0.014907
f1_weighted Test Score: 0.597380 +/- 0.016394
roc_auc Test Score: 0.601856 +/- 0.016990




For Boosting with: 
 method: ada 
 learning rate: 0.5 
 n_estimators: 50 

Metrics for normal_before_fs: 

accuracy Test Score: 0.674774 +/- 0.014374
f1_weighted Test Score: 0.607183 +/- 0.018601
roc_auc Test Score: 0.617471 +/- 0.016075




For Boosting with: 
 method: a

In [29]:
# Random Forest Parameters
import warnings
warnings.filterwarnings('ignore')

n_estimators_values = [10, 50, 100] # Number of trees in the forest
method_values = ['random', 'extreme']

for method in method_values:
    for n_estimators in n_estimators_values:

        if method == 'random':
            clf = RandomForestClassifier(n_estimators=n_estimators, random_state=0)
        elif method =='extreme':
            clf = ExtraTreesClassifier(n_estimators=n_estimators, random_state=0)

        print ("For Boosting with: \n method: %s \n n_estimators: %s \n"
              %(method, n_estimators))
        metric_consolidation(input_all, clf)

For Boosting with: 
 method: random 
 n_estimators: 10 

Metrics for normal_before_fs: 

accuracy Test Score: 0.660936 +/- 0.021159
f1_weighted Test Score: 0.623904 +/- 0.026006
roc_auc Test Score: 0.642747 +/- 0.019939


For Boosting with: 
 method: random 
 n_estimators: 50 

Metrics for normal_before_fs: 

accuracy Test Score: 0.682365 +/- 0.013944
f1_weighted Test Score: 0.617516 +/- 0.021433
roc_auc Test Score: 0.622716 +/- 0.017814


For Boosting with: 
 method: random 
 n_estimators: 100 

Metrics for normal_before_fs: 

accuracy Test Score: 0.685118 +/- 0.017376
f1_weighted Test Score: 0.616281 +/- 0.025718
roc_auc Test Score: 0.617975 +/- 0.017786


For Boosting with: 
 method: extreme 
 n_estimators: 10 

Metrics for normal_before_fs: 

accuracy Test Score: 0.638449 +/- 0.021202
f1_weighted Test Score: 0.600485 +/- 0.019625
roc_auc Test Score: 0.621277 +/- 0.020917


For Boosting with: 
 method: extreme 
 n_estimators: 50 

Metrics for normal_before_fs: 

accuracy Test Score: