## Boosting Algorithms

1. Adaboost
<br>
The core principle of AdaBoost is to fit a sequence of weak learners (i.e. models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions are then combined through a weighted majority vote (or sum) to product the final prediction. Used for classification and regression problems.
<br><br>
2. Gradient Tree Boosting
<br>
Generalizaiton of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems.

## Random Forests

1. Random Forests
<br>
Each tree in the ensemble is built from a sample drawn with replacement (i.e. a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split is chosen is no longer the best split among all the features. Instead the split that is picked is the best split among a random subset of the features.
<br><br>
2. Extremely Random Trees
<br>
Randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('always')

In [2]:
# Input data before feature selection
input_data_before_fs = pd.read_csv('processed_train.csv', index_col=0)

# Input data after feature selection
input_data_after_fs = pd.read_csv('processed_train_after_feature.csv', index_col=0)

# Upsampling without feature selection

# Upsampling with feature selection

# Downsampling without feature selection

# Upsampling with feature selection


# List of all the input data
input_all = {
    "normal_before_fs" : input_data_before_fs,
#     "normal_after_fs" : input_data_after_fs
}

In [3]:
# Adaboost

from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=0)

In [4]:
# Gradient Tree Boosting

from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                 max_depth=1, random_state=0)

In [5]:
# Random Forests

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
                             min_samples_split=2, random_state=0)

In [6]:
# Extreme Random Trees

from sklearn.ensemble import ExtraTreesClassifier

clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
                           min_samples_split=2, random_state=0)

In [8]:
# Functions
from sklearn.metrics import make_scorer, roc_auc_score

def metric_consolidation(input_all, classifier, method = "cross_validation"):
    metrics = {'accuracy': 'accuracy',
               'f1_weighted': 'f1_weighted',
               'roc_auc': make_scorer(roc_auc_score, average='weighted')
              }
    
    for input_name, input_data in input_all.items():
        # split the data
        x_train, x_test, y_train, y_test = preprocessing(input_data)

        # fit the classifier to the training data
        classifier.fit(x_train, y_train)

        # apply all metrics to the classifier for cross_validation
        if method == "cross_validation":
            scores = tenfold(classifier, x_train, y_train, metric = metrics)
            print ("Metrics for %s: \n" %input_name)
            for metric in metrics:
                test_score_name = "test_" + metric
                test_score = scores[test_score_name]
                print ("%s Test Score: %f +/- %f" %(metric, test_score.mean(),
                                               test_score.std()))   
            print ("\n")
            
def preprocessing(data):
    #Split data into variables types - boolean, categorical, continuous, ID
    bool_var = list(data.select_dtypes(['bool']))
    cont_var = list(data.select_dtypes(['float64']))
    cat_var = list(data.select_dtypes(['int64']))

    #Input Data can be from all except id details
    final_input_data = data[cat_var + cont_var + bool_var]
    
    x = final_input_data.loc[:, final_input_data.columns != 'Target'].values
    y = final_input_data['Target'].values
    y=y.astype('int')
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, 
                                                    random_state = 100 , stratify = y)
    
    return x_train, x_test, y_train, y_test

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.model_selection import cross_validate

def tenfold(model, x, y, metric='accuracy'):
    kfold = StratifiedKFold(n_splits=10, random_state=100, shuffle=True)
    scores = cross_validate(model, x, y, cv=kfold, scoring=metric, 
                            return_train_score=True)
    return scores

# accuracy_mean = scores['test_score'].mean()
# accuracy_std = scores['train_score'].std()

In [None]:
# Boosting Parameters
learning_rate_values = [0.1, 0.5, 1]
n_estimators_values = [50, 100, 150]
method_values = ['ada', 'gtb']

for method in method_values:
    for learning_rate in learning_rate_values:
        for n_estimators in n_estimators_values:
            
            if method =='ada':
                clf = AdaBoostClassifier(n_estimators=n_estimators, learning_rate=learning_rate, 
                                         random_state=0)
            elif method == 'gtb':
                clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=learning_rate, 
                                         random_state=0)
            
            print ("For Boosting with: \n method: %s \n learning rate: %s \n n_estimators: %s"
                  %(method, learning_rate, n_estimators))
            metric_consolidation(input_all, clf)
            print ("\n \n")

For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 50


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Metrics for normal_before_fs: 

accuracy Test Score: 0.673432 +/- 0.010978
f1_weighted Test Score: 0.567380 +/- 0.015882



 

For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 100


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Metrics for normal_before_fs: 

accuracy Test Score: 0.682820 +/- 0.014255
f1_weighted Test Score: 0.593405 +/- 0.016860



 

For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 150


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Metrics for normal_before_fs: 

accuracy Test Score: 0.681036 +/- 0.014907
f1_weighted Test Score: 0.597380 +/- 0.016394



 

For Boosting with: 
 method: ada 
 learning rate: 0.5 
 n_estimators: 50


  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)


Metrics for normal_before_fs: 

accuracy Test Score: 0.674774 +/- 0.014374
f1_weighted Test Score: 0.607183 +/- 0.018601



 

For Boosting with: 
 method: ada 
 learning rate: 0.5 
 n_estimators: 100
Metrics for normal_before_fs: 

accuracy Test Score: 0.673016 +/- 0.019268
f1_weighted Test Score: 0.620297 +/- 0.020162



 

For Boosting with: 
 method: ada 
 learning rate: 0.5 
 n_estimators: 150
Metrics for normal_before_fs: 

accuracy Test Score: 0.673423 +/- 0.018802
f1_weighted Test Score: 0.629004 +/- 0.019580



 

For Boosting with: 
 method: ada 
 learning rate: 1 
 n_estimators: 50
Metrics for normal_before_fs: 

accuracy Test Score: 0.664928 +/- 0.018340
f1_weighted Test Score: 0.620061 +/- 0.017520



 

For Boosting with: 
 method: ada 
 learning rate: 1 
 n_estimators: 100
Metrics for normal_before_fs: 

accuracy Test Score: 0.655521 +/- 0.023329
f1_weighted Test Score: 0.626039 +/- 0.024363



 

For Boosting with: 
 method: ada 
 learning rate: 1 
 n_estimators: 150
Me

In [None]:
# Random Forest Parameters
n_estimators_values = [10, 50, 100] # Number of trees in the forest
method_values = ['random', 'extreme']

for method in method_values:
    for n_estimators in n_estimators_values:

        if method == 'random':
            clf = RandomForestClassifier(n_estimators=n_estimators, random_state=0)
        elif method =='extreme':
            clf = ExtraTreesClassifier(n_estimators=n_estimators, random_state=0)

        print ("For Boosting with: \n method: %s \n n_estimators: %s"
              %(method, n_estimators))
        metric_consolidation(input_all, clf)
        print ("\n \n")