## Boosting Algorithms

1. Adaboost
<br>
The core principle of AdaBoost is to fit a sequence of weak learners (i.e. models that are only slightly better than random guessing, such as small decision trees) on repeatedly modified versions of the data. The predictions are then combined through a weighted majority vote (or sum) to product the final prediction. Used for classification and regression problems.
<br><br>
2. Gradient Tree Boosting
<br>
Generalizaiton of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems.

## Random Forests

1. Random Forests
<br>
Each tree in the ensemble is built from a sample drawn with replacement (i.e. a bootstrap sample) from the training set. In addition, when splitting a node during the construction of the tree, the split is chosen is no longer the best split among all the features. Instead the split that is picked is the best split among a random subset of the features.
<br><br>
2. Extremely Random Trees
<br>
Randomness goes one step further in the way splits are computed. As in random forests, a random subset of candidate features is used, but instead of looking for the most discriminative thresholds, thresholds are drawn at random for each candidate feature and the best of these randomly-generated thresholds is picked as the splitting rule. This usually allows to reduce the variance of the model a bit more, at the expense of a slightly greater increase in bias.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score

In [2]:
# Input data before feature selection
input_data_before_fs = pd.read_csv('processed_train.csv', index_col=0)

# Input data after feature selection
input_data_after_fs = pd.read_csv('processed_train_after_feature.csv', index_col=0)

# List of all the input data
input_all = {
    "normal_before_fs" : input_data_before_fs,
    "normal_after_fs" : input_data_after_fs
}

In [15]:
# Functions
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import LabelBinarizer

def multiclass_roc_auc_score(y_test, y_pred, average="macro"):
    lb = LabelBinarizer()
    lb.fit(y_test)
    y_test = lb.transform(y_test)
    y_pred = lb.transform(y_pred)
    return roc_auc_score(y_test, y_pred, average=average)

def metric_consolidation(input_all, classifier, method = "cross_validation"):
    metrics = {'accuracy': 'accuracy',
               'roc_auc': make_scorer(multiclass_roc_auc_score, average='weighted'),
               'f1_weighted': 'f1_weighted'
              }
    
    for input_name, input_data in input_all.items():
        # split the data
        x_train, x_test, y_train, y_test = preprocessing(input_data)

        # fit the classifier to the training data
        classifier.fit(x_train, y_train)

        # apply all metrics to the classifier for cross_validation
        if method == "cross_validation":
            scores = tenfold(classifier, x_train, y_train, metric = metrics)
            print ("Metrics for %s: \n" %input_name)
            for metric in metrics:
                test_score_name = "test_" + metric
                test_score = scores[test_score_name]
                print ("%s Test Score: %0.2f +/- %0.2f" %(metric, test_score.mean()*100,
                                               test_score.std()*100))   
            print ("\n")
            
        if method == "test":
            y_pred = classifier.predict(x_test)
            accuracy = accuracy_score(y_test, y_pred)
            roc_score = multiclass_roc_auc_score(y_test, y_pred, average='weighted')
            f1_weighted = f1_score(y_test, y_pred, average='weighted')
            print(confusion_matrix(y_test, y_pred))
            
            metric_values = {'accuracy': accuracy,
                             'roc_auc': roc_score,
                             'f1_weighted': f1_weighted
                            }
            for metric in metrics:
                test_score = metric_values[metric]
                print ("%s Test Score: %0.2f +/- %0.2f" %(metric, test_score.mean()*100,
                                               test_score.std()*100)) 
            
def preprocessing(data):
    #Split data into variables types - boolean, categorical, continuous, ID
    bool_var = list(data.select_dtypes(['bool']))
    cont_var = list(data.select_dtypes(['float64']))
    cat_var = list(data.select_dtypes(['int64']))

    #Input Data can be from all except id details
    final_input_data = data[cat_var + cont_var + bool_var]
    
    x = final_input_data.loc[:, final_input_data.columns != 'Target'].values
    y = final_input_data['Target'].values
    y=y.astype('int')
    
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, 
                                                    random_state = 100 , stratify = y)
    
    return x_train, x_test, y_train, y_test

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.model_selection import cross_validate

def tenfold(model, x, y, metric='accuracy'):
    kfold = StratifiedKFold(n_splits=10, random_state=100, shuffle=True)
    scores = cross_validate(model, x, y, cv=kfold, scoring=metric, 
                            return_train_score=True)
    return scores

# accuracy_mean = scores['test_score'].mean()
# accuracy_std = scores['train_score'].std()

In [4]:
# Adaboost

from sklearn.ensemble import AdaBoostClassifier

clf = AdaBoostClassifier(n_estimators=100, learning_rate=1.0, random_state=0)

In [5]:
# Gradient Tree Boosting

from sklearn.ensemble import GradientBoostingClassifier

clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
                                 max_depth=1, random_state=0)

In [6]:
# Random Forests

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators=10, max_depth=None,
                             min_samples_split=2, random_state=0)

In [7]:
# Extreme Random Trees

from sklearn.ensemble import ExtraTreesClassifier

clf = ExtraTreesClassifier(n_estimators=10, max_depth=None,
                           min_samples_split=2, random_state=0)

In [8]:
# Bagging 

from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

clf = BaggingClassifier(LogisticRegression(), n_estimators=10)


In [11]:
# Boosting Parameters
import warnings
warnings.filterwarnings('ignore')

learning_rate_values = [0.1, 0.5, 1]
n_estimators_values = [50, 100, 150]
method_values = ['ada', 'gtb']

for method in method_values:
    for learning_rate in learning_rate_values:
        for n_estimators in n_estimators_values:
            
            if method =='ada':
                clf = AdaBoostClassifier(n_estimators=n_estimators, learning_rate=learning_rate, 
                                         random_state=100)
            elif method == 'gtb':
                clf = GradientBoostingClassifier(n_estimators=n_estimators, learning_rate=learning_rate, 
                                         random_state=100)
            
            print ("For Boosting with: \n method: %s \n learning rate: %s \n n_estimators: %s \n"
                  %(method, learning_rate, n_estimators))
            metric_consolidation(input_all, clf)

For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 50 

Metrics for normal_before_fs: 

roc_auc Test Score: 55.38 +/- 1.61
f1_weighted Test Score: 56.74 +/- 1.59
accuracy Test Score: 67.34 +/- 1.10


For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 100 

Metrics for normal_before_fs: 

roc_auc Test Score: 59.22 +/- 1.75
f1_weighted Test Score: 59.34 +/- 1.69
accuracy Test Score: 68.28 +/- 1.43


For Boosting with: 
 method: ada 
 learning rate: 0.1 
 n_estimators: 150 

Metrics for normal_before_fs: 

roc_auc Test Score: 60.19 +/- 1.70
f1_weighted Test Score: 59.74 +/- 1.64
accuracy Test Score: 68.10 +/- 1.49


For Boosting with: 
 method: ada 
 learning rate: 0.5 
 n_estimators: 50 



KeyboardInterrupt: 

In [16]:
# Test Values for Boosting

gtb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 
                                         random_state=100)

metric_consolidation(input_all, gtb, method='test')

[[  9   9   3  34]
 [  9  32   9  61]
 [  2  21   7  59]
 [  5  21  15 448]]
f1_weighted Test Score: 61.93 +/- 0.00
roc_auc Test Score: 62.55 +/- 0.00
accuracy Test Score: 66.67 +/- 0.00
[[  8  12   5  30]
 [  7  33  10  61]
 [  4  19   9  57]
 [  8  23  10 448]]
f1_weighted Test Score: 62.44 +/- 0.00
roc_auc Test Score: 63.43 +/- 0.00
accuracy Test Score: 66.94 +/- 0.00


In [9]:
# Random Forest Parameters
import warnings
warnings.filterwarnings('ignore')

n_estimators_values = [10, 50, 100] # Number of trees in the forest
method_values = ['random', 'extreme']

for method in method_values:
    for n_estimators in n_estimators_values:

        if method == 'random':
            clf = RandomForestClassifier(n_estimators=n_estimators, random_state=100)
        elif method =='extreme':
            clf = ExtraTreesClassifier(n_estimators=n_estimators, random_state=100)

        print ("For Boosting with: \n method: %s \n n_estimators: %s \n"
              %(method, n_estimators))
        metric_consolidation(input_all, clf)

For Boosting with: 
 method: random 
 n_estimators: 10 

Metrics for normal_before_fs: 

accuracy Test Score: 66.09 +/- 2.12
roc_auc Test Score: 64.27 +/- 1.99
f1_weighted Test Score: 62.39 +/- 2.60


For Boosting with: 
 method: random 
 n_estimators: 50 

Metrics for normal_before_fs: 

accuracy Test Score: 68.24 +/- 1.39
roc_auc Test Score: 62.27 +/- 1.78
f1_weighted Test Score: 61.75 +/- 2.14


For Boosting with: 
 method: random 
 n_estimators: 100 

Metrics for normal_before_fs: 

accuracy Test Score: 68.51 +/- 1.74
roc_auc Test Score: 61.80 +/- 1.78
f1_weighted Test Score: 61.63 +/- 2.57


For Boosting with: 
 method: extreme 
 n_estimators: 10 

Metrics for normal_before_fs: 

accuracy Test Score: 63.84 +/- 2.12
roc_auc Test Score: 62.13 +/- 2.09
f1_weighted Test Score: 60.05 +/- 1.96


For Boosting with: 
 method: extreme 
 n_estimators: 50 

Metrics for normal_before_fs: 

accuracy Test Score: 67.03 +/- 1.46
roc_auc Test Score: 61.74 +/- 1.47
f1_weighted Test Score: 61.09 +/-

In [12]:
random = RandomForestClassifier(n_estimators=50, random_state=100)

metric_consolidation(input_all, random, method='test')

roc_auc Test Score: 60.31 +/- 0.00
f1_weighted Test Score: 60.79 +/- 0.00
accuracy Test Score: 66.94 +/- 0.00


In [11]:
# Bagging Parameters

import warnings
warnings.filterwarnings('ignore')

n_estimators_values = [10, 50, 100] # Number of trees in the forest
method_values = [LogisticRegression(), KNeighborsClassifier(), DecisionTreeClassifier()]

for method in method_values:
    for n_estimators in n_estimators_values:

        clf = BaggingClassifier(base_estimator=method, n_estimators=n_estimators, random_state=100)

        print ("For Bagging with: \n method: %s \n n_estimators: %s \n"
              %(method, n_estimators))
        metric_consolidation(input_all, clf)

For Bagging with: 
 method: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) 
 n_estimators: 10 

Metrics for normal_before_fs: 

accuracy Test Score: 67.38 +/- 1.49
roc_auc Test Score: 57.93 +/- 1.87
f1_weighted Test Score: 58.73 +/- 1.96


For Bagging with: 
 method: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False) 
 n_estimators: 50 

Metrics for normal_before_fs: 

accuracy Test Score: 67.39 +/- 1.43
roc_auc Test Score: 58.14 +/- 1.57
f1_weighted Test Score: 58.87 +/- 1.96


For Bagging with: 
 method: LogisticRegression(C=1.0, class_weight=None, dual=False,

In [10]:
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=100, random_state=100)

metric_consolidation(input_all, bagging, method='test')

roc_auc Test Score: 61.62 +/- 0.00
f1_weighted Test Score: 61.73 +/- 0.00
accuracy Test Score: 66.13 +/- 0.00


In [14]:
# Visualize GTB
from sklearn import tree
from IPython.display import Image  
from sklearn.tree import export_graphviz
import pydotplus

gtb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, 
                                         random_state=100)

# split the data
x_train, x_test, y_train, y_test = preprocessing(input_data_after_fs)

# fit the classifier to the training data
gtb.fit(x_train, y_train)

# Get the tree number 50
sub_tree_50 = gtb.estimators_[50, 0]

dot_data = tree.export_graphviz(
    sub_tree_50,
    out_file=None, filled=True,
    rounded=True,  
    special_characters=True,
    proportion=True,
)
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png()) 
graph.write_pdf("dt_boosting_50.pdf")
graph.write_png("dt_boosting_50.png")

True