# Comparing Boosting Techniques
## AdaBoost, Gradient Boosting Classifier, and XGBoost

This notebook compares the performance of the gradient boosting classifier in scikit-learn with XGBoost using the CharityML dataset. 

The boosting technique is an approach used in ensemble learning that trains a series of learning models (called weak learners) that improve with each iteration. The weak learners are trained in sequence with each weak learner accounting for the mistakes of the previous weak learner and as a result improving its own predictive performance

The final predictions are then made based on the aggregate predictions of each individual weak learner. The common boosting algorithms are AdaBoost, gradient boosting, and gradient boosting's improved implementation, XGBoost.

The key difference between AdaBoost and gradient boosting is in the way each weak learner improve on its predecessor. For AdaBoost, the points that are misclassified by the current weak learner are assigned greater weight in the next iteration. This changes the distribution of points resulting in greater emphasis on the avoidance of misclassifying these points again when training the next weak learner. 

Gradient boosting takes a slightly different approach. Instead of changing the weights of the misclassified points of the previous weak learner, the current weak learner adjusts the prediction itself. This adjustment are the residuals calculated from the predecessor weak learner. It turns out that this method of learning is very similar to the way the gradient descent algorithm work, hence the name. This is more clearly explained in the [informal introduction](https://en.wikipedia.org/wiki/Gradient_boosting) on gradient boosting on Wikipedia. The advantage of gradient boosting over AdaBoost is it lets you use any differentiable loss function. With AdaBoost, you're limited to using just exponential loss.

XGBoost is everything that the vanilla gradient boosting algorithm wanted to be growing up. It's the friend that got into all the good schools, received offers at all the cool jobs, and dated all the good-looking and smart people. In the domain of machine learning competition, it is the algorithm behind many winning solutions in competitions using structured data.

XGBoost is just a gradient boosting algorithm with improved features. This [article](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/) clearly outlines its advantages


### References
1. [Complete Guide to Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
2. [Good video explaining gradient boosting](https://www.youtube.com/watch?v=sRktKszFmSk)
3. [Tutorial on XGBoost][2]

### Installing XGBoost

The original instructions to install XGBoost can be found [here][1]. I had trouble installing it following these instructions which I suspect is because I'm using the Anaconda distribution. I was able to install it using this command

```
conda install -c conda-forge xgboost=0.6a2
``` 

which I found on Stack Overflow [here][3]. 

[1]: http://xgboost.readthedocs.io/en/latest/build.html
[2]: http://xgboost.readthedocs.io/en/latest/model.html
[3]: https://stackoverflow.com/questions/43464454/install-xgboost-on-anaconda

In [7]:
import sys
sys.version

'3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:04:09) \n[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]'

In [8]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from time import time
from IPython.display import display # Allows the use of display() for DataFrames

# Import supplementary visualization code visuals.py
import visuals as vs

# Pretty display for notebooks
%matplotlib inline

# Load the Census dataset
data = pd.read_csv("census.csv")

# Success - Display the first record
display(data.head(n=2))

Unnamed: 0,age,workclass,education_level,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,Bachelors,13.0,Never-married,Adm-clerical,Not-in-family,White,Male,2174.0,0.0,40.0,United-States,<=50K
1,50,Self-emp-not-inc,Bachelors,13.0,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.0,0.0,13.0,United-States,<=50K


## Prepare Data

In [9]:
# Import sklearn.preprocessing.StandardScaler
from sklearn.preprocessing import MinMaxScaler

# Split the data into features and target label
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

# Log-transform the skewed features
skewed = ['capital-gain', 'capital-loss']
features_log_transformed = pd.DataFrame(data = features_raw)
features_log_transformed[skewed] = features_raw[skewed].apply(lambda x: np.log(x + 1))

# Initialize a scaler, then apply it to the features
scaler = MinMaxScaler() # default=(0, 1)
numerical = ['age', 'education-num', 'capital-gain', 'capital-loss', 'hours-per-week']
features_log_minmax_transform = pd.DataFrame(data = features_log_transformed)
features_log_minmax_transform[numerical] = scaler.fit_transform(features_log_transformed[numerical])


# One-hot encode the 'features_log_minmax_transform' data using pandas.get_dummies()
features_final = pd.get_dummies(features_log_minmax_transform)

# Encode the 'income_raw' data to numerical values
income = pd.get_dummies(income_raw)
income = income['>50K']

## Shuffle and Split Data

In [10]:
# Import train_test_split
from sklearn.cross_validation import train_test_split

# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(features_final, 
                                                    income, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

Training set has 36177 samples.
Testing set has 9045 samples.


## Training and Prediction Pipeline

In [11]:
from sklearn.metrics import fbeta_score, accuracy_score

def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on
       - sample_size: the size of samples (number) to be drawn from training set
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    
    results = {}
    
    # Fit the learner to the training data using slicing with 
    # 'sample_size' using .fit(training_features[:], training_labels[:])
    start = time()
    learner = learner.fit(X_train[:sample_size], y_train[:sample_size])
    end = time()
    
    # Calculate the training time
    results['train_time'] = end - start
        
    # Get the predictions on the test set(X_test), then get 
    # predictions on the first 300 training samples(X_train) using .predict()
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    end = time() # Get end time
    
    # Calculate the total prediction time
    results['pred_time'] = end - start
            
    # Compute accuracy on training (the first 300 training samples) and test set
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train)
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    
    # Compute F-score on the training (first 300) and test set
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
    results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
       
    print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))

    return results

## Scikit-learn Gradient Boosted Tree

In [15]:
from sklearn.ensemble import GradientBoostingClassifier

learner = GradientBoostingClassifier()
sample_size = len(y_train)
train_predict(learner, sample_size, X_train, y_train, X_test, y_test)

GradientBoostingClassifier trained on 36177 samples.


{'acc_test': 0.86301824212271971,
 'acc_train': 0.85666666666666669,
 'f_test': 0.7395338561802719,
 'f_train': 0.73412698412698407,
 'pred_time': 0.02304530143737793,
 'train_time': 8.164382934570312}

In [6]:
import xgboost as xgb

param = {'max_depth':2, 'eta':1, 'silent':1, 'objective':'binary:logistic' }
num_round = 2




# References
* 
* 