<div class="alert alert-block alert-info" style="margin-top: 20px; background-color:#DCDCDC">
<strong>Classification</strong> Find what class a sample belongs to.
</div>

# XGBoost (Classification)

## Overview

- [Description](#Description)  
- [AdaBoost](#AdaBoost)
- [EXAMPLE - Decision Boundaries Visualization](#EXAMPLE---Decision-Boundaries-Visualization)

## Description

**XGBoost** is an implementation of [**gradient boosted decision trees**](https://en.wikipedia.org/wiki/Gradient_boosting) designed for speed and performance. XGBoost stands for e**X**treme **G**radient **B**oosting.

<div class="alert alert-block alert-warning" style="margin-top: 20px">
<strong>Boosting</strong>
<br/>
<br/>
Boosting is an **ensemble** technique where new models are added to correct the errors made by existing models. Models are added sequentially until no further improvements can be made. A popular example is the [AdaBoost](http://localhost:8888/notebooks/Machine%20Learning%20Foundations/machine-learning-notes/02%20Classification/07%20AdaBoost.ipynb) algorithm that weights data points that are hard to predict.
<br/>
<br/>
Gradient boosting is an approach where new models are created that predict the residuals or errors of prior models and then added together to make the final prediction. It is called gradient boosting because it uses a gradient descent algorithm to minimize the loss when adding new models.
<br/>
<br/>
This approach supports both regression and classification predictive modeling problems.
</div>

XGBoost is a software library that can be downloaded, installed and accessed from a variety ot interfaces including: 

- CLI
- C++
- Python
- R
- Julia
- Java, Scala

This librery offers a number of advanced features.

** Model Features **

- *Gradient Boosting* algorithm also called gradient boosting machine including the learning rate
- *Stochastic Gradient Boosting* with sub-sampling at the row, column and column per split levels
- *Regularized Gradient Boosting* with both L1 and L2 regularization

** System Features **

- *Parallelization* of tree construction using all the CPU cores during training
- *Distributed Computing* for training very large models using a cluster of machines
- *Out-of-Core Computing* for very large datasets that don't fit into memory
- *Cache Optimization* of data structures and algorithm to make best use of hardware

** Algorithm Features **

- *Sparse Aware* implementation with automatic handling of missing data values
- *Block Structure* to support the parallelization of tree construction
- *Continued Training* further boost an already fitted model on new data

Pros:

- Execution Speed
- Model Performance

Cons:

- Susceptible to overfitting




## AdaBoost

[AdaBoostClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html#sklearn.ensemble.AdaBoostClassifier) is a meta-estimator the begins fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases.

<div class="alert alert-block alert-info" style="margin-top: 20px">
<strong>AdaBoostClassifier</strong> (base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None)
<br/>
Parameters:
<ul>
<li>base_estimator: the base estimator from which the boosted ensemble is built (default=DecisionTreeClassifier</li>
<li>n_estimators: the maximum number of estimators at which boosting is terminated.</li>
<li>learning_rate: learning rate shrinks the contribution of each classifier by learning_rate</li>
<li>algorithm: if ‘SAMME.R’ then use the SAMME.R real boosting algorithm. base_estimator must support calculation of class probabilities. if ‘SAMME’ then use the SAMME discrete boosting algorithm. The SAMME.R algorithm typically converges faster than SAMME, achieving a lower test error with fewer boosting iterations</li>
</ul>
</div>

In [1]:
# load libraries and set plot parameters
import numpy as np
import pandas as pd
# import PrettyTable as pt

import matplotlib.pyplot as plt
%matplotlib inline

# plots configuration
# plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = 12, 8
plt.rcParams['axes.labelsize'] = 12
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['legend.fontsize'] = 11

In [2]:
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

dataset = load_iris()
print(dataset['DESCR'])

Iris Plants Database

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris d



In [3]:
df = pd.DataFrame(dataset.data, columns=dataset.feature_names)
df['class'] = pd.Series(dataset.target, name='class')
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [4]:
df.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),class
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.054,3.758667,1.198667,1.0
std,0.828066,0.433594,1.76442,0.763161,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [5]:
X = df.drop('class', axis=1)
y = df['class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=7)

In [6]:
xgboost = XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100)
xgboost.fit(X_train, y_train)

XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='multi:softprob', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

In [7]:
xgboost.classes_

array([0, 1, 2])

In [8]:
# feature_importances_
# The higher, the more important the feature. 
# The importance of a feature is computed as the (normalized) total reduction 
# of the criterion brought by that feature. It is also known as the Gini importance
xgboost.feature_importances_

array([ 0.22272727,  0.07954545,  0.53636366,  0.16136363], dtype=float32)

In [9]:
print('Accuracy: {0}'.format(xgboost.score(X_test, y_test)))
y_pred = xgboost.predict(X_test)
print('Number of milabeled points: {0}'.format((y_test!=y_pred).sum()))

Accuracy: 0.9111111111111111
Number of milabeled points: 4


## EXAMPLE - Decision Boundaries Visualization

For visualization purposes we will choose two features: petal widh and petal length

In [10]:
from modules import plot_decision_regions

X = df.drop(labels=['class','sepal length (cm)', 'sepal width (cm)'], axis=1)
y = df['class']
X.head()

Unnamed: 0,petal length (cm),petal width (cm)
0,1.4,0.2
1,1.4,0.2
2,1.3,0.2
3,1.5,0.2
4,1.4,0.2


In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

xgboost = XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100)
xgboost.fit(X_train, y_train)

plot_decision_regions(X_train, X_test, y_train, y_test, classifier=xgboost, test_marker=True)
plt.title('XGBoost')
plt.xlabel('petal length (cm)')
plt.ylabel('petal width (cm)')

ValueError: feature_names mismatch: ['petal length (cm)', 'petal width (cm)'] ['f0', 'f1']
expected petal length (cm), petal width (cm) in input data
training data did not have the following fields: f0, f1