# BT2101 Introduction to Ensemble Learning

## 1 Goal

In this notebook, we will explore **Ensemble Learning** including:
* Bagging
* Random Forest
* AdaBoost
* Gradient Boost

For the **Decision Tree** method, you will:
* Use open-source package to do ensemble learning
* Compare performances of different methods

In [None]:
# -*- coding:utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from math import sqrt
from __future__ import division
%matplotlib inline

## 2 Ensemble Learning

### 2.1 Bias-Variance Decomposition

\begin{align}
E[(y-\hat{f}(x))^2] &= (E[\hat{f}(x)-f(x)])^2 + E[\hat{f}(x)-E[\hat{f}(x)]]^2 + \sigma^2  \\
&= \text{Bias}[\hat{f}(x)]^2 + \text{Var}[\hat{f}(x)] + \text{Irreducibel Error}
\end{align}

A figure to illustrate **Bias** and **Variance**:
<img src="https://www.kdnuggets.com/wp-content/uploads/bias-and-variance.jpg" width="400">

A chart to understand the tradeoff of **Bias** and **Variance**:
<img src="http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png" width="500">

An example to show **Overfit** problem: <br/>
Which line is more preferred? Black or Green?
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/1/19/Overfitting.svg/320px-Overfitting.svg.png" width="300">

Question: <br/>
In previous lectures, you have already learned several techniques to overcome overfitting problem. What are [they](https://elitedatascience.com/overfitting-in-machine-learning)?

### 2.2  Ensemble Learning
Ensemble learning methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone ([Wikipedia](https://en.wikipedia.org/wiki/Ensemble_learning)). 

Ensemble learning is an appropriate way to somehow "relieve" overfitting problem (Remember that ensemble learning methods may still overfit). From the lecture class, we know that a basic workflow of ensemble method:

![alt text](https://www.researchgate.net/profile/Nipaporn_Chanamarn/publication/308368870/figure/fig4/AS:408666126733315@1474445005322/The-concept-diagram-of-stacking-ensemble-learning-32.jpg "Ensemble Learning")

Examples of Ensemble Learning methods include:
<ol>
    <li>Bootstrap Aggregating (aka.Bagging)</li>
    <li>Random Forest</li>
    <li>Boosting</li>    
</ol>


## 3 Examples

### 3.1 Case on Handwritten Digit Recognition

#### Dataset:

The **Kaggle** competition dataset can be obtained from https://www.kaggle.com/c/digit-recognizer/data. 

#### Overview:

MNIST ("Modified National Institute of Standards and Technology") is the de facto “hello world” dataset of computer vision. Since its release in 1999, this classic dataset of **handwritten images** has served as the basis for benchmarking classification algorithms. As new machine learning techniques emerge, MNIST remains a reliable resource for researchers and learners alike.

In this competition, your goal is to **correctly identify digits** from a dataset of tens of thousands of handwritten images. We’ve curated a set of tutorial-style kernels which cover everything from regression to neural networks. We encourage you to experiment with different algorithms to learn first-hand what works well and how techniques compare.

#### Acknowlegements:

More details about the dataset, including algorithms that have been tried on it and their levels of success, can be found at http://yann.lecun.com/exdb/mnist/index.html. The dataset is made available under a Creative Commons Attribution-Share Alike 3.0 license.

#### Attributes:

The data files train.csv and test.csv contain gray-scale images of hand-drawn digits, from zero through nine.

Each image is 28 pixels in height and 28 pixels in width, for a total of 784 pixels in total. Each pixel has a single pixel-value associated with it, indicating the lightness or darkness of that pixel, with higher numbers meaning darker. This pixel-value is an integer between 0 and 255, inclusive.

The training data set, (train.csv), has 785 columns. The first column, called "label", is the digit that was drawn by the user. The rest of the columns contain the pixel-values of the associated image.

Each pixel column in the training set has a name like pixelx, where x is an integer between 0 and 783, inclusive. To locate this pixel on the image, suppose that we have decomposed x as x = i * 28 + j, where i and j are integers between 0 and 27, inclusive. Then pixelx is located on row i and column j of a 28 x 28 matrix, (indexing by zero).

The test data set, (test.csv), is the same as the training set, except that it does not contain the "label" column.

The evaluation metric for this contest is the categorization accuracy, or the proportion of test images that are correctly classified. For example, a categorization accuracy of 0.97 indicates that you have correctly classified all but 3% of the images.

In [None]:
# Load dataset: You need to download dataset first
%pwd
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')

In [None]:
# 42,000 pictures; Each picture is composed of 28*28 dimensional pixels
train.shape

In [None]:
# What does an image look like
plt.imshow(np.array(train.iloc[1,1:]).reshape((28, 28)), cmap="gray")
plt.title("This digit is %d" % train.iloc[1,0])
plt.show()

### 3.2 Bagging

A basic procedure for **Bagging** is:

<img src="https://www.safaribooksonline.com/library/view/python-deeper-insights/9781787128576/graphics/3547_07_06.jpg" width="500">

In [None]:
# Bootstrap Aggregating Package
from sklearn.ensemble import BaggingClassifier
from sklearn.cross_validation import train_test_split

# Tranform dataframe to array
train_feature = train.iloc[:,1:].values
train_target = train.iloc[:,0].values
test_feature = test.iloc[:,1:].values
test_target = test.iloc[:,0].values

# Split train data: 70% for model fit, 30% for validation
X_train, X_valid, y_train, y_valid = train_test_split(train_feature, train_target, test_size=0.3, random_state=0)

In [None]:
# Fit Bagging Model; Binary Splitting using Entropy
# Just wait a minute...It will take some time
BA = BaggingClassifier(n_estimators=100, random_state=0)
BA_model = BA.fit(X_train, y_train)
BA_model.classes_

In [None]:
# Validation
y_pred_valid = BA_model.predict(X_valid)

In [None]:
# Performance of Random Forest model
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
cm = confusion_matrix(y_valid, y_pred_valid)
print cm
print "Accuracy: ", accuracy_score(y_valid, y_pred_valid)

The validation accuracy is 94.99%, which is quite ok. <br/>

### 3.3 Random Forest

A basic procedure for **Random Forest** is:

<img src="https://i.ytimg.com/vi/ajTc5y3OqSQ/hqdefault.jpg" width="500">

In [None]:
# Random Forest package
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split

# Tranform dataframe to array
train_feature = train.iloc[:,1:].values
train_target = train.iloc[:,0].values
test_feature = test.iloc[:,1:].values
test_target = test.iloc[:,0].values

# Split train data: 70% for model fit, 30% for validation
X_train, X_valid, y_train, y_valid = train_test_split(train_feature, train_target, test_size=0.3, random_state=0)

In [None]:
# Fit Random Forest Model; Binary Splitting using Entropy
RF = RandomForestClassifier(criterion='entropy', n_estimators=100, random_state=0)
RF_model = RF.fit(X_train, y_train)
RF_model.classes_

In [None]:
# Validation
y_pred_valid = RF_model.predict(X_valid)

In [None]:
# Performance of Random Forest model
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix
cm = confusion_matrix(y_valid, y_pred_valid)
print cm
print "Accuracy: ", accuracy_score(y_valid, y_pred_valid)

The validation accuracy is 95.95%, even better. <br/>

## 4 Boosting

A typical procedure for boosting method is:

<img src="https://koalaverse.github.io/machine-learning-in-R/images/boosting_diagram.png" width="500">

Examples of boosting methods include:
1. Boosting based on weights: Adaptive boosting (Adaboost)
2. Boosting based on residuals: Gradient boosting decision trees (GBDT)

### 4.1 Adaboost

The algorithm for **Adaboost** is:

<img src="https://koalaverse.github.io/machine-learning-in-R/images/adaboost_m1.png" width="700">

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris() 

# Train Adaboost Model with 100 subclassifiers
Ada = AdaBoostClassifier(n_estimators=100) 

# Cross validation
scores = cross_val_score(Ada, iris.data, iris.target, cv=10) 
scores.mean()                             

The average cross-validation accuracy is 94.67%.

### 4.2 Gradient Boosting

The algorithm for **Gradient Boosting** is:

<img src="https://koalaverse.github.io/machine-learning-in-R/images/friedman_gbm.png" width="700">

In [None]:
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris

# Load dataset
iris = load_iris() 

# Train Gradient Boosting Model with 100 subclassifiers
GB = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

# Cross validation
scores = cross_val_score(GB, iris.data, iris.target, cv=10) 
scores.mean()    

The average cross-validation accuracy is 95.33%.

More about Ensemble Learning can be found at http://scikit-learn.org/stable/modules/ensemble.html.

## 5 References

[1] Chris Albon. (2018). Machine Learning with Python Cookbook. O'Reilly.