<img src="https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png" style="float: left; margin: 15px;">
### Ensemble Methods: Random Forests, Boosting

** Week 5 | Lesson 3.2 **

---
| TIMING  | TYPE  
|:-:|---|---|
| 25 min| [Review: Decision Trees](#review) |
| 10 min| [Kaggle Winners](#hook) |
| 45 min| [Random Forests, Boosting (Adaboost, GBTs), ](#content) |
| 20 min| [Conclusion](#conclusion) |
| 5 min | [Additional Resources](#more)

---

### Lesson Objectives
*After this lesson, you will be able to:*

- Explain what Random Forest is and how it is different from Bagging of Decision trees
- Explain what Extra Trees models are
- Apply both techniques for supervised machine learning
- Describe Boosting and how it differs from Bagging
- Apply Adaboost and Gradient Boosting for supervised machine learning problems
---
### Student Pre-Work 

*Before this lesson, you should already be able to:*

- Decision Trees 
- Bias, Variance Trade Off

---
<a name="review"></a>
### Review: Decision Trees, Bagging 

| Algorithm  | Assumptions | Class/Reg/Both | Loss | Regularization | Parameters |Hyperparameters| Metrics | Intuition | Implementation
|:-:|---|---|
|Decision Trees| Partition the dataset into regions | Both | Regression: Minimize Variance in splits; Classification: Gini, Entropy | None | None |<a href=http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html> Parameters in the SKlearn Documentation </a> | Relevant Metrics for Classification and Regression |<a href=https://www.youtube.com/watch?v=eKD5gxPPeY0> Decision Tree for Applied ML </a>| <a href=http://gabrielelanaro.github.io/blog/2016/03/03/decision-trees.html> Decision Tree Implementation using Numpy </a>|

> **Check:** What are the main advantages of decision trees?

---
> **Check:** Describe how the bagging algorithm works. What is the difference between boostrapping and bagging?

---
<a name="hook"> </a>

### Applications: Algorithms of Champions 
<a href=http://blog.kaggle.com/2016/11/03/red-hat-business-value-competition-1st-place-winners-interview-darius-barusauskas/> Winners Solution: Predicting Customer Business Value </a>

<a href=http://blog.kaggle.com/2017/01/12/santander-product-recommendation-competition-2nd-place-winners-solution-write-up-tom-van-de-wiele/> Winners Solution: Product Recommendations </a>

<a href=http://blog.kaggle.com/> Blog for Looking Through Past Solutions </a>

<img src=http://www.kdnuggets.com/wp-content/uploads/kaggle.jpg>

---
### Random Forests, Extra Trees, AdaBoost, and GBTs 
<a name="content"></a>

<img src=https://image.slidesharecdn.com/slides-141010042109-conversion-gate01/95/understanding-random-forests-from-theory-to-practice-16-638.jpg?cb=1412914915> 

#### Random Forests
Another way of looking at it...<a href=https://www.coursera.org/learn/practical-machine-learning/lecture/XKsl6/random-forests> Coursera RF </a>

While DTs are powerful, DTs have some limitations. In particular, trees that are **grown very deep** tend to learn highly irregular patterns: they **overfit their training sets.** Bagging helps mitigate this problem by exposing different trees to different sub-samples of the whole training set.

Random forests are a further way of averaging multiple deep decision trees, trained on different parts of the same training set, with the goal of reducing the variance. This comes at the expense of a small increase in the bias and some loss of interpretability, but generally greatly boosts the performance of the final model.


Random Forests are some of the most widespread classifiers and regression models used.  They are relatively simple to use because they require very few parameters to set and they perform pretty well.

> **Check:** Describe how the bagging algorithm works:


_Random forests_ differ from bagging decision tree in only one way: they use a modified tree learning algorithm that selects, at each candidate split in the learning process, a **random subset of the features**. This process is sometimes called "feature bagging". The reason for doing this is the correlation of the trees in an ordinary bootstrap sample: if one or a few features are very strong predictors for the response variable (target output), these features will be selected in many of the bagging base trees, causing them to become correlated. By selecting a random subset of the features at each split, we counter this correlation between base trees, strengthening the overall model.

Typically, for a classification problem with $p$ features, $\sqrt{p}$ (rounded down) features are used in each split. For regression problems the inventors recommend $p/3$ (rounded down) with a minimum node size of 5 as the default.



#### Extremely Randomized Trees
Adding one further step of randomization yields extremely randomized trees, or _ExtraTrees_. These are trained using bagging and the random subspace method, like in an ordinary random forest, but an additional layer of randomness is introduced. Instead of computing the locally optimal feature/split combination (based on, e.g., information gain or the Gini impurity), **for each feature under consideration, a random value is selected for the split.** This value is selected from the feature's empirical range (in the tree's training set, i.e., the bootstrap sample), in other words, the top-down splitting in the tree learner is randomized.


## Guided Practice: Random Forest and ExtraTrees in Scikit Learn (20 min)

Scikit Learn implements both random forest and extra trees methods as part of the `ensemble` module.

Have a look at the documentation here: <a href= http://scikit-learn.org/stable/modules/ensemble.html#forest> Hyperparameters of RF </a>

> **Check:** What are the main Hyperparameters do you notice? 


Let's load the [car dataset](https://archive.ics.uci.edu/ml/machine-learning-databases/car/), we are familiar with it by now.

In [None]:
import pandas as pd
df = pd.read_csv('car.csv')
print df.head()

print df.describe()

### Your turn now:

Initialize the following models and check their performance:
- Bagging + Decision Trees
- Random Forest
- Extra Trees

You can also create a function to speed up your work...



## Boosting

With bagging and random forests we tran models on separate subsets and then combine their prediction. In a sense we are parallelizing the training and then combining (like a map-reduce...).



_Boosting_ is a different ensembling technique that is sequential.

Boosting is an iterative procedure that adaptively changes the sampling distribution of training records at each iteration, in order to correct the errors of the previous iteration of models. The first iteration uses uniform weights (like bagging) for all samples. In subsequent iterations, the weights are adjusted to emphasize records that were misclassified in previous iterations. The final prediction is constructed by a weighted vote (where the weights for a base classifier depends on its training error).

Since the base classifier's focus more and more closely on records that are difficult to classify as the sequence of iterations progresses, they are faced with progressively more difficult learning problems.

Boosting takes a base _weak_ learner and tries to make it a _strong_ learner by re-training it on the misclassified samples.



There are several algorithms for boosting, in particular we will mention `AdaBoost`, `GradientBoostingClassifier` that are implemented in scikit learn.


### AdaBoost

`AdaBoost` refers to a particular method of training a boosted classifier. A boost classifier is a classifier in the for
$$
F_T(x) = \sum_{t=1}^T f_t(x)
$$
where each $f_t$ is a weak learner that takes an object $x$ as input and returns a real valued result indicating the class of the object.
<img src=https://www.analyticsvidhya.com/wp-content/uploads/2015/11/bigd.png>


<img src=http://i.stack.imgur.com/5b2VM.png>


Each weak learner produces an output, hypothesis $h(x_i)$, for each sample in the training set. At each iteration $t$, a weak learner is selected and assigned a coefficient \alpha_t such that the sum training error E_t of the resulting t-stage boost classifier is minimized.



Here $F_{t-1}(x)$ is the boosted classifier that has been built up to the previous stage of training, $E(F)$ is some error function and $f_t(x) = \alpha_t h(x)$ is the weak learner that is being considered for addition to the final classifier.

At each iteration of the training process, a weight is assigned to each sample in the training set equal to the current error $E(F_{t-1}(x_i))$ on that sample. These weights can be used to inform the training of the weak learner, for instance, decision trees can be grown that favor splitting sets of samples with high weights.

$$
E_t = \sum_i E[F_{t-1}(x_i) + \alpha_t h(x_i)]
$$


- Fit an additive model (ensemble) in a forward stage-wise manner.
- In each stage, introduce a weak learner to compensate the shortcomings of existing weak learners.

### Gradient Boosting Trees

<a href=http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html> Gradient Boosted Trees </a>

<a href=http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf> A Gentle Introduction to GBTs </a>

Gradient Boosting is a generalization of boosting to arbitrary differentiable loss functions. GBRT is an accurate and effective off-the-shelf procedure that can be used for both regression and classification problems. Gradient Tree Boosting models are used in a variety of areas including Web search ranking and ecology.

The advantages of GBRT are:
- Natural handling of data of mixed type (= heterogeneous features)
- Predictive power
- Robustness to outliers in output space (via robust loss functions)

The disadvantages of GBRT are:
- Scalability, due to the sequential nature of boosting it can hardly be parallelized.


### Introduction to XGBoost: Content from Jason Mastery ML

<a href=https://s3.amazonaws.com/MLMastery/xgboost_with_python_mini_course.pdf?__s=mudfzqhho7h4jzuswfzy> Introduction to XGBoost </a>

- **Parallelization** of tree construction using all of your CPU cores during training.
- **Distributed Computing** for training very large models using a cluster of machines.
- **Out-of-Core Computing** for very large datasets that don’t fit into memory.
- **Cache Optimization** of data structures and algorithms to make best use of hardware.

In [16]:
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder

In [7]:
# load data 
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

In [17]:
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
labels = LabelEncoder()
label_encoded_y = labels.fit_transform(Y)

In [12]:
# create a model
model = XGBClassifier()

In [9]:
n_estimators = [50, 100, 150, 200]
max_depth = [2, 4, 6, 8]
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators, learning_rate=learning_rate)

In [18]:
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)

grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold,
verbose=1)

result = grid_search.fit(X, label_encoded_y)

Fitting 10 folds for each of 96 candidates, totalling 960 fits


[Parallel(n_jobs=-1)]: Done  59 tasks      | elapsed:    1.8s
[Parallel(n_jobs=-1)]: Done 359 tasks      | elapsed:   15.5s
[Parallel(n_jobs=-1)]: Done 859 tasks      | elapsed:   32.4s
[Parallel(n_jobs=-1)]: Done 953 out of 960 | elapsed:   35.8s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done 960 out of 960 | elapsed:   36.1s finished


In [22]:
# summarize results
print("Best: %f using %s" % (result.best_score_, result.best_params_))

means = result.cv_results_['mean_test_score']
stds = result.cv_results_['std_test_score']
params = result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param)                       )

Best: -0.470744 using {'n_estimators': 100, 'learning_rate': 0.1, 'max_depth': 2}
-0.691621 (0.000103) with: {'n_estimators': 50, 'learning_rate': 0.0001, 'max_depth': 2}
-0.690097 (0.000195) with: {'n_estimators': 100, 'learning_rate': 0.0001, 'max_depth': 2}
-0.688588 (0.000276) with: {'n_estimators': 150, 'learning_rate': 0.0001, 'max_depth': 2}
-0.687101 (0.000342) with: {'n_estimators': 200, 'learning_rate': 0.0001, 'max_depth': 2}
-0.691340 (0.000146) with: {'n_estimators': 50, 'learning_rate': 0.0001, 'max_depth': 4}
-0.689552 (0.000291) with: {'n_estimators': 100, 'learning_rate': 0.0001, 'max_depth': 4}
-0.687774 (0.000428) with: {'n_estimators': 150, 'learning_rate': 0.0001, 'max_depth': 4}
-0.686008 (0.000568) with: {'n_estimators': 200, 'learning_rate': 0.0001, 'max_depth': 4}
-0.691188 (0.000208) with: {'n_estimators': 50, 'learning_rate': 0.0001, 'max_depth': 6}
-0.689255 (0.000422) with: {'n_estimators': 100, 'learning_rate': 0.0001, 'max_depth': 6}
-0.687360 (0.000641) 

---
### Conclusion
<a name="conclusion"></a>

In this class we learned about Random Forest, Extremely randomized trees and Boosting. They are different ways to improve the performance of a weak learner.

Some of these methods will perform better in some cases, some better in other cases. For example, Decision Trees are more nimble and easier to communicate, but have a tendency to overfit. On the other hand Ensemble methods perform better in more complex scenarios, but may become very complicated and harder to explain.
Have a look [here](https://www.wise.io/resources) for a couple of examples from real world startup Wise.io.

> **Check:** Can you think of what could be limitations of these methods?

---
<a name="more"></a>
### Additional Resources

- **[Two Cultures of Statistics](https://projecteuclid.org/euclid.ss/1009213726)**
- [Original Random Forest Paper](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf) 
- [Random Forest on wikipedia](https://en.wikipedia.org/wiki/Random_forest)
- [Quora question on Random Forest](https://www.quora.com/How-does-randomization-in-a-random-forest-work?redirected_qid=212859)
- [Scikit Learn Ensemble Methods](http://scikit-learn.org/stable/modules/ensemble.html)
- [Scikit Learn Random Forest Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)
- <a href=https://s3.amazonaws.com/MLMastery/xgboost_with_python_mini_course.pdf?__s=mudfzqhho7h4jzuswfzy> Introduction to XGBoost </a>

- <a href=http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html> Gradient Boosted Trees </a>

- <a href=http://www.ccs.neu.edu/home/vip/teach/MLcourse/4_boosting/slides/gradient_boosting.pdf> A Gentle Introduction to GBTs </a>

