# Entry 54 - Gradient Boost

Ensemble learning that combines multiple Decision Trees where each tree attempts to improve the mistakes of the previous tree.

## Learning Style

<table align='left'>
    <tr>
        <th>Supervision</th>
        <th>Prediction types</th>
    </tr>
    <tr>
        <td>Supervised</td>
        <td>Regression</td>
    </tr>
    <tr>
        <td></td>
        <td>Classification</td>
    </tr>
</table>

## Description

Gradient Boost trains Decision Trees sequentially. It uses the errors from the previous tree as the input for the following tree. By default, there is no randomness to gradient boosted trees.

## Purpose

*Introduction to Machine Learning with Python* page 93: 

> As both gradient boosting and random forests perform well on similar kinds of data, a common approach is to first try random forests, which work quite robustly. If random forests work well but prediction time is at a premium, or it is important to squeeze out the last percentage of accuracy from the machine learning model, moving to gradient boosting often helps.

## Behavior

Often the maximum depth or maximum number of nodes is set very low; *Introduction to Machine Learning with Python* states on page 94 that it is often not deeper than five splits. The implication on page 90 is that strong pre-pruning is used in lieu of randomness. The small size of the trees has the added benefit of using less memory and making predictions faster.

The small size of the trees also means that the trees that comprise Gradient Boost are *weak learners* (see the "Weak and Strong Learners" section of [Entry 51]() for more details). Each tree only preforms well on a specific subset of the data, but when many trees are added together, the overall prediction generalizes well across the dataset.

## Parameters

- `learning_rate`: controls how strongly the subsequent model attempts to correct the mistakes of the previous tree. Per page 205 of [Hands-On Machine Learning](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291):
  - Lower values (such as 0.1) require more trees in the ensemble to fit the data
  - Low values combine with more trees usually produce ensembles that generalize better
  
- `n_estimators`: the number of trees to build
  - a high numbers of trees/`n_estimators` can lead to overfit models
  - more trees are needed when the `learning_rate` is low to build models of similar complexity
- `max_depth`: sets the maximum depth of each Decision Tree
- `max_leaf_nodes`: the maximum number of leaf nodes allowed in each individual Decision Tree

The ensemble is sensitive to changes in the parameters as can be seen in this illustration from the [Hands-On Machine Learning Chapter 7 gitthub page](https://github.com/ageron/handson-ml2/blob/master/07_ensemble_learning_and_random_forests.ipynb) with two very different sets of parameters for the learning rate and number of estimators:

<img src='images/under_overfit_ensembles.png'>

*Hands-On Machine Learning* suggests two different ways to find the best number of trees.

The first technique uses the `staged_predict` method, which will save your information for each tree. You can then make a chart that shows the errors after each tree, allowing you to find the minimum (chart from the [Hands-On Machine Learning Chapter 7 gitthub page](https://github.com/ageron/handson-ml2/blob/master/07_ensemble_learning_and_random_forests.ipynb) - code available on Aurelien's github page or the Jupyter Notebook accompanying this entry):

<img src='images/ensemble_trees_chart.png'>

This technique is a good one to remember because it translates to finding optimal hyperparameters for other algorithms.

The second technique uses the `warm_start`=True parameter, which keeps each tree as it's fit. From there you can use a for loop and a variable to keep track of reductions in error, then stop training when the validation metric doesn't improve after so many iterations (*Hands-On Machine Learning* uses 5). The code for this technique is available on Aurelien's [Chapter 7 github page](https://github.com/ageron/handson-ml2/blob/master/07_ensemble_learning_and_random_forests.ipynb) or the Jupyter Notebook accompanying this entry)

## Strengths

- Can provide better accuracy than Random Forests with the right parameter settings
- Minimal preprocessing required (like centering and scaling - see [Entry 8](https://julielinx.github.io/blog/08_center_scale_and_latex/))

## Limitations

- Can be computationally expensive and time consuming to train
- Models are trained sequentially, so unable to use parallel processing for training
- More sensitive to paramenter setting than Random Forests - careful tuning required
- I imagine the learning rate presents the same challenges as the learning rate as discussed in [Entry 37b](https://julielinx.github.io/blog/37b_regression_gradient_descent/) while discussing gradient descent
- Doesn't usually work well on high deminsonial sparce data

## Variations

### Stochastic Gradient Boosting

A random subsample of the training data can be used for training each tree using the `subsample` hyperparameter. Here's what *Hands-On Machine Learning* has to say on page 207:

> For example, if `subsample` = 0.25, then each tree is trained on 25% of the training instances, selected randomly. As you can probably guess by now, this technique trades a higher bias for a lower variance.

Since a much smaller amount of data is used, training time is reduced considerably.

## Up Next

AdaBoost

## Resources

- [Hands-On Machine Learning with Scikit-Learn & TensorFlow](https://www.amazon.com/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291)
- [Introduction to Machine Learning with Python](https://www.amazon.com/Introduction-Machine-Learning-Python-Scientists/dp/1449369413)