In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# Gradient Boosted Trees versus Random Forests

Individual trees are super overfit (high variance, low bias)
Ensemble methods, like Random Forests and Gradient Boosted Trees, attempt to correct this.

## Differences between RFs and GBTs:


### Random Forests
* Predictions of each tree are sort of independent. Each tree has a "mind of its own."
  * How do we make trees independent of each other?
   * Bootstrap samples randomly select observations.
   * Subset features randomly at each node
* Uses bushy trees

### Gradient Boosted Trees
* Predictions of each tree are not independent. Trees are grown sequentially. 
  * Instead of simply averaging trees as in a random forest, you are taking something like a weighted average, wherein each observation is weighting the points based on how far off from the actual value the prediction from the previous tree was.
* Uses stumpy trees
  * Limited depth is what makes learners weak.

## Why Gradient Boosting can Outperform Random Forests


Random Forest model is the 'average' of a collection of deep decision trees   
* Each tree is low bias, high variance
* Variance **mostly** 'drops out' when all the trees averaged

    
Gradient Boosted Trees model results from a series of shallow decision trees, each of which learns from its immediate predecessor
* Each tree (stump) is high bias, low variance
* Each step reduces the bias
* Slow learning of GBT model allows the model to get an extremely low bias   
    * As long as you don't allow to go too long and overfit

### Tradeoffs:
* Gradient Boosted Trees require much more extensive parameter tuning than random forests.

* GBTs are not inherently parallelizable, whereas RFs are.

* GBTs can overfit (albeit slowly), whereas RFs cannot.

# Gradient Descent versus Gradient Boosting

## Gradients

### The gradient is a vector of the partial derivatives of a function. It is the n-dimensional generalization of the derivative.  

### The direction of the gradient vector, at a given point, will naturally point in the steepest direction.

### Let's break that down a bit:

* Suppose we have two variables, X and Y, and we also have Z, which is a function of both X and Y.
* The change in Z divided by the change in Y is what we refer to as the partial derivative of Z with respect to Y.
* The change in Z divided by the change in X is what we refer to as the partial derivative of Z with respect to X.
* Suppose we represent the slope of this line with an arrow (vector). 
* The length and direction of the arrow indicates if the slope is positive or negative and by how much.
* Every point has an arrow that represents the partial derivative of Z with respect to X.
* Every point also has an arrow that respresents the partial derivateve of Z with respect to Y.
* If we add these two vectors together, we get a new vector, which is called the gradient. It always indicates the direction in which Z is increasing by the largest amount.

In [1]:
from IPython.display import YouTubeVideo
YouTubeVideo("GkB4vW16QHI")


## Gradient Descent

### General Optimization Algorithm


* It comes up a lot in machine learning.  


* It finds a local minimum of a (cost) function by repeatedly taking steps proportional to the negative of the gradient (or an approximate gradient) of the function at the current point


* Intuitively, you can think of this as repeatedly taking steps in the direction of the steepest descent. Doing so will eventually lead to a local minimum.


## Gradient Boosting:

### Sequential Weak Learners

* Like bagging, boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Gradient Boosting is typically applied to trees  <br> <br>

 Definition of Gradient Boosting:

* Given the current model, we fit a decision tree to the residuals from the model. We then add this new decision tree into the fitted function in order to update the residuals. <br> <br>

* Each of these trees can be rather small, with just a few terminal nodes, determined by a parameter in the algorithm. <br> <br>

* By fitting small trees to the residuals, we slowly improve in areas where it does not perform well. <br> <br>

* The shrinkage parameter (λ) slows the process down even further, allowing more and different shaped trees to attack the residuals.<br> <br>

* Boosting for classification is similar in spirit to boosting for regression, but is a bit more complex.


* Can also do stochastic gradient boosting, which applies applies the same principle as bagging (fitting each tree to a subset of the data).

## Comparison of Gradient Descent and Gradient Boosting

Gradient descent is a generic optimization algorithm for differentiable functions, whereas gradient boosting is a stagewise prediction algorithm.

They're similar in spirit, but they're used for different things:

* Gradient descent updates the parameters of a function step by step to reach a local minimum for a loss function.

* Gradient boosting adds new function to existing one in each step to reach a local minimum for a loss function.

Gradient boosting is just gradient descent, except instead of optimizing over the coefficient space, we optimize over the error space. 

In the end, the result of gradient descent is still the same function as at the beginning, just with a better parameters. But gradient boosting will end with a totally different functions (additions of multiple functions).