# Ensemble Learning
Wikipedia: In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

![](https://cdn-images-1.medium.com/max/800/1*8T4HEjzHto_V8PrEFLkd9A.png)
![](https://cdn-images-1.medium.com/max/800/1*PaXJ8HCYE9r2MgiZ32TQ2A.png)

# Decision Trees
[Stanford ppt](https://lagunita.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/trees.pdf)
[Bagging and Random Forests](https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/)
[Boosting](https://medium.com/mlreview/gradient-boosting-from-scratch-1e317ae4587d)

Decision trees involve a set of splitting rules used tosegment the predictor space. Algorithms that use decision trees include `bagging`, `random forests` and `boosting`. This method can be used for both regression and classification problems.

## Terminology
`terminal node`: the regions $R_t$ that predictions can fall into.

`internal node`: the points along the tree where the predictor space is split

## Regression Tree building
1. divide the predictor space (aka the set of possible values for x) into J distinct and non-overlapping regions $R_1,R_2,...,R_j$
2. For every observation that falls into the region $R_j$, we make the same prediction, which is simply the mean of the response values for the training observations in $R_j$.

the goal is to find 'boxes' $R_1,...,R_j$ that minimize the following loss function:

$$
\sum^J_{j=1}\sum_{i \in R_j}(y_i - p_{R_j})^2
$$

where $p_{R_j}$ is the mean of the training observations in the jth box

## Recusive Binary Split
This is a top-down, greedy algorithm to split the predictor space in j boxes.

It is top-down because it begins at the top of the tree and successively splits the predictor space; each split is indicated via two new branches further down on the tree. 

It is greedy because at each step of the tree-building process, the best split is made at that particular step, rather than looking ahead and picking a split that will lead to a better tree in some future step.

### The algorithm
1. Select the feature $X_j$ and the cutpoint s such that splitting the predictor space into the regions{X|$X_j$< s} and {X|$X_j$ ≥ s} leads to the greatest possible reduction in the loss function.
2. Next, we repeat the process, looking for the best predictor and best cutpoint in order to split the data further so as to minimize the RSS within each of the resulting regions. However, this time, instead of splitting the entire predictor space, we split one of the two previously identified regions. We now have three regions.
3. Again, we look to split one of these three regions further, so as to minimize the loss function. The process continues until a stopping criterion is reached; for instance, we may continue until no region contains more than five examples.

We predict the response for a given test observation using the mean of the training observations in the region to which that test observation belongs.

### Overfitting and pruning
The process described above may produce good predictions on the training set, but is likely to overfit the data, leading to poor test set performance.

We can grow a very large tree, then prune it back to obtain a subtree using `cost complexity pruning` or `weakest link pruning`.

we consider a sequence of trees indexed by a nonnegative tuning parameter α. For each value of α there corresponds a subtree T ⊂ $T_0$ such that

$$
\sum^{|T|}_{m=1}\sum_{i,x_i \in R_m}(y_i - p_{R_m})^2 + \alpha|T|
$$

is as small as possible. Here |T| indicates the number of terminal nodes of the tree T, Rm is the rectangle (i.e. the subset of predictor space) corresponding to the mth terminal node, and yˆRm is the mean of the training observations in Rm.

### Summary of the complete algorithm
1. Use recursive binary splitting to grow a large tree on the training data, stopping only when each terminal node has fewer than some minimum number of observations.
2. Apply cost complexity pruning to the large tree in order to obtain a sequence of best subtrees, as a function of α.
3. Use K-fold cross-validation to choose α. For each k = 1,...,K:
    - Repeat Steps 1 and 2 on the $\frac{K-1}{K}$th fraction of the training data, excluding the Kth fold
    - Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of α. 
    Average the results, and pick α to minimize the average error.
4. Return the subtree from Step 2 that corresponds to the chosen value of α.
 
## Classification Tree Building
Similar to regression trees, but this time, we use the classification error rate instead of the residual sum of squares.
$$
E = 1 - max_k(p_{mk})
$$
where $p_{mk}$ represents the proportion of training observations in the mth region that are from the kth class.

this is simply the fraction of the training observations in that region that do not belong to the most common class

but a better approach is to use the `Gini Index`:
$$
G = \sum^K_{k=1}p_{mk}(1-p_{mk})
$$
G is said to be indicative of node `purity`. If variance is small, then G is small - node contains predominantly observations from a single class.

another approach (which is often numerically equivelent to the Gini index) is `cross-entropy`:
$$
D = - \sum^K_{k=1}p_{mk}log(p_{mk})
$$

## Pros and Cons of Trees
Advantages:
- Trees are very easy to explain to people. In fact, they are even easier to explain than linear regression!
- Some people believe that decision trees more closely mirror human decision-making than do the regression and classification approaches seen in previous chapters.
- Trees can be displayed graphically, and are easily interpreted even by a non-expert (especially if they are small).
- Trees can easily handle qualitative predictors without the need to create dummy variables.

Disadvantage:
- Unfortunately, trees generally do not have the same level of predictive accuracy as some of the other regression and classification approaches seen in this book.

## Stopping Criterion
The most common stopping procedure is to use a minimum count on the number of training instances assigned to each leaf node. If the count is less than some minimum then the split is not accepted and the node is taken as a final leaf node.

The count of training members is tuned to the dataset, e.g. 5 or 10. It defines how specific to the training data the tree will be. Too specific (e.g. a count of 1) and the tree will overfit the training data and likely have poor performance on the test set.


## Bagging Decision Trees
Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method; we introduce it here because it is particularly useful and frequently used in the context of decision trees.

**Averaging a set of observations reduces variance.** Of course, this is not practical because we generally do not have access to multiple training sets.

### Statistics: Bootstrapping
Let’s assume we have a sample of 100 values (x) and we’d like to get an estimate of the mean of the sample.

We can calculate the mean directly from the sample as:

mean(x) = 1/100 * sum(x)

We know that our sample is small and that our mean has error in it. We can improve the estimate of our mean using the bootstrap procedure:

1. Create many (e.g. 1000) random sub-samples of our dataset with replacement (meaning we can select the same value multiple times).
2. Calculate the mean of each sub-sample.
3. Calculate the average of all of our collected means and use that as our estimated mean for the data.

### Algorithm
1. Create many (e.g. 100) random sub-samples of our dataset with replacement.
2. Train a CART model on each sample.
3. Given a new dataset, calculate the average prediction from each model.

For example, if we had 5 bagged decision trees that made the following class predictions for a in input sample: blue, blue, red, blue and red, we would take the most frequent class and predict blue.

### Prediction
By the boostrapping method, we obtain B sub samples from our sample space. We then train our method on the bth bootstrapped training set in order to get $\hat{f}^{*b}(x)$, the prediction for x. We then average the predictions to obtain:
$$
\hat{f}_{bag}(x) = \frac{1}{B}\sum^B_{b=1}\hat{f}^{*b}(x)
$$

### Classification trees
For classification trees: for each test observation, we record the class predicted by each of the B trees, and take a majority vote: the overall prediction is the most commonly occurring class among the B predictions.

### Characteristics of Bagging Trees
When bagging with decision trees, we are less concerned about individual trees overfitting the training data. For this reason and for efficiency, the individual decision trees are grown deep (e.g. few training samples at each leaf-node of the tree) and the trees are not pruned. These trees will have both high variance and low bias. These are important characterize of sub-models when combining predictions using bagging.

## Random Forest
Random Forests are a type of ensemble learning and an improvement over bagged decision trees.

It is a simple tweak. In CART, when selecting a split point, the learning algorithm is allowed to look through all variables and all variable values in order to select the most optimal split-point. The random forest algorithm changes this procedure so that the learning algorithm is limited to a random sample of features of which to search.

The number of features that can be searched at each split point (m) must be specified as a parameter to the algorithm. You can try different values and tune it using cross validation.

For classification a good default is: m = sqrt(p)
For regression a good default is: m = p/3

where p is the total number of data points

## Boosting 
Boosting is an ensemble technique in which the predictors are not made independently, but sequentially.

This technique employs the logic in which the subsequent predictors learn from the mistakes of the previous predictors. Therefore, the observations have an unequal probability of appearing in subsequent models and ones with the highest error appear most.

Boosting essentially works by starting with a weak learner, and incrementally improve on its mistakes. A weak hypothesis or weak learner is defined as one whose performance is at least slightly better than random chance.

## Adaptive Boosting (AdaBoost)
The weak learners in AdaBoost are decision trees with a single split, called decision stumps for their shortness.

AdaBoost works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns.



## Gradient Boosting
[slides](https://homes.cs.washington.edu/%7Etqchen/pdf/BoostedTree.pdf)
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees. (Wikipedia definition)

function to minimize:
$$
argmin\sum_{i=1}^nL(y_i,\hat{y_i}) + \sum^K_{k=1}\Omega(f_k)
$$
where n is the number of training examples, k is the number of weak learners, $\Omega$ is the complexity of tree k

![](img/boosting1.png)
![](img/boosting2.png)da

## Estimated performance and testing
For each bootstrap sample taken from the training data, there will be samples left behind that were not included. These samples are called Out-Of-Bag samples or OOB.

The performance of each model on its left out samples when averaged can provide an estimated accuracy of the bagged models. This estimated performance is often called the OOB estimate of performance.

These performance measures are reliable test error estimate and correlate well with cross validation estimates.