# Bagging

## Model Specification

Bagging or bootstrap aggregation averages the prediction over a collection of bootstrap samples with replacements.
$$\hat{f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x)$$
* The above is a Monte Carlo estimate of the empirical distribution of the samples.
* Note that the bagged estimate $\hat{f}_{bag}(x)$ only differ from the original estimate $\hat{f}(x)$ (as $B\rightarrow\infty$) only when the estimator is a nonlinear or adaptive function of the data.

### Variants and Generalizations

An alternative bagging strategy is to average the class probabilities themselves, before reaching the verdict. Not only does this produce improved estimates of the class probs, but it also tends to produce bagged classifiers with lower variance, especially for small $B$ (ESL does not provide reference, or whether this is a theoretical or empirical result).

## Theoretical Properties

### Advantages
- Bagging are used as a way to reduce variance of the base estimator (e.g. decision trees). In particular, the intuition that bagging reduces variance or instability (see [CART](CART.ipynb)) is that each base tree may use different variable to split due to the randomized bagging samples. When subsequently averaged, the dominance of one data point to the subsequent splits is significantly reduced.
- In many cases, bagging is very simple to implement without interfering with the inner-working of the base estimators.
- Bagging produces Out-Of-Bag samples, which natually incorporate LOO-type CV.

### Disadvantages

- Interpretability is sacraficed.
- **Independence among base estimator is crucial for the success of bagging.** When predictors have high correlations, the bagged sample tend to have high correlations, affecting the effectiveness of bagging to reduce variance by taking average. This is something that [random forest](random_forest.ipynb) tries to alleviate.
- **Error rate of the base learner still matters.** Actually, bagging a 'bad' classifier can make it worse. Simple example. Suppose $Y=1$ for all $x$, and the classifier $\hat{G}(x)$ predicts $Y=1$ for all $x$ with probability $0.4$ and predicts $Y=0$ with probability $0.6$. The base predictor will have $60\%$ to of misclassification rate, but that of the bagged classifier is $100\%$. 

### Relation to Other Models

- As a way to reduce overfitting, bagging work best with high-variance base models (e.g. fully grown decision trees without pruning), while [boosting](boosting.ipynb) tends to work best with biased base models (e.g. shallow decision trees). This has to do with the opposite goal of bagging vs. boosting: bagging works to reduce variance by taking average of unbiased base estimators, while boosting works to reduce bias by taking regularized sum of low-variance base estimators. 

## Empirical Performance

### Advantages and Disadvantages

## Implementation Details and Practical Tricks

## Use Cases

## Results Interpretation, Metrics and Visualization

## References 
- ESL, Section 8.7
- [scikit-learn Document 1.11](http://scikit-learn.org/stable/modules/ensemble.html)
### Further Reading

## Misc.