# Bagging

## Model Specification

Bagging (short for bootstrap aggregation) averages the prediction over a collection of bootstrap samples with replacements.
$$\hat{f}_{bag}(x)=\frac{1}{B}\sum_{b=1}^B \hat{f}^{*b}(x)$$
* The above is a Monte Carlo estimate of the empirical distribution of the samples.
* Note that the bagged estimate $\hat{f}_{bag}(x)$ only differ from the original estimate $\hat{f}(x)$ (as $B\rightarrow\infty$) only when the estimator is a nonlinear or adaptive function of the data.

### Variants and Generalizations

- An alternative bagging strategy is to average the class probabilities themselves before reaching the verdict, i.e. doing soft-voting rather than hard-voting; see a general discussion in [ensemble](../meta_learning/ensemble.ipynb). Not only does this produce improved estimates of the class probs, but it also tends to produce bagged classifiers with lower variance, especially for small $B$ (Though this is intuitive, ESL does not provide reference, or whether this is a theoretical or empirical result).

- When sampling is without replacement, it is called **pasting**. Bootstrapping introduces a bit more diversity in the subsets that each predictor is trained on, so bagging ends up with a slightly higher bias than pasting; but the extra diversity also means that the predictors end up being less correlated, so the ensemble’s variance is reduced. Overall, bagging often results in better models, which explains why it is generally preferred. However, if you have spare time and CPU power, you can use cross-validation to evaluate both bagging and pasting and select the one that works best.

## Theoretical Properties

### Advantages
- Bagging are used as a way to reduce variance of the base estimator (e.g. decision trees). In particular, the intuition that bagging reduces variance or instability (see [CART](CART.ipynb)) is that each base tree may use different variable to split due to the randomized bagging samples. When subsequently averaged, the dominance of one data point to the subsequent splits is significantly reduced.
- In many cases, bagging is very simple to implement without interfering with the inner-working of the base estimators.
- Bagging produces Out-Of-Bag samples, which natually incorporate LOO-type CV. Indeed, for an original sample with $N$ observations, a bootstrapped sample with the same $N$ observations will have the probability of $(1-1/N)^N\approx e^{-1}$ not to include a particular sample, which is roughly a third of $100\%$.

### Disadvantages

- Interpretability is sacraficed.
- **Independence among base estimator is crucial for the success of bagging**, as with any ensemble methods. When predictors have high correlations, the bagged sample tend to have high correlations, affecting the effectiveness of bagging to reduce variance by taking average. This is something that [random forest](random_forest.ipynb) tries to alleviate.
- **Error rate of the base learner still matters.** Actually, bagging a 'bad' classifier can make it worse. Simple example. Suppose $Y=1$ for all $x$, and the classifier $\hat{G}(x)$ predicts $Y=1$ for all $x$ with probability $0.4$ and predicts $Y=0$ with probability $0.6$. The base predictor will have $60\%$ to of misclassification rate, but that of the bagged classifier is $100\%$. 

### Relation to Other Models

- When both data and features are sampled, and the base estimator is a tree, it is what is called a [random forest](random_forest.ipynb). More fancy names: 
    - Sampling both training instances and features is called the **Random Patches method**. 
    - Keeping all training instances but sampling features is called the **Random Subspaces method**.

- As a way to reduce overfitting, bagging work best with high-variance base models (e.g. fully grown decision trees without pruning), while [boosting](boosting.ipynb) tends to work best with biased base models (e.g. shallow decision trees). This has to do with the opposite goal of bagging vs. boosting: bagging works to reduce variance by taking average of unbiased base estimators, while boosting works to reduce bias by taking regularized sum of low-variance base estimators. 

- Boosting appears to dominate bagging on most problems, and became the preferred choice. Again, [random forest](random_forest.ipynb) seems to help.

## Empirical Performance

### Advantages and Disadvantages

## Implementation Details and Practical Tricks

`BaggingClassier` is used as a wrapper to base estimators, to provide a meta estimator.

In [1]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
bagging = BaggingClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=10, max_samples=1.0, max_features=1.0, bootstrap=True, bootstrap_features=False, oob_score=True)
#bag_clf.fit(X_train, y_train)

**Some commonly used inputs** (shared with other tree-typed algorithms):

- **`base_estimator`**: The base estimator to fit on random subsets of the dataset. If None, then the base estimator is a decision tree.

- **`n_estimators`**: The number of base estimators in the ensemble.

- **`max_samples`**: The number of samples to draw from `X` to train each base estimator. If int, then draw `max_samples` samples. If float, then draw `max_samples * X.shape[0]` samples.

- **`bootstrap`**: Whether samples are drawn with replacement - when it is `False` it is pasting.

- **`n_jobs`**: The number of jobs to run in parallel for both fit and predict. If `-1`, then the number of jobs is set to the number of cores.

- **`random_state`**: If int, `random_state` is the seed used by the random number generator; If `RandomState` instance, `random_state` is the random number generator; If `None`, the random number generator is the `RandomState` instance used by `np.random`.

- **`oob_score`**: whether to request an automatic oob evaluation after training. The resulting evaluation score is available through the `oob_score_` variable of the bagging class instance. The oob decision function for each training instance is also available through the `oob_decision_function_` variable. In the case where the base estimator has a `predict_proba()` method, the decision function returns the class probabilities for each training instance.

- **`bootstrap_features`**: as the name suggests, sample features. Though a bagging with trees that with both `bootstrap` and `bootstrap_features` enabled is not that different from random forest. 

- **`max_features`**: works similarly as `max_samples`, but works with features.

There are other inputs that also allow for randomized features, presumably in the fashion of random forest.

Note that parallerlization by `n_jobs`, while perhaps useful for bagging here or [random forest](random_forest.ipynb) since a lot of trees are built, might not benefit building a single, big tree.

Also note that the `BaggingClassifier` automatically performs soft voting instead of hard voting if the base
classifier can estimate class probabilities (i.e., if it has a `predict_proba()` method), which is the case
with `DecisionTreeClassifier`.

## Use Cases

## Results Interpretation, Metrics and Visualization

## References 
- ESL, Section 8.7
- [scikit-learn Document 1.11.1](http://scikit-learn.org/stable/modules/ensemble.html)
- < Hands-on Machine Learning >, Chapter 7.
- MLEDU, Lecture 22.
### Further Reading

## Misc.