# Ensemble Methods: Introduction  
## Ensembles: Parallel vs Sequential
- Ensemble methods combine multiple models  
- Parallel ensembles: each model is built independently   
  - e.g. bagging and random forests   
  - Main Idea: Combine many (high complexity, low bias) models to reduce variance  
- Sequential ensembles  
  - Models are generated sequentially   
  - Try to add new models that do well where previous models lack

# The Benefits of Averaging  
## A Poor Estimator
- Let $Z,Z_1,...,Z_n$ i.i.d. $EZ = \mu$ and $\text{Var} Z = \sigma^2$.  
- We could use any single $Z_i$ to estimate $\mu$.  
- Unbiased: $EZ_i = \mu$.  
- Standard error of estimator would be $\sigma$.  
  - The standard error is the standard deviation of the sampling distribution of a statistic  
  - $\mathrm{SD}(Z)=\sqrt{\operatorname{Var}(Z)}=\sqrt{\sigma^{2}}=\sigma$  
  
## Variance of a Mean
Let’s consider the average of the $Z_i$’s.   
- Average has the same expected value but smaller standard error:   
$$
\mathbb{E}\left[\frac{1}{n} \sum_{i=1}^{n} Z_{i}\right]=\mu \quad \operatorname{Var}\left[\frac{1}{n} \sum_{i=1}^{n} Z_{i}\right]=\frac{\sigma^{2}}{n}
$$  
- Clearly the average is preferred to a single $Z_i$ as estimator  
- Can we apply this to reduce variance of general prediction functions?

## Averaging Independent Prediction Functions
- Suppose we have $B$ independent training sets from the same distribution  
- Learning algorithm gives $B$ decision functions: $\hat{f}_{1}(x), \hat{f}_{2}(x), \ldots, \hat{f}_{B}(x)$  
- Deﬁne the average prediction function as  
$$
\hat{f}_{\mathrm{avg}}=\frac{1}{B} \sum_{b=1}^{B} \hat{f}_{b}
$$  
- What’s random here?
- Fix some $x \in X$.  
- Then average prediction on $x$ is  
$$
\hat{f}_{\text {avg }}(x)=\frac{1}{B} \sum_{b=1}^{B} \hat{f}_{b}(x)
$$  
- Consider $\hat{f}_{\text {avg }}(x)$ and $\hat{f}_{1}(x), \ldots, \hat{f}_{B}(x)$ as random variables (since training data random).  
- $\hat{f}_{1}(x), \ldots, \hat{f}_{B}(x)$ are i.i.d.
- $\hat{f}_{\text {avg }}(x)$ and $\hat{f}_{b}(x)$ have the same expected value, but  
- $\hat{f}_{\text {avg }}(x)$ haas smaller variance:  
$$
\begin{aligned}
\operatorname{Var}\left(\hat{f}_{\text {avg }}(x)\right) &=\frac{1}{B^{2}} \operatorname{Var}\left(\sum_{b=1}^{B} \hat{f}_{b}(x)\right) \\
&=\frac{1}{B} \operatorname{Var}\left(\hat{f}_{1}(x)\right)
\end{aligned}
$$  
- But in practice we don’t have $B$ independent training sets...(Since we have to divide our dataset into $B$ independent seperate sets)   
- Instead, we can use the bootstrap....


# Bagging  
- Draw $B$ bootstrap samples $D_1,...,D_B$ from original data $D$.  
- Let $\hat{f}_{1}, \hat{f}_{2}, \ldots, \hat{f}_{B}$ be the decision function of each set  
- The **bagged decision function** is a combination of these:
$$
\hat{f}_{\text {avg }}(x)=\text { Combine }\left(\hat{f}_{1}(x), \hat{f}_{2}(x), \ldots, \hat{f}_{B}(x)\right)
$$  
- How might we combine  
  - decision functions for regression?   
  - binary class predictions? 
  - binary probability predictions? 
  - multiclass predictions?   
  
## Bagging for Regression
- Draw B bootstrap samples $D^1,...,D^B$ from original data $D$.  
- Let $\hat{f}_{1}, \hat{f}_{2}, \ldots, \hat{f}_{B}: x \rightarrow \mathbf{R}$ be the predict functions for each set  
- Bagged prediction function is given as  
$$
\hat{f}_{\mathrm{bag}}(x)=\frac{1}{B} \sum_{b=1}^{B} \hat{f}_{b}(x)
$$  
- Empirically, $\hat{f}_{bag}$ often performs similarly to what we’d get from training on $B$ independent samples: 
  - Same expectation value  
  - Smaller variance  

## Out-of-Bag Error Estimation
- Each bagged predictor is trained on about 63% of the data.  
- Remaining 37% are called out-of-bag (OOB) observations.  
- For $i$th training point, let  
$$
S_{i}=\left\{b \mid D^{b} \text { does not contain ith point }\right\}
$$  
- For example, we can't find $i$th training point in $D^3, D^5, D^7$, then $S_i = \{3,5,7\}$
- The OOB prediction on $x_i$ is  
$$
\hat{f}_{\mathrm{OOB}}\left(x_{i}\right)=\frac{1}{\left|S_{i}\right|} \sum_{b \in S_{i}} \hat{f}_{b}\left(x_{i}\right)
$$   

**Remark:**
- The OOB error is a good estimate of the test error.  
- OOB error is similar to cross validation error – both are computed on training set

## Bagging Classiﬁcation Trees
- Input space $X =R^5$ and output space $Y = \{−1,1\}$.  
- Sample size $N = 30$ (simulated data)  
<div align="center"><img src = "./bagging.jpg" width = '500' height = '100' align = center /></div>  

## Comparing Classiﬁcation Combination Methods
- Two ways to combine classiﬁcations: consensus class or average probabilities.
<div align="center"><img src = "./number of sample.jpg" width = '500' height = '100' align = center /></div>  

## Terms “Bias” and “Variance” in Casual Usage (Warning! Confusion Zone!)
- Restricting the hypothesis space $F$ “biases” the ﬁt  
  - away from the best possible ﬁt of the training data, 
  - and towards a [usually] simpler model.  
- Full, unpruned decision trees have very little bias.  
- Pruning decision trees introduces a bias  
- Variance describes how much the ﬁt changes across diﬀerent random training sets  
- If diﬀerent random training sets give very similar ﬁts, then algorithm has high stability  
- Decision trees are found to be high variance (i.e. not very stable).

## Conventional Wisdom on When Bagging Helps
- Hope is that bagging reduces variance without making bias worse.  
- General sentiment is that bagging helps most when 
  - Relatively unbiased base prediction functions 
  - High variance / low stability 
- Hard to ﬁnd clear and convincing theoretical results on this   
- But following this intuition leads to improved ML methods, e.g. Random Forests


# Random Forest  
## Recall the Motivating Principal of Bagging
Averaging $\hat{f}_1,..., \hat{f}_B$ reduces variance, if they’re based on i.i.d. samples from $P_{X×Y}$   
- Bootstrap samples are   
  - independent samples from the training set, but   
  - are not indepedendent samples from $P_{X×Y}$  
- This dependence limits the amount of variance reduction we can get.  
- Would be nice to reduce the dependence between $\hat{f}_i$’s...  

## Main idea of random forests  
Use **bagged decision trees**, but modify the tree-growing procedure to reduce the correlation between trees.
- Key step in random forests  
  - When constructing each tree node, restrict choice of splitting variable to a randomly chosen subset of features of size $m$.  
  - Typically choose $m \approx\sqrt{p}$, where $p$ is the number of features  
  - Can choose $m$ using cross validation  
<div align="center"><img src = "./m.jpg" width = '500' height = '100' align = center /></div>   

# Appendix  
-  Let $Z,Z_1,...,Z_n$ i.i.d. $EZ = \mu$ and $\text{Var} Z = \sigma^2$.  
$$
\mathbb{E}\left[\frac{1}{n} \sum_{i=1}^{n} Z_{i}\right]=\mu \quad \operatorname{Var}\left[\frac{1}{n} \sum_{i=1}^{n} Z_{i}\right]=\frac{\sigma^{2}}{n}
$$  
- What if Z’s are correlated?  
- Suppose $\forall i \ne j$, $\text{Corr}(Z_i,Z_j) = \rho$ . Then   
$$
\operatorname{Var}\left[\frac{1}{n} \sum_{i=1}^{n} Z_{i}\right]=\rho \sigma^{2}+\frac{1-\rho}{n} \sigma^{2}
$$