## General formulation: Classfification

- Let us train $M$ models: $b_1(x), \dots, b_M(x)$ (train M models separately)
- composition through majority vote:

$a_M(x) = argmax_{y \in \mathbb{Y}} \sum^{M}_{m=1}[b_m(x)=y]$ (predict separately, choose the most popular result among all models)

## General formulation: Regression

- Let us train $M$ models: $b_1(x), \dots, b_M(x)$ (train M models separately)
- composition through averaging:

$a_M(x) = \frac{1}{M} \sum^{M}_{m=1}b_m(x)$

How to build M models, if we have only 1 training dataset? (there is no use if all models are the same)

Two options:

- Train independently on different subsamples
    - Bagging
    - Random subsapces
    
- Train successively to correct error of the previous algorithm
    - Boosting
    
    
__Bagging__

- Bagging (bootstrap aggregating).
- Base algorithms are trained independently.
- Eahc model is trained on the subsample of the training dataset.
- The subsample is obtained by bootstrap.

__Bootstrap__

- Sample with replacement
- Select N element from X (where N is the size of the trainning dataset). (equal chance of getting a certain item each time. so there could be repetition)

    - e.g. $\{x_1, x_2, x_3, x_4\} \to \{x_1, x_2, x_2, x_4\}$
    
- Our subsample will have approximately 63.2% unique observations

__Random subsapces__

- Base algorithm are trained independently
- Use random subsets of features for each model
- May work poorly if some features is crucial for a reasonable model

__Source of randomness__

- Bagging: random subset (random rows)
- Random subsapces: random subset of features. (random columns)

## Random Forest

__Bias-variance trade-off__

$\mathbb{E}[y-a(x)]^2 = bias^2 + Variance$

Model is more complex, bias is lower, Variance is higher, and vice versa.

__Linear models__: High bias and low Variance. (Because it is very simple, insensitive to data)

__Decision trees__: Low bias and high Variance. (Sensitive to data, easily overfit)

### Bias and variance for composition

__Linear models__

- Bias: $a_M(x)$ has the same bias as $b_m(x)$

- Variance: $Var(a_M(x))=\frac{1}{M}(Var\; b_m(x)) + Cov(b_m(x), b_k(x))$

- If base models are independent, variance is M times smaller
- The more correlated models are, the less is effect of the ensemble. 

__Decision tree: Greedy Algorithm__

1. Put the whole dataset into the root: $R_1= X$

2. Start the tree construction: SplitNode$(1,R_1)$

SplitNode$(1,R_1)$:

1. If stopping criterion is met, then quit

2. Find the best split (feature and threshold): $j,t=argmax_{j,t}Q(R_m,j,t)$

3. Split objects: $R_l = \{ (x,y) \in R_m | [x_j < t] \}, R_r = \{ (x,y) \in R_m | [x_j \ge t]\} $

4. Repeat for the child nodes: SplitNode$(l, R_l)$ and SplitNode$(r, R_r)$

__We modify $j,t=argmax_{j,t}Q(R_m,j,t)$, not select all features, instead we select a random subset of features of size q. Select randomly for each model__

Turns out that if number of features that you randomly select for each split is small, then the correlation of the resulting trees will also be very small. (Low variance for composition)
But small size of features lead to larger bias.

Recommended values of q:
- Regression task: $q=\frac{d}{3}$
- Classification task: $q=\sqrt{d}$
where d is the total number of features in the dataset. (just approximation, better tune it)

### Random Forest algorithm

1. For $m=1,2,3,\dots,M$:

2. Sample $\tilde{X}$ from the training dataset using bootstrap. (bagging)

3. Train decision tree $b_m(x)$ on $\tilde{X}$
    - Stopping criterion: $n_{min}$ objects in the leaf
    - __Select q feature randomly before selecting optimal split on each node__.
    
__Random Forest__

Regression: $a_M(x) = \frac{1}{M} \sum^{M}_{m=1}b_m(x)$

Classfification: $a_M(x) = argmax_{y \in \mathbb{Y}} \sum^{M}_{m=1}[b_m(x)=y]$

_It doesn't overfit even for large M, becasue each tree is trained independently._

__out-of-bag__

- Each tree uses approximately 63% of the observations
- The rest can be used as a validation set
- $X_m$ - training dataset for $b_m(x)$
- We can estimate error:

$L_{val} = \frac{1}{N}\sum^{N}_{i=1} l(y_i, \frac{1}{\sum^{M}_{m=1}[x_i \notin X_m]} \sum^{M}_{m=1}[x_i \notin X_m] b_m(x_i))$

_We can evaluate generaliztion ability without validation set._