# Ensemble Learning

Ensemble/committee of different <i>base classifiers</i> vote on outcome.

Random Forest, for instance, is a bunch of Random Decicion Trees which work as an ensemble when predicting.

## Bootstrap AGGregatING (BAGGING)
Train each base-classifier using a subset of the training data (this is how you train Random Forests).

Complex base-classifiers may draw complex decision-boundaries, and be prone to over-fitting. This training method leads to reduced variance, and therefore a <u>reduced risk of overfitting</u>.

## Boosting
Train each base-classifier using all training data but with weights indicating how important each training sample is. Assign a different set of weights to the different base-classifiers. The assigning of weights might be serial, with the last classifier determining the weights for the next, or parallell, which each set of weights independent from the next.

## Simple/Weak Classifiers
The components used in Ensemble Learning.

Basically a "rule-of-thumb" classifier. For example: select only one feature to base the classification-threshold on (a decision-tree branch).

### Decision Stump
If you combine weak classifiers into a branch shape, it's called a decision stump:

`  
...A or B
    /  \
   /    \
  -1    +1`

## Classification and Regression Trees (CART)
Chaining decision stumps to each other.

Leads to a piece-wise flat classification-function.

> #### Regression Tree
> Same thing, but all leafs end in real-valued output rather than a label.

Of course, these trees can become bery complicated, which makes them prone to over-fitting.

### Random Forest
Bagging + Decision Trees

For each tree, use a random subset of training samples. For each branching, use a random subset of features.

## General Boosting Algorithm
Train weak classifiers sequentially!

1. Set each example weight $d_i=1/N$ => $\vec{d}_1$.
2. Train weak classifier using these weights.
3. Increase and decrease weights for wrongly and correctly classified training samples respectively. (we want new classifiers to focus on what we did wrong last) => $\vec{d}_2$
4. Train weak classifier using these weights.
5. Repeat X times.
6. Weight all classifier <u>outputs</u> according to their general performance.

## Training a Decision Stump
Find best split threshold $\tau$!

Optimize the cost function, which is the EMPIRICAL RISK FUNCTION (number of wrong counts), with each count multiplied by the weight associated with it.

> Observe that the minimization function will always be <= 0.5, since we could just flip it otherwise.

Since a weak classifier is so easy to optimize, we don't need gradient descent, we can just brute-force it, testing one for each example.

## Discrete AdaBoost
1. Find a weak classifier that minimizes the weighted classification error.
2. Update the weights depending on error $e^{-\alpha_i y_i h(y_i)}$
    
    $\alpha_i = \frac{1}{2}\ln\frac{1-\epsilon_i}{\epsilon_i}$
3. Renormalize so that all weights sum up to 1.
4. repeat 1-to-3 for each classifier.
5. Final classifier is sum of all classifiers multiplied by their respective $\alpha$.

### Outlier Problem
Outliers will gain weight exponentially each iteration until the classifier is <u>really</u> bad.

How to deal with it:
- Monitor weights
- Weight trimming
    - Maximum weight threshold
    - Disregard samples with large weight
- Use alternative weight update schemes with less aggressive increase

# Summary Ensemble Learning
- Nonlinear
- Easy to use, just a few parameters
- Inherent feature selection
- Slow to train, but fast to classify
- Look out for outliers: may cause issues!

# Example: Object Detection
Detecting faces
- Sweep a sub-window over the image, for each position, ask yes/no is there a face here.
- Features: <b>Haar Features</b>: rectangular shapes with different sizes and divided into black/white areas, which yeild 1 number when applied to an image section. $\sum_y\sum_xI_{x,y}H_{x,y}$, I is image, H is filter (filter only has values of 0 or 1)
- In this example, throw hundreds of thousands of these at random points in the sub-window, generating one number per feature.
- Train the detector with AdaBoost.
    - Positive set: small windows with faces
    - Negative set: no faces
    - Apply filters to each image to get features from that image.
    - Train as above.