# Introduction to AdaBoost

### Introduction

### Bagging to Boosting

* Bagging

Let's say that we are working with our breast cancer dataset, and are attempting to predict whether an observation is cancerous or not.  Now, we could begin by training a single decision tree to predict if our observation is cancerous.  Or, if we were to use a random forest classifier, we would train multiple decision trees, and then we would have the decision trees vote to make our predictions.

For example, if we train three decision trees, and at least two of three predict a 1, then our random forest predicts a 1.

The idea behind bagging is to use this *wisdow of the crowds approach*.  The reason why it tends to be successful, is that when an estimator is wrong due to randomness, we tend not to see that mistake very often and so it's not emphasized by the random forest.  And when the estimator detects an underlying pattern, other estimators also detect that underlying pattern, so the truth wins out.

* Boosting

Now Adaboost starts with the same approach.  Just like in bagging, we'll train multiple estimators.  And our adaboost model will ultimately a make a prediction by letting the individual classifiers vote.  So, if we again have three estimators, and let's say each estimator predicts a 1 for the observation being positive, and a -1 for the observation being negative then the hypothesis function of boosting algorithm looks like the following:

$H(x) = sign\bigg( h_1(x) + h_2(x) + h_3(x) \bigg) $

In other words, when the sum is greater than 0, we predict $1$, and when less than 0 we predict $- 1$.

> So for example, if for the first observation 
* we have predictions of $H(x) = 1 + -1 + -1 = -1 \rightarrow -1$
* or if we have predictions of $H(x) = 1 + 1 + 1 = 3 \rightarrow 1$

So everything, so far we're seeing is the same procedure as with our random forest classifier.  Now there are two changes that we'll make with boosting:

1. We'll weight the votes by the accuracy of each estimator 
    > The more the accurate estimator, the stronger the vote

2. We'll train each one after the other, and the observations the previous classifier misclassified are weighted more heavily so that this time, they are more likely to be trained correctly.

### Weighing the Observations

Let's cover the step of weighing the set observations.  Take at the diagram below, displaying how we would train an adaboost classifier with three successive estimators.

<img src="./three-steps.png" width="80%">

Let's take this steps by step, beginning with the estimator on the left. 

* Step 1. Each observation is weighed equally, and we see that the blue dots below the line are misclassified.
* Step 2. Notice that the blue dots are larger (indicating a higher weight), and the red dots are all smaller (as they were properly classified).  The estimator is trained on these weighted observations, and classifies the blue dots on the left correctly.  The red dots in the middle are also classified correctly.
* Step 3.  The blue dots on the right are now the largest, and the third estimator classifies them correctly.

So as you can see, we train our estimators one after the other, each time placing higher weight on the observations previously classified incorrectly.

### Weighing the Classifiers

Let's take another look at the classifiers, which we can represent as $h_1(x)$, $h_2(x)$, and $h_3(x)$, from left to right. 

<img src="./three-steps.png" width="80%">

> From [Adaboost Intuition](https://xavierbourretsicotte.github.io/AdaBoost.html).

So we know that each of the classifiers, trains different trees, with each one focusing on previously misclassified observations.  After the classifiers are trained, they vote.  But they do not each receive an equal vote.  Rather, the classifier which has the lowest error rate, receives the highest vote.  The weight that we assign to each classifier is $\alpha$.

So we should update the hypothesis function for our Adaboost model from:

$H(x) = sign\bigg( h_1(x) + h_2(x) + h_3(x) \bigg) $

to:

$H(x) = sign\bigg(\alpha_1 h_1(x) +\alpha_2 h_2(x) +\alpha_3 h_3(x) \bigg) $

This way the more accurate classifiers get a stronger vote than the less accurate classifiers. 

So we begin to think of our adaboost classifier as going beyond "the wisdom of the crowd" to the "wisdom of a panel of experts" -- with each "expert" trained to have a specialty that ideally complements the other.  Finally, there is a ranking to our panel, where the accuracy of the classifier determines the strength of it's vote.

### Final tweaks

So we've already seen the two major components of adaboost:

1. Each time we train our estimators, we place a higher weight on the observations previously classified incorrectly.
2. We weigh the influence of each estimator by their accuracy

Now, let's go through a couple other features of the algorithm.  

1. Calculating Alpha

The first is that, when we measure the error of the classifier, it's *weighted error*.  So a classifier's value of $\alpha$ is penalized more for getting observations with higher rates incorrect.  

2. Assigning Weights

The second tweak is that we weigh our observations not just by whether the previous estimator classified the observation correct or not, but also based on the classifier's value of $\alpha$.  So if the classifier has a high $\alpha$, but made misclassifies the observation, that observations receives an even higher weight (to counteract the strong misclassification).  If $\alpha$ is high and it's classified correctly, then the observation gets even less weight than normal, as it does not need to be corrected by a future classifier.

With that, we're done.

### Summary

In this lesson, we discussed the main components of the Adaboost classifier.  

1. We make predictions through a set of classifiers taking a weighted vote for each observation.

> $H(x) = sign\bigg(\alpha_1 h_1(x) +\alpha_2 h_2(x) +\alpha_3 h_3(x) \bigg) $

2. We train each classifier by weighing observations that were previously misclassified

In [19]:
import pandas as pd
pd.DataFrame({'obs +': ['w --', 'w -'],
              'obs -': ['w ++ ', 'w +']}, 
             index = ['model +', 'model -'])

Unnamed: 0,obs +,obs -
model +,w --,w ++
model -,w -,w +


<image src="./three-steps.png" width="40%">
    <img src="./aggregating-trees.png" width="40%">

1. Weak learners.  
$ f(x)$
* But potentially have these work le

$F(x) = sign(f(x_1) + f(x_2) + f(x_3))$

> So one of these can be wrong, so long as the other two are correct.

So if this formula is correct, 

Wisdom of a weighted crowd of experts, each of which is good at part of a space.

### Process

<img src="./loop-process.png" width="60%">

Choose weight such that:

Unnamed: 0,obs +,obs -
model +,w --,w ++
model -,w -,w +
