# 1.9 Boosting #

## ***Vocabulary & Code*** ##

# Lecture Notes #

## ***1.9.0 Introduction*** ##

##### Recall two key parameters from PAC learning:
- $\delta$: probability of failure
- $\epsilon$: accuracy parameter

Requirement:
For any choice of $\epsilon$ & $\delta$, $A$ should output, with probability $\ge 1-\delta$, an $\epsilon$ accurate classifier.

A is allowed to run in time $poly(\frac{1}{\epsilon}\frac{1}{\delta})$, and its allowed to take a number of samples $poly(\frac{1}{\epsilon}\frac{1}{\delta})$.

---

Question: What if we have an algorithm, $A$, that with probability 5% outputs an $\epsilon$ accurate classifier. How can we use A to obtain a standard PAC learner?

We want to increase that 5% probability of success to $(1-\delta)$. 

The solution is to run A a large number of times, say $t$. Then $Pr[A \;fails\; to\; output\;an\; \epsilon \;accurate \;classifier] \le (0.95)^t$, if $A$ is run $t$ times.

We can make $(0.95)^t$ very small by choosing $t$ to be $≈ O(log\frac{1}{\delta})$, then we can "test" each classifier generated during these $t$ trials to see if any of them are good classifiers.

**Summary:** \
We have an algorithm that only succeeds with probability 5%, if we run it $t$ times, the proabability that it fails to ouput a classifier in all those $t$ times is at most $(0.95)^t$. So we can take to to be $O(log\frac{1}{\delta})$ and then the probability that it fails to output an accurate hypothesis over these $t$ trails is going to be smaller than $\delta$. So with reasonable probability, one of these classifiers will be at least $\epsilon$ accurate.

---

Trickier Question: What if $\epsilon$ is fixed, to say .49?

Imagine $A$ with probability $\ge 1- \delta$ ouputs a classifier with $\epsilon = .49$.

Natural Question: How do we amplify/improve the accuracy parameter? 

The solutions are called boosting algorithms.

---

Adaboost overview:


<br>
<center>
    <img src="images/1.9.1.png" alt="Professor Notes" />
</center>
<br>

<br>
<center>
    <img src="images/1.9.2.png" alt="Professor Notes" />
</center>
<br>

## ***1.9.1 Adaboost*** ##

Simplified Adaboost:

Assume we have a training set of size $m$.\
Initially, the first distribution, $D_0$, is the uniform distribution, which corresponds to $w_i = 1 \;\forall_i$.\
The distribution is obtained by dividing by $W$, the sum of the weights.

$E$ = error rate\
$A$ = accuracy = $1-E$\
$\beta$ = $\frac{E}{A}$

Concretely: $E = \frac{1}{2}-\gamma$, $\beta = \frac{\frac{1}{2}-\gamma}{\frac{1}{2}+\gamma}$

How to update the weights: at iteration $t$, run $A$ to obtain $h_t$\
For each $x_i$ such that $h_i(x_i)$ is correct:
$$w_i^{new} = \beta w_i^{old}$$
For each $x_i$ such that $h_i(x_i)$ is incorrect:
$$w_i^{new} = w_i^{old}$$

Repeat for $T$ steps and output $maj(h_1, ..., h_T)$.

---

Claim: After $T$ iterations, the error $h_{final} = maj(h_1, ..., h_T) \le e^{-2T\gamma^2} \implies choose\;T≈\frac{1}{\gamma^2}lg(\frac{1}{\epsilon})$ then the error of $h_{final} \le \epsilon$

Proving it:

<br>
<center>
    <img src="images/1.9.3.png" alt="Professor Notes" />
</center>
<br>

<br>
<center>
    <img src="images/1.9.4.png" alt="Professor Notes" />
</center>
<br>

<br>
<center>
    <img src="images/1.9.5.png" alt="Professor Notes" />
</center>
<br>

## ***1.9.2 Adaboost Modification*** ##

# Personal Notes #

**[Understanding Machine Learning: From Theory to Algorithms, Chapter 10](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html)** 