# Cross-Validation

## ***Vocabulary***

# Lecture Notes #

## ***Introduction***

We will be looking at how to calculate the true error of a classifier model. 

Let's assume classification; so the hypothesis h is going to output boolean values (e.g., {0,1}, {-1,1}.

- Hold-out approach (validation set) for testing/approximating the true error of a classifier
    - Leave some part of the training set out during training time
    - Then when you want to evaluate the true error of the classifier, you test the classifier on this held out set.
    - The error on the held out set is the approximated true error for unseen data
 
<br>
<center>
    <img src="images/1.5.1.png" alt="Professor Notes" />
</center>
<br>

**Markov's Inequality**

- Let x be a random variable that takes on only positive values
- $Pr[x\ge k*\mathbf{E}[x]]\le \frac{1}{k}$
    - *The probability that x is more than a factor of k times the expected value of x, is at most 1 over k*
 
**Chebyshev's Inequality**

- Review:
    - Let us say $\mathbf{E}[x]$ = $\mu$ 
    - Review the variance of a random variable: $ Var[x] = \mathbf{E}[(x-\mathbf{E}[x])^2] $
        - On average, how much does a draw of x deviate from its expectation or average squared
    - Recall that $\sqrt{Var[x]} = standard\;deviation(x) = \sigma$
 
<br>

If we have a random variable, we understand that its mean is $\mu$, and its variance is $\sigma$, the probability that the random variable x deviates from its expectation by more than t standard deviations, is less than or equal to one over t<sup>2</sup>.

$$ Pr[|x-\mu| \gt t*\sigma] \le \frac{1}{t^2} $$

<br>
<center>
    <img src="images/1.5.2.png" alt="Professor Notes" />
</center>
<br>

## ***Chernoff Bound***

Let's say we have random variables $x_1, x_2, ..., x_n$, $x_i \epsilon \{0,1\}$ and that $\mathbf{E}[x_i] = p$ (the chernoff bound holds for $\mathbf{E}[x_i] = p_i$, but we are fixing to p for simplicity). 

We also have 

$$S = \sum_{i=1}^{n} x_i $$

Also let $\mathbf{E}[S]$ = $\mu$ = $p*n$, in other words $\mathbf{E}[x_1+...+x_n] = p*n$

The Chernoff Bound says:

$$ Pr[S>\mu +\delta n] \le e^{-2n\delta^2} $$

$$ Pr[S<\mu -\delta n] \le e^{-2n\delta^2} $$

$$ \Rightarrow Pr[|S-\mu| > \delta n] \le 2e^{-2n\delta^2} $$

Basically, when you have a bunch of independent random variables, each ones mean is p and you take the sum of them, you would expect the sum to be about p * n, so this is the expectation of S, the sum. What that Chernoff Bound says is that the probability that your sum S deviates from $\mu$ is actually exponentially small in the quantity n.

"The probability that I deviate more than $\delta$n is exponentially small in n, and depending on my choice of $\delta$ I will be getting different bounds for these probabilities." 

<br>
<center>
    <img src="images/1.5.3.png" alt="Professor Notes" />
</center>
<br>

Applying the Chernoff Bound to the case of estimating the true error of a classifier...

We have hold-out set S, and we'll say the |S| = n (the size of S is n).\ 
Fix h (generated using some independent training set). Recall that there is some underlying distribution D from which we are generating training points, and that S is a sample drawn from D independent of the trianing set.

$$ Z = Pr_{x\textasciitilde D}[h(x) \ne c(x)] $$

where c is the unknown function we are trying to learn, and h is the classifier that we've generated and want to understand its true error, which is Z.

What random variable should we define if we want to use the Chernoff Bound...?

Let $x_i$ be the random variable that equals 1 if h is incorrect on the i<sup>th</sup> element of S. It will be 0 if h is correct on the i<sup>th</sup> element of S.

<br>
<center>
    <img src="images/1.5.4.png" alt="Professor Notes" />
</center>
<br>

So we have random variables $x_1, x_2, ..., x_n$, and 

$$ x_i = \begin{cases} 
          1\;if\;h\;is\;correct\;on\;i^{th}\;element \\
          0\;otherwise 
          \end{cases}
$$

and 

$$S = \sum_{i=1}^{n} x_i $$

and

$$ \mathbf{E}[S] = n*p $$

where p is actually the true error of h, because p is the expected value of $x_i$, and $x_i$ outputs 1 when incorrect.

$$ Pr[|S - n*p| > \delta n] \le 2e^{-2n\delta^2} $$

(Recall p is the true error of classifier h)  

Say we set $\delta$ = .1, then $ Pr[|S - n*p| > .1n] \le 2e^{\frac{-2n}{100}}$ 

I will call the quantity $ 2e^{\frac{-2n}{100}} $ from the above inequality the confidence parameter. 

How large do we need to take n before the confidence parameter becomes smaller than some small quantity $\alpha$? 

$$ 2e^{\frac{-2n}{100}} \lt \alpha $$
$$ e^{\frac{-2n}{100}} \lt \frac{\alpha}{2} $$
$$ \frac{-2n}{100} \lt log(\frac{\alpha}{2}) \Rightarrow n \gt 50*log(\frac{2}{\alpha})$$ 

So if we want the probability of failure to be less than alpha, and we want to be confident that our estimate is within .1\*n, then we need n to be $ 50*log(\frac{2}{\alpha}) $

Notice: if $ |S-n*p| \le .1n \Rightarrow error\;rate\;on\;S\;is\;within\;.1\;of\;true\;error\;rate $

<br>
<center>
    <img src="images/1.5.5.png" alt="Professor Notes" />
</center>
<br>



## ***How It Works***

The hold-out set is somewhat expensive..
- Data is expensive or difficult to obtain, and we arent using it for training the classifier
- If we want to try out multiple methods for generating classifiers, we quickly lose confidence in our estimates (however many times you use the hold out set you must multiply alpha). This gets expensive when testing different classifiers and parameter settings.

How can we build lots of understand the true error of many different classifiers that we generate? How can we reuse our training set to build different classifiers and still understand our true error? Still an open problem in the field, but best current solution is cross-validation.

Cross-validation works very well in practice, and is used in packages such as scikit-learn. 

---

The idea behind cross validation is that we are going to take our entire training set and break it up into *folds*. We will then use the training set to at the same time train the classifier and calculate the true error.

First we'll hold out fold 1 and train using folds 2 thorugh fold k. We will then test on fold 1, and that will be our estimate for that classifiers true error.

Then, we'll hold out fold 2 and train using fold 1 and fold 3 through fold k. We will the test on fold 2, and that will be our estimate for the true estimate for that classifier. 

We will do this k times, once for each fold, and we will average all the errors that we got. That will be the estimate for the true error of a classifer produced with the parameters used to build the model.

<br>
<center>
    <img src="images/1.5.6.png" alt="Professor Notes" />
</center>
<br>

---

Let's use decision trees for an example. We have a training set S, and are trying to decide:
- Should I build a decision tree of depth 10 or depth 15?

We decide using cross-validation...

We create the folds in our training set and set depth = 10. We leave out fold 1 and build a decision tree with depth 10, and test the accuracy on fold 1. Then we'll hold out fold 2 and do the same thing. We will repeat for all folds and take the average of their error rates, which will provide and estimate of the true error for the decision tree of depth 10.

Then, we will do the exact same thing, but set the depth parameter to 15 this time. We will end up with the estimate for the true error for the tree with depth 15. 

Whichever error is smaller would be the tree we want to use!

---

Question: What should k be set to?
- Between 5 and 10 is typically what is used.

<br>
<center>
    <img src="images/1.5.7.png" alt="Professor Notes" />
</center>
<br>

## ***Resources***

**[Understanding Machine Learning: From Theory to Algorithms, Chapter 3](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html)** (Internet link)

# Personal Notes #