# Generalization

## ***Vocabulary***

none yet

# Lecture Notes #

## ***Introduction***

#### **Generalization, or predictive power of a classifier.**

What we want to understand is how well the classifier we built is going to perform when it is given data that it has not seen before. We would like to estimate this generalization error and understand when it is going to have good generalization error. This is going to lead us to the PAC model of learning, which is a foundational model of learning.

- What is the "true error" or generalization error of a classifier?
- Decision trees:
    - Let us fix some tree T, created based on the rules we created above.
    - And let us say we have a probability distribution D on new examples. 
    - So for a new challenge and a new label, and we want to know what is:

$$Pr_{(x, y)\textasciitilde D}[T(x) \neq y]$$

<br>*[English: The probability that when we randomly draw a challenge from some unknown distribution D, T(x) does not equal y]*

We call this the **true error**, or generalization error, of T. Of course, we are hoping that this quantity is small.

<br>
<center>
    <img src="images/1.3.1.png" alt="Professor Notes" />
</center>
<br>

---

#### **So when might this quantity not be small?**

Let us imagine that we have a training set Տ, where we have challenges x<sup>1</sup> through x<sup>m</sup>, and labels y<sup>1</sup> through y<sup>m</sup>., where x<sup>i</sup>∈{0,1}<sup>n</sup> and y<sup>i</sup>∈{0,1}. Assume all x<sup>i</sup> are distinct.

A learner is given Տ, and provides the classifier that is an exact copy of the training set. In this case, the learner is memorizing the training set. No real learning has occured, and unless Տ is the entire set of all possible inputs, this is a bad classifier. 

<br>
<center>
    <img src="images/1.3.3.png" alt="Professor Notes" />
</center>
<br>

---

Let us again consider a case with training set Տ as described before. We can build a decision tree at least of the size of Տ, that is consistent with all the points in Տ. 

**Then the question is: *How well does this tree generalize?* or, *What is the true error of this tree?***

It will be pretty bad, qualitatively, because it is simply memorizing every entry in the training set.

In our decision trees, we only considered having a low training error. To create a robust tree that can handle real data, we also need to consider, are we getting true low generalization error? We want a good combination of both things.

---

#### **How can we estimate the true error of a decision tree?**

A **"hold-out"** or **"validation set"** is used for this purpose.

The idea is that we have Տ, the training set, and then we will have some more data that we will not let the decision tree look at, called ᕼ, our hold out.

1. Use Տ to build a decision tree
2. Estimate the tree's true error via its error on ᕼ

By counting the number of mistakes that our tree makes on ᕼ, we will compute the error rate of the tree on ᕼ, and we'll use that for the estimate of the true error. As long as ᕼ is suffiently large, for any fixed tree we have built using Տ, the estimate of its true error will be very close to its error on ᕼ, which is something you can prove.

It is important that you do not go back to the tree and modify it over and over to reduce error on ᕼ, because at that point you are now incorporating the validation set into the tree with Տ. This would be considered **over-fitting**.

It is also important to consider that hold-out sets can be expensive, especially if you create many models and larger models.

To combat this, we can use a technique called **cross-validation**, which allows you to reuse data that has been held out in a validation set. This will be covered later.

<br>
<center>
    <img src="images/1.3.4.png" alt="Professor Notes" />
</center>
<br>

## ***Model Complexity***

#### **Trading Training Error for Model Complexity** ####

Another approach to estimate the true error of a decision tree is trade off training error with "model complexity".

We can define another potential function, phi, where phi will map tree to real numbers. A common mapping is, given a training set Տ:

$$ \phi(T) = training\;error\;on\;Տ + \alpha * \frac{size(T)}{|S|} $$

&emsp;*[Englsh: phi of T will output the training error on Տ + some value alpha * the size of T, divided by the size of Տ]*

The alpha is often referred to as a hyperparameter, it is set in the beginning of the training.

The goal is to minimize phi, and there are two ways that phi can increase.
- Large training error on Տ
- Tree is large in size

Therefore, we can choose a tree that minimizes the potential function.

**Because the size of the tree will be large when memorizing the training set, by balancing a reasonably small training error with a reasonably small tree, we can create a good tree.**

<br>
<center>
    <img src="images/1.3.5.png" alt="Professor Notes" />
</center>
<br>

---

#### **Minimum Description Length** ####

Another common approach is MDL, **minimum desciption length principle**. Again, given a training set and some number of bits needed to encode Տ, an upper bound would be $m*n(+1)$, m for the number of examples in the training set, n for the example x, and 1 for the label. We can then build a tree T. Let's say T is correct on 90% of Տ, and incorrect on 10%. We can encode Տ using 

$$bits(T) + \#bits\;to\;encode\;remaining\;10\%\;wrong$$ 

*[English: the number of bits needed to encode T, the tree, plus the number of bits needed to encode the remaining 10% we got wrong]*.

Recall: A bad learner is one that takes the training set and hands it back to you. What this is trying to capture is a notion of compression, how well can you compress the data given. We can encode it using just this tree and the bits required to encode the remaining 10% of the training set. So, the tree with the smallest MDL, is the better tree.

---

#### **Classifying trees based on generalization error** ####
- MDL
- Trading off training error and tree size
- Validation set
- Cross-validation

<br>
<center>
    <img src="images/1.3.6.png" alt="Professor Notes" />
</center>
<br>

## ***PAC Model of Learning***

Consider a distribution D on {0, 1}<sup>n</sup>, our domain, a function class C:

$$ C = {decision\;trees\;of\;size\;S} $$

and fix c ∈ C, where c is the unknown decision treewe want to learn.

---

Learner that runs in polynomial-time. The learner recieves an example $(x, y)$, where $x \textasciitilde D$ and $y=c(x)$. *[English: x is drawn according to the probability distribution D, and y is equal to c(x)]* The learner can request a new draw at any time, from x<sup>1</sup>, y<sup>1</sup> through x<sup>m</sup>, y<sup>m</sup> and y<sup>i</sup> = c(x<sup>i</sup>). 

The goal is for the learner to output h, where h ∈ C, with the following property:

$$ Pr_{x\textasciitilde D}[h(x) ≠ c(x)] ≤ ϵ $$

*[English: The probabiliyt that an x, from the distribution script D, that h(x) does not equal c(x), should be at most epsilon.]*

Think of ϵ being something small, like .01. The learner should be efficient. We had a parameter n (from the domain), and a parameter s (the size of the tree). The learner should always run in the time polynomial in n and s. And the number of samples, or draws, that the learner can request should also be bounded by a polynomial in n and s. (Becuase it takes one time step to take a draw from the distribution).

The real goal is to output some hypothesis (classifier), h, whose true error is at most ∈.

This is different from the mistake bounded model of learning, because that only required a bounded number of mistakes. Here we talk about probabilities, distribution, and is a little more complicated.

<br>
<center>
    <img src="images/1.3.7.png" alt="Professor Notes" />
</center>
<br>

---

#### **Formalizing it:** ####

    With probability (over the draws from D) at least 1 - δ [one minus delta], the learner should output a hypothesis h such that 

$$ Pr_{x\textasciitilde D}[h(x) ≠ c(x)] ≤ ϵ $$

    And the running time, which includes the draws it may take, should be 

$$ run-time = polynomial(\frac{1}{ϵ}, \frac{1}{s}, n, s) $$

&emsp;*[English: some polynomial in one over epsilon, one over delta, n, and s]*

That is the formal statement of the goal of the learner.\

#### **Why do we need the probability at least 1 - δ?** ####

Imagine that the learner keeps requesting new draws from the distribution, and gets really unlucky and all the x's it draws are equal. It gets the same example over and over. We cannot expect the learner in this case to putput a classifier that has small error. Thus, we have to allow for some probability of failure. Luckily the probability of drawing the same example over and over is extremely small\


#### **What does PAC mean?**

P - probably - $ probability\;at\;least\;1 - δ $\
A - approximately - $ Pr_{x\textasciitilde D}[h(x) ≠ c(x)] ≤ ϵ $\
C - correct 

<br>

Note: As you demand more accuracy, or smaller ϵ, then you're allowed to run in more time and take more samples.

This would work for any class of functions that output boolean values, not just decision trees. I.e. for any classification problem.

<br>
<center>
    <img src="images/1.3.8.png" alt="Professor Notes" />
</center>
<br>

---

#### **When can we PAC learn a function class?**

Or, what function classes can we PAC learn?

Give learner an algorithm A, which maps training sets to decision trees...

A on a training set S will output a tree T that is consistent with S. Furthermore, the size of T is going to be at most s.

So, A always outputs a consistent hypothesis from C given any training set (assuming there is one).

Question: *Given algorithm A, how can we PAC learn C*?

<br>
<center>
    <img src="images/1.3.9.png" alt="Professor Notes" />
</center>
<br>

**High level algorithm**: Draw sufficiently many training points (which we will call S), use A to find a function (c) in C consistent with S, then output c.

The only question left is *how large should S be*? (Recall that the PAC learning model must run in polynomial time in terms of the parameters, drawing exponentially many points will not do)

    Illustrative example: Marble Game
    
    Jar 1: all blue marbles\
    Jar 2: 90% red marbles, 10% blue marbles\
    
    Figure out if you've been given Jar 1 or Jar 2, given a random element of the jar any time you want. 
    
    - Pick a random marble from the jar:
        - Case 1: the marble is red -> We have Jar 2
        - Case 2: the marble is blue -> Probably Jar 1
            - Choose at most 100 marbles
                - If we see red, then Jar 2, if not then choose jar 1.
             
    **What is the probability of failure?**
    
    Probability of failure is (.1)<sup>100</sup>, which corresponds to the δ parameter in PAC learning.

<br>
<center>
    <img src="images/1.3.10.png" alt="Professor Notes" />
</center>
<br>

Back to PAC learning:
- Draw many samples
- Run A
- Output classifier c that is consistent with S given from A

What is the probability this procedure fails? 

And what do we want the above probability to be less than? δ

<br>
<center>
    <img src="images/1.3.11.png" alt="Professor Notes" />
</center>
<br>

The bad event, or failure, is we output c that is consistent with S, but the true error of c is greater than ϵ. So, *what is the probability of this bad event*?

Imagine we have enumerated all functions in C = {c<sub>1</sub>, ..., c<sub>N</sub>}. Fix c<sub>1</sub>, assume c<sub>1</sub> has true error > ϵ (we do not want it, true error too high).

What is $ Pr_S\;[c_1\;is\;consistent\;with\;S] $?

$$ Pr_S\;[c_1\;is\;consistent\;with\;S] \le (1-\epsilon)^{|S|}$$

Now, let's fix c<sub>2</sub>, assume c<sub>2</sub> has true error > ϵ (we do not want it, true error too high). What is the same probability? Also $ Pr_S\;[c_1\;is\;consistent\;with\;S] \le (1-\epsilon)^{|S|}$.

<br>
<center>
    <img src="images/1.3.12.png" alt="Professor Notes" />
</center>
<br>

For every c<sub>i</sub> (with error > ϵ): $ Pr_S\;[c_i\;is\;consistent\;with\;S] \le (1-\epsilon)^{|S|}$.

Question: Randomly form S, what is the probability there even exists a function c whose error is > ϵ and is consistent with S?

Hint:

    Union bound: Given A, B Pr[A ∨ B] ≤ Pr[A] + Pr[B]
    *The probability that A or B occurs is less than or equal to the probability that A occurs plus the probability that B occurs*

Answer: $Pr[Bad\;Event] \le |C|*(1-\epsilon)^{|S|}$, since there are at most C functions to consider and the probability of failure for each was $(1-\epsilon)^{|S|}$. Recall, we wanted Pr[Bad Event] < δ, so the goal is 

$$ Pr[Bad\;Event] \le |C|*(1-\epsilon)^{|S|} \le \delta $$

The only unknown variable is |S|, so we will solve for it.

<br>
<center>
    <img src="images/1.3.13.png" alt="Professor Notes" />
</center>
<br>

**Solving for |S|**

$|C|*(1-\epsilon)^{|S|} \le \delta$\

$|C|*(e)^{\epsilon|S|} \le \delta$ (using (1-x) ≈ e<sup>-x</sup>)\

$(e)^{\epsilon|S|} \le \frac{\delta}{|C|}$

$\epsilon|S| \le log(\frac{\delta}{|C|})$

which yields:

$$|S| \ge \frac{log(\frac{\delta}{|C|})}{\epsilon}$$

This is saying that if you choose number of training points larger than $\frac{log(\frac{\delta}{|C|})}{\epsilon}$, then with probability ≥ (1-δ), the function output c is 1-ϵ accurate.\

Thus we have exactly computed the size of the training set S such that the output function c is 1-ϵ accurate.

<br>
<center>
    <img src="images/1.3.14.png" alt="Professor Notes" />
</center>
<br>

Since we assumed A would give a consistent hypothesis each time, that suggests that there is a "consistent hypothesis" approach to learning...

# Personal Notes #

**[Understanding Machine Learning: From Theory to Algorithms, Chapter 3](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html)** 