# **1.1 Mistake-bounded Model of Learning**

## &emsp;&emsp;Notes

Analogy for mistake-bounded model: 

    Image an email spam filtering program that had the 100% guarantee that it would only mislabel 100 emails. No matter if you have 30 emails or 30,000 emails in your inbox, the program will only make 100 mistakes.

---

In this model we have a <b>"Learner"</b>, which takes in data points. Once it receives a data point, it responds with its guess for the classification of that data point. There is also the <b>"Teacher"</b>, which responds to the classification guess with whether the guess was correct or incorrect. When the Teacher tells the Learner that it made a mistake, a counter for the number of mistakes increases by one. However, also when the Learner makes a mistake, it learns from the mistake, updating its internal state.

<br>
<center>
    <img src="1.1.1.png" alt="Professor Notes" />
</center>
<br>

We say a Learner has mistake-bound <i>t</i> if for every sequence of challenges, Learner makes at most <i>t</i> mistakes.

<br>

---

<br>

**𝒞 (Script C) definition:**
<br>𝒞 = {monotone disjunctions on n variables} 
<br>&emsp;&emsp;*English: Script C is equal to the set of all monotone disjunctions on n variables*
<br>&emsp;&emsp;*(Note: Called monotone because there are no negations.)*
<br>Domain = {0,1}<sub>n</sub> 
<br>&emsp;&emsp;*English: Domain is equal to the set of 0, 1 to the n (i.e., bit strings of length n)*

Some functions in 𝒞:
- x1 ∨ x3 - *Evaluates to 1 when given a bit string that has a one in the first or third position.*
- f(x) = x1 ∨ x7 ∨ x9 - *Evaluates to 1 when given a bit string that has a one in the first, seventh, or ninth position.*

<br>
<center>
    <img src="1.1.2.png" alt="Professor Notes" />
</center>
<br>

---

*f* ∈ 𝒞, so *f* is a monotone disjunction. The Learner does not know that *f* is a monotone disjunction. The Learner is fed a string in the domain, and responds with a 0 or a 1. The Teacher then responds with "correct" if the guess equals f(x), or "mistake" otherwise.

If the Learner is giving a guess, 0 or 1, and the guess equals f(x), then nothing happens and the Learner moves on to the next input. If the Teacher replies that the guess was a mistake, then the Learner will update its state and recieve another input. 

In either case, the Learner is learning something. If the Learner was correct, it learned that it knew f(x). If it was incorrect, it still knows f(x) because f(x) is simply the opposite of the response the learner had given.

<br>
<center>
    <img src="1.1.3.png" alt="Professor Notes" />
</center>
<br>

---

### Question:
Can you come up with a Learner/algorithm with mistake bound at most *n*?

<br>

## &emsp;&emsp;Resources

**[On-line Algorithms in Machine Learning](C:\Users\laesc\OneDrive\Documents\college\ut%20austin\machine%20learning\4%20-%20resources\online-algorithms-in-machine-learning.pdf)** (Local link)

# **1.2 Decision Trees**

## &emsp;&emsp;Notes

A decision tree is a boolean function (outputs true or false). At each node in the decision tree, there is a literal. At the leaves there is a fixed value which is the output.

The size of the decision tree will be the number of nodes in the tree. The depth (height) of the tree is equal to the length of the longest path from the root to a leaf.

*Note that for an input going into a decision tree, the x is referred to as a "challenge", and the y a "label".*

Topics:
- Heuristics for learning decision trees
- Theoretical properties

---

**Example input: X ∈ {0, 1}<sup>n</sup> (bit string of n length)**

The decision tree is going to encode some function f(x) into {0, 1} as follows:

- At each node, the tree decides which branch to take based on the value of the literal, until it reaches the leaf.

The example decision tree's depth = 2, and size = 3.

<br>
<center>
    <img src="1.2.1.png" alt="Professor Notes" />
</center>
<br>

---

#### The machine learning problem:
- Given a set of labeled examples, build a tree with low error

<br>

---

<br>

**Տ** = training set, where Տ is a collection of strings and 0, 1 labels.

- So Տ is a collection of X's and y's, where X ∈ {0, 1}<sup>n</sup>, and y ∈ {0, 1}.
<br>

**Error Rate/Training Error/Emperical Error Rate** = (number of mistakes that T makes on Տ)/ size of Տ, where T is a decision tree.

<br>
<center>
    <img src="1.2.2.png" alt="Professor Notes" />
</center>
<br>

---

#### Natural approach for building decision trees:
- Given a set Տ

<br>

- Tree 1: Very simple, trivial tree
    - Tree is a leaf (we dont query any literals, always output 0 or 1)
    - How do we decide what to output?
        - Choose 1 or 0 depending on which label is more prevalent in the dataset
 <br>
     
- Tree 2: More advanced tree
    - Tree has one node, the root
    - How do we decide which literal to put at the root?
        - You want a literal at the root that is going to discriminate between zero and one labels

<br>
<center>
    <img src="1.2.3.png" alt="Professor Notes" />
</center>
<br>

<br>

---

#### So how do we decide which literal to put at the root?

Define a potential function Φ(a):
<br>&emsp;&emsp;*[English: phi of a]*

$$Φ(a) = min(a, 1-a)$$

<br>

---

So, for the trivial decision tree:

Pick a literal, *x<sub>i</sub>* , then compute Φ(Pr<sub>(x, y)~Տ</sub> (y = 0))
<br>&emsp;&emsp;*[English: Compute phi of the probability that for an example we choose from Տ that y = 0]*

- Assume: 10 positive examples
- Assume: 5 negative examples
- What is Φ(Pr<sub>(x, y)~Տ</sub> (y = 0))?
    - 1/3
- *This* probability is the error rate for the trivial decistion tree.

$$ Φ(Pr_{(x, y)\textasciitilde Տ} (y = 0)) $$

<br>
<center>
    <img src="1.2.4.png" alt="Professor Notes" />
</center>
<br>

---

Looking at the tree with one node, pick a literal, *x<sub>1</sub>*, as the root node...

What label should be put on the first leaf?
- Condition on *x<sub>1</sub>* = 0 -> output the majority value

Then, for the second leaf...
- Condition on *x<sub>1</sub>* = 1 -> output the majority value

Meaning, for each option of the value of *x<sub>1</sub>*, we output the majority label for that value of *x<sub>1</sub>*.

<br> 

**What is the new error rate?**

It is a weighted average of the error of each of the new leaves. Explicitly written out, the error rate for the decision tree with one node is:

$$
Pr_{(x, y)\textasciitildeՏ}[x_1 = 0]*Φ(Pr_{(x, y)\textasciitilde Տ} (y = 0) | x_1 = 0) + 
Pr_{(x, y)\textasciitilde Տ}[x_1 = 1]*Φ(Pr_{(x, y)\textasciitilde Տ} (y = 0) | x_1 = 1)
$$

---

**Gain(x<sub>1</sub>) = Old Rate - New Rate using x<sub>1</sub>**
<br>&emsp;&emsp;*[English: The gain of x<sub>1</sub> is the old error rate minus the new error rate using x<sub>1</sub>]*

This is the gain in training error that we attained by moving from the trivial decision tree to the decision tree where we put x<sub>1</sub> at the root. We are defining it as Gain(x<sub>1</sub>).

<br>
<center>
    <img src="1.2.5.png" alt="Professor Notes" />
</center>
<br>

---

Now we can compute the Gain(x<sub>i</sub>) of each literal, from x<sub>1</sub> to x<sub>n</sub>, and find which literal maximizes the gain and place that literal at the root of our tree. 

Once we have done that, each branch will now be using a subset of the original set. In this case the left branch will use the training set Տ<sub>|x<sub>1</sub>=0</sub> *[English: Տ restricted to x<sub>1</sub>=0]*, and the right branch will be using the training set Տ<sub>|x<sub>1</sub>=1</sub> *[English: Տ restricted to x<sub>1</sub>=1]*. 

Meaning we have two different training sets now, one for the left subtree and one for the right subtree. We repeat the process of computing what literal should be at the root of the next subtrees and continue until the tree has been completed.

Is this computationally feasible?

    It depends on what the functions are. In this case, the gain function is relatively easy to compute, but also consider how large of a tree that you want to build. Also, if you start building trees that are extremely or exponentially large in terms of the features we have, that is not going to be computationally feasible. So we are going to need some sort of stopping criterion. The stopping criterion will be covered later.

## &emsp;&emsp;Resources

**[Understanding Machine Learning: From Theory to Algorithms, Chapter 18](https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/index.html)** (Internet link)

# **1.3 Generalization**

## &emsp;&emsp;Notes

### 1.3.0 Introduction to Generalization

#### **Generalization, or predictive power of a classifier.**

What we want to understand is how well the classifier we built is going to perform when it is given data that it has not seen before. We would like to estimate this generalization error and understand when it is going to have good generalization error. This is going to lead us to the PAC model of learning, which is a foundational model of learning.

- What is the "true error" or generalization error of a classifier?
- Decision trees:
    - Let us fix some tree T, created based on the rules we created above.
    - And let us say we have a probability distribution D on new examples. 
    - So for a new challenge and a new label, and we want to know what is:

$$Pr_{(x, y)\textasciitilde D}[T(x) \neq y]$$

<br>*[English: The probability that when we randomly draw a challenge from some unknown distribution D, T(x) does not equal y]*

We call this the **true error**, or generalization error, of T. Of course, we are hoping that this quantity is small.

<br>
<center>
    <img src="1.3.1.png" alt="Professor Notes" />
</center>
<br>

---

#### **So when might this quantity not be small?**

Let us imagine that we have a training set Տ, where we have challenges x<sup>1</sup> through x<sup>m</sup>, and labels y<sup>1</sup> through y<sup>m</sup>., where x<sup>i</sup>∈{0,1}<sup>n</sup> and y<sup>i</sup>∈{0,1}. Assume all x<sup>i</sup> are distinct.

A learner is given Տ, and provides the classifier that is an exact copy of the training set. In this case, the learner is memorizing the training set. No real learning has occured, and unless Տ is the entire set of all possible inputs, this is a bad classifier. 

<br>
<center>
    <img src="1.3.3.png" alt="Professor Notes" />
</center>
<br>

---

Let us again consider a case with training set Տ as described before. We can build a decision tree at least of the size of Տ, that is consistent with all the points in Տ. 

**Then the question is: *How well does this tree generalize?* or, *What is the true error of this tree?***

It will be pretty bad, qualitatively, because it is simply memorizing every entry in the training set.

In our decision trees, we only considered having a low training error. To create a robust tree that can handle real data, we also need to consider, are we getting true low generalization error? We want a good combination of both things.

---

#### **How can we estimate the true error of a decision tree?**

A **"hold-out"** or **"validation set"** is used for this purpose.

The idea is that we have Տ, the training set, and then we will have some more data that we will not let the decision tree look at, called ᕼ, our hold out.

1. Use Տ to build a decision tree
2. Estimate the tree's true error via its error on ᕼ

By counting the number of mistakes that our tree makes on ᕼ, we will compute the error rate of the tree on ᕼ, and we'll use that for the estimate of the true error. As long as ᕼ is suffiently large, for any fixed tree we have built using Տ, the estimate of its true error will be very close to its error on ᕼ, which is something you can prove.

It is important that you do not go back to the tree and modify it over and over to reduce error on ᕼ, because at that point you are now incorporating the validation set into the tree with Տ. This would be considered **over-fitting**.

It is also important to consider that hold-out sets can be expensive, especially if you create many models and larger models.

To combat this, we can use a technique called **cross-validation**, which allows you to reuse data that has been held out in a validation set. This will be covered later.

<br>
<center>
    <img src="1.3.4.png" alt="Professor Notes" />
</center>
<br>

### 1.3.1 Model Complexity

### 1.3.2 PAC Model of Learning

## &emsp;&emsp;Resources

**[]()** (Local link)