## Sections
* General Definitions
* Mistake-Bounded Learning
* Decision Trees and Potential Functions
* PAC Learning
* Cross Validation
* Perceptron Learning
* Kernel Functions
* Linear Regression
* Gradient Descent
* Boosting
* Logistic Regression
* Singular Value Decomposition (SVD)
* Principal Component Analysis (PCA)
* Maximum Likelihood Estimation (MLE)
* Bayesian Inference

---

## General Definitions
* **C / $C$** = a general class of potential algorithms/models/functions.
* **On-line Learning** = a learning algorithm that learns from a stream of data. 
* **Off-line Learning** = a learning algorithm where the learner is fed a batch of examples all at once to train on.
* **Generalization Error** = the "true" error of a model when it is subjected to new observations.

---

## Mistake-Bounded Learning
* **Mistake-Bounded Learner** = on-line learning algorithm that has an upper bound on the number of mistakes it will make.
* Simple mistake-bounded learning problem:
    * Say we are presented with a strings of bits $x \in \{0, 1\}^n$ and we are to learn a function $f \in C$ that is a monotonic disjunction of the bit strings.
    * An example of a potential monotonic disjunction function $f \in C$ is: $f(x) = x_1 \lor x_3$. 
        * This function is given a bit string and returns $1$ if it is given a bit string where the first or third bit is $1$. 
    * We are looking for a mistake-bounded learner that will make at most $n$ mistakes before it finds the correct algorithm $f$.
    * To solve this we start with the monotonic function $f(x) = x_1 \lor x_2 \lor ... \lor x_n$ which returns $1$ in all cases except for the case that the entire bit string is $0$.
    * If the learning receives $0110010$ it will guess $1$ and be told it is wrong, however, it has learned that $x_2$, $x_3$, and $x_6$ are not in the monotonic disjunction and it removes these from $f$. The learner then takes a new bit string and guesses again.
    * With each new bit string (new challenge) the learner either passes the challenge and has learned $f$ or it is able to eliminate at least one bit from $f$. Thus, in the worst case scenario the learner takes $n$ challenges to learn the true function $f$, where $n$ is the length of the bit string.

---

## Decision Trees and Potential Functions
* **Size** = number of nodes in the tree.
* **Depth** = length of the longest path from the root to a leaf.
* The question of what feature to differentiate on at the root and subsequent leaves is decided by the **potential function** ($\phi(a)$) of the tree.
    * The potential function also tells us what the error rate on the training set will be if we differentiate on a particular feature.
* **Gini Function** = a common potential function of the form: $\phi(a) = 2 \cdot a \cdot (1 - a)$
    * This particular potential function is differentiable which is a nice property to have.
* **Trivial Tree** = a tree that has a depth of $0$ and only one node, the accuracy of the trivial tree where inputs are binary is calculated as:

$$\theta(P | NEG)$$ 

* This is simply the potential function of the number of $0s$ in the training set divided by the numbers of $1s$ in the training set.

<br>

* The general process for selecting the feature to differentiate on if inputs are binary:
    * For the root, compute the potential function $\phi(a)$ of the tree for each potential feature, then select the feature with the lowest error rate.
    * For subsequent leaves, compute the potential function $\phi(a)$ conditional on all previously selected features, then select the feature that maximizes the **gain** (error rate with new leaf - previous best error rate).
* Given a tree with three features $X$, $Y$, $Z$, one would find the root feature by computing the following for all three features and taking the feature with the lowest result (lowest error rate).

$$P(X = 0) \cdot \theta P(NEG | X = 0) + P(X = 1) \cdot \theta P(NEG | X = 1)$$

* This is just the conditional probability of the feature being $0$ or $1$ times the error rate conditional on the feature being $0$ or $1$.
* If we select feature $Z$ to put at the root, then we can compute the left child (the root = $1$ path) of the root by computing the following for features $X$ and $Y$ and selecting the feature with the largest gain over our current error rate.

$$P(X = 0 | Z = 1) \cdot \theta P(NEG | X = 0, Z = 1) + P(X = 1 | Z = 1) \cdot \theta P(NEG | X = 1, Z = 1)$$

* This is the same formula as the root, but we are now computing the error rate conditional on $Z = 1$.
* Two methods for knowing when to stop growing a tree:
    * When the gain is less than a small threshold stop adding new nodes.
    * **Pruning** = build an enormous tree and then prune the nodes with the least gain until you get to a desired number of nodes.

<br>

### Representing Trees in Polynomial Form

Take a simple decision tree with just one root $x_1$:

<pre>

                    x1
                   /  \
                  1   -1

</pre>

This decision tree can be represented in polynomial form as:
$$p(x) = 1 \cdot (1 + x_1) \cdot \frac{1}{2}$$
$$p(x) = -1 \cdot (1 - x_1) \cdot \frac{1}{2}$$

Note that in the top polynomial that represents the left branch, if $x_1 = 1$ then the polynomial evaluates to 1 and if $x_1 = -1$ then the polynomial evaluates to 0. The polynomials representing all of the branches follow the pattern of desired label $*$ a polynomial that evaluates to 1 if the given branch is taken and evaluates to 0 if the given branch is not taken.

For a more complicated tree such as:

<pre>

                     x1
                    /  \
                  x2    -1
                 /  \
                1   -1

</pre>


The path from $x_1$ to $x_2$ to 1, which can be thought of as $p(1, 1) = 1$, can be represented in polynomial form as:

$$p(x) = 1 \cdot (1 + x_1) \cdot (1 + x_2) \cdot \frac{1}{4}$$

Note that if either $x_1$ or $x_2$ are -1 instead of 1, this polynomial evaluates to 0, representing that another path on the tree was taken. I am not going to write out the other two polynomials for this tree as I believe the above examples demontrate the pattern.


---

## PAC Learning
* Introduced by *Valiant* (1984).
* **PAC Learning** (Probably Approximately Correct) = a theoretical framework for a hypothesis (algorithm) $h$ which has an error rate $\epsilon$ on new examples. Additionally we want to have high confidence $\delta$ that the error rate is small.
    * PAC Learning asks whether it will be possible to come up with a hypothesis $h$ from a function class $C$ that will be acceptably accurate on new examples and we can be acceptably confident in that accuracy.
* An algorithm is PAC learnable if you can show that the number of examples $m$ needed to learn the algorithm with an accepted error rate $\epsilon$ and confidence in our error rate $\delta$ is:

$$m = O(\frac{1}{\epsilon} \cdot \log \frac{1}{\delta})$$

### Formal example of PAC Learning:
* Consider a underlying data distribution $D$ that is $(0,1)^n$ on the real number line.
* We have a function class $C$ that is a decision tree and a learner $c$ that is an unknown decision tree that we want to learn.
* The learner receives an input $(x, y)$ where $x$ is drawn from $D$ and $y = c(x)$.
    * $y_i$ (the label) is always equal to $c(x_i)$ because $c$ is the true function we are trying to learn.
* The goal is to output $h \in C$, a decision method (hypothesis $h$) within the function class $C$ (a decision tree), with the property: $PR_{x~D} [h(x) \neq c(x)] \leq \epsilon$ 
    * When we draw an example $x$ from $D$ we want our guess $h(x)$ to not be incorrect more than $\epsilon$ (the error rate) percent of the time. 
* When can we PAC learn a function class?
    * Consider an algorithm $A$ that takes a training set $S$ and returns a decision tree $c$ that is consistent with $S$.
    * Given $A$, how can we PAC learn $C$, the original function class that $c$ is taken from?
    * The bad outcome is that we output a tree $c$ that is consistent with $S$ but has a true error > $\epsilon$.
    * For example, assume we come up with a tree $c_1$ that has true error > $\epsilon$ (bad), what is the probability that $c_1$ is consistent with $S$?
    * The probability is $\leq (1 - \epsilon)^m$ where $m$ is the number of examples in $S$. This is akin to creating to the probability of drawing a training set $S$ from the distribution $D$ that only has "misleading" examples. 
    * Given that there are at most $C$ possible functions to consider ($c_1, c_2, ... c_n$), the probability of this bad outcome occuring is: $C \cdot (1 - \epsilon)^m$.
    * The above probability is the only bad event that can occur, the probability that we draw a sample which causes us to create a tree $c$ that is consistent with the drawn sample $S$ (the tree works well with examples $S$), but that has a true error rate > $\epsilon$, and our goal is make sure that this probability of a bad outcome is $\leq \delta$: 
    
    $$C \cdot (1 - \epsilon)^m \leq \delta$$

    * Using the approximation $(1-x) = e^{-x}$ we can solve for $m$, how many example should be in our training set: 
    
    $$m > \frac{1}{\epsilon} \cdot \log \frac{1}{\delta}$$

    * This function allows us to make the statement: if you choose a sufficient number of training samples $m$ that satisfies then with probability $1 - \delta$ ($\delta$ being our accepted failure rate) we will output a tree $c$ that is $1 - \epsilon$ accurate ($\epsilon$ being the error rate of the tree). 
    * One nice property is that we take the log of the possible size of the function class $C$, thus we can have extremely large function classes and still be able to learn them in polynomial time.



---

## Cross-Validation

**Markov's Inequality**
* "The probability that a positive random variable is $K$ times greater than its mean is $\leq \frac{1}{K}$."
* Given a random variable $X$ that takes only positive values, where $K$ is a constant and $E[X]$ is the mean of $X$:

$$Pr[X \geq K \cdot E[X]] \leq \frac{1}{K}$$

* If $X$ has $\mu = 4$ the probability of sampling 8 is $\leq \frac{1}{2}$.

**Chebyshev's Inequality**
* "When we sample from a random variable $X$ the probability that the given sample deviates more than its standard deviation $\sigma$ by a factor of $t$ is $\leq \frac{1}{t^2}$."

$$Pr[|X - \mu| > t \cdot \sigma] \leq \frac{1}{t^2}$$

* If $X$ has $\mu = 4$ and $\sigma = 2$ the probability of sampling 8 is $\leq \frac{1}{2^2} = \frac{1}{4}$.

**Chernoff Bound** 
* Say that we have a set of random variables $X_1, X_2, ..., X_n$ and that they have an average expected value of $E[X_i] = p$.
* We can define the expected sum of all the random variables $\mu$ which due to linearity of expectation is equal to the sum of the expected values of each random variable.
    * $S = \sum_{i=1}^{n} X_i$, the sum of the random variables, and $\mu = E[S] = \sum_{i=1}^{n} E[X_i] = np$.
    * This is simply multilying the number of random variables $n$ by the average expected value of each random variable $p$.
* The Chernoff Bound says:

$$Pr[|S - \mu| > \delta n] \leq 2e^{-2n \delta^2}$$

* The probability that the sum of the random variables deviates from the expected sum $\mu$ by a factor of $\delta$ is $\leq 2e^{-2n \delta^2}$.
* This is akin to saying that the probability that you deviate from $\mu$ is exponentially small in $n$, the sum of many random variables with the same $p$ will not deviate much.
* Holds for any random variable where we can bound the variance, does not have to be a normal distribution.

### Applying the Chernoff Bound to Estimating the True Error of a Classifier
* We have a hold-out set $S$ of size $n$.
* Fix $h$ (our tree) that was trained on a training set.
* Recall there is an underlying distribution $D$ from which we draw $S$.
* We are interested in the random variable: $Z = Pr_{x~D} [h(x) \neq c(x)]$
    * Where $c$ is the true function we are trying to learn. $Z$ is the true error of the classifier.
* Let $X_i$ be a random variable that is 1 if $h$ is incorrect on the $i^{th}$ element of S and 0 otherwise.
* $S = \sum_{i=1}^{n} X_i$ and $E[S] = n \cdot p$ where $p$ is the true error rate.
* Say that we set $\delta = 0.1$. Plugging this into the Chernoff Bound we get a right side of:
    * $2e^{-2n \delta^2} = 2e^{-2n \cdot 0.1^2} = 2e^{-0.2n}$
    * Say that we want this "confidence metric" to be less than $\alpha$, how large does $n$ need to be before we satisfy this? Solving for $n$ we get:
        * $n > 50 \cdot log(\frac{\alpha}{2})$
    * $\delta = 0.1$ says I want the error-rate on the hold out set to be 0.1 of the true error rate and $\alpha = 0.1$ says that I want to be 90% confident about this.
    * $\alpha$ says the probability that our error-rate $\delta$ fails to be within 0.1 of the true error rate is at most 0.1.

### Cross-Validation
* Hold-out is expensive:
    * Data is expensive.
    * If we want to generate multiple methods for generating classifiers, we quickly lose confidence in our estimates. 
* Cross-validation works really well in practice, but there is little theory to explain why it performs so well.
* Cross-validation:
    * Take your entire dataset and break it into folds.
    * First, leave out Fold 1 and train using Folds 2 through Fold K. Then test the classifier on Fold 1.
    * Then hold-out Fold 2 and train on the rest of the classifiers ...
    * This process will occur K times, then you average the error rates of the K classifiers. This will be the true error rate estimation.
* Between 5 and 10 folds is typically used.
* It is very hard to prove anything about cross-validation because re-trains over the folds are not independent.

---

![](./NoteFiles/ML1.png)
![](./NoteFiles/ML2.png)
![](./NoteFiles/ML3.png)
![](./NoteFiles/ML4.png)
![](./NoteFiles/ML5.png)
![](./NoteFiles/ML6.png)
![](./NoteFiles/ML7.png)
![](./NoteFiles/ML8.png)
![](./NoteFiles/ML9.png)
![](./NoteFiles/ML10.png)

---

![](./NoteFiles/ML11.png)
![](./NoteFiles/ML12.png)
![](./NoteFiles/ML13.png)
![](./NoteFiles/ML14.png)
![](./NoteFiles/ML15.png)

---

![](./NoteFiles/ML16.png)
![](./NoteFiles/ML17.png)
![](./NoteFiles/ML18.png)
![](./NoteFiles/ML19.png)
![](./NoteFiles/ML20.png)
![](./NoteFiles/ML21.png)
![](./NoteFiles/ML22.png)

---

![](./NoteFiles/ML23.png)
![](./NoteFiles/ML24.png)
![](./NoteFiles/ML25.png)
![](./NoteFiles/ML26.png)
![](./NoteFiles/ML27.png)
![](./NoteFiles/ML28.png)
![](./NoteFiles/ML29.png)
![](./NoteFiles/ML30.png)

---

![](./NoteFiles/ML31.png)
![](./NoteFiles/ML32.png)
![](./NoteFiles/ML33.png)
![](./NoteFiles/ML34.png)
![](./NoteFiles/ML35.png)

---

![](./NoteFiles/ML36.png)
![](./NoteFiles/ML37.png)
![](./NoteFiles/ML38.png)
![](./NoteFiles/ML39.png)
![](./NoteFiles/ML40.png)
![](./NoteFiles/ML41.png)
![](./NoteFiles/ML42.png)
![](./NoteFiles/ML43.png)

---

![](./NoteFiles/ML44.png)
![](./NoteFiles/ML45.png)
![](./NoteFiles/ML46.png)
![](./NoteFiles/ML47.png)
![](./NoteFiles/ML48.png)
![](./NoteFiles/ML49.png)
![](./NoteFiles/ML50.png)
![](./NoteFiles/ML51.png)

---

![](./NoteFiles/ML52.png)
![](./NoteFiles/ML53.png)
![](./NoteFiles/ML54.png)
![](./NoteFiles/ML55.png)
![](./NoteFiles/ML56.png)
![](./NoteFiles/ML57.png)
![](./NoteFiles/ML58.png)
![](./NoteFiles/ML59.png)
![](./NoteFiles/ML60.png)


---