# 1. Learning Problem

1. $E_{in}$ based on data points that have been used for training
2. $E_{out}$ based on performance over the entire input space
3. $E_{test}$ :If the size of test becomes large,  $E_{test}$ will be close to $E_{out}$. 


( After later study, we know that we can use Hoeffding Inequality of single hypothesis to bound $E_{test}$, it's a lot tighter bound. Thus it can be useful to estimate $E_{out}$ when the size of test set is large. But a good estimation will not give us any improvement on our hypothesis - a trade-off here is that smaller training set gives a worse $E_{in}$)

## 1.1 Hoeffding Inequality


For a Single Hypothesis h, the Hoeffding Inequality can be used:


$$P(\mid E_{in}(h) - E_{out}(h) \mid > \epsilon) \leq 2e^{-2\epsilon^2N}$$ for any $\epsilon > 0$, $N$ is the number of training examples


The h is fixed before generating the whole dataset. If you want to change h later, then the Hoeffding Inequality no longer holds.

With multiple hypotheses $H = \{h_1, h_2, h_3, ..., h_M\}$, the algorithm needs to pick a hypothesis based on the data - in this way, the g is picked after the generation of dataset. In order to bound it, we don't want the choice of g affects our bound, thus we want this to be true:

$$"\mid E_{in}(h) - E_{out}(h) \mid > \epsilon" \   => $$


$$\  \mid E_{in}(h_1) - E_{out}(h_1) \mid > \epsilon \ or \mid E_{in}(h_2) - E_{out}(h_2) \mid > \epsilon \  or ... \  or \mid E_{in}(h_M) - E_{out}(h_M) \mid > \epsilon$$


or can be expressed as Union Bound:

$$\mid E_{in}(h_1) - E_{out}(h_1) \mid > \epsilon \ \bigcup \mid E_{in}(h_2) - E_{out}(h_2) \mid > \epsilon \  \bigcup ... \bigcup \mid E_{in}(h_M) - E_{out}(h_M) \mid > \epsilon$$


$$\mid E_{in}(h_1) - E_{out}(h_1) \mid > \epsilon \ + \mid E_{in}(h_2) - E_{out}(h_2) \mid > \epsilon \  + ... \  + \mid E_{in}(h_M) - E_{out}(h_M) \mid > \epsilon$$

Thus,

$$P(\mid E_{in}(g) - E_{out}(g) \mid > \epsilon) \leq 2Me^{-2\epsilon^2N}$$

## 1.2 Feasibility of Learning

1. Can we make sure that $E_{out}(g)$ is close enough to $E_{in}(g)$ ?
2. Can we make $E_{in}(g)$ small enough?

The Hoeffding Inequality addresses the first question only. The second question is answered after we run the learning algorithm on the actual data and see how small we can get $E_{in}(g)$ to be.

The two questions provides further insight into the role about "complexity", which leads us to the discussion about generaliztion tradeoff:

The more complex of $H$, $M$ becomes bigger, thus we run more risk that $E_{in}(g)$ will be a poor estimator of $E_{out}(g)$

But we stand a better chance that $E_{in}(g)$ could be small, since $g$ has to come from $H$

Even though the complexity of $f(x)$ doesn't affect Hoeffding Inequality, $E_{in}$ is more likely to be worse when $f(x)$ is more complex


## 1.3 Noise

When there is no noise, then we can use the true function $f(x)$ as our target function.

BUT When noise exsists, formally we don't use the true function $f(x)$ as our target function. Instead, we are using $P(y\mid x)$, and the datapoint are be represented as $P(x,y) = P(y\mid x) * P(x) $

The noisy target function is composed of two things:

1. deterministic target: $E(y\mid x)$
2. random noise: $y - f(x)$

the deterministic target can be also viewed as a special case of noisy target function where noise = 0. In other words, we can express $f$ as a distribution of $P(y\mid x)$ by choosing $P(y\mid x)$ equals to zero except $y = f(x)$


Even though we don't loose any generality if we consider the target to be a distribution rather than a function, $E_{in}(g)$ is more likely to be worse when the noise presents

# 2. Theory of Generalization

## 2.1 Starts From Hoeffding Inequality 
$$P(\mid E_{in}(g) - E_{out}(g) \mid > \epsilon) \leq 2Me^{-2\epsilon^2N}$$

If we pick a tolerance level $\delta$:

$$ E_{out} \leq E_{in}(g)  + \sqrt{\frac{1}{2N}\ln{\frac{2M}{\delta}}}$$

**BUT** It's a very loose bound not only because in real-life $M$ always be $\infty $, but also there are lots of overlap hypothesis in $H$

So, we want to find a **Growth Function** to bound $M$

## 2.2 Effective Number of Hypothesis

1. shatter: If $H$ is capable of generating all possible dichotomies on $x_1, ... x_N$, then $H$ can shatter $(x_1, ... x_N)$

$$m_H(N) \leq 2^N$$

2. break point: If *no data set of size $k$* can be shattered by $H$, then $k$ is said to be a break point for $H$ 

$$m_H(k) < 2^k$$

it can also be proved that if $m_H(k) < 2^k$, then for all $N$:

$$m_H(N) \leq \sum_{i=0}^{k-1}{N\choose i} \leq N^{k-1}$$



## 2.3 VC Bound 

### 2.3.1 VC Dimension

The Vapnik-Chervonenkis dimension of a hypothesis set $H$ - $d_{vc}(H)$ is the largest value of $N$ for which $m_H(N) == 2^N$. Thus if k is a break point, $d_{vc}(H) = k-1 $:

$$m_H(N) \leq \sum_{i=0}^{d_{vc}(H)}{N\choose i} \leq N^{d_{vc}(H)}$$

We can say that if we have a finite $d_{vc}(H)$, then as $N$ becomes close to $\infty$, $ \sqrt{\frac{1}{2N}\ln{\frac{2M}{\delta}}}$ will converge to 0, which means $E_{in}$ becomes close to $E_{out}$

But if $d_{vc}(H)$ is infinite, then there is no guarantee that $E_{in}$ and $E_{out}$ could be close



### 2.3.2 VC Generalization Bound

After prove, substitute $M$ for $m_H$ and some change (to make sure the form holds):
$$ E_{out}(g) \leq E_{in}(g)  + \sqrt{\frac{8}{N}\ln{\frac{4m_H(2N)}{\delta}}}$$
with probability $\delta$

$$ E_{out}(g) \leq E_{in}(g)  + \sqrt{\frac{8}{N}\ln{\frac{4(2N)^{d_{vc}}}{\delta}}}$$

It's loose thus we can:

1. Estabilish the feasibility of learning with infinite hypothesis sets -> finite vc dimension
2. Useful for comparing the generalization performance of different models

We can also compute the size of data set we need, for a given tolerance level $\delta$ and $\epsilon$:

$$N \geq \frac{8}{\epsilon^2}\ln(\frac{4m_H(2N)}{\delta})$$



# 3. Approximation Generalization Tradeoff

## 3.1 Classification Problem Setting


$$E_{out} (g) \leqslant E_{in} (g) + \mathcal{O}(\sqrt{d_{vc}\frac{log(n)}{n}})$$

**$E_{in} (g)$ decreases as $d_{vc}$ increases**;

**$\mathcal{O}(\sqrt{d_{vc}\frac{log(n)}{n}})$ increases as $d_{vc}$ increases**.


$d_{vc}$ represents VC-Dimension, the greatest number of points that can be
shattered by Hypothesis set



## 3.2 Bias-Variance Tradeoff

** For Regression Problem Settings **

Starts from Mean Squared Error, with training input D:

$$E_{out}(g_D) = \mathbb E_{\overrightarrow{x}\sim P}(g_D(\overrightarrow{x}) - f(\overrightarrow{x})^2)$$

Thus we have:


$$\mathbb E[E_{out}(g)] = \mathbb E_{\overrightarrow x}[\mathbb E_D[g_D(\overrightarrow x) - \bar g(\overrightarrow{x})^2] + (\bar g(\overrightarrow{x}) - f(\overrightarrow{x}))^2]$$

$$= \mathbb E_{\overrightarrow x}[Variance\space of\space g_D(\overrightarrow x) + Bias\space of\space \bar g(\overrightarrow{x})]$$


**Bias**

How different between the average of our hypothesis set and the true function f

(How well, on average, does g approximate f?)


**Vairance**

How variable our whole dataset compared to the average of our hypothesis set

(How well could g approximate anything? How much could noise affect g?)


**More complicated Hypothesis**

Bias decreases, Variance increases


**More data brought to exsisting Hypothesis**

Bias is fixed when model was chosen, so no change for Bias;

Variance will decrease.

# Linear Regression

## Maximum Likelihood Estimation 

Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. 
Finding the parameter values that maximize the likelihood of making the observations given the parameters.


**Assumption for Linear Regression - Gaussian Noise Distribution** 

1. The distribution of $X$ is arbitrary 
2. If $\overrightarrow X$ and $\overrightarrow \beta$ , then $Y = \overrightarrow \beta * \overrightarrow X + \epsilon$
3. $\epsilon \sim Normal (0, \sigma ^2)$
4. $\epsilon$ is independent across observations
5. $Y$ is independent across observations give $X_i$
6. Then ($Y$ given X) $\sim Normal ( {\hat {\beta}\overrightarrow X_i}, \sigma ^2)$




**Probability Density Function for Normal Distribution**

$$P(x_i\mid \mu,\sigma^2)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x_i-\mu )^{2}}{2\sigma ^{2}}}}$$

** Bayes's Theorem **

$$ P(\theta \mid X)={\frac {P(X \mid \theta)P(\theta)}{P(X)}}\cdot $$

where likelihood funtion is $P(X \mid \theta)$ and prior distribution is $p(\theta )$

** For a Dataset, if independence holds **

$$ L = P(Y_i \mid \overrightarrow X) = P(Y_1,Y_2,...Y_n  \mid \overrightarrow X) = P(Y_1\mid x_1)P(Y_2 \mid x_2)P(Y_3\mid x_3)...P(Y_n\mid x_n) = \prod_{i=1} ^n P(Y_i \mid x_i) $$

substitute for normal PDF

$$ L = \frac{1}{\sigma ^n(2\pi)^{\frac{n}{2}}} e^{-\frac{1}{2\sigma^2}\sum_{i=1}^n(\overrightarrow {\beta}\overrightarrow X_i)}$$

$$ \ln(L) = -n\ln(\sqrt {2\pi} \sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i- \overrightarrow {\beta}\overrightarrow X_i)^2$$

In order to maximize $\ln(L)$:

$$argmax_\beta[- \sum_{i=1}^n(Y_i- \overrightarrow {\beta}\overrightarrow X_i)^2]$$

Thus, $argmin_\beta[\sum_{i=1}^n(Y_i-\overrightarrow {\beta}\overrightarrow X_i)^2]$

*which is also the squared error*

In order to find the optima, we can set the first derivative equals to zero, find the weights, and then check the sign using second derivative



## Mean Squared Error and Vectorized Optimization

Similarily as MLE methods, we want to optimize parameters to have lowest mean squared error.

Let's try a vectorized optimization:

$$ E_{in}(\overrightarrow \beta)=\frac{1}{n}\sum_{i=1}^n(\overrightarrow {\beta}^T \overrightarrow {x_i} - Y_i)^2 = \frac{1}{n}\sum_{i=1}^n(\overrightarrow {x_i}^T \overrightarrow {\beta} - Y_i)^2 = \frac{1}{n}\mid\mid X\overrightarrow w - \overrightarrow Y \mid \mid ^2$$

$$ since \mid \mid \overrightarrow Z \mid\mid = \sqrt{\sum_{i=1}^d z_i^2} = \sqrt{\overrightarrow Z^T \overrightarrow Z}$$

$$ E_{in}(\overrightarrow \beta)= \frac{1}{n}(X\overrightarrow \beta - \overrightarrow Y)^T(X\overrightarrow \beta - \overrightarrow Y)$$

$$ \Delta_\beta E_{in}(\overrightarrow \beta^*) = \frac{1}{n}(2X^TX\overrightarrow \beta^* - 2X^T\overrightarrow Y) = 0$$

$$\overrightarrow \beta^* = (X^TX)^{-1}X^T\overrightarrow Y$$

*when the number of parameter << the dataset, we can assume X is transferable*



***
**Reference**

Learning From Data (http://amlbook.com/)

Maximum Likelihood Wikipedia https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

Estimating simple linear regression II https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf