
# Approximation Generalization Tradeoff

** For Classification Problem Settings **


$$E_{out} (g) \leqslant E_{in} (g) + \mathcal{O}(\sqrt{d_{vc}\frac{log(n)}{n}})$$

**$E_{in} (g)$ decreases as $d_{vc}$ increases**;

**$\mathcal{O}(\sqrt{d_{vc}\frac{log(n)}{n}})$ increases as $d_{vc}$ increases**.


$d_{vc}$ represents VC-Dimension, the greatest number of points that can be
shattered by Hypothesis set

# Bias-Variance Tradeoff

** For Regression Problem Settings **

Starts from Mean Squared Error, with training input D:

$$E_{out}(g_D) = \mathbb E_{\overrightarrow{x}\sim P}(g_D(\overrightarrow{x}) - f(\overrightarrow{x})^2)$$

Thus we have:


$$\mathbb E[E_{out}(g)] = \mathbb E_{\overrightarrow x}[\mathbb E_D[g_D(\overrightarrow x)^2 - \bar g(\overrightarrow{x})^2] + (\bar g(\overrightarrow{x}) - f(\overrightarrow{x})^2]$$

$$= \mathbb E_{\overrightarrow x}[Variance\space of\space g_D(\overrightarrow x) + Bias\space of\space \bar g(\overrightarrow{x})]$$


**Bias**

How different between the average of our hypothesis set and the true function f

(How well, on average, does g approximate f?)


**Vairance**

How variable our whole dataset compared to the average of our hypothesis set

(How well could g approximate anything? How much could noise affect g?)


**More complicated Hypothesis**

Bias decreases, Variance increases


**More data brought to exsisting Hypothesis**

Bias is fixed when model was chosen, so no change for Bias;

Variance will decrease.

# Linear Regression

## Maximum Likelihood Estimation 

Maximum Likelihood Estimation (MLE) is a method of estimating the parameters of a distribution by maximizing a likelihood function, so that under the assumed statistical model the observed data is most probable. 
Finding the parameter values that maximize the likelihood of making the observations given the parameters.


**Assumption for Linear Regression - Gaussian Noise Distribution** 

1. The distribution of $X$ is arbitrary 
2. If $\overrightarrow X$ and $\overrightarrow \beta$ , then $Y = \overrightarrow \beta * \overrightarrow X + \epsilon$
3. $\epsilon \sim Normal (0, \sigma ^2)$
4. $\epsilon$ is independent across observations
5. $Y$ is independent across observations give $X_i$
6. Then $Y \sim Normal ( {\hat {\beta}\overrightarrow X_i}, \sigma ^2)$




**Probability Density Function for Normal Distribution**

$$P(x_i\mid \mu,\sigma^2)={\frac {1}{\sqrt {2\pi \sigma ^{2}}}}e^{-{\frac {(x_i-\mu )^{2}}{2\sigma ^{2}}}}$$

** Bayes's Theorem **

$$ P(\theta \mid X)={\frac {P(X \mid \theta)P(\theta)}{P(X)}}\cdot $$

where likelihood funtion is $P(X \mid \theta)$ and prior distribution is $p(\theta )$

** For a Dataset, if independence holds **

$$ L = P(Y_i \mid \overrightarrow X) = P(Y_1,Y_2,...Y_n  \mid \overrightarrow X) = P(Y_1\mid x_1)P(Y_2 \mid x_2)P(Y_3\mid x_3)...P(Y_n\mid x_n) = \prod_{i=1} ^n P(Y_i \mid x_i) $$

substitute for normal PDF

$$ L = \frac{1}{\sigma ^n(2\pi)^{\frac{n}{2}}} e^{-\frac{1}{2\sigma^2}\sum_{i=1}^n(\overrightarrow {\beta}\overrightarrow X_i)}$$

$$ \ln(L) = -n\ln(\sqrt {2\pi} \sigma) - \frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i- \overrightarrow {\beta}\overrightarrow X_i)^2$$

In order to maximize $\ln(L)$:

$$argmax_\beta[- \sum_{i=1}^n(Y_i- \overrightarrow {\beta}\overrightarrow X_i)^2]$$

Thus, $argmin_\beta[\sum_{i=1}^n(Y_i-\overrightarrow {\beta}\overrightarrow X_i)^2]$

*which is also the squared error*

In order to find the optima, we can set the first derivative equals to zero, find the weights, and then check the sign using second derivative



## Mean Squared Error and Vectorized Optimization

Similarily as MLE methods, we want to optimize parameters to have lowest mean squared error.

Let's try a vectorized optimization:

$$ E_{in}(\overrightarrow \beta)=\frac{1}{n}\sum_{i=1}^n(\overrightarrow {\beta}^T \overrightarrow {x_i} - Y_i)^2 = \frac{1}{n}\sum_{i=1}^n(\overrightarrow {x_i}^T \overrightarrow {\beta} - Y_i)^2 = \frac{1}{n}\mid\mid X\overrightarrow w - \overrightarrow Y \mid \mid ^2$$

$$ since \mid \mid \overrightarrow Z \mid\mid = \sqrt{\sum_{i=1}^d z_i^2} = \sqrt{\overrightarrow Z^T \overrightarrow Z}$$

$$ E_{in}(\overrightarrow \beta)= \frac{1}{n}(X\overrightarrow \beta - \overrightarrow Y)^T(X\overrightarrow \beta - \overrightarrow Y)$$

$$ \Delta_\beta E_{in}(\overrightarrow \beta^*) = \frac{1}{n}(2X^TX\overrightarrow \beta^* - 2X^T\overrightarrow Y) = 0$$

$$\overrightarrow \beta^* = (X^TX)^{-1}X^T\overrightarrow Y$$

*when the number of parameter << the dataset, we can assume X is transferable*



***
**Reference**

Learning From Data (http://amlbook.com/)

Maximum Likelihood Wikipedia https://en.wikipedia.org/wiki/Maximum_likelihood_estimation

Estimating simple linear regression II https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lecture-06.pdf