# Review of Machine Learning Concepts

**e.g.** Linear Regression

$t_i \stackrel{iid}{\sim} \mathcal{N}(w^\top \phi(x_i), \beta^{-1}$  
$t \sim \mathcal{N}(\Phi w, \beta^{-1} I)$  
$L = p(t | w)$  
"optimal" $w$ maximizes $L$, $\hat{w} = (\Phi^\top \Phi)^{-1} \Phi^\top t$

Alternatively, use prior $w \sim \mathcal{N}(0, \alpha^{-1} I)$  
Posterior $w | t \sim \mathcal{N}(m_N, S_N)$  
If our beliefs are captured by the prior, the model is correct

**e.g.** Logistic Regression

$t_i \stackrel{indep}{\sim} Bernoulli(\sigma(w^\top \phi(x_i))$  
prior: $w \sim \mathcal{N}(0, \alpha^{-1} I)$  
$\hat{w}_{MAP} = w^\top w - (\Phi^\top diag(y (1 - y)) \Phi)^{-1} \Phi^\top (y - t)$  
$y_i = \sigma(w^\top \phi(x_i))$  
true posterior is not gaussian, requires MCMC  
can use approximation (Laplace): $\hat{w} \sim \mathcal{N}(m_N, S_N)$, $m_N = \hat{w}_{MAP}$, $S_N = (\alpha I + \Phi^\top R \Phi)^{-1}$

# New Framework

Assumptions/rules

* $(x_i, y_i) \sim D$, $D$ is not related to my algorith/model  
each $(x_i, y_i)$ drawn independently
* what we care about is captured by a loss or objective function $\ell(y, \hat{y})$
    * 0-1 loss for classification (whether $y$ is the same as $\hat{y}$)
    * log-loss $\ell(y, p(y | model)) = -\log p(y | model)$
* there are a finite set of options (hypothesis class $H$)
    * if parameter is continuous (infinite set of options), e.g., $w \in \mathbb{R}^d$,  
    for now, assume each entry $w_k$ in vector $w$ is a float64 ($2^{64 d}$ options)
* realizability assumption  
$\forall h \in H$, $L_D(h) = E_{(x, y) \sim D}[\ell(y, h(x))]$  
then $\exists h^* \in H$ s.t. $L_D(h) = 0$  
i.e., there exists a "true" solution
* learning setup/framework
    * sample $s \sim D^N$ (independent sample of size $N$)
* algorithm: ERM  
sets $s$ as input and finds $h \in H$ that has zero error on $s$  
$L_s(h) = N^{-1} \sum_i \ell(y_i, h(x_i))$

**theorem**

if ERM is run on problem satisfying the above with sample size $N \geq \frac{log |H| / \delta}{\epsilon}$,  
then with probability $\geq 1 - \delta$, $L_D(\tilde{h}) \leq \epsilon$

**corr**

applying to linear regression,  
$\log |W| = 64 d \log 2 \leq 45 d$  
$N \geq \frac{\log(\delta^{-1} + 45d)}{\epsilon} \implies L_D(\tilde{w}) \leq \epsilon$ with probability $1 - \delta$

**proof**

* only aiming to set small error $L_D \leq \epsilon$
* $h$ is "bad" if $L_D(h) > \epsilon$
* focus on one bad hypothesis, $\bar{h}$  
$P(\bar{h} \text{ is not detected using } s)$  
$= P(\bar{h} \text{ labels all examples in sample correctly})$  
$= \prod_i P(\bar{h} \text{ labels correctly } x_i)$  
$\leq (1 - \epsilon)^N$  
$\leq e^{-\epsilon N} \leq e^{-\epsilon \frac{\log |H| / \delta}{\epsilon}}$ (taylor approx)  
$= \delta / |H|$
* $P(\text{output of ERM is bad})$  
$= P(L_D(\tilde{h} > \epsilon)$  
$= P(\exists h \in H, L_D(H) \geq \epsilon, L_S(h) = 0)$  
$\leq |H| P(\text{for some } \bar{h}, L_D(\bar{h}) > \epsilon, L_S(\bar{h}) > 0)$  
$\leq |H| \frac{\delta}{|H|}$  
$= \delta$