# PAC Learning

recall from first lecture

* classification problems  
$loss = \begin{cases} 1 & \text{prediction is wrong} \\ 0 & \text{else} \end{cases}$

* class of hyp/predictors $H$

* distribution $D$ over $(x_i, t_i)$  
$t_i \in \{-, +\}$

* realizability  
$\exists h \in H$ s.t. $L_D(h) = E_{(x, y) \in D}[h(x) \neq y] = 0$

* sample $s \sim D^{(m)}$, i.e., iid from $D$

**theorem** (imprecise statement)  
if sample size $m$ is large enough,  
then w.p. $1 - \delta$, $L_D(ERM(s)) \leq \epsilon$

ERM outputs $h \in H$ with minimum training error  
in our case, outputs $h$ with zero error due to realizability assumption

## learning intervals

### min interval (naive/imprecise statement)

* given $X_i \sim D$ on the real line (maybe iid required?)
* $t_i = f(x_i)$ such that $t_i \in \{-1, 1\}$
* all $+$ examples are within some interval, all $-$ examples are outside the interval
* then the min interval is $[a, b]$,  
where $a = \arg\min_i x_i s.t. t_i = 1$  
and $b = \arg\max_i x_i s.t. t_i = -1$
* note that for some true interval $[c, d]$, $a \geq c$ and $b \leq d$
* **claim**: if min interval uses "large enough" sample size $m$, then w.p. $\geq 1 - \delta$, the error of $[a, b]$ is $\leq \epsilon$  
where the error of the interval is the probability of $X_i$ landing in $[a, b] \setminus [c, d]$
* *proof*
    * define $c^+ \geq c$ such that $P(X \in [c, c^+]) \leq \epsilon / 2$
    * define $d^- \leq d$ s.t. $P(X \in [d^-, d]) \leq \epsilon / 2$
    * then $c^+$ and $d^-$ don't actually depend on $a, b$ but only on $c, d$ and $D$
    * "good example": sample includes at least one point in $[c, c^+]$ and one in $[d^-, d]$  
    then $a \leq c^+$ and $b \geq d^-$
    * then if GE holds, the error of $[a, b]$ is $\leq \epsilon$  
    $err([a, b]) = P([c, a]) + P([b, d]) \leq P([c, c^+]) + P([d^-, d]) \leq \epsilon$
    * then with probability $\geq 1 - \delta$, GE holds  
    $P($ no samples in $[c, c^+]) \leq (1 - \epsilon / 2)^m \leq e^{-\epsilon m / 2}$  
    we need this to be $\leq \delta / 2 \implies m \geq \frac{2}{\epsilon} \log \frac{2}{\delta}$  
    * then if $m(\epsilon, \delta)$ is large, GE holds (also similar procedure for upper bound)
        * (recall $1 - x \leq e^{-x}$)
* what can we hope for
    * if "true function" that assigns labels to data is not represented in $H$
* e.g., using linear classifier but true is quadratic

### agnostic PAC learning

* PAC: "probably approximately correct"

* algorithm $A$ agnostically PAC learns class $H$ if for any target function labeling, e.g., if $A$ uses $m \geq (*)$ examples sampled iid from $D$, then with probability $\geq 1 - \delta$, $A$ outputs $h$ such that $err(h) \leq err(h^*) + \epsilon$ where $h^* = \arg\min_{h \in H} err(h)$
    * $(*)$ is some polynomial in $(\frac{1}{\delta}, \frac{1}{\delta})$

* **claim**: the ERM algorithm agnostically PAC learns $H$ if $m \geq (*)$ 
* note: ERM finds $h \in H$ which minimizes training set error

chernoff/hoeffding bounds

* estimating $p = P(H)$
* $\hat{p} = \frac{1}{n} \sum x_i$ where $x_i$ is 1 if heads (0 if tails) from iid
* $P(|p - \hat{p}| > \alpha) = 2 e^{-2 n \alpha^2}$

* more generally, $X_i \in [a, b]$ s.t. $E[X_i] = \mu$
    * $\hat{\mu} = \bar{X}_i$
    * then $P(|\mu - \hat{\mu}| > \alpha) \leq 2 e^{-2 n \alpha^2 / (b - a)^2}$

*proof*

* thought to consider: what would guarantee $err(ERM(S)) \leq err(h^*) + \epsilon$?
    * what we know: ERM picks model with minimum training set error  
    $\hat{err}(h) = \frac{1}{n} \sum_i I(h(x_i) \neq t_i)$, training set error of $h$  
    (note that $err(h)$ is the population error of $h$)

* "good event": $\forall h \in H$, $|err(h) - \hat{err}(h)| \leq \epsilon / 2$
    * $err(h) = P(h(X) \neq t)$
* lemma 1: if GE holds, then $err(ERM(S)) \leq err(h^*) + \epsilon$

    * let $\hat{h} = ERM(S)$
    * then $err(\hat{h}) \leq \hat{err}(\hat{h}) + \epsilon / 2$ (GE)  
    $\leq \hat{err}(h^*) + \epsilon / 2$ (ERM)  
    $\leq err(h^*) + \epsilon / 2 + \epsilon / 2$  
    $= err(h^*) + \epsilon$
* lemma 2: w.p. $\geq 1 - \delta$, $\forall h \in H$, $|err(h) - \hat{err}(h)| \leq \epsilon / 2$
    * let $e_i = 1$ if $h(x_i) \neq t_i$ and $0$ otherwise
    * $P(e_i = 1) = P(h(X_i) \neq t_i) = err(h)$
    * fix one hypothesis $h$ and take a random sample $S \sim D^n$  
    $P(|err(h) - \hat{err}(h)| \geq \epsilon / 2) \leq 2 \exp(-n \epsilon^2 / 2)$  
    we want this to be $\leq \delta / |H|$
        * to guarantee this, need $n \epsilon^2 / 2 \geq \log \frac{2 |H|}{\delta}$  
        $\implies n \geq \frac{2}{\epsilon^2} \log \frac{2 |H|}{\delta}$
    * $P(\text{GE does not occur})$  
    $= P(|\hat{err}(h) - err(h)| > \epsilon / 2$ $\forall h \in H)$  
    $\leq \sum_h P(|\hat{err}(h) - err(h)| > \epsilon / 2)$  
    $\leq \sum_h \frac{\delta}{|H|} = \delta$
    
* $P(|err(f(S)) - \hat{err}(f(S))| \geq \epsilon / 2)  
realizable case: $P(\hat{err}(ERM(S)) = 0) = 1$ by definition/assumption