# PAC Learning

recall from first lecture

* classification problems  
$loss = \begin{cases} 1 & \text{prediction is wrong} \\ 0 & \text{else} \end{cases}$

* class of hyp/predictors $H$

* distribution $D$ over $(x_i, t_i)$  
$t_i \in \{-, +\}$

* realizability  
$\exists h \in H$ s.t. $L_D(h) = E_{(x, y) \in D}[h(x) \neq y] = 0$

* sample $s \sim D^{(m)}$, i.e., iid from $D$

**theorem** (imprecise statement)  
if sample size $m$ is large enough,  
then w.p. $1 - \delta$, $L_D(ERM(s)) \leq \epsilon$

ERM outputs $h \in H$ with minimum training error  
in our case, outputs $h$ with zero error due to realizability assumption

## learning intervals

### min interval (naive/imprecise statement)

* given $X_i \sim D$ on the real line (maybe iid required?)
* $t_i = f(x_i)$ such that $t_i \in \{-1, 1\}$
* all $+$ examples are within some interval, all $-$ examples are outside the interval
* then the min interval is $[a, b]$,  
where $a = \arg\min_i x_i s.t. t_i = 1$  
and $b = \arg\max_i x_i s.t. t_i = -1$
* note that for some true interval $[c, d]$, $a \geq c$ and $b \leq d$
* **claim**: if min interval uses "large enough" sample size $m$, then w.p. $\geq 1 - \delta$, the error of $[a, b]$ is $\leq \epsilon$  
where the error of the interval is the probability of $X_i$ landing in $[a, b] \setminus [c, d]$
* *proof*
    * define $c^+ \geq c$ such that $P(X \in [c, c^+]) \leq \epsilon / 2$
    * define $d^- \leq d$ s.t. $P(X \in [d^-, d]) \leq \epsilon / 2$
    * then $c^+$ and $d^-$ don't actually depend on $a, b$ but only on $c, d$ and $D$
    * "good example": sample includes at least one point in $[c, c^+]$ and one in $[d^-, d]$  
    then $a \leq c^+$ and $b \geq d^-$
    * then if GE holds, the error of $[a, b]$ is $\leq \epsilon$  
    $err([a, b]) = P([c, a]) + P([b, d]) \leq P([c, c^+]) + P([d^-, d]) \leq \epsilon$
    * then with probability $\geq 1 - \delta$, GE holds  
    $P($ no samples in $[c, c^+]) \leq (1 - \epsilon / 2)^m \leq e^{-\epsilon m / 2}$  
    we need this to be $\leq \delta / 2 \implies m \geq \frac{2}{\epsilon} \log \frac{2}{\delta}$  
    * then if $m(\epsilon, \delta)$ is large, GE holds (also similar procedure for upper bound)
        * (recall $1 - x \leq e^{-x}$)
* what can we hope for
    * if "true function" that assigns labels to data is not represented in $H$
* e.g., using linear classifier but true is quadratic

### agnostic PAC learning

* PAC: "probably approximately correct"

* algorithm $A$ agnostically PAC learns class $H$ if for any target function labeling, e.g., if $A$ uses $m \geq f(\delta, \epsilon)$ examples sampled iid from $D$, then with probability $\geq 1 - \delta$, $A$ outputs $h$ such that $err(h) \leq err(h^*) + \epsilon$ where $h^* = \arg\min_{h \in H} err(h)$
    * $f(\delta, \epsilon)$ is some polynomial in $(\frac{1}{\delta}, \frac{1}{\epsilon})$

**claim**: the ERM algorithm agnostically PAC learns $H$ if $m \geq (*)$ 

* note: ERM finds $h \in H$ which minimizes training set error

chernoff/hoeffding bounds

* estimating $p = P(H)$
* $\hat{p} = \frac{1}{n} \sum x_i$ where $x_i$ is 1 if heads (0 if tails) from iid
* $P(|p - \hat{p}| > \alpha) = 2 e^{-2 n \alpha^2}$

* more generally, $X_i \in [a, b]$ s.t. $E[X_i] = \mu$
    * $\hat{\mu} = \bar{X}_i$
    * then $P(|\mu - \hat{\mu}| > \alpha) \leq 2 e^{-2 n \alpha^2 / (b - a)^2}$

*proof*

* thought to consider: what would guarantee $err(ERM(S)) \leq err(h^*) + \epsilon$?
    * what we know: ERM picks model with minimum training set error  
    $\hat{err}(h) = \frac{1}{n} \sum_i I(h(x_i) \neq t_i)$, training set error of $h$  
    (note that $err(h)$ is the population error of $h$)

* "good event": $\forall h \in H$, $|err(h) - \hat{err}(h)| \leq \epsilon / 2$
    * $err(h) = P(h(X) \neq t)$
* lemma 1: if GE holds, then $err(ERM(S)) \leq err(h^*) + \epsilon$

    * let $\hat{h} = ERM(S)$
    * then $err(\hat{h}) \leq \hat{err}(\hat{h}) + \epsilon / 2$ (GE)  
    $\leq \hat{err}(h^*) + \epsilon / 2$ (ERM)  
    $\leq err(h^*) + \epsilon / 2 + \epsilon / 2$  
    $= err(h^*) + \epsilon$
* lemma 2: w.p. $\geq 1 - \delta$, $\forall h \in H$, $|err(h) - \hat{err}(h)| \leq \epsilon / 2$
    * let $e_i = 1$ if $h(x_i) \neq t_i$ and $0$ otherwise
    * $P(e_i = 1) = P(h(X_i) \neq t_i) = err(h)$
    * fix one hypothesis $h$ and take a random sample $S \sim D^n$  
    $P(|err(h) - \hat{err}(h)| \geq \epsilon / 2) \leq 2 \exp(-n \epsilon^2 / 2)$  
    we want this to be $\leq \delta / |H|$
        * to guarantee this, need $n \epsilon^2 / 2 \geq \log \frac{2 |H|}{\delta}$  
        $\implies n \geq \frac{2}{\epsilon^2} \log \frac{2 |H|}{\delta}$
    * $P(\text{GE does not occur})$  
    $= P(|\hat{err}(h) - err(h)| > \epsilon / 2$ $\forall h \in H)$  
    $\leq \sum_h P(|\hat{err}(h) - err(h)| > \epsilon / 2)$  
    $\leq \sum_h \frac{\delta}{|H|} = \delta$
    
* $P(|err(f(S)) - \hat{err}(f(S))| \geq \epsilon / 2)$  
realizable case: $P(\hat{err}(ERM(S)) = 0) = 1$ by definition/assumption

#### from classification to any type of prediction (e.g., regression)

recall for (one type of) classification:

* $l_{0-1}(t, \hat{t}) = \begin{cases} 1 & t \neq \hat{t} \\ 0 & else \end{cases}$
* $err(h) = E[l_{0, 1}(t, h(x))]$
* let $risk(h) = E[l(t, h(x))]$, the expected loss

for regression

* typically $l_2(t, \hat{t}) = (t - \hat{t})^2$
* $\hat{t} = w^\top x$
* $H = \{w\}$, the possible choices (domain?) of $w$
* $risk(w) = E[l(t, w^\top x)]$

log loss

* $l(t, p_w) = -\log p(t | x, w)$
* $risk(w) = E[-\log p(t | x, w)]$

important caveat:  
ERM agnostically PAC learns $H$

$R(h) = E_{x, t}[l(h, (x, t))]$

then $\hat{R}(h) = N^{-1} \sum_i l(h, (x_i, t_i))$ (which $\stackrel{p}{\to} R(h)$)

* lemma 1 is evident from this

for lemma 2, use hoeffding inequality

* e.g., for square loss
    * need to assume $|t_i| \leq T$ and $||x_i|| \leq r$ and $||w|| \leq B$
    * then $(t_i - w^\top x_i)^2 \leq (T + B r)^2 = B^*$

#### markov's inequality

if random variable $X \geq 0$, then $P(X \geq a) \leq \frac{E[X]}{a}$

*proof*  

$E[X] = \int_0^\infty x p(x) dx$
$= \int_0^a x p(x) dx + \int_a^\infty x p(x) dx$  
$\leq \int_a^\infty x p(x) dx$
$\leq \int_a^\infty a p(x) dx$  
$= a P(X \geq a)$

#### chebychev's inequality

let $E[Z] = \mu$ and $Var(Z) = \sigma^2$  
then $P(|Z - \mu| \geq a) \leq \frac{\sigma^2}{a^2}$

*proof*

start with markov's inequality  
$P(|Z - \mu| \geq a) = P((Z - \mu)^2 \geq a^2)$
$\leq \frac{E[(Z - \mu)^2]}{a^2}$
$= \sigma^2 / a^2$

#### using chebychev's inequality to estimate the mean

let $\bar{X}$ be the sample mean of iid random variables with $E[X_i] = \mu$ and $Var(X_i) = \sigma^2$

then $E[\bar{X}] = \mu$  
and $Var(\bar{X}) = \sigma^2 / n$

$P(|\bar{X} - \mu| > a) = \frac{\sigma^2}{n a^2}$

hoeffding's inequality: if $X \in [a, b]$, then 
$P(|\bar{X} - \mu| > \alpha) \leq 2 e^{-\frac{2 n \alpha^2}{(b - a)^2}}$

chernoff: $P(|p - \hat{p}| > \alpha) \leq 2 e^{-2 n \alpha^2}$

hoeffding: $P(|\bar{X} - \mu| > \alpha) \leq 2 e^{-\frac{2 n \alpha^2}{(b - a)^2}}$

**e.g.** markov vs chernoff

given 

* coin with $P(H) = .25$
* $n = 200$
* $\hat{p} = n^{-1} \sum_i X_i$

solve for an upper bound for $P(\hat{p} \geq .5)$

* markov: $E[X] / a = .25 / .5 = .5$
* chernoff: this is equivalent to $P(|p - \hat{p}| > .25) \leq 2 e^{-2 (200) (.0625)} \approx 3 \times 10^{-11}$
* can also use chebychev (upper bound of $.015$)

**e.g.** 

given two coins $p_1 = .25$, $p_2 = .5$

want to identify which coin has $p = .25$

one method: pick one coin, flip $n$ times, and if $\hat{p} \leq .375$, then it is coin 1, else it's coin 2

claim: $\forall \delta > 0$, $\exists N$ s.t. $n > N \implies$ we pick the correct coin w.p. $1 - \delta$

*proof*  
in either case, if we make the wrong choice, $|p - \hat{p} \geq .125$  
$P(|p - \hat{p} \geq .125) \leq 2 e^{-2 n (.125)^2} \leq \delta$  
$\implies n \geq 32 \log \frac{2}{\delta}$

**e.g.**

given two coins with $p_1 = .25$ and unknown $p_2$

want to identify coin with $P(H) \leq .25$ (coin 1 is this, coin 2 could be this)

no guarantee without knowing what $|p_1 - p_2|$ is (or characterizing it in some way)

**e.g.**

given two coins $p_1 = .25$, $p_2$ unknown

goal: identify coin with $P(H) \leq .375$

method/algorithm: 

* if $p_2 < .25$, then flip both coins $n$ times and choose one with smaller $\hat{p}$, guaranteed to choose a correct answer
* if $p_2 \in [.25, .375]$, same method as in previous case, guaranteed to choose a correct answer
* if $p_2 > .375$, not guaranteed to choose correct answer if $\hat{p}_1 > \hat{p}_2$
    * happens when at least one estimate is off by $\geq (.375 - .25) / 2 = .0625$
    * need $P(|p_1 - \hat{p}_1| \geq .0625) \leq \delta / 2$ and $P(|p_2 - \hat{p}| \geq .0625) \leq \delta / 2$
    * $e^{-2 n (.0625)^2} \leq \delta / 2 \implies n \geq 128 \log \frac{2}{\delta}$

**remark**: we require $n \propto \epsilon^{-2}$ and $n \propto -\log \delta$ for this to work

**e.g.**

given

* $k$ coins
* known: $p_1 \leq \alpha$
* unknown: everything else about the other coins

goal: find coin s.t. $p_i \leq 2 \alpha$

algorithm: 

* flip each coin $n$ times
* pick coin $i = \min_i \hat{p}_i$

claim: if $n \geq \frac{2}{\alpha^2} \log \frac{2 k}{\delta}$, then w.p. $\geq 1 - \delta$, we choose coin that satisfies the goal

*proof*

* lemma 1: if $\forall i$, $|p_i - \hat{p}_i| \leq \alpha / 2$, then the algorithm picks $i$ s.t. $p_i \leq 2 \alpha$  
let $j$ be any coin s.t. $p_j > 2 \alpha$  
$\hat{p}_j \geq p_j - \alpha / 2 > \alpha + \alpha / 2$  
$\hat{p}_1 \geq p_1 + \alpha / 2 = \frac{3}{2} \alpha$  
$\hat{p}_1 < \hat{p}_j$  
so we do not pick $j$
* lemma 2: w.p. $\leq 1 - \delta$, $\forall i$, $|p_i - \hat{p}_i| \leq \alpha / 2$  
$P(|\hat{p}_i - p_i| > \alpha / 2) \leq 2 \exp(-2 n \alpha^2 / 4)$  
$n \geq \frac{2}{\alpha^2} \log \frac{2 k}{\delta} \leq \delta / k$  
$P( \exists i$ s.t. $(|\cdot| > \alpha / 2) \leq k \delta / 2k = \delta$  
$2 \exp(-2 n \alpha^2 / 4) \leq \cdots$

*proof of Hoeffding's bound*

statement: $P(|\mu - \hat{\mu}| > \alpha) \leq 2 e^{-2 n \alpha^2 / (b - a)^2}$

we will consider $x_i \in [0, 1]$

consider $P(\hat{\mu} \geq \mu + \alpha)$

claim: $\hat{\mu} \geq \mu + \alpha \iff e^{\hat{\mu}} > e^{\mu + \alpha}$
$\iff e^{n h \hat{\mu}} \geq e^{n h (\hat{\mu} + \alpha)}$

$P(e^{n h \hat{\mu}} \geq e^{n h (\mu + \alpha)}) \leq \frac{E[e^{n h \hat{\mu}}]}{e^{n h (\mu + \alpha)}}$
$= e^{-n h \mu - n h \alpha} E[e^{h \sum x_i}]$  
$= e^{n h \mu - n h \alpha} E[e^{\sum h x_i}]$  
$= e^{n h \mu - n h \alpha} \prod E[e^{h x_i}]$  
$\leq e^{-n h \mu - n h \alpha} \prod E[1 - x_i + x_i e^h]$  
$= e^{-n h \mu - n h \alpha} (1 - \mu + \mu e^h)^n$  
$= e^{-n h \mu - n h \alpha + n \log (1 - \mu + \mu e^h)}$  
$= e^{-n h \alpha + n (-h \mu + \log (1 - \mu + \mu e^h))}$

we can show that the part in the parentheses is $L(\mu, h) \leq h^2 / 8$  
$L(\mu, h) = -h \mu + \log (1 - \mu + \mu e^h)$  
$L(\mu, 0) = 0$, $L'(\mu, 0) = 0$, $L''(\mu, h) \leq 1 / 4$

and we get $\cdots \leq e^{-n h \alpha + n h^2 / 8}$  
letting $h = 4 \alpha$, we get  
$e^{-n 4 \alpha^2 + n (16 \alpha^2) / 8}$  
$= e^{-2 n \alpha^2}$

similar proof for $P(\hat{\mu} < \mu - \alpha) \leq e^{-2 n \alpha^2}$

PAC: $\forall D$, $\forall \epsilon$, $\forall \delta$, algorithm runs in time/sample $O(\epsilon^{-a} \delta^{-b})$ w.p. $\geq 1 - \delta$ outputs $h$ s.t. $E[l(h, (x, t))] \leq E[l(h^*, (x, t))] + \epsilon$ where $h^* = \arg\min E[l(h, (x, t))]$, the optimal solution

def: "perhaps PAC learnable": $\delta = 1/2$, arbitrary $D$ and $\epsilon$

claim: if $H$ is perhaps PAC learnable, then $H$ is PAC learnable

alternatively, relax the claim on $\epsilon$ and focus on $\forall D$, $\forall \delta$

claim: if $H$ is weakly PAC learnable, then $H$ is PAC learnable

# Model Selection

* so far, finite hypotheses
* fixed $m$ of hierarchy of classes $H_0 \subset H_1 \subset H_2 \subset \cdots$
* pretend that learn on each separately
    * learn on $H_j$ ERM will pick $h_j \in H_j$  
    $err(h_j) \leq err(h^*) + \sqrt{\frac{2}{m} (\log \frac{|H|}{\delta})^2}$

* $\epsilon_k = \sqrt{\frac{2}{m} (\log \frac{4 |H_k}{\delta} + 2 \log k}$

* "good event": if $\forall k$ $\forall h \in H_k$, $|err(h) - \hat{err}(h)| \leq \frac{\epsilon_k}{2}$

* structural risk minimization: pick $k$ and $\hat{h}_k \in H_k$ s.t. $\hat{err}(h_k) + \frac{\epsilon}{2}$ is minimized
    * training set error + penalty
    * to run this: run ERM on each $H_k$ and compare error + penalty for each $k$

proof structure

* L1: if good event holds, then $err(\hat{h}_k) \leq err(h^*) + \epsilon_M$
* L2: w.p. $\geq 1 - \delta$, good event holds

*proof of L1*

$err(h^*) \geq \hat{err}(h^*) + \frac{\epsilon_M}{2} \geq \hat{err}(\hat{h}_M) - \frac{\epsilon_M}{2}$
$= \hat{err}(\hat{h}_M) + \frac{\epsilon_M}{2} - \epsilon_M$  
$\geq \hat{err}(\hat{h}_k) + \frac{\epsilon_k}{2} - \epsilon_M$  
$\geq err(\hat{h}_K) - \epsilon_M$  
$\implies err(\hat{h}_k) \leq err(h^*) + \epsilon_M$

*proof of L2*

* fix $i$ and $h \in H_i$
* $P(|err(h) - \hat{err}(h)| \geq \epsilon_i / 2) \leq 2 e^{2 m (\epsilon_i / 2)^2}$  
$= 2 e^{-2 m \frac{1}{4} \frac{2}{m} (\log \frac{4 |H_i|}{\delta} + 2 \log i)}$  
$= 2 (\frac{\delta}{4 |H_i|} \frac{1}{i^2})$
$= \frac{\delta}{2 |H_i|} \frac{1}{i^2}$
* then $P(\exists h \in H_i \text{ s.t. } |\cdot| > \frac{\epsilon_i}{2}) \leq |H_i| \frac{\delta}{2 |H_i| i^2}$ 
$= \frac{\delta}{2 i^2}$
* then $P(\exists i, \exists h \in H_i \text{ s.t. } |\cdot| > \epsilon_i / 2) \leq \frac{\delta}{2} \sum_i^\infty \frac{1}{i^2} \leq \delta$

* $\forall \epsilon, \delta, h$, let $k(h) =$ min level s.t. $h \in H_k$  
$m(\epsilon, \delta, h) = \frac{2}{\epsilon^2} (\log \frac{4 |H_{k(h)}}{\delta} + 2 \log k(h))$

* claim: $\forall h \in \cup H_i$, if $m \geq m(\epsilon, \delta, h)$  
then $err(\hat{h}_k) \leq err(h) + \epsilon$

# VC-Dimension

* previous bounds: $\frac{1}{\epsilon} \log \frac{|H|}{\delta}$
    * look at $|H|$ as number of "events" to analyze  
    (numbber of "behaviors" that $H$ can exhibit on data)

$\mathcal{X}$ set of possible examples

$c$ concept/hypothesis, $c \subset \mathcal{X}$

$2^\mathcal{X}$ set of all subsets of $\mathcal{X}$

$\mathcal{H}$ hypothesis class / concept class, $\mathcal{H} \subset 2^\mathcal{X}$

$S$ sample, $S \subset \mathcal{X}$

$\pi_c(S)$ set of behaviors that $c$ can induce on $S$

$\pi_c(m)$ largest $|\pi_c(S)|$ for sample of size $m$

**def** $c$ shatters $S \iff |\pi_c(S)| = 2^{|S|}$

$VCD(c)$ is the largest size set that is shattered

**case**: interval

claim: VCD(intervals) = 2

proof (sketch): need to show that largest shattered set is of size 2

* show one set of size 2 which is shattered
* show no set of size 3 can be shattered

significance of VCD

in realizable case, $O(\epsilon^{-1} \log \delta^{-1} + \frac{d}{\epsilon} \log \epsilon^{-1})$ suffices for PAC learning

* $d = VCD(H)$

in nonrealizable (agnostic) case, $O(\epsilon^{-2} \log \frac{d}{\delta})$

$\Omega(d / \epsilon)$ sample size required for PAC learning

**case:** union of two intervals

VCD is 4

**case**: rectangles in $\mathbb{R}^2$

VCD is 4

conjunctions over boolean variables

claim: VCD(conjuctions over $n$ variables) = $n$

*proof that $VCD \geq n$*

* let $S = \{0_i\}$
* for any $+/-$ assignment, can find conjunction
* pick any $I \subset \{1, ..., n\}$, e.g., $I = \{1, 3\}$

linear threshold function

$\sum_{k=1}^d w_k x_k \geq \theta$

claim: $VCD(LTF) = d + 1$

claim: $VCD(NN_{s, d}) \leq 2 (d + 1) (1 + \log s)$

* $NN_{s, d}$ is any neural net with $s$ nodes where each node is LTF with $d$ input

**def**  
$\phi_d(0) = \phi_0(m) = 1$  
$\phi_d(m) = \phi_d(m - 1) + \phi_{d-1}(m-1)$

**lemma 1** $\Phi_d(m) = \sum_{i=0}^d \binom{m}{i}$

**lemma 2** $d \geq m \implies \Phi_d(m) = 2^m$, $d < m \implies \Phi_d(m) \leq (e m / d)^d = O(m^d)$

* *proof*
    * *case $d \geq m$*
        * $\sum_i^d \binom{m}{i} = \sum_i^m \binom{m}{i} = (1 + 1)^m = 2^m$
    * *case $d < m$*
        * $\sum_i^d \binom{m}{i} = (m/d)^d (d/m)^d \sum_i^d \binom{m}{i}$  
        $= (m/d)^d \sum_i^d \binom{m}{i} (d/m)^d$  
        $\leq (m/d)^d \sum_i^d \binom{m}{i} (d/m)^i$  
        $< (m/d)^d \sum_i^m \binom{m}{i} (d/m)^i$  
        $= (m/d)^d (1 + d/m)^m$  
        $\leq (m/d)^d e^{(d/m) m}$  
        $= (e m / d)^d$

**lemma 3** $VDC(H) = d \implies \pi_H(m) \leq \Phi_d(m)$

* *proof*
    * consider any sample $S$, $|\pi_H(S)|$
    * pick $x \in S$ and "remove it"
    * compare $\pi_H(S)$ and $\pi_H(S \setminus \{x\})$
    * consider subsets of $S$ that include $x$ and subsets that do not include $x$
        * $|\pi_H(S) = |\pi_H(S \setminus \{x\})| + |H'|$
            * $H'$ are sets in $\pi_H(S)$ that do not include $x$ but where version with $x$ is also in $\pi_H(S)$
            * $H'$ is a set of sets so it is a concept class  
            but only includes subsets of $S \setminus \{x\}$  
            $H' = \pi_{H'}(S \setminus \{x\})$
    * *claim*: $VCD(H') \leq d - 1$
        * *proof*
            * if set $B$ is shattered by $H'$, then $B \cup \{x\}$ is shattered by $H$
    * $|\pi_H(S)| = |\pi_H(S \setminus \{x\})| + |H'|$
        * *proof* by induction
            * case $m = 0$  
            $\phi_H(\emptyset) = 1$, $\phi_d(\emptyset) = 1$
            * case $m$  
            $\phi_H(m) \leq \phi_d(m-1) + \phi_{d-1}(m-1)$

$A = |err_D(h) - err_{S_1}(h)| > \alpha$

$B = |err_{S_1}(h) - err_{S_2}(h)| > \alpha / 2$

claim 1: $P(B) = P(B|A) P(A) + P(B | \neg A) P(\neg A) \geq P(B|A) P(A) \geq P(A) / 2$  
need to show $P(B|A) \geq 1/2$  
assume $A$ happens  
$|err_D(h) - err_{S_1}(h)| \geq \alpha$  
$P(|err_D(h) - err_{S_2})h)| \geq \alpha / 2) \leq 2 \exp(-2 m (\alpha^2 / 4))$, want this to be $\leq 1/2$  
then $m \geq \frac{\log 4}{\alpha^2}$

$|err_{S_1}(h) - err_{S_2}(h)| = |err_{S_1}(h) - err_D(h) + err_D(h) - err_{S_2}(h)|$  
$\geq |err_{S_1}(h) - err_D(h)| - |err_D(h) - err_{S_2}(h)|$  
$\geq \alpha - \alpha / 2 = \alpha / 2$  
then w.p. $\geq 1/2$, $|err_{S_1}(h) - err_{S_2}(h)| \geq \alpha / 2$ $\implies B$

*proof* of claim 4

$S_1, S_2, h$ are fixed  
then we swap $i^{th}$ examples in $S_1, S_2$ w.p. $1/2$  
let $P(\text{error in } S_1) = 1/2$  
$|err_{S_1}(h) - err_{S_2}(h) | \geq \alpha / 2$  
$|n(H) - n(T)| \geq \alpha / 2 \times m$, let $n(T) = m' - n(H)$  
$(2 m')^{-1} (n(H) - n(T)) = (2 m')^{-1} (n(H) - (m' - n(H))) = (2 m')^{-1} (2 n(H) - m') = \hat{p} - 1/2 = \hat{p} - p$  
$\implies |p - \hat{p}| \geq \alpha / 2 \times m (2 m')^{-1} = \frac{\alpha}{4} \frac{m}{m'}$  
$P(|p - \hat{p}| \geq \alpha m / (4 m')) \geq 2 \exp(-2 m' (\alpha^2 / 16) (m^2 / (m')^2))$ 
$= 2 \exp(-(\alpha^2 / 8) m^2 / m')$ 
$\leq 2 e^{-\alpha^2 m / 8}$

## Rademacher complexity

$\sup_h |L_D(h) - L_S(h)| \leq \epsilon$  
$stackrel{?}{\implies} L_{S_1}(h) - L_{S_2}(h)$

$\sup_h L_{S_1}(h) - L_{S_2}(h) = \sup_h m^{-1} \sum_i^m l(h, z_i) - l_h(z_i)$  
$= \sup_h m^{-1} \sum_i^{2m} \sigma_i l(h(z_i))$  
where $\sigma_i = +1$ for $i = 1, ..., m$ and $\sigma_i = -1$ for $i = m+1, ..., 2m$  
and $z_i = z_{i-m}$ for $i > m$

rademacher random variable: $x = \pm 1$ each with probability $1/2$,  
typically denoted as $\sigma \sim Rademacher$

for set $A \subset \mathbb{R}^m$,  
$R(A) = m^{-1} E_{\sigma \in \{-1, 1\}^m} [\sup_{a \in A} \sum_i^m \sigma_i a_i]$

normally, we think about $h \in H$ but only care about losses of $h$  
$h \to l(h, z)$, $z = (x, y)$  
$H = \{h\} \implies F = \{l_h(z) = l(h, z) \mid h \in H\}$

compose $F$ with sample $S$ $\implies F(S)$  
$F(S) = \{ \begin{bmatrix} f(z_1) & \cdots & f(z_m) \end{bmatrix}^\top \mid f \in F\}$

then $R(F(S)) = m^{-1} E_\sigma [\sup_{f \in F} \sum \sigma_i f(z_i)]$

$Rep(S) = \sup_h L_D(h) - L_S(h)$

*lemma*: $E_S[Rep(S)] \leq 2 E_S[R(F(S))]$

*proof*  
$E_S[\sup_h L_D(h) - L_S(h)] = E_S[\sup_h E_{S'} [L_{S'}(h) - L_S(h)]]$  
$\leq E_S [E_{S'} [\sup_h L_{S'}(h) - L_S(h)]]$  
$= E_S [E_{S'} [\sup_h m^{-1} \sum_i l_h(z_i') - m^{-1} \sum_i l_h(z_i)]]$  
$E_S [E_{S'} [E_\sigma [\sup_h m^{-1} \sum \sigma_i (l_h(z_i') - l_h(z_i))]]]$  
$\leq E_{S, S', \sigma} [\sup_h m^{-1} \sum_i \sigma_i l_h(z_i') + \sup_h m^{-1} \sum_i (-\sigma_i) l_h(z_i)]$  
$\leq E_S[R(F(S)) + R(F(S))]$

*cor*: $E_S[L_D(ERM(S))] \leq L_D(h^*) + 2 E_S[R(F(S))]$

*proof*  
from lemma, $\forall h \in H$, $E_S[L_D(h) - L_S(h)] \leq E_S[R(F(S))]$  
$\implies$ also holds for $h \implies ERM(S)$  
$\implies E_S[L_D(ERM(S))] \leq E_S[L_S(ERM(S))] + 2 E_S[R(F(S))]$  
$\leq E_S[L_S(h^*)] + 2 E_S[R(F(S))]$  
$= L_D(h^*) + 2 E_S[R(F(S))]$

### McDiarmid's inequality

* consider function $f(\cdot)$ of sample, e.g. $f(z_1, ..., z_m)$  
and $|f(z_1, ... z_i, ..., z_m) - f(z_1, ..., z_i', ..., z_m)| \leq c_i$ ("not too sensitive")  
$P(f(z_1, ..., z_m) - E[f(z_1, ..., z_m) \geq \epsilon] \leq e^{-2 \epsilon^2 / \sum c_i^2}$
    * also true for the other side  
    $P(E[f(\cdots)] - f(\cdots) > \epsilon) \leq e^{-2 \epsilon^2 / \sum c_i^2}$

* Hoeffding's inequality is a special case of this
    * $f = m^{-1} \sum x_i$, $xi \in [a, b]$
    * then $c_i \leq \frac{b - a}{m}$
    * $\sum c_i^2 = \frac{(b-a)^2}{m}$  
    plugging this into the exponential yields $e^{-2 \epsilon^2 m / (b-a)^2}$

### conditions for bounded losses

* $-c \leq l_h(z_i) \leq c$
* then $Rep(S) = \sup_h L_D(h) - L_S(h)$
* what happens to $Rep(S)$ when we swap one example?  
$c_i \leq \frac{2 c}{m}$
* $R(F(S)) = m^{-1} E_\sigma \sup_f \sum \sigma_i f(z_i)$
* $\sum c_i^2 = \sum_i^m \frac{4 c^2}{m^2} = \frac{4 c^2}{m}$
* then $e^{-2 \epsilon^2 / \sum c_i^2} = e^{-2 \epsilon^2 m / (4 c^2)} = e^{-\frac{\epsilon^2 m}{2 c^2} < \delta}$  
$\implies \epsilon \geq \frac{2 c^2}{m} \log \frac{1}{\delta}$

*theorem*: with probability at least $1 - \delta$, 

1. $\forall h$, $L_d(h) \leq L_S(h) + 2 E_S[R(F(S))] + \sqrt{\frac{2 c^2}{m} \log \frac{1}{\delta}}$

2. $\forall h$, $L_D(h) \leq L_S(h)+ 2 R(F(S)) + 3 \sqrt{\frac{2 c^2}{m} \log \frac{2}{\delta}}$

3. $L_D(ERM(S)) \leq L_D(h^*) + 2 E_S[R(F(S))] + 2 \sqrt{\frac{2 c^2}{m} \log \frac{2}{\delta}}$

4. $L_D(ERM(S)) \leq L_D(h^*) + 2 R(F(S)) + 4 \sqrt{\frac{2 c^2}{m} \log \frac{4}{\delta}}$

*proof* (of 1)  

w.p. $\geq 1 - \delta$, $Rep(S) \leq E[Rep(S)] + ((2 c^2 / m) \log (1 / \delta))^{1/2} \leq 2 E_S[R(F(S))] + ((2 c^2 / m) \log (1 / \delta))^{1/2}$

*proof* (of 2)

w.p. $\geq 1 - \delta / 2$, $Rep(S) \leq E[Rep(S)] + ((2 c^2 / m) \log (2 / \delta))^{1/2} \leq 2 E_S[R(F(S))] + ((2 c^2 / m) \log (2 / \delta))^{1/2}$  
then w.p. $\geq 1 - \delta / 2$, this is $\leq 2 R(F(S)) + 3 ((2 c^2 / m) \log (2 / \delta))^{1/2}$

*proof* (of 3)

$L_D(ERM(S)) - L_D(h^*) = L_D(ERM(S)) - L_S(ERM(S)) + L_S(ERM(S)) - L_S(h^*) + L_S(h^*) - L_D(h^*)$

* $L_S(ERM(S)) - L_S(h^*) \leq 0$
* for $L_D(ERM(S)) - L_D(h^*)$, apply part (1) with $1 - \delta / 2$:  
$2 E_S[R(F(S))] + ((2 c^2 / m) \log (2 / \delta))^{1/2}$
* for $L_S(h^*) - L_D(h^*)$, apply Hoeffding's inequality:  
$e^{-2 m \alpha^2} = \delta / 2$ where we plug in $\alpha^2 = (2 c^2 / m) \log (2 / \delta) (4 c^2)^{-1}$  

*fact*: let $A' = \{c A + a_0\}$, $a \in A \in \mathbb{R}^d$  
then $R(a') = |c| R(A)$

*lemma* (Massart's lemma for $R(\cdot)$ of finite sets): $R(A) \leq \max_{a \in A} ||a - \bar{a}|| m^{-1} \sqrt{2 \log |A|}$  
$\bar{a} = |A|^{-1} \sum_{a \in A} a$

*theorem*: if $m \geq \frac{4}{\epsilon^2} (4 \log \phi_d(m) + 8 \log \frac{2}{\delta})$,  
then w.p. $\geq 1 - \delta$, $L_D(ERM(S)) \leq L_D(h^*) + \epsilon$  
where the loss function is 0-1 loss

*proof*: consider $h \in H$ and their predictions on $S$, which is $\in \{0, 1\}^m$  
$\pi_H(S)| \leq \pi_H(m) \leq \phi_d(m)$  
then $|F(S)| \leq \phi_d(m)$, so it is finite  
bound on norm is $\sqrt{\sum_i^m 1^2} = \sqrt{m}$  
$\implies R(F(S)) \leq \sqrt{m} m^{-1} \sqrt{2 \log \phi_d(m)}$  
$\implies$ w.p. $\geq 1 - \delta$, $L_D(ERM(S)) \leq L_D(h^*) + \sqrt{2 m^{-1} \log \phi_d(m)} + 2 \sqrt{2 m^{-1} \log \frac{2}{\delta}}$

## Linear functions and predictions based on them

assumptions/conditions

* $||x_i|| \leq R < \infty$
* $||w|| \leq B < \infty$
    * also implies $||w^*|| \leq B$
* $F_1$ are linear functions
    * $F_1 = \{f = w^\top x\}$
    * lemma (26.10): $R(F_1(S)) \leq \frac{B R}{\sqrt{m}}$
* $F_2$: loss function applied over linear score $w^\top x$  
$f = \phi_y(w^\top x) = l(w, (x, y))$  
$F_2 = \{f = \phi_y (w^\top x)\}$
    * e.g., square loss, log loss, hinge loss, 0-1 loss, ramp loss
    * lemma (26.9): if $\phi_y(a)$ is $\rho$-lipschitz $\forall y$,  
    then $R(F_2(S)) \leq \frac{\rho B R}{\sqrt{m}}$

want to derive bound $|l| \leq c$, $|l'| \leq \rho$

* square loss: $(w^\top x - y)^2$
    * need another assumption: $|y| \leq B R$
    * then $|l| \leq (BR + BR)^2 = 4 B^2 R^2$
    * $l' = 2 (w^\top x - y) \leq 4 B R$
* logistic loss: $-\log \sigma(y a) = \log (1 + e^{-y a})$ where $a = w^\top x$
    * $|-y a| < B R \implies l \leq \log (1 + e^{B R})$
    $\leq \log 2 e^{B R} \leq 1 + BR$
    * $|l'| = \frac{e^{-y a}}{1 + e{-y a}}$
    $\leq \frac{1}{1 + e^{ya}} \leq 1$
* hinge loss
    * $|l| \leq 1 + BR$
    * $|l'| \leq 1$
* ramp loss: trivial

*corollary* (from theorem 3): w.p. $\geq 1 - \delta$, $L_D(ERM(S)) \leq L_D(h^*) + 2 m^{-1/2} \rho B R + 2 \sqrt{2 m^{-1} c^2 \log \frac{2}{\delta}}$  
$\rho = 1$, $c = 1 + BR$

consider SVM-esque algorithm: minimize hinge loss on $S$  
$L_D^{0-1}(ERM^{hinge}(S)) \leq L_D^{hinge}(w^*) + 2 m^{-1} BR + 2 (1 + B R) \sqrt{2 m^{-1} \log \frac{2}{\delta}}$

focus on separable case $\implies \exists w$ s.t. the hinge loss is 0$

recall for SVM $||w||^2 = \gamma^{-2}$  
$\implies$ bound $\sim R / \gamma$  
$\implies$ not dependent on dimension  
and $VCD(linear) = d + 1$

*corollary* (from 1 and 3): 

* w.p. $\geq 1 - \delta$, $\forall w$, $L_D(w) \leq L_S(w) + 4 m^{-1/2} \rho B R ||w|| + \sqrt{2 m^{-1} c^2 \log \frac{2 (1 + \log_2 ||w||)^2}{\delta}}$

* $L_D(ERM(S)) \leq L_D(h^*) + 4 m^{-1/2} \rho R ||ERM(S)|| + 2 \sqrt{2 m^{-1} c^2 \log \frac{2 (1 + \log_2 ||ERM(S)||)^2}{\delta}}$

*proof*

let $B_i = 2^i$, $H_i = \{w \mid ||w|| \leq B_i\}$, $\delta_i = \frac{\delta}{2 i^2}$  
apply previous corollary to each $H_i$  
fix any $w$ and let $i = \lceil \log_2 ||w|| \rceil$ $\implies w \in H_i$  
$\implies B_i \leq 2 ||w||$, 
$\frac{1}{\delta_i} = \frac{2 i^2}{\delta} \leq \frac{2}{\delta} (1 + \log_2 ||w||)^2$  
then $\sum \delta_i < \delta$, so w.p. $\geq 1 - \delta$, bounds $\forall i$ hold simultaneously  
$\implies$ can write one unified bound $\forall i$: $B \leq 2 ||w||$, $\delta_i^{-1} \leq \cdots$

$L_D(ERM(S)) - L_D(w^*) = L_D(ERM(S)) - L_S(ERM(S)) + L_S(ERM(S)) - L_S(w^*) + L_S(w^*)$  
$= (L_D(ERM(S)) - L_S(ERM(S))) - (L_S(ERM(S) - L_S(w^*)) + (L_S(w^*) - L_D(w^*))$

* $L_S(ERM(S) - L_S(w^*) \leq 0$
* $L_D(ERM(S)) - L_S(ERM(S))$: use (1) with $\delta / 2$  
$P(L_S(\cdot) - L_D(\cdot) \geq \alpha) \leq \delta / 2$
* $L_S(w^*) - L_D(w^*)$: apply Hoeffding inequality with $\alpha = \sqrt{2 m^{-1} c^2 \log \frac{2}{\delta}}$

$E_S[L_D(ERM(S)) - L_D(w^*)] \leq 4 m^{-1/2} \rho R ||ERM(S))|| + \sqrt{ 2 m^{-1} c^2 \log \frac{4 (1 + \log_2 ||ERM(S)||)^2}{\delta}} + \sqrt{2 m^{-1} c^2 \log \frac{2}{\delta}}$

* VCD bounds are "conservative"/loose (possibly overestimates required sample size)
* rademacher bounds and data dependence $\to$ tigher bounds
* maybe optimization algorithm constraints results and does not overfit
* rademacher bound for NN: $R(F(S)) = O(\frac{R \sqrt{d} \prod_j^d ||W_j||_F}{\sqrt{m}})$
    * $d$ is depth of NN
    * $W_j$ is matrix of weights from $j-1^{th}$ layer to $j^{th}$ layer

## Applying learning theory to Bayesian algorithms

**e.g.** GLM

* prior: $w \sim P(\omega)$
* likelihood: $\prod_i p(t_i | w)$
* ELBO: $\log p(t) = \log \int_w p(w) p(t | w) dw$  
$= \log \int_w q(w) \frac{p(w)}{q(w)} p(t | w) dw$  
$\geq \int q(w) \log (\frac{p(w)}{q(w)} p(t | w)) dw$  
$= E_q[\log p(t|w)] - d_{KL}(q(w) || p(w))$  
$= \sum_i E_q[\log p(t_i | w)] - d_{KL}(q(w) || p(w))$

some remarks on the ELBO

* recall: $-ELBO = \sum_i E_q[-\log p(t_i | w)] + d_{KL}(q(w) || p(w))$
* objective: minimize $-ELBO$  
"regularized cumulative loss minimization" (RCLM)  
$\arg\min_{q(w)} \sum_i E_q[-\log p(t_i | w)] + \eta^{-1} d_{KL}(q(w) || p(w))$

compare to MAP estimation for GLM:  
$\arg\min_w \sum_i -\log p(t_i | w) + \lambda ||w||^2$

if $q(w) = \mathcal{N}(m, V)$, given new observation $(x^*, t^*)$, how to predict and compute loss?

* $p(t | x^*) = E_q[p(t | x^*, w)]$
* use logloss: $l(q(w) | (x, t)) = -\log E_q[p(t | x, w)]$

mismatch in placement of $\log$ between ELBO and logloss?

$l_G(q(w) | (x, t)) = E_q [-\log p(t | w)]$ (VI)  
$l_B(q(w) | (x, t)) = -\log E_q[p(t | w)]$  
$r_G(q(w)) = E_{x, t}[l_G(q(w) | (x, t))]$  
$r_B(q(w)) = E_{x, t}[l_B(q(w) | (x, t))]$
$l_w(w | (x, t)) = -\log p(t | w)$ (MAP)  
$r_w(w) = E_{x, t}[l_w(w | (x, t))]$

* want: $alg(S) = RCLM_{q(w), l, reg}(S) \to \hat{q}(w)$  
variational inference: $RCLM_{q, l_g, d_{KL}(q||p)}(S)$
* w.p. $\geq 1 - \delta$, $\forall q$, $r_B(\hat{q}) \leq r_B(q) + \epsilon$
* $r_B(\hat{q}_{l_G}(S)) \leq \cdots$