## The learning process

We switch to batch learning, we receive __all the data at the beginning__ and we have to generate a classifications function.

### Risk for binary classification

- Binary classification $|\mathcal y|=2$. A classifier is a function from $\mathcal X \to \mathcal y $

- For every $x \in \mathcal X$, there exists a true label, denoted $h(x)$. True label = the label that is used for training and testing.

__Definition__

> Given a true classifier $h$ and a measure $P$ over $\mathcal X$, the __risk__ of a classifier $f$ is $R(f)=P(f(x) \ne h(x))$

Explicit notation $R_{P, h}(f)=R(f)$

- $0 \le R(f) \le 1$
- Even if $R(f) = 0$, the set $E=\{x: h(x) \ne f(x)\}$ can be large, but $P(E)=0$ (even if the risk of a function zero, it can happen that some inputs are classified wrongly.)
- We aim at finding a classifier $f$ with small $R(f)$

Let $\mathcal y = {A,B}$, constant functions $f_1(x)=A$ and $f_2(x)=B$. Determine

$R(f_1)+R(f_2) = 1$

Becasue $Pr[h(x)=B] + Pr[h(x)=A] = 1$

$\to$ it's easy to find a classifier with risk $\le 1/2$

__example__

- $\mathcal X = \{1,2,3, \dots\} \;\;\;\; \mathcal y =\{0,1\}$
- $h(x) = x \mod 2$
- P(x) = 2^{-x}

$f(x)=(x \mod 2) \cdot [x \le 10] = \begin{cases} x \mod 2 \;\; if \; x\le 10 \\ 0 \;\;\;\; otherwise\end{cases}$

What is $R(f)$?

$2^{-11} + 2^{-13} + 2^{-15} + \cdots = 2^{-11} \cdot \frac{1}{1-1/4} = 1/3 \cdot 2^{-9}$

## Learning process

The input of a learning algorithm is a labelled dataset

$Z = [(x_1, y_1), \dots, (x_n, y_n)] \in (\mathcal X \times \mathcal y)^*$

Its output is a classifier $f_S: \mathcal X \to \mathcal y$

measure $P$ over $\mathcal X$   
true classifier $h: \mathcal X \to \mathcal y$   
interger n

- Sample $n$ inputs i.i.d from $P: S=[X_1, \dots, X_n]$
- The learner gets

$S^h = [(X_1, h(X_1)), \dots,(X_n, h(X_n))]$ ($h$ is true function)

- Run the learning algorithm and obtain the classifier $f_{S^h}$
- Sample a test point $\tilde{X}$ from $P$. Check $f_{S^h}(\tilde{X}) \ne h(\tilde{X})$

Given $P$ and $h$, the performance of the learning algorithm is 

<center>$P^{n+1}[f_{S^h}(\tilde{x})\ne h(\tilde{x})] = \mathbb{E}_S R(f_{S^h})$</center>

(because we have $n+1$ i.i.d. vatriables taken from $P$ and variables are in $S$ and one variable is $\tilde{X}$  the the test point. We can also write it as the expected value over choosing $S$ of the learned classifier.)

(the smaller This probability, the better for us. If it's close to one. We're also very happy because then we can flip the sign off all classifier generated by the learning algorithm.)

- Smaller = better. If close to 1 $\implies$ invert $f_{S^h}$
- For a random selection of $f_+$ and $f_-$: this equal to 1/2

---
__example__

>The learning algorithm may be a probabilistic algorithm. In this case, this internal randomness is also included in the expectation. 
>
>Classifiers are defined as functions, and hence always deterministic. Note that after a learning algorithm has produced a classifier ff and after generating $\tilde{x} $, evaluating $f(\tilde{x})$ does not require randomness. 
>
>Let $\mathcal Y = \{\texttt{A}, \texttt{B}\}, f_1(x) = \texttt{A}$ and $f_2(x) = \texttt{B}$. What is the average risk of the learning algorithm that outputs $f_1$ and $f_2$ with probability 1/2? 

1/2. We already proved that $\mathrm{R}(f_1) + \mathrm{R}(f_2) = 1$ and this is twice the expected value of the risk.

---



__performance 1-nearest neighbor__

$\mathcal X = \{-1, 0, +1\} \;\;\;\; \mathcal y = \{0,1\}$

$D$ is uniform over $\mathcal X \;\;\;\; h(x)=x^2 \;\;\;\; n=2$

$1 \;\;\; 0 \;\;\; 1$  
$\bullet \;\;\; \bullet \;\;\; \bullet$

Cases

- $S$ contains 2 different points.(P = 1/3 x 1/3 x 3 = 6/9)  
    $\implies $ 1 point is not in $S$. Always classified wrong  
    $\implies R(f_{S^h}) = 1/3$  
    
- $S$ contains 2 coinciding points.  
    - Point on the side $\implies R(f_{S^h}) = 1/3$ (P=1/3 x 1/3 x 2 = 2/9)  
    - Point on the middle $\implies R(f_{S^h}) = 2/3$ (P=1/3 x 1/3 x 1 = 1/9)  
    
Average risk = 1/9 x 2/3 + 8/9 x 1/3 = 10/27 (performance measure)

## Non-trivial bounds require assumptions

fix $\mathcal X$ (assume $|\mathcal y| = 2$)

__Theorem__
> For every learning algorithm, distribution $P$ over $\mathcal X$, integer $n$, there exists a function $h: \mathcal X \to \mathcal y$ such that
>
>$P^{n+1}(f_{S^h}(X) \ne h(X)|X \notin S) \ge \frac{1}{2}$

(Probability of predict wrong label on test data is more than 1/2. This implies that when we look to worst case  average risk, it implies that simple reproductive learning is optimal.)

_Proof_. Probabilistic method: select $h$ randomly

(We will select a random function h, and we will prove the claim for a random function $h$. We show that an expected value of the quantity in the theorem is precisely equal to 1/2. And if the expected value of some random variable it's 1/2, then there must exist a value in the set for which the expression is at least 1/2.)

$\mathbb{E}_h P^{n+1}(f_{S^h}(X) \ne h(X) | X \notin S) = \frac{1}{2}$

So all what we need to do is prove above equation.

Fix $S, x\notin S$ and $h$ in $X\backslash\{x\}$ 

(Select $h$ randomly, for each value, we flip a coin and select the label randomly. Imagine we have already selected all values of $h$ except for the point $x$.)

$Pr_{h(x)}[f_{S^h}(x) \ne h(x)] = \frac{1}{2}$

(In above equation, only thing not fixed is $h(x)$. The outcome of the learning algorithm is fixed, it's prediction on $x$ is fixed. So above probability is 1/2, two labels, select randomly)

If above statement is true, it is also true for a random $S$, only requirement is $X \notin S$

$Pr_{h, S, X \notin S}[f_{S^h}(x) \ne h(x)] = \frac{1}{2}$

$\implies Pr_{h, S, X}[f_{S^h}(x) \ne h(x) | X \notin S] = \frac{1}{2}$

If we write this equation as expected value $\implies \mathbb{E}_h P^{n+1}(f_{S^h}(X) \ne h(X) | X \notin S) = \frac{1}{2}$

---
If the learning algorithm is probabilistic, we could fix its source of randomness to a certain value, and apply the reasoning for a deterministic algorithm. However, at 2m15 and 2m26, we also take the probability over this source of randomness. How does this possibly affect the value of the probability?

__The probability always remains equal to 1/2.__ for random learners, the proof is essentially the same.

__Corollary__

> Let $P$ be uniform. There exists a function $h: \mathcal X \to \mathcal y$ such that $ \mathbb E R(f_{S^h}) \ge \frac{1}{2}(1-\frac{n}{|\mathcal X|})$

($n$ is the size of trainning set)

_Proof_

Since $P$ is uniform, $\implies P(S) \le \frac{n}{| \mathcal X|}$ (It can be smaller if $S$ has repeated elements.)

Let $A = [f_{S^h}(x) \ne h(x)], B = [x \notin S]$ 

(A equal to the event that our prediction is wrong, and B equal to the event that we pick a test point outside the set)

Bayes law $Pr[A]\ge Pr[A \& B] = Pr[A|B]\cdot Pr[B] \ge \frac{1}{2} \cdot (1-\frac{n}{|\mathcal X|})$

The corollary is related to the curse of dimensionality.

If $\mathcal X$ is infinite, e.g. $\mathcal X = \mathbb R$: apply to an arbitrary large finite subset $\mathcal X^{'} \subseteq \mathcal X$. (we conclude that a learning algorithm has risk at least 1/2, arbitrarily close to 1/2. Since $\frac{n}{|\mathcal X|} \to 0$)

If $\mathcal X = \{0,1\}^p$, then to do significantly better than 1/2, we need $\Omega(2^p)$ samples. ($|\mathcal X| = 2^p$)

__Conclusion:__

If any function can be true function, $\to$ reproductive learning is optimal. But this is infeasible for large dimensions.  
(if we want to have an interesting learning algorithm and interesting performance bounds, we need some assumptions on the class $h$. Without assumptions, we might need exponentially large training set sizes on $p$ dimensional inputs. )

### Analogue results for multiclass classification

>In the movie we considered binary classification, i.e., $|\mathcal Y| = 2$. We showed that for each input space $\mathcal X$, for each learning algorithm and each integer $n$, there exists a true classifier $h : \mathcal X \rightarrow \mathcal Y$ for which 
>
>$P^{n+1}(f_{S^h}(X) \ne h(X)|X \notin S) \ge 1/2$
>
>Now consider multiclass classification, i.e., we assume $|\mathcal Y| \ge 2$. Let $k = |\mathcal Y|$. If we apply the same reasoning as in the proof, we obtain the same inequality above with a possibly different value in the right-hand side. This value is larger if $k \ge 3$.
>
>What is this value?

$1-1/k$ 

(Probability of $h(X) = f_{S^h}(X)$ is $1/k$)

>In the movie also a corollary is proven that concerns the uniform distribution. What is the analogous lower bound for multiclass classification with $|\mathcal Y| = k \ge 2$? Let $s = |\mathcal X|$ and as usual, let $n$ be the size of the training set.

$(1-1/k)(1-n/s)$

Same as proof of corallary.

## Vanishing average risk for large training sets

Define the classes $\mathcal H$ with vanishing worst-case risk.

- Let $\mathcal H$ be a set of classifiers $f: \mathcal X \to \mathcal y$. Recall $|\mathcal H| \ge 1$ and $|\mathcal y| = 2$
- We will prove risk bounds that hold for
    - all true classifier $h$ in $\mathcal H$
    - we do not restrict the distribution $P$ over $\mathcal X$
    
__Definition__

>A class $\mathcal H$ is learnable with vanishing worst-case risk if there exists a learning algorithm for which
>
>$\lim_{n \to +\infty} \text{sup}_{h \in \mathcal H, P} P^{n+1}(f_{S^h}(X) \ne h(X)) = 0$

(if there exists a learning algorithm for which the supreme over all distributions $P$. And all true classifiers $h \in \mathcal  H$ has an error probability arbitrary close to zero when $n$ becomes large.)

- __$|\mathcal H|=1 \implies \mathcal H$ is learnable__

(he trivial class that contains only one classifier is of course, learnable. We simply considered the learning algorithm that outputs the single tone element in H.)

- __$|\mathcal X| = \infty$ and $\mathcal H$ contains all functions on $\mathcal X \implies \mathcal H$ is not learnable.__

Let $\mathcal X^{'} \subseteq X$ of size $|\mathcal X^{'}| = 2n$ and $P$ uniform over $\mathcal X^{'}$.

By the result of $ \mathbb E R(f_{S^h}) \ge \frac{1}{2}(1-\frac{n}{|\mathcal X|})$

$P^{n+1}[f_{S^h}(X) \ne h(X)] \ge \frac{1}{2} (1-\frac{n}{\mathcal X^{'}}) = \frac{1}{4}$

- __$|\mathcal H|=2 \implies \mathcal H$ is learnable__

_Proof_

Learning algorithm

$\;\;\;\;$Input: $(x_1, y_1),\dots,(x_n,y_n)$

$\;\;\;\;$If $g(x_i) = y_i \forall i \le n$, output $g$ else output $g^{'}$

(in this algorithm, only two function in $\mathcal H$, so we can always predict true label in training)

Fix $n, h \in H$ and $P$. $f_{S^h}(\tilde{x}) \ne h(\tilde{x})$ requires $g(x_i)=g^{'}(x_i) \forall i \le n$ and $g(\tilde{x})\ne g^{'}(\tilde{x})$ 

(What is the probability that a test sample is bad? For this, it is needed that all training samples have equal labels predicted by the two functions. But the test sample predicts differently.
It's like the "king taster and candy story", the probability of test is bad while all trainning result is good is bounded by $\le 1/(n+1)$, $n$ is training size)

$P^{n+1}[g(\tilde{X})\ne g^{'}(\tilde X) \& g(X_1) = g^{'}(X_1) \& \cdots \& g(X_n) = g^{'}(X_n)] \le \frac{1}{n+1}$

$\lim_{{n \to +\infty}} \frac{1}{n+1} = 0$

$P^{n+1}(f_{S^h}(\tilde{X}) \ne h(\tilde{X})) \le \frac{1}{n+1}$

- __Every finite $\mathcal H$ is learnable__

_Proof_

$\mathcal H = \{f_1, \dots, f_l\}$

Learnable algorithm: outputs $f_j$ for smallest $j$ s.t. $f_j(x_1) = y_1,\dots,f_j(x_n)=y_n$

(output $f_j$ for smallest $j$, such that $f_j$ is consistent with training data)

Fix $n, h \in \mathcal H$ and $P$. fix $j$

$P^{n+1}[f_j(\tilde X)\ne h(\tilde X)\& f_j(X_1) = h(X_1) \& \dots \& f_j(X_n) = h(X_n)] \le \frac{1}{n+1}$

(becasue here $f_j$ is a good function, consistent with all training data)

(But now, there are many more functions that our learning algorithm can output by mistake. So, what is the probability that some function $\in \mathcal H$ is consistent with all learning data but predicts badly on the test data?)

The probability that this happens for some $j \le l$ is $l$ times this probability, by the union bound. Thus

$P^{n+1}(f_{S^h}(\tilde{X}) \ne h(\tilde{X})) \le \frac{|\mathcal H|}{n+1}$


It is true that this bound holds for every learning algorithm that always outputs a classifier in $\mathcal H$ that is consistent with the training set $S$.

We bound the probabability that there exists a classifier in $\mathcal H$ that makes a different prediction from $h$ on a test point $\tilde{X}$, but coincides on the training set $S = [X_1, ..., X_n]$.

### Threshold functions in $\mathcal X = \mathbb R$ are learnable

Let $f_t(x)=1_{x \ge t}$ and $\mathcal H_{th} = \{f_t : t\in \mathbb R\} \cup \{x \mapsto 0 , x \mapsto 1 \}$

__Lemma__

> $\mathcal H_{th}$ is learnable.

_Proof_

LA: If Z contains no labels 1, output $f_Z(x) = 0$. Otherwise, output $f_t$ where $t = min\{x_i : y_i=1\}$

Remove all points from $S$ except a point $x_i=t \implies f_{S^h}$ remains the same.

Suppose we select a random set $S^{'}$ of size $n+1$, and we remove from this set a random point $\tilde{x}$. We use the remaining set $S = S^{'} \backslash \{\tilde x\}$ for training, and $\tilde x$ for testing, Then this test point is misclassified __if and only if this point is the minimal one in $S^{'}$ that has label 1__.

$P^{n+1}(f_{S^h}(\tilde{X}) \ne h(\tilde{X})) \le \frac{1}{n+1}$

### Halfspace in $\mathbb R^p$

LA: Fir halfspace with largest margin.

They depend on at most $p+1$ points in $S$.

Similar conclusion

$P^{n+1}(f_{S^h}(\tilde{X}) \ne h(\tilde{X})) \le \frac{p+1}{n+1}$


## Sample complexity

### Probably approximately correct

Learning thresholds.
- No hope to find the exact threshold. Unless an input in the training set exactly equals the threshold. For smooth $P$, this is a prob zero event. $\implies $ we want to find $f$ that has risk at most $\varepsilon > 0$

- If $\varepsilon < 1/2$, no hope to achieve such error w.p. 1.