## The online mistake model

Hypothesis set $\mathcal H$ containing function $f: \mathcal X \to \mathcal y$ (assume on of the function always predict correctly)

Supervisor has $h \in \mathcal H$

Input sequence $x_1,x_2, \dots \in \mathcal X$

For increasing $t$, the online algorithm:

- receives an input $x_t$,
- makes a prediction $\hat{y}_t$,
- receives the correct label $y_t = h(x_t)$

Goal: minimize the number of wrong predictions.

### Example: $\mathcal X = {1,0}^p$

Let $f_i(x^{'}) = x^{'}_i$, for $i=1, \dots, p$ and $\mathcal H = \{f_1, f_2, \dots, f_p \}$
(inputs are bit string of length $p$, hypothesis class $f_1, f_2, \dots, f_p$)

__Algorithm:__ on input $x_t$ predict $f_i(x_t)=x_{t,i}$ for the smallest $i$ such that $y_1=f_i(x_1),\dots,y_{t-1} = f_i(x_{t-1})$

(we will look to the smallest $i$, such that $f_i$ is consistent with all previously received labels.)

---

Example: $\mathcal X = \{0, 1\}^3, p=3\;\;\;\;\;\;\;\;\;\;$    Assume true function: $f_3$

Receive: 010$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$consistent $\{f_1, f_2, f_3\}$  
Predict: $f_1(010)=0$ __correct__  
Receive: $y_1=0$$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ The set of function that is still consistent $\{f_1, f_3\}$  

Receive: 110  
Predict: $f_1(110)=1$ __wrong__  
Receive: $y_2=0$$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ The set of function that is still consistent $\{f_3\}$ 

At this point, only $f_3$ is consistent with the received labels so far. That means from this point on, we will always predict correct because we choose a true function $h$, to $f_3$
  
---
  
Example: $\mathcal X = \{0, 1\}^3, p=3\;\;\;\;\;\;\;\;\;\;$    Assume true function: $f_3$

Receive: 011$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$consistent $\{f_1, f_2, f_3\}$  
Predict: $f_1(011)=0$ __wrong__  
Receive: $y_1=0$$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ The set of function that is still consistent $\{f_1, f_3\}$  

Receive: 001  
Predict: $f_1(001)=0$ __wrong__  
Receive: $y_2=1$$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;$ The set of function that is still consistent $\{f_3\}$ 

At this point, only $f_3$ is consistent with the received labels so far. That means from this point on, we will always predict correct because we choose a true function $h$, to $f_3$

How many mistakes on a sequence $x_1, x_2, x_3, \dots$? If true function classifier is $f_1$, then always 0 mistakes.

$p-1$

because each time we make a mistake, we delete at least one function of the list of consistent functions. Also, we can achieve $p - 1$ errors

### More generally for all $\mathcal X$ and finite $\mathcal H$

Let $\mathcal H = \{f_1,\dots,f_K\}$

Algorithm: Predict according to $f_i$ for the smallest $i \le K$ such that $f_i$ is consistent with all previously seen data.

$\to$ makes at most $K-1$ mistakes

### Number of mistakes

__Definition__

>Given a hypothesis class $\mathcal H$ and an algorithm $A$, the _maximum number of mistakes_ $M(\mathcal H, A)$ of $A$ on $\mathcal H$ is $max\{\#\text{mis}(A,f,x_1x_2\cdots):f \in \mathcal H$ and $x_1,x_2,\dots \in \mathcal X\}$, where $\text{mis}(A,f,x_1x_2\cdots)$ is the number of mistakes made by $A$ on $x_1,x_2,\dots$ if $f$ is true. 

For every finite $\mathcal H$, there exists an algorithm $A$ for which $M(A, \mathcal H) \le |\mathcal H| - 1$

__Theorem__

> For every finite $\mathcal H$, there exists an algorithm $A$ for which $M(A, \mathcal H) \le \log_2|\mathcal H|$

_Proof_. For a list $L \in \mathcal y^*$ of binary labels, let $maj(L)$ denote a label appearing at least $|L|/2$ times. e.g.$maj([1,1,0,0,1]) = 1$

__Halving algorithm__

>$\mathcal F_1 \gets \mathcal H$
>
>while True:  
>    $\;\;\;\;$Receive $x_t$  
>    $\;\;\;\;$Predict $\hat{y}_t = maj([f(x_t): f\in \mathcal F_t])$  
>    $\;\;\;\;$Receive $y_t$  
>    $\;\;\;\;$$\mathcal F_{t+1} \gets \{f\in \mathcal F_t : f(x_t)=y_t\}$

Makes at most $\log_2|\mathcal H|$ mistakes, because 
- if $\hat{y}_t \ne y_t$, then $|\mathcal F_{t+1}| \le \frac{1}{2}|\mathcal F_{t}|$
- thus after $m$ mistakes, we have $|\mathcal F_{t}| \le |\mathcal H| \cdot 2^{-m}$
- the true hypothesis $f$ always remains in $\mathcal F_t$ thus 

$1\le |\mathcal F_{t}| \le |\mathcal H| \cdot 2^{-m} \implies m \le \log_2|\mathcal H|$

__example__

>Often, $|\mathcal H|$ is very large. For example it equals $10^{15}$ for binary decision trees with 8 inputs and 15 nodes. In this case, it is very time-consuming to compute the majority vote exactly, and one might use an approximate algorithm.
>
>Assume that our approximate algorihm calculates the majority vote correctly if at least a fraction $1/\sqrt{2} \sim 0.707$ of the classifiers in $\mathcal F_{t}$ produce the same label for the given input $x_t$. Otherwise, our algorithm can produce any label.
>
>Let $N = |\mathcal H|$. What is the maximal number of mistakes our implementation of the halving algorithm can make? 

- if $\hat{y}_t \ne y_t$, then $|\mathcal F_{t+1}| \le \frac{1}{\sqrt{2}}|\mathcal F_{t}|$
- thus after $m$ mistakes, we have $|\mathcal F_{t}| \le |\mathcal H| \cdot \frac{1}{\sqrt{2}^{m}}$
- the true hypothesis $f$ always remains in $\mathcal F_t$ thus 

$1\le |\mathcal F_{t}| \le |\mathcal H| \cdot \frac{1}{\sqrt{2}^{m}} \implies \frac{1}{N} \le \frac{1}{\sqrt{2}^{m}} \implies 2^{\frac{m}{2}} \le N \implies \frac{m}{2} \le \log_2 N \implies m \le 2\log_2 N$

## Too many mistakes

What if $\mathcal H=$ set of all binary function on $\mathcal X$?

__Lemma__

> For each algorithm $A$ and sequence $x_1, x_2, \cdots $ there exists an $f$ that makes $|\{x_1, x_2, \dots \}|$ mistakes.

(For every algorithm there exists an infinite input sequence where every label is predicted wrongly.)

_Proof_. Run the algorithm. For each $x_t$ that was not yet seen, send the opposite label $y_t$ as the predicted $\hat{y}_t$. $\mathcal y = \{0, 1\}$

Receive: $x_1$   
Predict: 1 __wrong__  
Receive: 0  

Receive: $x_2\ne x_1$   
Predict: 0 __wrong__
Receive: 1

$\dots$

(it means each time we predict a label, we say the correct label is opposite, so on and so on. We get a sequence with all unseen inputs predicted wrongly)

When we look at the worst case number of mistakes an algorithm makes, then we see that just reproductive learning is optimal. It's just optimal to remember the inputs that we have already seen and predict anything otherwise. 

for example, randomly or always the same label, it doesn't matter.

### No meaningful mistake bounds

Thus for this $\mathcal H$, reproductive leaning is optimal (just remember things)

Principal:

_To prove meaningful preformance bounds, we need assumptions on what we "learn"_

(So we will always prove something, assuming the true functions belong to some class.)

Therefore, we consider bounds that only apply to functions in a hypothesis lass $\mathcal H$

We always assume that

- $\mathcal H$ (a set of functions) contains function from $\mathcal X \to \mathcal y$

- $\mathcal X \to \mathcal y$ and $\mathcal H$ are nonempty. (for now we assume $|\mathcal y| = 2$)

__How many mistakes do we learn halfspace in $\mathcal X =\mathbb R^p$__

(Let us now consider the class $\mathcal H$ consisting of linear threhold functions. So one label is a halfspace in an $X =\mathbb R^p$.

Is there an algorithm that makes at most poly(p) number of mistakes?

No, already learning thresholds in $X=\mathbb R$ is infeasible in this model. (it's only getting worse as dimension getting higher)

Let prove

__Lemma__

>For every learning algorithm there esists an $f_r \in H_{th}$ and infinite sequence $x_1, x_2, x_3, \dots \mathcal X$ on which it predicts every label wrong.

$y = \{-1, +1\}\;\; \mathcal X =\mathbb R\;\; \mathcal H_{th} = \{f_r: r \in \mathbb R\}$

$f_r(x)=sign(x-r)=\begin{cases}1 \;\;\; if\;\; x \ge r \\ -1 \;\;\;\;\; otherwise \end{cases}$

Idea. Move the threshold right or left depending on the predictions of the algorithm. Let $s=\lim_t x_t$

(if a point is predicted to be positive then we move the threshold right and keep the threshold at the right. Otherwise, we moved to the left in the opposite direction, so we are always wrong.)

Receive: $x_1=0$ (initial value doesn't matter)   
Predict: $\hat{y}_1=-1$ (the algorithm predicted -1, move the threshold to left)    
Receive: $y_1=1$ (inform the algorithm the correct label is 1, so we are wrong)

Receive: $x_2=x_1-3^{-1}$   
Predict: $\hat{y}_2=+1$  
Predict: $y_2=-1$  

(So at this point, we promised to keep the threshold between $x_2$ and $x_1$)

Receive: $x_2=x_1-3^{-2}$   
Predict: $\hat{y}_3=+1$  
Predict: $y_3=-1$  

$ \large{-} \;\;\;\;\;\;\;\; x_2\bullet \;\;\;\;\;\;\;\;\;\;\;\; x_4\bullet \;\;\;\; x_3\bullet \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; x_1\bullet \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\; \large{+} \mathbb R$

_Proof_. Let $x_1=0$, We run A and for $t \ge 2$, feed it $x_{t+1}=x_t + \hat{y}_t\cdot 3^{-t}$

Let $s=lim_t x_t = \sum^{\infty}_{t=1} \hat{y}_t 3^{-t}$.  
Thus $sign(x_t-s) = sign(\sum_{j \ge t}-\hat{y}_j3^{-j})=sign(-\hat{y}_t3^{-t}) = -\hat{y}_t$

(We have to look at the sign and for this we know that the first term of the sum determines the sign, which is $y_t$.   
There is an extra minus sign before $y_t$, so the true label is always opposite as the predicted label.)

(That's how we prove it formally but this lemma is quite obvious)

__Conclusions__

- We can not obtain meaningful mistake bounds that apply to all binary functions on $\mathcal X$.  
Each result will apply to some subset $\mathcal H$ of functions of the form $\mathcal X \to \mathcal y$

(We can only obtain meaningful mistakes bounds and if we restrict the class of true functions.)

- No meaningful bounds for the class of threshold functions in $\mathbb R$

(we proved that the class of threshold functions on R is not learnable and this has two consequences. We cannot learn halfspaces, and we can also not learn decision trees with continuous input in this model.)

>In the movie, we explained that if $\mathcal H$ is the set of all functions on $\mathcal X$, then no nontrivial mistake bounds can be proven. Thus if $\mathcal X$ contains $k$ elements, then for every online prediction algorithm, there exists a true function for which the prediction algorithm  makes at least $k$ errors.
>
>We consider now the case where we delete a single function, say the constant zero function for convenience, but similar conclusions hold for any other function.
>
>Let $k \ge 1$, $\mathcal X = \{1,2,\ldots,k\}$ and $\mathcal Y = \{0,1\}$. Let $\mathcal H$ be the set of all functions from $\mathcal X$ to $\mathcal Y$ except the constant zero function. 
>
>How many mistakes will the halving algorithm make? Simplify the formula so that no rounding is needed.
>
>hint:
>
>If $k = 1$, then $\mathcal{X} = \{1\}$ and $\mathcal H$ contains only 1 function. 
>
>This is because there are only 2 functions that map elements from $\mathcal X$ to elements from $\mathcal Y$: the constant 0 and constant 1 functions, but the first one is not in $\mathcal H$ by definition of $\mathcal H$. Hence, the majority vote will always be 1, and the halving algorithm makes 0 mistakes.
>
>How many functions are there that map $\mathcal X$ to $\mathcal Y$? Note that this number is equal to the number of bitstrings of length $k$. Can your answer be equal to $k$? Give an integer lower bound, and note that the difference with the exact value is less than 1.

Indeed, the number of functions from $\mathcal X$ to $\mathcal Y$ equals $2^k$, thus $|\mathcal H| = 2^k - 1$. Hence, the binary logarithm is at less than $k$.

$log_2(2^k-1) \approx k-1$