## Subfields of statistical learning theory

What is statistical learning theory?

Goal: prove performance guarantees of learning algorithms

- Statistics: specific assumptions on distribution on $X$ and $y$.

- SLT: no or very general assumptions on distribution.

- Supervised: $(x_1, y_1), (x_2, y_2), \dots \in X \times y$

    - Batch: all data is given at once, priduce $\hat{f}$
    - Online: alternate between receiving new data and prediction
    - Active: learner can choose $x_i$ and receive $y_i$
    
- Unsupervised: clustering, anomaly detection

- Semisupervised: labelled + unlabelled examples

- Reinforment learning: control an agent in an envirionment that gives rewards. Optimize total reward in a given time.

### Overfitting

- A decision tree of very large size, it just memorize all the input and output. It has little predictive value.

In the figure, the vertical axis represents the number of incorrectly predicted labels of the neural net. The weights of the net are represented in the horizontal axes (the network has 40 edges, thus 40 weights but only 2 of them are shown). 

During training, we aim to select the weights so that the network makes a minimal number of mistakes. Roughly stated, the randomized gradient descent method selects a local minimum in some random way.

Neural networks can be very large and at the same time still not overfit. Why? There is a set of solutions that we could call "simple", and each such solution corresponds to a large number of equivalent local minimal. On the other hand "complex" solutions correspond to fewer equivalent local minima. Therefore a sufficiently random training procedure will select a local minimum that corresponds to a simple solution with high probability.



## 3 levels of learning

__1. Reproductive learning__

Information is stored almost without processing. 

Problem: can not make predictions for example outside of the training set. It just memorize the symbol.

__2. Rule bases learning__

Learn by following a fixed procedure

E.g. fit a threshold function with minimal error on training set.

E.g. Diagnose diseases: common cold, flue, bronchitis. Decision procedure of experts + books are turned into a rule based recommender system. Pupular research in 80s, Did not become successful.

Can make prediction for all examples. kind of like a decision tree.

__3. Creative learning__

It's hard to tell how human learn something. It seems that you're optimizing something, simplicity versus how many errors you make on the data and so on. Understanding what we're optimizing is what we want to do in this course.

## Definitions: hyperplane, halfspace, halfplane

A hyperspace in $\mathcal R^p$ is an affine subspace of dimension $p-1$. For example in the 3-dimensional Euclidean space $\mathcal R^3$, hyperplanes are planes. In $\mathcal R^2$, hyperplanes are straight lines. (Lines and planes are always straight.)

A halfspace in $\mathbb R^p$ are the set of points on 1 side of a hyperplane. There exist 2 types of halfspaces: a closed halfspace includes all points of the hyperplane and an open halfspace includes no points of the hyperplane. The full space $\mathbb R^p$ is not a halfspace.

A halfplane is a halfspace in $\mathbb R^2$. Thus the border of a halfplane is a line.

A  halfspace in $\mathbb R$ is an interval of the form $(-\infty, r],  (-\infty, r), [r, +\infty), or (r, +\infty)$ with $r \in \mathbb R$. (Elements $r \in \mathbb R$ are always finite.)

## 2 classification algorithms

Fitting halfspaces

$X = \mathbb{R}^2$  $y={1,0}$

- Hypothesis: x has label 1 $\iff$ it lies in a halfplane (lies on one side of a line)

(We will assume that this hypothesis is true, so that classification of apples and pears, can indeed happen by such a halfplane.)

(We might find a line that separate two halfplane $\hat{H}$, and this line is diff with the true line.)

- For each measure $P$ over $X$, for each true halfplane $H$, gives a few random examples sampled from P, we can give $\hat{H}$ s.t. with high probability $P(H \bigtriangleup \hat{H})$ is small.

- Similar: for $X = \mathbb{R}^p$ with $O(p)$ random examples. (we can do above for all dimension)

__example__

In this lesson we always consider binary classification, i.e., $|\mathcal Y| = 2$.

We consider classification functions on $\mathcal X = [0,1]^2 \subseteq \mathbb{R}^2$, i.e., the inputs are pairs of numbers in the real interval $[0,1]$. We use the uniform measure over $\mathcal X$. The labels are $\mathcal Y = \{\square, \bullet\}$.

For parameters $a,b \in \mathbb{R}$ and $(u,v) \in \mathcal{X}$, let

$f_{a,b}(u,v)= \begin{cases}\square \;\;\;\text{if}\; v \ge (1−u)a+ub \\ \bullet \;\;\;\text{otherwise} \end{cases}$

Suppose $f_{a,b}$ is the true function and we have fit a function $f_{c,d}$ with $0 \le a \le c \le 1$ and $0 \le b \le d \le 1$. What is the probability that $f_{c,d}$ predicts a wrong label?

![image-2.png](attachment:image-2.png)

__(c-a+d-b)/2__  
__this is the surface of the trapezium. If dots are in this area, the classifier make wrong prediction__

---

__The number of samples that we need is proportianl to the number of features p(=input dimensional)__

- Hypothesis is not true $\implies$ fit best halspace. 

(The problem is that our hypothesis might not be true. It might happen that the true separation between apples and pears, is not by a line, but by a more complicated surface. We can still apply the method, we will make some modifications, and we will fit the best halfspace.)

- Hypothesis is far from true $\implies$ large error. 


1. Nearest neighbors classification

- Given $S=[(x_1, y_1), \dots ,(x_n, y_n)]$, classify an inpit $x^{'}$ as $y_i$ for the closest input $x_i$ in $S$. In case of a tie, use the first point in $S$

- We use Euclidean distance 

$d_E(x^{'}, x_i) = \sqrt{(x^{'}_1 - x_{i, 1})^2 + \dots + (x^{'}_n - x_{i,n})^2}$

(As distance, we will for simplicity use Euclidean distance, but you can use any distance you like. In many applications, people design a very specific distance that works very well, especially if the data is not structured by a tuple of real numbers.)

Equivalently, use the square of the distance.

__Voronoi diagrams__

![image.png](attachment:image.png)

- Approximates any smooth classification rule in any dimension, given enough data.

- Never miss classfies training points

(But it actually performed poorly for complex applications, as dimension increase, it's not better than random guess.)

### Conclusion

- If true classifier is a halfspace in p-dim space $\implies$ we can find a good appxomation with $O(p)$ random examples.

- If true dimension is small and the true classifier is smooth $\implies$ 1NN is a good approximation with few random examples.

- If the dimension $p$ is large, 1-NN may perform badly.

Remark: We used the true halfspace $x_1 \ge 0$. For any other halfspace through the origin we obtain the same conclusion because 1-NN and SVM are rotation invariant.

__example__

Let $\mathcal X = \mathbb{R}^p$. Let $f_S$ denote the 1-nearest neighbor classifier applied to the list $S = [(x_1, y_1), ..., (x_n, y_n)] \in (\mathcal X \times \mathcal Y)^n$ using the Euclidean distance. Select all statements that are always correct.

- If $\mathcal X = \mathbb{R}$ then $f_S$ is a halfspace classifier in $\mathcal X$.

> __False__, For example, $S = [(-1, 1), (0, 0), (1, 1)]$, this can't be divided by one halfplane.

- Let $\mathcal X = \mathbb{R}^2$. Assume 

$S=[((1,1),y_1),((1,-1),y_2),((-1,1),y_3),((-1,-1),y_4)],$

then regardless of the labels $y_1, y_2, y_3, y_4$, in each quadrant, all points in the interior of the quadrant have the same label. 

(E.g., the interior of the first quadrant are the points with $\tilde{x} = (\tilde{x}_1, \tilde{x}_2) \in {\mathbb R}^2$ with $\tilde{x}_1>0$ and $\tilde{x}_2>0$.)

- Suppose $S$ consists of 3 labeled points, then for each label $y \in \mathcal{Y}$, the set of inputs $x$ with $f_S(x) = y$ is either equal to $\mathcal X$, a halfspace, a union of 2 halfspaces or an intersection of 2 halfspaces. 

- If $S = [(x_1, y_1), (x_2, y_2)]$ with $x_1 \not= x_2$ and $y_1 \not= y_2$, then $f_S$ is a halfspace classifier.

(guessing, two label can be separated by a halfplane/halfspace, three label need two halfspaces?)

__example__

Let $\mathcal{X} = [-1,1]^2$ and $\mathcal{Y} = \{\square, \bullet\}$. We consider the uniform probability on $\mathcal X$. In the picture, the area where the true classifier evaluates to $\bullet$ is indicated in gray. Also, a fixed labeled dataset $S$ is shown.

![image.png](attachment:image.png)

Again $let f_S$ denote the 1-nearest neighbor classifier for the fixed set $S$ shown above. You may assume that the distance of the 1-NN decision border is within a distance 0.2 from the true border. 

- Let $y = \square$. The constant function $f(x) = y$ misclassifies a random point with probability exactly $1/2$.

> it will miss classify the points in 2 and 4 quadrant

- The probability that a random element is misclassified by $f_S$ is at most 0.2.

- Every halfplane misclassifies a random point in $\mathcal X$ with probability at least 0.25.

> no matter how you draw the line, at least 1/4 of points will be miss classified

In [None]:
# Implement the nearest neighbor algorithm with the Euclidean distance.

# If needed, place auxiliary functions here

def nearest_neighbor(train_set, test_point):
    """inputs are tuples of a given length
    labels can be anything
    trainset is a list of input output pairs
    The test code does not have distances that are ties."""


    # your code goes here
    min_dis = float('inf')
    for data in train_set:
        d = distance(test_point, data[0])
        if d < min_dis:
            min_dis = d
            label = data[1]

    # label = 'x'
    return label

def distance(p1, p2):
    s = 0
    for i in range(len(p1)):
        s += (p1[i] - p2[i])**2

    return s**0.5

train_set = [((0, 0), 'a'), ((1,1), 'b')]
test_point = (2,2)

assert nearest_neighbor(train_set, test_point) == 'b'


train_set = [((-5,), 0), ((6,), 1)]
test_point = (0,)

assert nearest_neighbor(train_set, test_point) == 0

In this exercise, you are asked to give an example where the error of 1-nearest neighbor classificatier $f_S$ increases after elements are added to the training set $S$. 

Let $\mathcal X = [-1, 1]$ with the uniform distribution. Let $\mathcal{Y} = \{0,1\}$. We define the true label $h(x)$ of $x \in \mathcal X$ to be  $h(x) = 1$ if $x \ge 0$ and $0$ otherwise. Give an example of a set $S = [(x_1, h(x_1)), ..., (x_n, h(x_n))]$ and a point $x_{n+1}$ for which the error probability of $f_S$ is smaller than of $f_{S'}$, where $S' = [(x_1, h(x_1)), ..., (x_{n+1}, h(x_{n+1}))]$

In [None]:
# modify these variables
train_inputs = [-2, 2, 3]
new_input = 1


def true_classifier(point) :
    return int(point[0] >= 0)

"""
the testing dots are evenly distributed on [-1, 1]. If training data is also symmetric on 0, 
the two halfplane will be (-inf, 0) and (0, inf). (can't predict 0, because it's a tie)
Any newly added point that break this symmetry, will lead to larger error
"""

def error_probability(train_set) :
    mistakes = 0
    no_test_points = 2000
    for i in range(0,no_test_points):
        point = (2*i/no_test_points - 1,)
        if true_classifier(point) != nearest_neighbor(train_set, point) :
            mistakes += 1
    return mistakes/no_test_points


S = [((x_val,), true_classifier((x_val,))) for x_val in train_inputs]
error_S = error_probability(S)
new_point = (new_input,)
S.append( (new_point, true_classifier(new_point)) )
assert error_S < error_probability(S)

## Nearest neighbors in high dimensions

__Curse of dimensionality__

- Consider $X=[-1,1]^p$

- Let us say that $x \in X$ is _close to the border_ if there is a point in $\mathbb{R}^n\backslash X$ at distance at most 0.1. 
_(Let us say that the point X is close to the border if it's possible to travel a distance 0.1 and get outside the hyper cube. So in a line, two borders, e.g.[0.9,1], [-0.9,-1]. In two dimensions we have four borders)_

- If $p=2$, then a fraction $(0.9)^2=0.81$ are far from the border
_(for $p=3, (0.9)^3\approx0.7$)_

- If $p = 100$, almost all points are close to the border, all except a fraction $0.9^{100} < 2.7 \times 10^{-5}$

### Random points are far from each other

A third way to look at high dimensions is to observe that two random points are far from each other with high probability.

![image.png](attachment:image.png)

- $X = \{0, 1\}^p$ with uniform distribution (=discrete variant in blue above)

- True classifier $h(x)=x_1$ (we classify according to the first coordinate X1 or not.)

- No points are close to the decision border $\implies$ easier

(as the dimension increases, the points are far from each other. the larger the error. because more points close to the decision border as depicted above.)

__example__

>Let $D$ be the squared distance between 2 randomly selected bitstrings $x$ and $x'$.
>
>$D=1_{x_1\ne x^{'}_1} + \dots + 1_{x_p\ne x^{'}_p}$
>
>Selecting a random $p$-bit string is the same as selecting $p$ random bits. We do this both for $x$ and $x^{'}$. We have $x_1 = x^{'}_1$ with probability 1/2 and $x_1 \neq x'_1$ with probability 1/2. Thus each term in the sum equals 0 and 1 with probability 1/2. Hence, $D$ is binomially $B(p,1/2)$ distributed (recall that $p$ is the input dimension).
>
>What is the expected value of $D$?

$\frac{p}{2}$ Expected value of Binomial distribution is $np$

### Distance between random points

Example: in $p=2$ dimensions $X = \{00,01,10,11\}$

- $X=\{0,1\}^p$ with uniform distribution
- Classification according to the first coordinates $x_1$ of $x$.
- The squared distance of 2 random points $x$ and $x^{'}$

$D=1_{x_1\ne x^{'}_1} + \dots + 1_{x_p\ne x^{'}_p}$

- Each term $1_{x_1\ne x^{'}_1}$ equals 0 w.p. 1/2 and 1 w.p. 1/2.
- For large $p, D$ is approximately normal with $\sigma = \frac{1}{2}\sqrt{p}$. _(p0 to 2p+1 column)_


If we take a test point, a random test point on a random training point, then the distance between the straining point and this point consists out of two parts. 

The contribution of the first coordinate and the remaining coordinates. __Contribution of remaining coordinates is noise and the first coordinate gives us important information.__

But this first coordinate contributes at most 1, the other coordinates at most $p -1$. 

And the contribution of remaining part is a normal distribution, with large standard deviation proportional to $\sqrt{p}$.

Every point in the training set gives only a very little amount of information. (first coordinates)

And therefore we need a huge number of training points before the distance becomes informative. And this effect requires us to be almost exponential in order to classify something much better than by probability 1/2. 

- In the example: noise $\gg 1_{x_1\ne x^{'}_1}$  
The square distance $D(x, x^{'})$ is not informative.  
Effect only disappear if $|S|$ is exponential

- The noise is not much larger than $\sqrt{p}$ by the Chenoff bound

$Pr[|D-p/2| \ge \varepsilon p] \le 2 \exp (-2\varepsilon^2 p)$

### Summary

Curse of dimensionality = random points in a high dimensional space are typically

- close to border
- far from each other

__example__

>In statistical learning theory, we typically do not reason with specific distribution. However, in this lesson we make an exception.
>
>The nearest neighbor algorithm works well if points that are close to each other are likely to have the same labels. If the (squared) distance between 2 random points correlates well with the labels being different, then the 1-NN classifier might work well. In this quiz, we consider the simple example from the movie and show that this is not the case for large dimensions pp.  
>
>Let $\mathcal{X} = \{0,1\}^p$ and $\mathcal Y = \{0,1\}$. We classify a bitstring $x \in \mathcal X$ by its first coordinate $x_1 \in \mathcal Y$.
>
>We select two independently and uniformly random strings $X$ and $X^{'}$ in $\mathcal X$. Let $Z$ be the random variable that is equal to 1 if $X$ and $X^{'}$ belong to different classes and is 0 otherwise. Thus, $Z = 1$ if $X_1 \not= X'_1$, and $Z=0$ otherwise. Observe that the distribution of $Z$ does not depend on $p$. 
>
>What is the variance of $Z$? 

$Z$ is only depend on $x_1$, first dimension. $X=1$ w.p. 1/2, $X=0$ w.p. 1/2. Same for $X^{'}$.  
So $X_1$ and $X^{'}$ is equal w.p. 1/2. ({1,1}, {0,0}).  
$X_1$ and $X^{'}$ is not equal w.p. 1/2. ({1,0}, {0,1}). 

$\mathbb{E}(Z) = 1/2$

$Var Z = \mathbb{E}(Z - \mathbb{E}(Z))^2 = 1/2(1-1/2)^2 + 1/2(0-1/2)^2 = 1/4$

> Let $D$ denote the squared Euclidean distance between $X$ and $X^{'}$. Thus 
>
>$D=(X_1−X^{'}_1)^2 + \dots +(X_p−X^{'}_p)^2$
>
>The variance of $D$ is a function of the dimension $p$. Determine this variance. 

when $X_i \ne X^{'}_i \implies (X_i - X^{'}_i)^2 = 1$

$Pr[X_i = X^{'}_i] = 1/2 \implies D = Binom(p, 1/2) \implies VarD = p \times 1/2 (1-1/2) = p/4 \;\; \mathbb{E}D = p/2$

>Determine the covariance of $Z$ and $D$ as a function of $p$.
>
> $cov(Z,D)=\mathbb E((Z−\mathbb EZ)(D−\mathbb ED))$

When $p = 1 \mathbb E D = \frac{p}{2} = \frac{1}{2}$

| ZD | 0 | 1 |
|--|--|--|
| 0 | 1/2 | 0 |
| 1 | 0 | 1/2 |

$Cov(Z, D) = \frac{1}{2}(0-1/2)(0-1/2) + \frac{1}{2}(1-1/2)(1-1/2) = 1/4$

When $p = 2 \mathbb E D = \frac{p}{2} = 2$

| ZD | 0 | 1 | 2 |
|--|--|--|--|
| 0 | 1/4 | 1/4 | 0 |
| 1 | 0 | 1/4 | 1/4 |

$Cov(Z, D) = \frac{1}{4}(0-1/2)(0-1) + \frac{1}{4}(0-1/2)(1-1) + \frac{1}{4}(1-1/2)(1-1) + \frac{1}{4}(1-1/2)(2-1) = 1/4$

And so on so forth

$Cov(Z, D) = 1/4$

__Another solution:__

Hint: do you see a relationship between $(X_1 - X^{'}_1)^2$ and $Z$? Note that $Z$ is independent of $X_2, ..., X_p$ and $X^{'}_2, ..., X^{'}_p$.

Write the expectation above as $E(\tilde{Z}(\tilde{Z}+R))$,

where $\tilde{Z}$ only depends on $X_1$ and $X^{'}_1$ and $R$ on $X_2, X'_2, ..., X_p, X^{'}_p$.

>Calculate the correlation between $Z$ and $D$ as a function of $p$.
>
>$cor(Z,D)=\frac{cov(Z,D)}{\sqrt{Var(Z)Var(D)}}$

$cor(Z,D) = \frac{1/4}{\sqrt{1/4 \times p/4}} = 1/\sqrt{p} $

__Thus the correlation is very small for large dimensions p. A small correlation is not enough to prove that 1-NN fails. In the next quiz, we present a problem in which 1-NN performs significantly worse than halfspace classification, unless the training set is exponentially large__

### Nearest neighbors failing provably

>In this quiz we present a simple classification function for which the 1-nearest neighbor algorithm requires an exponentially large training set before it classifies correctly with probability slightly better than 1/2.
>
>Let $\mathcal X = \{-1,1\}^p \cup \{ 0^p\}$, in other words, $\mathcal{X}$ is a $p$-dimensional hypercube together with its center $0^p$. 
>
>We use the labels $\mathcal Y = \{0,1\}$. We consider the true classifier $h(x) = \frac{1}{p} \sum_{i \le p}(x_i)^2$, i.e. __all points are mapped to $1$, except the center, which is mapped to $0$__.
>
>Let $\mathbb P$ be the measure over $\mathcal{X}$ that selects the center with probability 1/2, and each point of the hypercube with probability $2^{-p-1}$.
>
>Let $c=0.1$. We show that for each $S \subseteq \mathcal X$, the 1-nearest neighbor classifier $f_S$ misclassifies a random test point $X$ with probability.
>
><center>$\large \mathbb{P}(f_S(X) \ne h(X)) \ge \frac{1}{2}−|S|exp(-cp)$</center>
>
>(We will get a bound with better parameters.)
>
>To warm up, we first consider the special case $p = 2$.
>
>$S = [((-1, 1), 1), ((1, 1), 1), ((-1, -1), 1), ((1, -1), 1), ((0, 0), 0)]$
>
>Each of the 4 outer points have measure 1/8, ($2^{-2-1}$) and the center point 1/2
>
>Fix $S=[((-1,-1),1), ((0, 0), 0)]$ what is the probability that a random $X$ is misscalsified?

X is misclassified, When $x \in {(-1, 1), (1, 1), (1, -1)}$, $d_E(x, (0,0)) = 2$, $d_E(x, (-1,-1)) = 4 or 8 or 4$.

since each of the 4 outer points have measure 1/8, the probability that a random $X$ is misscalsified is 3/8

>Let $x^{'}$ be a fixed point in $\{-1,1\}^p$ and let $X$ denote a uniformly randomly selected point in $\{-1,1\}^p$. Let $D = \mathrm{d}_{\text{E}}(x',X)^2$ denote the squared Euclidean distance. 
>
>What is the expected value of $D$ as a function of $p$?

Let $Z$ denote $(X_i - X^{'}_i)^2$, when $(X_i, X^{'}_i) = (-1,1)\;\;or\;\;(1,-1)\;\;Z = 4;$ when $X_i = (-1,-1)\;\;or\;\;(1,1)\;\;Z = 0$

$Pr[Z=4]=1/2,Pr[Z=0]=1/2$

$\mathbb{E}D = p/2 \times 4 + p/2 \times 0 = 2p$

>Recall the Chernoff bound. Let $V_1, V_2, ..., V_p$ be independent random Bernoulli variables, each with expected value $\mu$. Then
>
>$P(\frac{1}{p}(V_1+\dots+V_p) \le \mu−\varepsilon ) \le \exp(−2\varepsilon^2 p)$
>
>Use this to obtain an upper bound for the probability that $D \le p$ in the previous question. What is this bound?

__Bernoulli distribution, is the discrete probability distribution of a random variable which takes the value $1$ with probability ${\displaystyle p}$ and the value $0$ with probability ${\displaystyle q=1-p}$__

Therefore $D/4$ is a sum of bernoulli variables, each variables has expected value euqal to 1/2 (In above question we know $\mathbb{E}Z = 2$, $\mathbb{E}Z/4$ = 1/2)



$P(\frac{1}{p} \frac{D}{4} \le 1/2 - \varepsilon) \le \exp(-2 \varepsilon^2 p) \implies P(D \le 2p - 4\varepsilon p) \le \exp(-2 \varepsilon^2 p)$

Let $\varepsilon = 1/4$

$P(D \le p) \le \exp(-2 (1/4)^2p) = \exp(-p/8)$

Consider now the problem of question 1. If $S$ does not contain the center point $0^p$, then $f_S$ predicts the constant $1$ and the probability that $f_S(X) = h(X)$ equals exactly 1/2. Assume now that $S$ contains the center point. Let $n$ denote the size of $S$. 

Let $A_1, ..., A_t$ be events. The union bound states that the probability that at least 1 of the events happen is at most $\mathrm{P}(A_1) + \cdots + \mathrm{P}(A_t)$.

Use the union bound and the above inequality to obtain an upper bound on the probability that $f_S(X) = h(X)$ as a function of the size $n = |S|$ and $p$.

Hint: do not forget about the center point. 