In [3]:
# Add lib input sys.path
import os
import sys
import time

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from scipy.optimize import minimize
import math
from sklearn.preprocessing import normalize
from functools import partial
import h5py
from scipy.spatial import distance

nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

from matplotlib.colors import ListedColormap
import libs.linear_models as lm
import libs.data_util as data
import libs.nn as nn
import libs.plot as myplot

%matplotlib inline

#### Exercise 8.1

* (a) If there's such a hyperplane that can tolerate noise radius greater than $\frac{1}{2}|x_{+}-x_{-}|$, we draw a line between two points, for $(x_{+}, +1)$, we can pick a point on the line that just pass the middle point between $x_{+}$ and $x_{-}$ and still within the radius (which is greater than $\frac{1}{2}|x_{+}-x_{-}|$) of $x_{+}$. This point will have label $+1$. However, it's obviously also in the radius of $x_{-}$, so it shall have a label of $-1$ as well. It is impossible to classify such point. This thus contradicts the fact that a hyperplane exists to tolerate such a noise radius. Our assumption is wrong, there's no such hyperplane that can tolerate noise radius greater than $\frac{1}{2}|x_{+}-x_{-}|$.

* (b) We can choose the hyperplane that perpendicular to the line between $x_{+}$ and $x_{-}$ and passes through the middle point. The two balls with radius of $\frac{1}{2}|x_{+}-x_{-}|$ and centered at $x_{+}, x_{-}$ are separated by this hyperplane totally. The projection of any point in the ball of $x_{+}$ on the norm of the hyperplane is positive.

Thus the hyperplane can tolerate such a noise radius.

#### Exercise 8.2

* (a) 

\begin{align*}
Y\otimes (Xw + b) &= Y \otimes ( \begin{bmatrix}0 \\ -4 \\ 2.4\end{bmatrix}-0.5)\\
&= \begin{bmatrix}-1 \\ -1 \\ +1\end{bmatrix} \otimes \begin{bmatrix}-0.5 \\ -4.5 \\ 1.9\end{bmatrix}\\
&= \begin{bmatrix}0.5\\ 4.5 \\ 1.9\end{bmatrix}\\
\end{align*}

So $\rho = \min y_n(w^Tx_n +b) = 0.5$

* (b) So we have the new weights $w' = \frac{1}{\rho}(b,w) = (-1, \begin{bmatrix}2.4 \\ -6.4 \end{bmatrix})$

We compute

\begin{align*}
Y\otimes (Xw' + b) &= Y \otimes ( \begin{bmatrix}0 \\ -8 \\ 4.8\end{bmatrix}-1)\\
&= \begin{bmatrix}-1 \\ -1 \\ +1\end{bmatrix} \otimes \begin{bmatrix}-1 \\ -9 \\ 3.8\end{bmatrix}\\
&= \begin{bmatrix}1\\ 9 \\ 3.8\end{bmatrix}\\
\end{align*}

The new minimum is $\rho' = 1$ which satisfies equation (8.2)

* (c) It's easy to see that the two hyperplanes are the same in 2-Dimensional space.


#### Exercise 8.3

If $h$ is the optimal hyperplane, its $w_h$ minimizes $w^Tw$. Also, for a separating hyperplane, the data point that is nearest to the hyperplance has distance $\frac{1}{|w_h|}$. Assume $\rho_{+} > \rho_{-}$, so we should have $\rho_{-} = \frac{1}{|w_h|}$. 

Let $l=\frac{\rho_{+}-\rho_{-}}{2}$, and $\rho'_{+} = \rho_{+} - l = \frac{\rho_{+}+\rho_{-}}{2} > \frac{1}{|w_h|}$, $\rho'_{-} = \rho_{-} + l = \frac{\rho_{+}+\rho_{-}}{2} > \frac{1}{|w_h|}$.

If we let $w' = \rho'_{+} = \rho'_{-}$, then we have $|w'| < |w_h|$, which contradicts the fact that $w_h$ minimizes $w^Tw$.


#### Exercise 8.4

Suppose $Y=\begin{bmatrix}y^1 & y^2 & \dots & y^N\end{bmatrix}$ and $X=\begin{bmatrix}1 & x^T_1 \\ \dots & \dots \\ 1 & x^T_N\end{bmatrix}$, where the columns of $Y$ is $y^n = \begin{bmatrix}0 \\ \dots \\ y_n \\ \dots \end{bmatrix}$, the only non-zero entry is at the $n$-th row with value of $y_n$.

We use column times row in matrix multiplication, so 

\begin{align*}
A &=YX \\
&= \sum^N_{n=1} \begin{bmatrix}0 \\ \dots \\ y_n \\ \dots \end{bmatrix} \begin{bmatrix}1 & x^T_n\end{bmatrix}\\
&= \sum^N_{n=1} \begin{bmatrix}0 & 0^T_d \\ \dots & \dots \\ y_n & y_nx^T_n \\ \dots & \dots \\ 0 & 0^T_d \end{bmatrix}\\
\end{align*}

This is exactly the formula we are seeking for.

#### Exercise 8.5

It's clear that the eigenvalues of $Q$ are eigher 0 or 1, so we know $Q$ is positive semi-definite.

#### Exercise 8.6 TODO

#### Exercise 8.7

* (a) Assume we have $\rho_1 < \rho_2$, consider the corresponding hypothesis sets $\mathcal{H}_{\rho_1}$ and $\mathcal{H}_{\rho_2}$. For a given data set $(x_1,y_1),\dots, (x_N, y_N)$, for any given dichotomy on the data set, if the dichotomy can be implemented by a hypothesis $h_2 \in \mathcal{H}_{\rho_2}$, then there must be a hypothesis $h_1 \in \mathcal{H}_{\rho_1}$ that can implement the dichotomy as well. 

This is because we can choose the $(b_2,w_2)$ from $h_2$, since $\rho_2 = \frac{1}{|w_2|}$, if we set $(b_1, w_1) =  \frac{\rho_2}{\rho_1}(b_2, w_2)$ such that $|w_1| = \frac{1}{\rho_1}$, so $h_1 = (b_1, w_1)$ belongs to $\mathcal{H}_{\rho_1}$ and has a margin of $\rho_1$.

As the margin of $h_1$ is smaller than the margin of $h_2$ ($\rho_1 < \rho_2$), if $h_2$ can implement the dichotomy, $h_1$ certainly can do that too. The reverse is not necessarily true since if we add points (Assuming adding the proper signs on the two sides of $h_1$) that are in the margin of $h_2$ but outside the margin of $h_1$, $h_2$ won't be able to implement the new dichotomy while $h_1$ can still do. 

So we see that $d_{VC}(\rho)$ is non-increasing in $\rho$.

* (b) We first prove that for any 3 points in the unit disc, there must be two that are within distance $\sqrt{3}$ of each other. If this is not ture, then there exists 3 points in the unit disc, such that the distances between any two points are larger than $\sqrt{3}$. 

The center of the disc can't be one of such 3 points, because the distance from center to any point is less than or equal to 1. 
Now we draw lines from the center to 3 points and extend them to meet with the boundary of the disc. We check the lengths of the segments (formed by connecting the 3 points on the circle), they must be all larger than $\sqrt{3}$.

We can compute the angle that each segment corresponds to, e.g. $\cos(\theta) = \frac{a^2+b^2-c^2}{2ab}$, where $a=b=1$ and $c$ is the segment, $\theta$ is the corresponding angle. We have $\cos(\theta) = 1 - \frac{c^2}{2} < -0.5$, we conclude that $\theta > \frac{2\pi}{3}$. This is true for all three angles, so we have a total angle of larger than $3\times \frac{2\pi}{3} = 2\pi$, which contradicts. 

So for any 3 points in the unit disc, there must be two that are within distance of $\sqrt{3}$ of each other.

Suppose that when $\rho >\frac{\sqrt{3}}{2}$, $d_{VC}(\rho) \ge 3$, that is the $\mathcal{H}_{\rho}$ can shatter at least 3 points. 

Now consider any 3 points in the unit disc, from what we have proved, there must be two that are within distance of $\sqrt{3}$ of each other. Consider the dichotomy when the two points is one positive, one negative, any hypothesis hyperplane (a line) that can implement this dichotomy needs to cross with the line between the two points. We see that of the two distances from the two points to the hyperplane, there is at least one that is in the margin, i.e. less than $\rho$. The distances are equal when the line is passing through the middle and perpendicular to the line between the two points. And the distances there are both $\frac{\sqrt{3}}{2}$, which is smaller than $\rho$, meaning at least one point will always stay in the margin of any hyperplane that can implement the dichotomy. 

This contradicts with the assumption that the hypothesis set $\mathcal{H}_{\rho}$ can shatter any 3 points. Thus $d_{VC}(\rho) < 3$ when $\rho >\frac{\sqrt{3}}{2}$

#### Exercise 8.8

* (a) In Figure 8.5, there are total 23 points with 5 support vectors, so the bound is $E_{CV}(SVM) \le \frac{5}{23}$

* (b) If one of the four support vectors in a gray box are removed, it doesn't change the classifier. The classifier still has the maximum margin among all candidates.

* (c) The bound is now $\frac{4}{23}$

#### Exercise 8.9

* (a) Since $u_0$ is optimal for (8.10), so we have $c-a^Tu_0 \le 0$, the maximum value that can be achieved with $\alpha \ge 0$ is thus $\max_{\alpha \ge 0} \alpha (c-a^Tu_0) = 0$

* (b) We now prove that $u_1$ is feasible for (8.10). Assume the contrary that $c-a^Tu_1 > 0$, then $\max_{\alpha \ge 0} \alpha (c-a^Tu_0) = \infty$ and the objective in (8.11) goes to infinite. However, when $u=u_0$, we have a finite objective in (8.11) as $\max_{\alpha \ge 0} \alpha (c-a^Tu_0) = 0$. This contradicts the assumption that $u_1$ is optimal for (8.11). So we have $c-a^Tu_1 \le 0$ and $u_1$ is feasible for (8.10).

* (c) For the objective in (8.11), we know from problem (a),(b) that for both $u_0, u_1$, we have $c-a^Tu_0 \le 0$ and $c-a^Tu_1 \le 0$. Then we have $\max_{\alpha \ge 0} \alpha (c-a^Tu_1) = \max_{\alpha \ge 0} \alpha (c-a^Tu_1) = 0$. 

Thus we see that $u_1$ actually minimizes $\min_{u\in R^L}\frac{1}{2}u^TQu + p^Tu$, on the other hand, by definition, $u_0$ also minimize the objective in (8.10), which is the same objective here. So we must have $\frac{1}{2}u^T_0Qu_0 + p^Tu_0 = \frac{1}{2}u^T_1Qu_1 + p^Tu_1$

* (d) Let $u^*$ be any optimal solution for (8.11), then by problem (b) and (a), we have $\max_{\alpha \ge 0} \alpha (c-a^Tu^*) = 0$, if the maximum is attained at $\alpha^*$, then $\alpha^* (c-a^Tu^*) = 0$, so we either have $c-a^Tu^*= 0$ or $\alpha^* = 0$.

#### Exercise 8.10

\begin{align*}
\frac{\partial{\mathcal{L}}}{\partial{u}_1} &= 2u_1 - \alpha_1 -\alpha_2 = 0 \\
\frac{\partial{\mathcal{L}}}{\partial{u}_2} &= 2u_2 - 2\alpha_1 -\alpha_3 = 0 \\
\end{align*}

Let the derivatives equal to 0, we have $u_1 = \frac{\alpha_1 + \alpha_2}{2}, u_2 = \frac{2\alpha_1 + \alpha_3}{2}$.

\begin{align*}
\mathcal{L}(u,\alpha) &= u^2_1 + u^2_2 + \alpha_1(2-u_1-2u_2) -\alpha_2u_1 -\alpha_3u_2 \\
&=  \frac{(\alpha_1 + \alpha_2)^2}{4} + \frac{(2\alpha_1 + \alpha_3)^2}{4} + \alpha_1 ( 2- \frac{\alpha_1 + \alpha_2}{2} - (2\alpha_1 + \alpha_3)) - \alpha_2 \frac{\alpha_1 + \alpha_2}{2} - \alpha_3 \frac{2\alpha_1 + \alpha_3}{2} \\
&= \frac{1}{4}\alpha^2_1 + \frac{1}{2}\alpha_1\alpha_2 + \frac{1}{4}\alpha^2_2 + \alpha^2_1 + \alpha_1\alpha_3 +  \frac{1}{4}\alpha^2_3 + 2\alpha_1 -  \frac{1}{2}\alpha^2_1 -  \frac{1}{2}\alpha_1\alpha_2 - 2\alpha^2_1 - \alpha_1\alpha_3 -  \frac{1}{2}\alpha_1\alpha_2 -  \frac{1}{2}\alpha^2_2 - \alpha_1\alpha_3 - \frac{1}{2}\alpha^2_3\\
&= -\frac{5}{4}\alpha^2_1 - \frac{1}{4}\alpha^2_2 \frac{1}{4}\alpha^2_3 - \frac{1}{2}\alpha_1\alpha_2 - \alpha_1\alpha_3 + 2\alpha_1\\
\end{align*}