In [3]:
# Add lib input sys.path
import os
import sys
import time

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from scipy.optimize import minimize
import math
from sklearn.preprocessing import normalize
from functools import partial
import h5py
from scipy.spatial import distance

nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

from matplotlib.colors import ListedColormap
import libs.linear_models as lm
import libs.data_util as data
import libs.nn as nn
import libs.plot as myplot

%matplotlib inline

#### Exercise 8.1

* (a) If there's such a hyperplane that can tolerate noise radius greater than $\frac{1}{2}|x_{+}-x_{-}|$, we draw a line between two points, for $(x_{+}, +1)$, we can pick a point on the line that just pass the middle point between $x_{+}$ and $x_{-}$ and still within the radius (which is greater than $\frac{1}{2}|x_{+}-x_{-}|$) of $x_{+}$. This point will have label $+1$. However, it's obviously also in the radius of $x_{-}$, so it shall have a label of $-1$ as well. It is impossible to classify such point. This thus contradicts the fact that a hyperplane exists to tolerate such a noise radius. Our assumption is wrong, there's no such hyperplane that can tolerate noise radius greater than $\frac{1}{2}|x_{+}-x_{-}|$.

* (b) We can choose the hyperplane that perpendicular to the line between $x_{+}$ and $x_{-}$ and passes through the middle point. The two balls with radius of $\frac{1}{2}|x_{+}-x_{-}|$ and centered at $x_{+}, x_{-}$ are separated by this hyperplane totally. The projection of any point in the ball of $x_{+}$ on the norm of the hyperplane is positive.

Thus the hyperplane can tolerate such a noise radius.

#### Exercise 8.2

* (a) 

\begin{align*}
Y\otimes (Xw + b) &= Y \otimes ( \begin{bmatrix}0 \\ -4 \\ 2.4\end{bmatrix}-0.5)\\
&= \begin{bmatrix}-1 \\ -1 \\ +1\end{bmatrix} \otimes \begin{bmatrix}-0.5 \\ -4.5 \\ 1.9\end{bmatrix}\\
&= \begin{bmatrix}0.5\\ 4.5 \\ 1.9\end{bmatrix}\\
\end{align*}

So $\rho = \min y_n(w^Tx_n +b) = 0.5$

* (b) So we have the new weights $w' = \frac{1}{\rho}(b,w) = (-1, \begin{bmatrix}2.4 \\ -6.4 \end{bmatrix})$

We compute

\begin{align*}
Y\otimes (Xw' + b) &= Y \otimes ( \begin{bmatrix}0 \\ -8 \\ 4.8\end{bmatrix}-1)\\
&= \begin{bmatrix}-1 \\ -1 \\ +1\end{bmatrix} \otimes \begin{bmatrix}-1 \\ -9 \\ 3.8\end{bmatrix}\\
&= \begin{bmatrix}1\\ 9 \\ 3.8\end{bmatrix}\\
\end{align*}

The new minimum is $\rho' = 1$ which satisfies equation (8.2)

* (c) It's easy to see that the two hyperplanes are the same in 2-Dimensional space.


#### Exercise 8.3

If $h$ is the optimal hyperplane, its $w_h$ minimizes $w^Tw$. Also, for a separating hyperplane, the data point that is nearest to the hyperplance has distance $\frac{1}{|w_h|}$. Assume $\rho_{+} > \rho_{-}$, so we should have $\rho_{-} = \frac{1}{|w_h|}$. 

Let $l=\frac{\rho_{+}-\rho_{-}}{2}$, and $\rho'_{+} = \rho_{+} - l = \frac{\rho_{+}+\rho_{-}}{2} > \frac{1}{|w_h|}$, $\rho'_{-} = \rho_{-} + l = \frac{\rho_{+}+\rho_{-}}{2} > \frac{1}{|w_h|}$.

If we let $w' = \rho'_{+} = \rho'_{-}$, then we have $|w'| < |w_h|$, which contradicts the fact that $w_h$ minimizes $w^Tw$.


#### Exercise 8.4

Suppose $Y=\begin{bmatrix}y^1 & y^2 & \dots & y^N\end{bmatrix}$ and $X=\begin{bmatrix}1 & x^T_1 \\ \dots & \dots \\ 1 & x^T_N\end{bmatrix}$, where the columns of $Y$ is $y^n = \begin{bmatrix}0 \\ \dots \\ y_n \\ \dots \end{bmatrix}$, the only non-zero entry is at the $n$-th row with value of $y_n$.

We use column times row in matrix multiplication, so 

\begin{align*}
A &=YX \\
&= \sum^N_{n=1} \begin{bmatrix}0 \\ \dots \\ y_n \\ \dots \end{bmatrix} \begin{bmatrix}1 & x^T_n\end{bmatrix}\\
&= \sum^N_{n=1} \begin{bmatrix}0 & 0^T_d \\ \dots & \dots \\ y_n & y_nx^T_n \\ \dots & \dots \\ 0 & 0^T_d \end{bmatrix}\\
&= \begin{bmatrix}y_1 & y_1x^T_1 \\ \dots & \dots \\ y_n & y_nx^T_n \\ \dots & \dots \\ y_N & y_Nx^T_N \end{bmatrix}\\
\end{align*}

This is exactly the formula we are seeking for.

#### Exercise 8.5

It's clear that the eigenvalues of $Q$ are eigher 0 or 1, so we know $Q$ is positive semi-definite.

#### Exercise 8.6 TODO

#### Exercise 8.7

* (a) Assume we have $\rho_1 < \rho_2$, consider the corresponding hypothesis sets $\mathcal{H}_{\rho_1}$ and $\mathcal{H}_{\rho_2}$. For a given data set $(x_1,y_1),\dots, (x_N, y_N)$, for any given dichotomy on the data set, if the dichotomy can be implemented by a hypothesis $h_2 \in \mathcal{H}_{\rho_2}$, then there must be a hypothesis $h_1 \in \mathcal{H}_{\rho_1}$ that can implement the dichotomy as well. 

This is because we can choose the $(b_2,w_2)$ from $h_2$, since $\rho_2 = \frac{1}{|w_2|}$, if we set $(b_1, w_1) =  \frac{\rho_2}{\rho_1}(b_2, w_2)$ such that $|w_1| = \frac{1}{\rho_1}$, so $h_1 = (b_1, w_1)$ belongs to $\mathcal{H}_{\rho_1}$ and has a margin of $\rho_1$.

As the margin of $h_1$ is smaller than the margin of $h_2$ ($\rho_1 < \rho_2$), if $h_2$ can implement the dichotomy, $h_1$ certainly can do that too. The reverse is not necessarily true since if we add points (Assuming adding the proper signs on the two sides of $h_1$) that are in the margin of $h_2$ but outside the margin of $h_1$, $h_2$ won't be able to implement the new dichotomy while $h_1$ can still do. 

So we see that $d_{VC}(\rho)$ is non-increasing in $\rho$.

* (b) We first prove that for any 3 points in the unit disc, there must be two that are within distance $\sqrt{3}$ of each other. If this is not ture, then there exists 3 points in the unit disc, such that the distances between any two points are larger than $\sqrt{3}$. 

The center of the disc can't be one of such 3 points, because the distance from center to any point is less than or equal to 1. 
Now we draw lines from the center to 3 points and extend them to meet with the boundary of the disc. We check the lengths of the segments (formed by connecting the 3 points on the circle), they must be all larger than $\sqrt{3}$.

We can compute the angle that each segment corresponds to, e.g. $\cos(\theta) = \frac{a^2+b^2-c^2}{2ab}$, where $a=b=1$ and $c$ is the segment, $\theta$ is the corresponding angle. We have $\cos(\theta) = 1 - \frac{c^2}{2} < -0.5$, we conclude that $\theta > \frac{2\pi}{3}$. This is true for all three angles, so we have a total angle of larger than $3\times \frac{2\pi}{3} = 2\pi$, which contradicts. 

So for any 3 points in the unit disc, there must be two that are within distance of $\sqrt{3}$ of each other.

Suppose that when $\rho >\frac{\sqrt{3}}{2}$, $d_{VC}(\rho) \ge 3$, that is the $\mathcal{H}_{\rho}$ can shatter at least 3 points. 

Now consider any 3 points in the unit disc, from what we have proved, there must be two that are within distance of $\sqrt{3}$ of each other. Consider the dichotomy when the two points is one positive, one negative, any hypothesis hyperplane (a line) that can implement this dichotomy needs to cross with the line between the two points. We see that of the two distances from the two points to the hyperplane, there is at least one that is in the margin, i.e. less than $\rho$. The distances are equal when the line is passing through the middle and perpendicular to the line between the two points. And the distances there are both $\frac{\sqrt{3}}{2}$, which is smaller than $\rho$, meaning at least one point will always stay in the margin of any hyperplane that can implement the dichotomy. 

This contradicts with the assumption that the hypothesis set $\mathcal{H}_{\rho}$ can shatter any 3 points. Thus $d_{VC}(\rho) < 3$ when $\rho >\frac{\sqrt{3}}{2}$

#### Exercise 8.8

* (a) In Figure 8.5, there are total 23 points with 5 support vectors, so the bound is $E_{CV}(SVM) \le \frac{5}{23}$

* (b) If one of the four support vectors in a gray box are removed, it doesn't change the classifier. The classifier still has the maximum margin among all candidates.

* (c) The bound is now $\frac{4}{23}$

#### Exercise 8.9

* (a) Since $u_0$ is optimal for (8.10), so we have $c-a^Tu_0 \le 0$, the maximum value that can be achieved with $\alpha \ge 0$ is thus $\max_{\alpha \ge 0} \alpha (c-a^Tu_0) = 0$

* (b) We now prove that $u_1$ is feasible for (8.10). Assume the contrary that $c-a^Tu_1 > 0$, then $\max_{\alpha \ge 0} \alpha (c-a^Tu_0) = \infty$ and the objective in (8.11) goes to infinite. However, when $u=u_0$, we have a finite objective in (8.11) as $\max_{\alpha \ge 0} \alpha (c-a^Tu_0) = 0$. This contradicts the assumption that $u_1$ is optimal for (8.11). So we have $c-a^Tu_1 \le 0$ and $u_1$ is feasible for (8.10).

* (c) For the objective in (8.11), we know from problem (a),(b) that for both $u_0, u_1$, we have $c-a^Tu_0 \le 0$ and $c-a^Tu_1 \le 0$. Then we have $\max_{\alpha \ge 0} \alpha (c-a^Tu_1) = \max_{\alpha \ge 0} \alpha (c-a^Tu_1) = 0$. 

Thus we see that $u_1$ actually minimizes $\min_{u\in R^L}\frac{1}{2}u^TQu + p^Tu$, on the other hand, by definition, $u_0$ also minimize the objective in (8.10), which is the same objective here. So we must have $\frac{1}{2}u^T_0Qu_0 + p^Tu_0 = \frac{1}{2}u^T_1Qu_1 + p^Tu_1$

* (d) Let $u^*$ be any optimal solution for (8.11), then by problem (b) and (a), we have $\max_{\alpha \ge 0} \alpha (c-a^Tu^*) = 0$, if the maximum is attained at $\alpha^*$, then $\alpha^* (c-a^Tu^*) = 0$, so we either have $c-a^Tu^*= 0$ or $\alpha^* = 0$.

#### Exercise 8.10

\begin{align*}
\frac{\partial{\mathcal{L}}}{\partial{u}_1} &= 2u_1 - \alpha_1 -\alpha_2 = 0 \\
\frac{\partial{\mathcal{L}}}{\partial{u}_2} &= 2u_2 - 2\alpha_1 -\alpha_3 = 0 \\
\end{align*}

Let the derivatives equal to 0, we have $u_1 = \frac{\alpha_1 + \alpha_2}{2}, u_2 = \frac{2\alpha_1 + \alpha_3}{2}$.

\begin{align*}
\mathcal{L}(u,\alpha) &= u^2_1 + u^2_2 + \alpha_1(2-u_1-2u_2) -\alpha_2u_1 -\alpha_3u_2 \\
&=  \frac{(\alpha_1 + \alpha_2)^2}{4} + \frac{(2\alpha_1 + \alpha_3)^2}{4} + \alpha_1 ( 2- \frac{\alpha_1 + \alpha_2}{2} - (2\alpha_1 + \alpha_3)) - \alpha_2 \frac{\alpha_1 + \alpha_2}{2} - \alpha_3 \frac{2\alpha_1 + \alpha_3}{2} \\
&= \frac{1}{4}\alpha^2_1 + \frac{1}{2}\alpha_1\alpha_2 + \frac{1}{4}\alpha^2_2 + \alpha^2_1 + \alpha_1\alpha_3 +  \frac{1}{4}\alpha^2_3 + 2\alpha_1 -  \frac{1}{2}\alpha^2_1 -  \frac{1}{2}\alpha_1\alpha_2 - 2\alpha^2_1 - \alpha_1\alpha_3 -  \frac{1}{2}\alpha_1\alpha_2 -  \frac{1}{2}\alpha^2_2 - \alpha_1\alpha_3 - \frac{1}{2}\alpha^2_3\\
&= -\frac{5}{4}\alpha^2_1 - \frac{1}{4}\alpha^2_2 \frac{1}{4}\alpha^2_3 - \frac{1}{2}\alpha_1\alpha_2 - \alpha_1\alpha_3 + 2\alpha_1\\
\end{align*}


#### Exercise 8.11

Change $\sum^N_{n=1}y_n\alpha_n = 0$ into $\sum^N_{n=1}y_n\alpha_n \ge 0$ and $\sum^N_{n=1}y_n\alpha_n \le 0$, i.e. $-\sum^N_{n=1}y_n\alpha_n \ge 0$, combine these two conditions with $\alpha_n \ge 0$, we have

\begin{align*}
A_D &= \begin{bmatrix}y^T \\ -y^T \\ I_{N\times N}\end{bmatrix}\\
\end{align*}

We set $M=\begin{bmatrix}y_1x_1 & y_2x_2 & \dots & y_Nx_N\end{bmatrix}$, where $x_n, n=1,\dots, N$ are the column vectors with dimension $d\times 1$, so $M$ has a dimension of $d \times N$. 

We also have column vectors of $y=\begin{bmatrix}y_1 \\ y_2 \\ \vdots \\ y_N\end{bmatrix}$ and $\alpha = \begin{bmatrix}\alpha_1 \\ \alpha_2 \\ \vdots \\ \alpha_N\end{bmatrix}$

Let $Q_D= M^TM = \begin{bmatrix}y_1y_1x^T_1x_1 & y_1y_2x^T_1x_2 & \dots & y_1y_Nx^T_1x_N\\ y_ny_1x^T_nx_1 & y_ny_2x^T_nx_2 & \dots & y_ny_Nx^T_nx_N\\ \dots & \dots & \dots & \dots\\ y_Ny_1x^T_Nx_1 & y_Ny_2x^T_Nx_2 & \dots & y_Ny_Nx^T_Nx_N\end{bmatrix}$. $M$ has a dimension of $N\times N$.



So we can write 

\begin{align*}
\frac{1}{2}\sum^N_{m=1}\sum^N_{n=1} y_ny_m\alpha_n \alpha_m x^T_nx_m &= \frac{1}{2}\sum^N_{m=1}y_m\alpha_m\sum^N_{n=1} y_n\alpha_n  x^T_nx_m\\
&= \frac{1}{2}\sum^N_{m=1}\alpha_m \sum^N_{n=1}\alpha_n y_nx^T_nx_my_m\\
&= \frac{1}{2}\sum^N_{m=1}\alpha_m \sum^N_{n=1}\alpha_n [Q_D]_{n,m}\\
&= \frac{1}{2}\alpha^T Q_D \alpha\\
\end{align*}

* (b) This part is proved in problem (a) when we construct matrix $Q_D$. So $X_s = M^T$, and $Q_D = X_sX^T_s$

For any vector $\alpha \ne 0$, we have 

\begin{align*}
\alpha^TQ_D \alpha & = \alpha^T X_sX^T_s \alpha \\
&= (\alpha^TX_s)(\alpha^TX_s)^T\\
&= |\alpha^TX_s|^2 \\
&\ge 0\\
\end{align*}

So $Q_D$ is positive semi-definite.

#### Exercise 8.12

If all the data is from one class, then $\alpha^*_n = 0$ for $n=1,\dots, N$.

* (a) $w^* = \sum^N_{n=1}y_n\alpha^*_n x_n = 0$

* (b) Since all $\alpha^*_n = 0$, we can apply KKT conditions here to obtain $b^*$. To satisfy the constraints on the primal optimization problem, we need to have $b \ge 1$ if all points are positive.

#### Exercise 8.13

Consider a data set with two positive examples at $x_1 = (0,0)$ and $x_2 = (1,0)$, and one negative example at $x_3 = (0,1)$. We look for hyperplane (line) that separate the negative example with the positive examples. As there's only 1 negative example, it has to be the support vector, either one of the two positive examples or both of them can be the support vectors. It's not hard by trial and error to find out that the optimal fat-hyperplane is $-2[x]_2+1 = 0$, i.e. with $(b,w) = (1, [0, -2])$

The optimal solution $\alpha^*$ has to satisfy $w^* = \sum^N_{n=1}y_n\alpha^*_nx_n$,

\begin{align*}
w^* &= \sum^N_{n=1}y_n\alpha^*_nx_n \\
&= y_1\alpha_1x_1 + y_2\alpha_2x_2 + y_3\alpha_3x_3 \\
&= \alpha_1\begin{bmatrix}0 \\ 0\end{bmatrix} + \alpha_2\begin{bmatrix}1 \\ 0\end{bmatrix} - \alpha_3\begin{bmatrix}0 \\ 1\end{bmatrix} \\
&= \begin{bmatrix}\alpha_1 \\ -\alpha_3\end{bmatrix}
\end{align*}

since $w^* = \begin{bmatrix}0 \\ -2\end{bmatrix}$, we have $\alpha_1 = 0$.

On the other hand, for this hyperplane, all three points are support vectors. 
It's easy to check that for $n=1,2,3$, we have $y_n(w^*Tx_n + b^*) = 1$. 

So if a point $(x_n, y_n)$ is on the boundary satisfying $y_n((w^*)^Tx_n + b^*)=1$, it's possible that $\alpha^*_n= 0$ as the $\alpha^*_1 = 0$ here.

#### Exercise 8.14

If we remove a data point $(x_n, y_n)$ with $\alpha^*_n = 0$, suppose the previous optimal solution is $\mathbf{\alpha}^*$.

* (a) Since $\mathbf{\alpha}^*$ is the optimal solution for the previous dual problem, It satisfies the constraints in (8.21). i.e. $\alpha_i \ge 0$ for $i=1,\dots, N$. And

$\sum^N_{i=1} y_i\alpha_i = \sum^N_{i\ne n} y_i\alpha_i = 0$ since $\alpha^*_n = 0$. The second part is exactly the new constraint for the problem with $(x_n, y_n)$ removed.

So the solution $\mathbf{\alpha}^*$ (after removing $\alpha^*_n$) is feasible for the new dual problem.

* (b) If there's another feasible solution ($\alpha'$) for the new dual and it has a lower objective value than $\mathbf{\alpha}^*$. We construct a new solution for previous dual problem by adding $\alpha^*_n = 0$ into $\alpha'$, i.e. $\alpha^c$. It's clear that $\alpha^c$ is a feasible solution for the previous dual problem.

From (8.21), the objective value of $\alpha^c$ for previous dual problem is thus:

\begin{align*}
V(\alpha^c) &= \frac{1}{2}\sum^N_{i=1}\sum^N_{j=1}y_iy_j\alpha^c_i\alpha^c_j x^T_ix_j - \sum^N_{i=1}\alpha^c_i \\
&= \frac{1}{2}\sum^N_{i\ne n}\sum^N_{j\ne n}y_iy_j\alpha^c_i\alpha^c_j x^T_ix_j - \sum^N_{i\ne n}\alpha^c_i\\
&< \frac{1}{2}\sum^N_{i\ne n}\sum^N_{j\ne n}y_iy_j\alpha^*_i\alpha^*_j x^T_ix_j - \sum^N_{i\ne n}\alpha^*_i\\
&= V(\alpha^*) \\
\end{align*}

This contradicts the fact that $\alpha^*$ is the optimal solution for the previous problem. So we conclude there's no other feasible solution for the new dual problem that has a lower objective value than $\mathbf{\alpha}^*$.

* (c) Hence we showed that $\mathbf{\alpha}^*$ (minus $\alpha^*_n$) is optimal for the new dual problem.

* (d) Since $w^* = \sum^N_{i=1}y_i\alpha^*_ix_i = \sum^N_{i\ne n}y_i\alpha^*_ix_i$. $w^*$ is the same as previous problem. 
Also $b^*$ is computed using a point where $\alpha^*_s > 0$, it's not affected by $\alpha^*_n$ as well. So we conclude that the optimal hyperplane doesn't change. 

* (e) As the final hypothesis is not changed when we throw out any data point with $\alpha^*_n=0$, after we throw out all such points, we are left with data points that have $\alpha^*_n > 0$, thus we shows that $E_{CV} = \frac{1}{N}\sum^N_{n=1}e_n \le \frac{\text{number of }\alpha^*_n > 0}{N}$

#### Exercise 8.15

* (a) 

\begin{align*}
\Phi^T(x)\Phi(x) &= \begin{bmatrix}\Phi^T_1 & \Phi^T_2\end{bmatrix}\begin{bmatrix}\Phi_1 \\ \Phi_2\end{bmatrix}\\
&= \Phi^T_1 \Phi_1 + \Phi^T_2\Phi_2 \\
&= K_1 + K_2 \\
\end{align*}

* (b) Let $\Phi_1 = \begin{bmatrix} a_1 \\ \dots \\ a_N\end{bmatrix}$, and $\Phi_2 = \begin{bmatrix} b_1 \\ \dots \\ b_M\end{bmatrix}$, so we have $K_1 = \Phi_1^T \Phi_1 = \sum a^2_i $ and $K_2 = \Phi_2^T \Phi_2 = \sum b^2_i $

$\Phi = \Phi_1 \Phi_2^T = \begin{bmatrix}v_1 \\ \dots \\ v_N\end{bmatrix}$ where $v_n = \begin{bmatrix}a_nb_1 \\ \dots \\ a_nb_M\end{bmatrix}$

So we have 

$K=\Phi^T\Phi = \sum v_i^Tv_i = \sum_{n=1} \sum_{m=1} a_n^2b_m^2 = \sum_{n=1}a^2_n \sum_{m=1}b^2_m = K_1K_2$
* (c) It shows that $K_1$ and $K_2$ are kernels, then so are $K_1 + K_2$ and $K_1K_2$. 

#### Exercise 8.16

* (a) (8.30) minimizes w.r.t. $w, b, \xi$, so the optimization variable is $u = \begin{bmatrix} b \\ w \\ \xi \end{bmatrix}$

* (b) There are $N$ constraints for $y_n(w^Tx_n +b) \ge 1 - \xi_n$, and $N$ constraints for $\xi \ge 0$. It's easy to see that RHS has $c=\begin{bmatrix}1_N \\ 0_N \end{bmatrix}$. 

On the LHS, $A=\begin{bmatrix}YX & I_N \\ 0_{N\times (d+1)} & I_N \end{bmatrix}$ where $YX = \begin{bmatrix}y_1 & y_1x^T_1 \\ \dots & \dots \\ y_n & y_nx^T_n \\ \dots & \dots \\ y_N & y_Nx^T_N \end{bmatrix}$ as in exercise 8.4

The $p^Tu$ term is $C\sum\xi_n$, so we have $p = \begin{bmatrix}0_{(d+1)\times 1} \\ C_{N\times 1}\end{bmatrix}$

And finally the $Q = \begin{bmatrix}0 & 0^T_d & 0^T_N \\ 0_d & I_d & 0_{d\times N} \\ 0_N & 0_{N\times d} & 0_{N\times N}\end{bmatrix}$ with some work. 

* (c) Once we have $u^*$, then $b^* = u^*_0$, $w^* = u^*_{1:d+1}$, $\xi^* = u^*_{d+2:d+1+N}$

* (d) Any point with $\xi_n > 0$ violates the margin. The points with $y_n(w^Tx_n +b) = 1$ are on the margin, and the points with $y_n(w^Tx_n +b) > 1$ are correctly separated and outside the margin.

#### Exercise 8.17

With classification 0/1 error, $E_{in} = \frac{1}{N}\sum^N_{n=1}[sign(w^Tx_n+b)\ne y_n]$ and $E_{SVM}(b,w) = \frac{1}{N}\sum^N_{n=1} max(1-y_n(w^Tx_n+b), 0)$.

For any $n$, 
* If the point is correctly classified with no violation, then $y_n(w^Tx_n+b) \ge 1$ and $max(1-y_n(w^Tx_n+b), 0) = 0$, so $e_{in} = e_{SVM}$
* If the point is correctly classified with violation, then $ 1 > y_n(w^Tx_n+b) \ge 0$, so $max(1-y_n(w^Tx_n+b), 0) = 1-y_n(w^Tx_n+b)$, so we have $0 = e_{in} < e_{SVM}$ 
* If the point is wrongly classified, then we have $y_n(w^Tx_n+b) < 0$ and $max(1-y_n(w^Tx_n+b), 0) = 1 - y_n(w^Tx_n+b) > 1$, so we have $1 = e_{in} < e_{SVM}$ 

We conclude that $E_{in} \le E_{SVM}$

#### Problem 8.1

let the two points be $(x_{+},y_{+})$ and $(x_{-}, y_{-})$, and $y_{+} = 1, y_{-} = -1$. From (8.4), we have to minimize

$E = \frac{1}{2}w^Tw $ with two constraints: $y_{+}(w^Tx_{+} + b) \ge 1$ and $y_{-}(w^Tx_{-} + b) \ge 1$. That is

\begin{align*}
w^Tx_{+} + b &\ge 1 \\
w^Tx_{-} + b &\le -1 \\
\end{align*}

Subtract the two inequations, we have $w^T(x_{+} - x_{-}) \ge 2$, so we have $|w^T||x_{+} - x_{-}| \ge |w^T(x_{+} - x_{-})| \ge 2$, then $|w| \ge \frac{2}{|x_{+} - x_{-}|}$. The minimal objective value is thus 

$E=\frac{1}{2} \frac{4}{|x_{+} - x_{-}|^2} = \frac{2}{|x_{+} - x_{-}|^2}$

The optimal hyperplane then have $w^* = \frac{2}{x_{+} - x_{-}}$ and $b^* = 1 - (w^*)^Tx_{+} = \frac{x_{+}+x_{-}}{x_{-} - x_{+}}$. 

The margin is thus $\frac{1}{|w^*|} = \frac{|x_{+} - x_{-}|}{2}$

This agrees with the results from exercise 8.1

#### Problem 8.2

For this data, we let $w=\begin{bmatrix}w_1 \\ w_2 \end{bmatrix}$, we have the objective value

\begin{align*}
E &= \frac{1}{2}w^Tw \\
&= \frac{1}{2}(w^2_1 + w^2_2) \\
\end{align*}

The constraints are 

\begin{align*}
y_1(w_1x_{11} + w_2x_{12} + b) &= -b \ge 1 \\
y_2(w_1x_{21} + w_2x_{22} + b) &= w_2 - b \ge 1 \\
y_3(w_1x_{31} + w_2x_{32} + b) &= -2w_1 + b \ge 1 \\
\end{align*}

Combine the first and third inequalities, we have $w_1 \le -1$. Combine the first and the second, we have $w_2 \ge 0$. 
So the objective achieves minimal at $w_1 = -1, w_2 = 0$, where $E=\frac{1}{2}$.

The optimal $b= -1$.

The margin is thus $\frac{1}{|w|} = \frac{1}{\sqrt{w^2_1+w^2_2}} = 1$


#### Problem 8.3

* (a) 
Let $\alpha = \begin{bmatrix}\alpha_1 \\ \alpha_2 \\\alpha_3 \\\alpha_4 \end{bmatrix}$. 

With $Q_D=\begin{bmatrix}0 & 0 & 0 & 0 \\ 0 & 8 & -4 & -6 \\ 0 & -4 & 4 & 6 \\ 0 & -6 & 6 & 9\end{bmatrix}$, and $A_D = \begin{bmatrix}-1 & -1 & 1 & 1 \\ 1 & 1 & -1 & -1 \\ 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1\end{bmatrix}$,  we have

\begin{align*}
\mathcal{L}(\alpha) &=\frac{1}{2}\alpha^TQ_D\alpha - 1^T_N\alpha \\
&=\frac{1}{2}( 8\alpha^2_2 -4\alpha_2\alpha_3 -6\alpha_2\alpha_4 -4\alpha_2\alpha_3 + 4\alpha^2_3 + 6\alpha_3\alpha_4 -6\alpha_2\alpha_4 + 6\alpha_3\alpha_4+9\alpha^2_4) - (\alpha_1+\alpha_2+\alpha_3+\alpha_4)\\
&= 4\alpha^2_2+ 2\alpha^2_3+\frac{9}{2}\alpha^2_4 -4\alpha_2\alpha_3 -6\alpha_2\alpha_4  + 6\alpha_3\alpha_4 - \alpha_1-\alpha_2-\alpha_3-\alpha_4\\
\end{align*}

The constraints are: $A_D\alpha \ge 0$, i.e.

\begin{align*}
-\alpha_1-\alpha_2 + \alpha_3 + \alpha_4 &\ge 0 \\
\alpha_1+\alpha_2 - \alpha_3 - \alpha_4 &\ge 0 \\
\alpha_1 &\ge 0 \\
\alpha_2 &\ge 0 \\
\alpha_3 &\ge 0 \\
\alpha_4 &\ge 0 \\
\end{align*}

The first two inequalities combine into $\alpha_1+\alpha_2 = \alpha_3 + \alpha_4$

* (b) 

From the equality constraint, we have $\alpha_1 = \alpha_3+\alpha_4 - \alpha_2$, replace $\alpha_1$ in the $\mathcal{L}(\alpha)$, we have 

\begin{align*}
\mathcal{L}(\alpha) &= 4\alpha^2_2+ 2\alpha^2_3+\frac{9}{2}\alpha^2_4 -4\alpha_2\alpha_3 -6\alpha_2\alpha_4  + 6\alpha_3\alpha_4 - (\alpha_3+\alpha_4 - \alpha_2)-\alpha_2-\alpha_3-\alpha_4\\
&= 4\alpha^2_2+ 2\alpha^2_3+\frac{9}{2}\alpha^2_4 -4\alpha_2\alpha_3 -6\alpha_2\alpha_4  + 6\alpha_3\alpha_4 -2\alpha_3-2\alpha_4\\
\end{align*}

* (c) Fix $\alpha_3, \alpha_4 \ge 0$, take derivative of $\mathcal{L}(\alpha)$ w.r.t. $\alpha_2$ and set it to zero, we have

\begin{align*}
\frac{\partial{\mathcal{L}(\alpha)}}{\partial{\alpha_2}} &= 8\alpha_2 - 4\alpha_3 - 6\alpha_4 \\
\alpha_2 &= \frac{1}{2}\alpha_3 + \frac{3}{4}\alpha_4\\
\end{align*}

So $\alpha_1 = \alpha_3 + \alpha_4 - \alpha_2 = \frac{1}{2}\alpha_3 + \frac{1}{4}\alpha_4$

Note that both $\alpha_1 \ge 0$ and $\alpha_3 \ge 0$ if $\alpha_3,\alpha_4\ge0$.

* (d) Now we compute $\mathcal{L}(\alpha)$

\begin{align*}
\mathcal{L}(\alpha) &= 4\alpha^2_2+ 2\alpha^2_3+\frac{9}{2}\alpha^2_4 -4\alpha_2\alpha_3 -6\alpha_2\alpha_4  + 6\alpha_3\alpha_4 -2\alpha_3-2\alpha_4\\
&= 4(\frac{1}{2}\alpha_3 + \frac{3}{4}\alpha_4)^2 + 2\alpha^2_3+\frac{9}{2}\alpha^2_4 -4(\frac{1}{2}\alpha_3 + \frac{3}{4}\alpha_4)\alpha_3-6(\frac{1}{2}\alpha_3 + \frac{3}{4}\alpha_4)\alpha_4 +6\alpha_3\alpha_4-2\alpha_3  -2\alpha_4\\
&=\alpha^2_3 + 3\alpha_3\alpha_4 + \frac{9}{4}\alpha^2_4 + 2\alpha^2_3 + \frac{9}{2}\alpha^2_4 - 2\alpha_3^2 - 3\alpha_3\alpha_4 - 3\alpha_3\alpha_4 - \frac{9}{2}\alpha_4^2 + 6\alpha_3\alpha_4 -2\alpha_3-2\alpha_4\\
&= \alpha_3^2 + \frac{9}{4}\alpha^2_4 + 3\alpha_3\alpha_4 -2\alpha_3 - 2\alpha_4\\
&= \left(\alpha_3 + \frac{3\alpha_4 -2}{2}\right)^2 + \alpha_4 -1 \\
\end{align*}

It's clear that $\mathcal{L}(\alpha)$ achieves minimal value of $-1$, when $\alpha_4 = 0$ and $\alpha_3 = 1$

Then we have $\alpha_2 = \frac{1}{2}\alpha_3 + \frac{3}{4}\alpha_4 = \frac{1}{2}$ and $\alpha_1 = \alpha_3+\alpha_4 - \alpha_2 = \frac{1}{2}$

#### Problem 8.4

Using data from exercise 8.2, we have $N=3$, let $\alpha = \begin{bmatrix}\alpha_1 \\ \alpha_2 \\ \alpha_3 \end{bmatrix}$, then we have constraints:

$\alpha_1 \ge 0, \alpha_2 \ge 0, \alpha_3 \ge 0$ and $\sum^N y_n\alpha_n = -\alpha_1 - \alpha_2 + \alpha_3 = 0$

Let $X=\begin{bmatrix}0 & 0 \\ 2 & 2 \\ 2 & 0 \end{bmatrix} = \begin{bmatrix}x^T_1 \\ x^T_2 \\ x^T_3 \end{bmatrix}$, then we have 

$M = XX^T = \begin{bmatrix}0 & 0 & 0 \\ 0 & 8 & 4 \\ 0 & 4 & 4\end{bmatrix}$, which has $M_{ij} = x^T_ix_j$

The objective value is 

\begin{align*}
\mathcal{L}(\alpha) &= \frac{1}{2} \sum^N_{m=1}\sum^N_{n=1} y_ny_m\alpha_n\alpha_m x^T_nx_m - \sum^N \alpha_n \\
&= \frac{1}{2} \sum^N_{m=1}y_m\alpha_m\sum^N_{n=1} y_n\alpha_n x^T_nx_m - (\alpha_1+\alpha_2+\alpha_3) \\
&= \frac{1}{2} \sum^N_{m=2}y_m\alpha_m\sum^N_{n=2} y_n\alpha_n x^T_nx_m - (\alpha_1+\alpha_2+\alpha_3) \\
&= \frac{1}{2} (-\alpha_2(-8\alpha_2 + 4\alpha_3) + \alpha_3(-4\alpha_2 + 4\alpha_3)) - (\alpha_1+\alpha_2+\alpha_3) \\
&= 4\alpha_2^2 - 2\alpha_2\alpha_3 -2\alpha_2\alpha_3 + 2\alpha_3^2 - \alpha_1 - \alpha_2 - \alpha_3\\
&= 4\alpha_2^2 - 4\alpha_2\alpha_3 + 2\alpha_3^2 - \alpha_1 - \alpha_2 - \alpha_3\\
\end{align*}

Take $\alpha_1 = \alpha_3 - \alpha_2$ into the formula, we have

$\mathcal{L}(\alpha) = 4\alpha_2^2 - 4\alpha_2\alpha_3 + 2\alpha_3^2 - 2\alpha_3 = (2\alpha_2 - \alpha_3)^2 + (\alpha_3 -1)^2 -1$

It's clear that at the minimal, $\alpha_3 = 1, \alpha_2 = \frac{1}{2}$ so $\alpha_1 = \frac{1}{2}$.

We compute the optimal weights $w^* = \sum^N y_n\alpha_nx_n = \begin{bmatrix}1 \\ -1 \end{bmatrix}$, since all $\alpha > 0$, pick one, e.g. $\alpha_1 = \frac{1}{2}$, we should have $y_1(w^Tx_1 + b) = 1$, i.e. $-b= 1$, so $b=-1$.

This is consistent with the results from example 8.2

#### Problem 8.5 TODO

#### Problem 8.6

\begin{align*}
L &= \sum^N_{n=1}|x_n-\mu|^2 \\
&= \sum^N_{n=1}(x_n-\mu)^T(x_n-\mu) \\
&= \sum^N_{n=1}x_n^Tx_n-x_n^T\mu + \mu^Tx_n+\mu^T\mu \\
\end{align*}

Take derivative w.r.t. $\mu$ and let it equal to 0, we have

\begin{align*}
\frac{\partial{L}}{\partial{\mu}} &= \sum^N_{n=1} 2x_n - 2\mu \\
&= 2 \sum^N_{n=1} x_n - 2N \\
&= 0 \\
\end{align*}

So we have $\mu =  \frac{1}{N}\sum^N_{n=1} x_n $

#### Problem 8.7

Let $X=\begin{bmatrix}x_1 & \dots & x_N \end{bmatrix}$, where $x_i$, $i=1,\dots, N$ are the column vectors. Let $y=\begin{bmatrix}y_1 \\ \dots \\ y_N\end{bmatrix}$.

* (a) 

\begin{align*}
|\sum^N_{n=1}y_nx_n|^2 &= |Xy|^2 \\
&= (Xy)^T(Xy) \\
&= y^TX^TXy \\
&= y^TMy \\
&= y^T\begin{bmatrix}\sum^N_{m=1}M_{1m}y_m \\ \dots \\ \sum^N_{m=1}M_{Nm}y_m\end{bmatrix} \\
&= \sum^N_{n=1} y_n\sum^N_{m=1}M_{nm}y_m \\
&= \sum^N_{n=1} \sum^N_{m=1}y_ny_mM_{nm} \\
&= \sum^N_{n=1} \sum^N_{m=1}y_ny_mx^T_nx_m \\
\end{align*}

Where $M=X^TX$ and $M_{nm} = x^T_nx_m$.

* (b) When $n=m$, we have $y_ny_m = y_n^2 = 1$.
When $n\ne m$, we have $P(y_n=1) = \frac{1}{2}$ because there are $\frac{N}{2}$ positive points and $\frac{N}{2}$ negative points. We also have $P(y_n = 1|y_m=1) = \frac{\frac{N}{2}-1}{N-1}$ because if we have $y_m=1$, then we need choose another positive point (there are $\frac{N}{2}-1$ of them) from the remaining $N-1$ points.

\begin{align*}
P(y_ny_m = 1) &= P(y_ny_m = 1|y_m=1)P(y_m=1) + P(y_ny_m = 1|y_m=-1)P(y_m=-1)\\
&= P(y_n = 1|y_m=1)P(y_m=1) + P(y_n = -1|y_m=-1)P(y_m=-1)\\
&= P(y_n = 1|y_m=1)\frac{1}{2} + P(y_n = -1|y_m=-1)\frac{1}{2}\\
&= \frac{\frac{N}{2}-1}{N-1}\frac{1}{2} + \frac{\frac{N}{2}-1}{N-1}\frac{1}{2}\\
&= \frac{\frac{N}{2}-1}{N-1}\\
\end{align*}

We thus have $E[y_ny_m] = 1$ when $m=n$ because the when $n=m$, the probability of $P[y_ny_m=1] = 1$. 

When $n\ne m$, 

\begin{align*}
E[y_ny_m] &= 1P(y_ny_m =1 \cap n\ne m) - 1P(y_ny_m =-1 \cap n\ne m)\\
&= \frac{\frac{N}{2}-1}{N-1} - \frac{\frac{N}{2}}{N-1}\\
&= -\frac{1}{N-1}\\
\end{align*}

* (c) 

\begin{align*}
E\left[\|\sum^N_{n=1}y_nx_n\|^2\right] &= E\left[\sum^N_{n=1} \sum^N_{m=1}y_ny_mx^T_nx_m\right]\\
&= \sum^N_{n=1} \sum^N_{m=1}E[y_ny_m]x^T_nx_m\\
&= \sum^N_{n=1}\left[ \sum^N_{m\ne n}E[y_ny_m]x^T_nx_m + E[y_ny_n]x^T_nx_n\right]\\
&= \sum^N_{n=1}\left[x^T_nx_n - \sum^N_{m\ne n}\frac{1}{N-1}x^T_nx_m\right]\\
&= \sum^N_{n=1}\left[x^T_nx_n -\frac{1}{N-1} \sum^N_{m\ne n}x^T_nx_m\right]\\
&= \frac{N}{N-1}\sum^N_{n=1}\left[\frac{N-1}{N}x^T_nx_n - \frac{1}{N}\sum^N_{m\ne n}x^T_nx_m\right]\\
&= \frac{N}{N-1}\sum^N_{n=1}\left[x^T_nx_n - \frac{1}{N}\sum^N_{m=1}x^T_nx_m\right]\\
&= \frac{N}{N-1}\sum^N_{n=1}\left[x^T_nx_n - x^T_n\frac{1}{N}\sum^N_{m=1}x_m\right]\\
&= \frac{N}{N-1}\sum^N_{n=1}\left[x^T_nx_n - x^T_n\bar{x}\right]\\
&= \frac{N}{N-1}\sum^N_{n=1}\left[x^T_nx_n - x^T_n\bar{x}-(\bar{x}^Tx_n - \bar{x}^T\bar{x})\right]\\
&= \frac{N}{N-1}\sum^N_{n=1}(x_n - \bar{x})^T(x_n - \bar{x})\\
&= \frac{N}{N-1}\sum^N_{n=1}\|x_n - \bar{x}\|^2\\
\end{align*}

Where in the third to the last equation we  have used the fact that $\sum^N_{n=1}\bar{x}^Tx_n = \sum^N_{n=1}\bar{x}^T\bar{x}$

* (d) From problem 8.6, we see that $\sum^N_{n=1}\|x_n - \mu\|^2 $ achieves minimal value when $\mu = \frac{1}{N}\sum^N_{n=1}x_n$, since $\bar{x}=\frac{1}{N}\sum^N_{n=1}x_n$, let $\mu=0$, we have 

$\sum^N_{n=1}\|x_n - \bar{x}\|^2 \le \sum^N_{n=1}\|x_n\|^2 \le NR^2$ by the assumption.

* (e) Take the result of problem (d) into the equation of (c), we have

\begin{align*}
E\left[\|\sum^N_{n=1}y_nx_n\|^2\right] &\le \frac{N}{N-1}\sum^N_{n=1} \|x_n\|^2 \\
&\le \frac{N}{N-1}NR^2 \\ 
&= \frac{N^2R^2}{N-1}\\ 
\end{align*}

Let $Z=\|\sum^N_{n=1}y_nx_n\|^2$, and $t=\frac{N^2R^2}{N-1}$, we claim that $P[\sqrt{Z} \le \sqrt{t}]  = P[\|\sum^N_{n=1}y_nx_n\| \le \frac{N^2R^2}{\sqrt{N-1}}] > 0$, otherwise, if $P[\sqrt{Z} \le \sqrt{t}]  = 0$, then   $P[\sqrt{Z} > \sqrt{t}]  = 1$, we have (since $Z\le 0$, and $t>0$)

\begin{align*}
E[Z] &= \int_0^{\infty}ZP(dZ)\\
&= \int_t^{\infty}ZP(dZ)\\
&> t\int_t^{\infty}P(dZ)\\
&= t\\
\end{align*}

This contradicts with our conclusion that $E[Z] \le t$

So $P[\|\sum^N_{n=1}y_nx_n\| \le \frac{N^2R^2}{\sqrt{N-1}}] > 0$