In [2]:
# Add lib input sys.path
import os
import sys
import time

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from scipy.optimize import minimize
import math
from sklearn.preprocessing import normalize
from functools import partial
import h5py
from scipy.spatial import distance

nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

from matplotlib.colors import ListedColormap
import libs.linear_models as lm
import libs.data_util as data
import libs.nn as nn
import libs.plot as myplot

%matplotlib inline

#### Exercise 9.1

$x_g = (47,35)$, $x_b = (22, 40)$, $x_u = (21, 36)$

The distance from $x_u$ to Mr. Bad is closer than the distance to Mr. Good. So the BoL should NOT give hime credit.

If the income is measured in dollars, Mr. Unknown is closer to Mr. Good, so the BoL should give him credit.

In [5]:
# The distance between $x_u$ and the two points
print('--- Income measured in K')
xg = np.array([47, 35])
xb = np.array([22, 40])
xu = np.array([21, 36])

d_ug = np.linalg.norm(xu-xg)
d_ub = np.linalg.norm(xu-xb)
print(f"--- Distance from unknow to Mr. Good: {d_ug}")
print(f"--- Distance from unknow to Mr. Bad: {d_ub}")

print('--- Income measured in dollars')
# Income measured in dollars
xg = np.array([47, 35000])
xb = np.array([22, 40000])
xu = np.array([21, 36000])

d_ug = np.linalg.norm(xu-xg)
d_ub = np.linalg.norm(xu-xb)
print(f"--- Distance from unknow to Mr. Good: {d_ug}")
print(f"--- Distance from unknow to Mr. Bad: {d_ub}")

--- Income measured in K
--- Distance from unknow to Mr. Good: 26.019223662515376
--- Distance from unknow to Mr. Bad: 4.123105625617661
--- Income measured in dollars
--- Distance from unknow to Mr. Good: 1000.3379428972991
--- Distance from unknow to Mr. Bad: 4000.000124999998


#### Exercise 9.2

\begin{align*}
Z &= \gamma X \\
&= (I - \frac{1}{N}1 1^T)X \\
&= X - \frac{1}{N} 1 1^T X\\
&= X - 1\frac{1}{N} \begin{bmatrix}\sum^N_{i=1} x_{i1} & \sum^N_{i=1} x_{i2} & \dots & \sum^N_{i=1} x_{id}\end{bmatrix}  \\
&= X - 1\begin{bmatrix}\bar{x}_1 & \bar{x}_2 & \dots & \bar{x}_d\end{bmatrix}\\
&=  X - 1\bar{x}^T \\
\end{align*}

#### Exercise 9.3

\begin{align*}
Z &= \begin{bmatrix}z^T_1 \\ \dots \\ z^T_N \end{bmatrix}\\
&= \begin{bmatrix}(Dx_1)^T \\ \dots \\ (Dx_N)^T\end{bmatrix}\\
&= \begin{bmatrix}x_1^TD^T \\ \dots \\ x_N^TD^T \end{bmatrix}\\
&= \begin{bmatrix}x_1^TD \\ \dots \\ x_N^TD \end{bmatrix}\\
&= XD \\
\end{align*}

\begin{align*}
Z^TZ &= (XD)^TXD \\
&= D^TX^TXD\\
&= DX^TXD\\
\end{align*}

#### Exercise 9.4

* (a) $\text{variance}(x_1) = \text{variance}(\hat{x}_1) = 1$, $\text{variance}(x_2) = \text{variance}(\sqrt{1-\epsilon^2}\hat{x}_1+\epsilon\hat{x}_2) = (1-\epsilon^2)\text{variance}(\hat{x}_1)+ \epsilon^2 \text{variance}(\hat{x}_2) = 1$

$\text{covariance}(x_1,x_2) = E[(x_1 - \bar{x}_1)(x_2 - \bar{x}_2)] = E[x_1x_2] = E[\sqrt{1-\epsilon^2}\hat{x}^2_1 + \epsilon\hat{x}_1\hat{x}_2] = \sqrt{1-\epsilon^2}$

* (b) 

\begin{align*}
f(x) &= w_1x_1 + w_2x_2 \\
&= w_1\hat{x}_1 + w_2 (\sqrt{1-\epsilon^2}\hat{x}_1 + \epsilon \hat{x}_2) \\
&= (w_1 + w_2 \sqrt{1-\epsilon^2})\hat{x}_1 + w_2\epsilon \hat{x}_2 \\
&= \hat{w}_1\hat{x}_1 + \hat{w}_2 \hat{x}_2 \\
\end{align*}

So if we set $\hat{w}_1 = w_1 + w_2 \sqrt{1-\epsilon^2}, \hat{w}_2 = w_2\epsilon$, we see $f$ is linear in $x_1,x_2$.

* (c) From problem (b), we have $\hat{w}_1 = \hat{w}_2 = 1$, so we have $w_1 = \frac{\epsilon - \sqrt{1-\epsilon^2}}{\epsilon}, w_2 = \frac{1}{\epsilon}$, so that $C \ge w^2_1 + w^2_2 = 2\frac{1-\epsilon\sqrt{1-\epsilon^2}}{\epsilon^2}$

* (d) As $\epsilon \to 0$, we have the minimum $C \to \infty $. It means that we have to use a huge $C$ to be able to implement the target function, which is impossible here.

* (e) If there is significant noise in the data, with correlated inputs, it'll be hard to regularize the learning, and overfitting is likely. So var term can be high while bias can be low.

#### Exercise 9.5

We compute the distances between $x_1,x_2 $ with $x_{test}$, we have

$d^2_1 = |x_1 - x_{test}|^2 = \sum^d(a_i + 1)^2$ and $d^2_2 = |x_2 - x_{test}|^2 = 4 + \sum^d(b_i + 1)^2$.

Suppose there are $k$ $+1$s in $a_i$ and $l$ $+1$s in $b_i$, then we have

$d^2_1 = \sum^d(a_i + 1)^2 = 4k$ and $d^2_2 = 4 + 4l = 4(l+1)$.

To correctly classify $x_{test}$ we want to have $d_1 < d_2$, which indicates that $k < l+1$, i.e. $k\le l$.

So we need compute the probabilities of the number of $+1$ in $a$ is less than or equal to the number of $+1$s in $b$.
Both $a$ and $b$ have $d$ elements, by symmetry the probability of $P(k > l) = P(l < k)$, i.e. the probability of $a$ having more $+1$ than $b$ is equal to the probability of $a$ having less $+1$ than $b$. So we have

$2P(k>l) + P(k=l) = 1$, thus $P(k\le l) = P(k < l) + P(k=l) = \frac{1}{2}(1 + P(k=l))$

We only have to solve the probability $P(k=l)$. 

For a given $k$, the probability of $P\left[(a \text{ has } k\; +1)\cap (b \text{ has } k\; + 1)\right] = \frac{d \choose k}{2^d}\frac{d \choose k}{2^d} = \frac{{d \choose k}^2}{2^{2d}}$

So we have 

\begin{align*}
P(k=l) &= \sum^d_{k=0}P\left[(a \text{ has } k\; +1)\cap (b \text{ has } k\; + 1)\right]\\
&= \sum^d_{k=0}\frac{{d \choose k}^2}{2^{2d}}\\
&= \frac{1}{2^{2d}}\sum^d_{k=0} {d \choose k}^2\\
&= \frac{1}{2^{2d}}{2d \choose d}\\
&= \frac{1}{2^{2d}}\frac{(2d)!}{d!d!}\\
&\text{apply stirling approximation } n! = \sqrt{2\pi n}\left(\frac{n}{e}\right)^n\\
&\approx \frac{1}{2^{2d}}\frac{\sqrt{2\pi 2d}\left(\frac{2d}{e}\right)^{2d}}{\sqrt{2\pi d}\left(\frac{d}{e}\right)^d \sqrt{2\pi d}\left(\frac{d}{e}\right)^d}\\
&= \frac{1}{\sqrt{\pi d}}\\
\end{align*}

So $P(k\le l) = \frac{1}{2}(1 + P(k=l)) = \frac{1}{2} + \frac{1}{2}\frac{1}{\sqrt{\pi d}} = \frac{1}{2} + O(\frac{1}{\sqrt{d}})$

This is the probability of classifying $x_{test}$ correctly with two data points. 

If there's a third data point $x_3$, then to correctly classify the $x_{test}$, we need have both $d_1 < d_2$ and $d_1 < d_3$, so the probability $P = P(k \le l)P(k \le l) = \frac{1}{4} + O(\frac{1}{\sqrt{d}})$

The probability of correctly classifying the $x_{test}$ drop about half.

#### Exercise 9.6

* (a) A large offset shouldn't affect the 'natural axes' from PCA. They only change the origin of the natural axes, but not the directions. We don't have to perform centering before PCA.

* (b) If one dimension is inflated, this will increase the variance along the dimension and make PCA choose 'natural axes' along the inflated dimension. We should perform input normalized before doing PCA.

* (c) If we do input whitening, the 'natural axes' for the inputs can be any orthogonal axes. Rotation to any angle won't reveal that one dimension is more important than another because they are all equalized.

We shouldn't perform input whitening before doing PCA, because that will equalize all axes for the inputs. PCA won't be able to find axes that has large variances against other axes.

#### Exercise 9.7

* (a) 

\begin{align*}
z &= \begin{bmatrix}x^Tv_1 \\ \dots \\ x^Tv_k \end{bmatrix}\\
&= \begin{bmatrix}v_1^Tx \\ \dots \\ v_k^Tx \end{bmatrix}\\
&= \begin{bmatrix}v_1^T \\ \dots \\ v_k^T \end{bmatrix}x \\
&= V^Tx\\
\end{align*}

The dimension of $V$ is $d \times k$, its columns are the basis $v_1, \dots, v_d$

* (b) 

\begin{align*}
Z &= \begin{bmatrix} z^T_1\\ \dots \\ z^T_N \end{bmatrix}\\
&= \begin{bmatrix} (V^Tx_1)^T\\ \dots \\ (V^Tx_N)^T \end{bmatrix}\\
&= \begin{bmatrix} x^T_1V \\ \dots \\ x^T_NV \end{bmatrix}\\
&= \begin{bmatrix} x^T_1 \\ \dots \\ x^T_N \end{bmatrix}V\\
&= XV\\
\end{align*}

* (c) We consider the $z=V^Tx$ where $z$ has $d$ components, so $V$ is $d\times d$. $V=\begin{bmatrix}v_1^T \\ \dots \\ v_d^T \end{bmatrix}$, since $v$s are othrogonal, we have $V^TV = I$, so $V^T = V^{-1}$

\begin{align*}
\sum^d_{i=1} z^2_i &= \|z\|^2 \\
&= z^Tz \\
&=  (V^Tx)^T(V^Tx) \\
&= x^TVV^Tx \\
&= x^TVV^{-1}x \\
&= x^Tx \\
&= \sum^d_{i=1} x^2_i \\
\end{align*}

If we choose $k\le d$ for $z$, we thus have $\|z\| \le \|x\| $

#### Exercise 9.8

$U^TX = U^TU\Gamma V^T = \Gamma V^T$, $XV = U\Gamma V^T V = U\Gamma$, so we have 

$X^TU = V\Gamma$, i.e. $X^Tu_i = \gamma_i v_i$ and $Xv_i = \gamma_i u_i$

#### Exercise 9.9

Let $A$ be a $m\times n$ matrix, and $A=\begin{bmatrix} a_1 & \dots & a_n \end{bmatrix}$, where $a_i$ are column vectors of dimension $m\times 1$

* (a) $\|A\|^2_F = \sum_m \|\text{row}_m(A)\|^2 = \sum_n \|\text{column}_n(A)\|^2$

$AA^T = \begin{bmatrix} a_1 & \dots & a_n \end{bmatrix}\begin{bmatrix} a^T_1 \\ \dots \\ a^T_n \end{bmatrix} = \sum^n_{i=1} a_i a^T_i$

So $\text{trace}(AA^T) = \text{trace}(\sum^n_{i=1} a_i a^T_i) =\sum^n_{i=1} \text{trace}(a_i a^T_i) = \sum^n_{i=1} \sum^m_{j=1}a^2_{ij} = \sum^n_{i=1} \|a_i\|^2 = \|A\|^2_F$


Also $A^TA = \begin{bmatrix}a^T_1a_1 & a^T_1a_2 & \dots & a^T_1a_n \\ a^T_2a_1 & a^T_2a_2 & \dots & a^T_2a_n \\ \dots & \dots & \dots & \dots \\ a^T_na_1 & a^T_na_2 & \dots & a^T_na_n \\ \end{bmatrix}$, and $\text{trace}(A^TA) = \sum^n_{i=1} a^T_i a_i = \sum^n_{i=1} \|a_i\|^2 = \|A\|^2_F$

* (b) Apply the result from problem (a) 

\begin{align*}
\|UAV^T\|^2_F &= \text{trace}((UAV^T)^T(UAV^T)) \\
&= \text{trace}(VA^TU^TUAV^T) \\
&= \text{trace}(VA^TAV^T) \\
&= \text{trace}((AV^T)^T(AV^T)) \\
&\text{apply } trace(XX^T) = trace(X^TX), X = AV^T \\
&= \text{trace}((AV^T)(AV^T)^T) \\
&= \text{trace}((AV^T)(VA^T)) \\
&= \text{trace}(AA^T) \\
&= \|A\|^2_F\\
\end{align*}

#### Exercise 9.10

If all the singular values of $X$ are distinct, then the eigenvalues of $\Sigma$ are all distinct and positive or zero. 

* (a) Let $X$ has a dimension of $N\times d$, and with SVD, let $X=U\Gamma V^T$ where $U$ has a dimension of $N\times d$ and $V$ has a dimension of $d\times d$, so we have $U^TU=I_d$ and $V^TV=VV^T=I_d$. Also let $V=\begin{bmatrix}v_1 & \dots & v_d\end{bmatrix}$, where $v_i$ are the $d\times 1 $ column vector and they are orthonormal (basis). $\Gamma$ is the diagonal matrix of the singular values of $X$, and by construction it's ordered, i.e. $\lambda_1 \ge \lambda_2 \ge \dots \ge \lambda_d \ge 0 $. 

Consider any direction $v=\sum^d_{i=1}q_iv_i$, it should have $\|v\|^2 = \sum^d_{i=1}q^2_i = 1$.

\begin{align*}
\text{var}(z_1,\dots,z_N) &= v^T\Sigma v \\
&= v^T \frac{1}{N}X^TX v \\
&= \frac{1}{N}v^T (U\Gamma V^T)^T (U\Gamma V^T) v \\
&= \frac{1}{N}v^T V\Gamma U^TU\Gamma V^T v \\
&= \frac{1}{N}v^T V\Gamma^2  V^T v \\
&= \frac{1}{N}v^T V\Gamma^2 \begin{bmatrix}v^T_1 \\ \dots \\ v^T_d\end{bmatrix}  v \\
&= \frac{1}{N}v^T V\begin{bmatrix}\lambda^2_1 & \dots & 0 \\ 0 & \lambda^2_2 & \dots \\ \dots & \dots & \dots \\ \dots & 0 & \lambda^2_d \end{bmatrix} \begin{bmatrix}v^T_1v \\ \dots \\ v^T_dv\end{bmatrix} \\
&= \frac{1}{N}v^T \begin{bmatrix}v_1 & \dots & v_d\end{bmatrix} \begin{bmatrix}\lambda^2_1v^T_1v \\ \lambda^2_2v^T_2v \\ \dots \\  \lambda^2_dv^T_dv \end{bmatrix}\\
&= \frac{1}{N} \begin{bmatrix}v^Tv_1 & \dots & v^Tv_d\end{bmatrix} \begin{bmatrix}\lambda^2_1v^T_1v \\ \lambda^2_2v^T_2v \\ \dots \\  \lambda^2_dv^T_dv \end{bmatrix}\\
&= \frac{1}{N} \sum^d_{i=1}v^Tv_i \lambda^2_iv^T_iv \\
&= \frac{1}{N} \sum^d_{i=1}\lambda^2_i v^Tv_i v^T_iv \\
&= \frac{1}{N} \sum^d_{i=1}\lambda^2_i \|v^Tv_i\|^2 \\
&\le \frac{1}{N} \sum^d_{i=1}\lambda^2_1 \|v^Tv_i\|^2 \\
&= \frac{\lambda^2_1}{N} \sum^d_{i=1} \|v^Tv_i\|^2 \\
&= \frac{\lambda^2_1}{N} \sum^d_{i=1} q^2_i \\
&= \frac{\lambda^2_1}{N} \\
&= \frac{1}{N} v^T_1v_1  \lambda^2_1 v^T_1v_1\\
&= \frac{1}{N} \sum^d_{i=1} v^T_1v_i  \lambda^2_1 v^T_1v_i\\
&= v^T_1\Sigma v_1\\
\end{align*}

So the variance is highest when the principal direction is $v_1$ , the top right singular vector of $X$.

* (b) Follow the proof in problem (a), we see that the top-1 principal direction is $v_1$, next, if we select the next direction $v$ that is orthogonal to $v_1$, it's clear that $\text{var}(z_1,\dots,z_N) = v^T\Sigma v = \frac{1}{N} \sum^d_{i=2}\lambda^2_i \|v^Tv_i\|^2$. It's easy to see that the principal direction with the highest variance is $v_2$. Continue this, we see that the top-k principal directions are $v_1, \dots, v_k$. 


* (c) If we don't have the data matrix $X$, but knows $\Sigma$, since $\Sigma = \frac{1}{N}X^TX = \frac{1}{N}V\Gamma^2  V^T$

$\Sigma$ is symmetric, we can do eigen-decomposition, the principal directions are the top-k eigenvectors of $\Sigma$.

#### Exercise 9.11

$x=\begin{bmatrix} 1 \\ 1 \end{bmatrix}$, $z=\begin{bmatrix}1 \\ 1 \\ 2\end{bmatrix}$. The reconstructed test point in $\mathcal{Z}$ space using top-1 PCA is: 

$\hat{z} = z^Tv_1v_1 = 2v_1 = \begin{bmatrix}0 \\ 0 \\ 2\end{bmatrix}$. 
By the feature transformation $\Phi$, we don't have a solution for the reconstructed $\hat{x}$ in $\mathcal{X}$ space.

#### Exercise 9.12

* (a) The VC dimension of $\mathcal{H}_{+},  \mathcal{H}_{+}$ is 1. The VC dimension of $\mathcal{H}$ is 2.

* (b) I would pick $\mathcal{H}_{+}$, because if $x$ is the income, then the credit should increase with income, and $w >0$ makes sense here. 

#### Exercise 9.13

If the hybrid strategy stopped at $\mathcal{H}_{+}$, we can't use the VC bound for $\mathcal{H}_{+}$ as we implicit assume we have a bigger hypothesis set $\mathcal{H}$. We should this the VC-bound of $\mathcal{H}$ instead.

#### Exercise 9.14

* (a) $sign(x)$ is monotonically increasing function on $x$, to have $h(x)$ is monotonically increasing in $x$, for $x \ge z$, we want to have 

$w_0 + w_1 x_1 + w_2 x_2 + w_3 x^2_1 + w_4 x^2_2 + w_5 x_1x_2 \ge w_0 + w_1z_1 + w_2z_2 + w_3z^2_1 + w_4z^2_2 + w_5z_1z_2$ 

i.e. $w_1(x_1-z_1) + w_2(x_2 - z_2) + w_3(x^2_1 - z^2_1) + w_4(x^2_2 - z^2_2) + w_5 (x_1x_2 - z_1z_2) \ge 0$

So we want $w_1,w_2,w_3,w_4,w_5 \ge 0$ for $h(x) to be monotonically increasing in $x$

* (b) For $h(x)$ to be invariant under and arbitrary rotation of $x$, it needs to depend only on $\|x\| = \sqrt{x^2_1 + x^2_2}$

So we constrain $w_1=w_2 = w_5 = 0$ and $w_2 = w_3$, thus $h(x) = sign(w_0 + w\|x\|^2)$

* (c) Since the 2D quadratic function is less than 0 in the middle of the bowl, if we want the positive set to be convex, we want to have $w_3 < 0, w_4 < 0$, such that the bowl is now upside down, and the middle part is larger than 0. The set is convex and enclosed by the points where $h(x) = 0$.


#### Exercise 9.15

Use the data points allow the algorithm to have larger freedom in selecting the hypothesis, thus have a better fit of the data. On the other hand, the virtual examples force the algorithm to select hypothesis that satisfy the constraints, so it doesn't matter much if there are hypotheses that violates the hint. They will have much less probability to be selected as the final hypothesis. 

If we want to strictly enforce the constraint, we can add more hint examples, so they can help avoid bad hypotheses.

#### Exercise 9.16

* Rotational invariance: $E_{hint}(h) = \frac{1}{N}\sum^N_{n=1}\left(h(x_n) - h(x'_n)\right)^2 1\left[h(x_n) \ne h(x'_n)\right]$

* Convexity hint: $E_{hint}(h) = \frac{1}{N}\sum^N_{n=1}\left(h(x_n) - h(x'_n)\right)^2 1\left[h(\eta x_n + (1-\eta)x'_n) > \eta h(x_n) + (1-\eta)h(x'_n)\right]$, where $0 \le \eta \le 1$


* Perturbation hint:  $E_{hint}(h) = \frac{1}{N}\sum^N_{n=1}\left(h(x_n) - g(x_n)\right)^2 1\left[|h(x_n) - g(x_n) | > \delta h(x_n) \right]$, where $g(x)$ is the known function that the target function is close to.

#### Exercise 9.17 TODO



#### Exercise 9.18 TODO

#### Problem 9.1 TODO

#### Problem 9.2 TODO



#### Problem 9.3 

\begin{align*}
\Sigma &= U\Gamma U^T \\
\Sigma &= U \Gamma^{\frac{1}{2}} \Gamma^{\frac{1}{2}} U^T \\
&= U\Gamma^{\frac{1}{2}}U^TU\Gamma^{\frac{1}{2}}U^T\\
&= (U\Gamma^{\frac{1}{2}}U^T)^2\\
&= (\Sigma^{\frac{1}{2}})^2\\
\end{align*}

So $\Sigma^{\frac{1}{2}} = U\Gamma^{\frac{1}{2}}U^T$

\begin{align*}
I &= UU^T \\ 
&= \Gamma^{\frac{1}{2}}U^T U\Gamma^{-\frac{1}{2}}\\
&= U\Gamma^{\frac{1}{2}}U^T U\Gamma^{-\frac{1}{2}}U^T \\
&= \Sigma^{\frac{1}{2}} \Sigma^{-\frac{1}{2}}\\
\end{align*}

So $\Sigma^{-\frac{1}{2}} = U\Gamma^{-\frac{1}{2}}U^T$

$\Gamma^{\frac{1}{2}}$ and $\Gamma^{-\frac{1}{2}}$ are both diagonal matrices, the former with the square root of the eigenvalues of $\Sigma$ and the latter has the $\frac{1}{\sqrt{\lambda_i}}$

#### Problem 9.4

Multiply both sides of $A=V\psi$ by $V^T$, also note that $VV^T = V^TV = I$

\begin{align*}
A &= V\psi \\
\psi &= V^T A \\
\end{align*}

Then we have $\psi^T\psi = A^TV V^T A = I$ and $\psi \psi^T = V^T A A^T V = I$
so $\psi$ is orthogonal matrix. 

#### Problem 9.5

