In [1]:
# Add lib input sys.path
import os
import sys
import time

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from scipy.optimize import minimize
import math
from sklearn.preprocessing import normalize
from functools import partial
import h5py
from scipy.spatial import distance

nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

from matplotlib.colors import ListedColormap
import libs.linear_models as lm
import libs.data_util as data
import libs.nn as nn
import libs.plot as myplot

%matplotlib inline

#### Exercise 7.1

We consider the regions of $f$ which are '+' here. From the top '+' region, we have $\bar{h}_1h_2h_3$, from the bottom left '+' region, we have $h_1h_2\bar{h}_3$ and from the bottom right region, we have $h_1\bar{h}_2h_3$. If either of them are True, then $f$ is True.

so $f=\bar{h}_1h_2h_3 + h_1h_2\bar{h}_3 + h_1\bar{h}_2h_3$

#### Exercise 7.2

* (a) Skipped for the graph. The Boolean $OR(x_1,\dots,x_M)$ is similar to the $OR(x_1,x_2)$, except the weight for bias is $M-0.5$, all the weights for $x_i$ are still 1. For $AND(x_1, \dots, x_M)$, the weight for bias is $-(M-0.5)$, while other weights are all 1.
* (b) Skipped. The weights are $w_i$ for each $x$. 
* (c) Skipped. $\bar{x}_2$ will take the input from $x_2$, and $x_1, \bar{x}_2, x_3$ form the ordinary $OR$ operation, the weight on bias is 2.5

#### Exercise 7.3

It's straightforward to verify that the graph is consistent with $f(x)$.

#### Exercise 7.4

If we have $h_1(x)=\text{sign}(w^T_1x)$, $h_2(x)=\text{sign}(w^T_2x)$, $h_3(x)=\text{sign}(w^T_3x)$, then we have 
$f= sign(sign(-h_1+h_2+h_3-2.5) + sign(h_1-h_2+h_3-2.5) - sign(h_1+h_2-h_3-2.5) + 2.5)$

#### Exercise 7.5
Follow the hint that for large enogh $\alpha$, we have $sign(x)\approx tanh(\alpha x)$, so given $w_1$ and $\epsilon >0$, we set $x = w^T_1x_n$, and $\alpha = w^T_2 w^{-T}_1$, such that $\alpha x = w^T_2x_n$. If we want the difference to be small, then we want $\alpha = w^T_2 w^{-T}_1$ to be large enough, that is for a large enough $\alpha$, we have $w^T_2 = \alpha w^T_1$

#### Exercise 7.6

On each layer $l$, we compute $s^{(l)} = \left(W^{(l)}\right)^Tx^{(l-1)}$, this takes $d^{(l)}(d^{(l-1)}+1)$ multiplications and additions. Then we compute $x^{(l)} = \theta(s^{(l)})$, which takes $d^{(l)}$ $\theta$-evaluations. Add them up from $l=1$ to $L$, we have a total of $O(Q)$ multiplications and additions, and $O(V)$ $\theta$-evaluations.

#### Exercise 7.7

$\tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}$, we have $\frac{d \tanh(x)}{dx} = \frac{4}{(e^x+e^{-x})^2} = 1 - \tanh^2(x)$ , so we have

\begin{align*}
\nabla E_{in}(w) &= \nabla \frac{1}{N}\sum^N_{n=1}\left(\tanh(w^Tx_n)-y_n\right)^2 \\
&= \frac{1}{N}\sum^N_{n=1}\nabla\left(\tanh(w^Tx_n)-y_n\right)^2 \\
&= \frac{1}{N}\sum^N_{n=1}2\left(\tanh(w^Tx_n)-y_n\right)\frac{d \tanh(w^Tx_n)}{dx}  \\
&= \frac{1}{N}\sum^N_{n=1}2\left(\tanh(w^Tx_n)-y_n\right)\left(1 - \tanh^2(w^Tx_n)\right)x_n  \\
&= \frac{2}{N}\sum^N_{n=1}\left(\tanh(w^Tx_n)-y_n\right)\left(1 - \tanh^2(w^Tx_n)\right)x_n  \\
\end{align*}

If $w \to \infty$, $\tanh(w^Tx_n) \to 1$, and $\nabla E_{in}(w)\to 0$, so the gradient won't change when $w \to \infty$, which is hard to improvoe the perceptron solution.

#### Exercise 7.8

The weight matrices are the same as in example 7.1. For the data point $x=2$, $y=1$, we have $x^{(0)} = \begin{bmatrix}1 \\ 2 \end{bmatrix}$, $s^{(1)}$ is the same as before, $s^{(1)} = \begin{bmatrix}0.7 \\ 1 \end{bmatrix}$. 

So $x^{(1)} = \begin{bmatrix}1 \\ 0.7 \\ 2 \end{bmatrix}$, and $s^{(2)} = \begin{bmatrix} -2.1 \end{bmatrix}$, 

$x^{(2)} = \begin{bmatrix}1 \\ -2.1 \end{bmatrix}$

$s^{(3)} = \begin{bmatrix} -3.2 \end{bmatrix}$

$x^{(3)} = -3.2$


Apply backpropagation and note that $\theta'(s^{(l)}) = 1$ to compute 

$\delta^{(3)} = 2(x^{(3)}-y) = -8.4$

$\delta^{(2)} = \theta'(s^{(2)}) \otimes \left[W^{(3)}\delta^{(3)}\right] = -16.8$

$\delta^{(1)} = \begin{bmatrix}-16.8 \\ 50.4 \end{bmatrix}$

$\frac{\partial{e}}{\partial{W^{(1)}}} = x^{(0)}(\delta^{(1)})^T = \begin{bmatrix}-16.8 & 50.4 \\ -33.6 & 100.8\end{bmatrix}$

$\frac{\partial{e}}{\partial{W^{(2)}}} = x^{(1)}(\delta^{(2)})^T = \begin{bmatrix}-16.8 \\ -11.76 \\ -33.6\end{bmatrix}$

$\frac{\partial{e}}{\partial{W^{(3)}}} = x^{(2)}(\delta^{(3)})^T = \begin{bmatrix}-8.4 \\ -17.64 \end{bmatrix}$


#### Exercise 7.9 

If we initialize all weights to be 0, all the inputs $s^{(l)}$ are zeros, sensitivity $\delta^{(l)} = 0$ during backpropagation. So the gradient $G^{(l)}(x_n) = 0$ and the in-sample error $E_{in}$ is not updated through iterations.

#### Exercise 7.10

For layers 1 to $L$, the weights $W^{(l)}$ has a dimension of $d^{(l-1)}+1$ by $d^{(l)}$, so the total number of weight parameters are 

$Q = \sum^L_{l=1}d^{(l)}(d^{(l-1)}+1)$

For $L=3$ and $d^{(1)}=d^{(2)}=10$, also we assume $d^{(0)}=d$ where $d$ is the dimension of input $x$ and $d^{(3)}=1$, so $Q=131 + 10d$

#### Exercise 7.11


Take derivative w.r.t. $w^{(l)}_{ij}$ in the second term of $E_{aug}(w,\lambda)$, we have its derivative equals to $\frac{\lambda}{N}\frac{2w^{(l)}_{ij}}{\left(1+(w^{(l)}_{ij})^2\right)^2}$

This proves the equation.

We use the ratio of gradient versus weight to check the rate of decay.

From the derivative, we check the ratio of the second term to the weight $w^{(l)}_{ij}$, and we have $\frac{2\lambda}{N}\frac{1}{\left(1+(w^{(l)}_{ij})^2\right)^2}$, which achieves maximum value of 1 when $w^{(l)}_{ij} \to 0$. So the smaller the weight, the larger the decay w.r.t. itself.

This indicates that small weights decay much faster  than large ones.

#### Exercise 7.12

"More data is better" applies to a fixed model $(\mathcal{H}, \mathcal{A})$. However when we are doing early stopping, we are selecting models on a nested hypothesis sets $\mathcal{H}_1 \subset \mathcal{H}_1 \subset \dots $ determined by $\mathcal{D}_{train}$, that's because at each step, the $w_1$ is selected by choosing the one with minimal in-sample error. If we use the full data $\mathcal{D}$, the $w_1,w_2,\dots$ will be different and as a result, the hypothesis sets will change even if we keep the step size $\eta$ the same. 

That's why the 'more data is better' doesn't apply here. 

#### Exercise 7.13

* (a) $w_{50}$ is generated using data in $\mathcal{D}_{train}$, which has nothing to do with $\mathcal{D}_{val}$, so $E_{val}(w_{50})$ is an unbiased estimate of $E_{out}(w_{50})$

* (b) From formula 2.1, with $M=1$, $N=50$, $\delta = 0.1$, we have $E_{out}(w_{50}) \le E_{val}(w_{50}) + \sqrt{\frac{1}{2N}\ln\frac{2M}{\delta}} =E_{val}(w_{50}) + \sqrt{\frac{1}{100}\ln\frac{2}{0.1}} = 0.05 +0.2995732273553991 \approx 0.35$

* (c) We can't bound $E_{out}$ with $E_{train}$ because we don't know the number the hypothesis used to create $w_{50}$. But for validation set, we first fixed the $w_{50}$ from training, then calculate the $E_{val}$, so there's only one hypothesis.

#### Exercise 7.14

Take derivative of $E(w)$ w.r.t. $w$, we have $\nabla E(w) = (Q^T+Q)(w-w^*) = 2Q(w-w^*) = -2Qw^*$ when $w=0$.

The weights that minimize $E(w)$ is $w^*$ where $E(w^*)=0$. From the calculated $\nabla E(w)$, the gradient descent at $w=0$ moves in the same direction as $w^*$, i.e. $-\nabla E(w) = Qw^*$.

The direction should be the same as $w^*$, we need normalize $\nabla E(w)$ to get a unit direction and the size of the step is controlled by the step size. Otherwise, we may overshoot, even if we are moving in the right direction.


#### Exercise 7.15

If we start with $\eta_1 < \eta_2 < \eta_3$, in first iteration, we have $\bar{\eta}_1 = \frac{1}{2}(\eta_1+\eta_3)$, if we end up with the U-arrangement of ${\bar{\eta}_1, \eta_2, \eta_3}$, the interval is halfed. Otherwise, we have the U-arrangement of ${\eta_1, \bar{\eta}_1, \eta_2}$, which in the second iteration will end up with either ${\eta_1, \bar{\eta}_2, \bar{\eta}_1}$ or ${\bar{\eta}_2, \bar{\eta}_1, \eta_3}$, where $\bar{\eta}_2 = \frac{1}{2}(\eta_1 + \bar{\eta}_1)$. both of them are at least half of the original interval $|\eta_3-\eta_1|$.

So after two iterations, the original interval is at least halfed, Keep doing these, we see that it decreases exponentially in the bisection algorithm.

#### Exercise 7.16

In the picture, we have 2-Dimensional components in the gradient of $E_{in}$. At $w=w(t+1)$, it achieves minimum along the direction $v(t)$, so one component of the gradient of $E_{in}$ is zero. Since $v(t+1)$ is the conjugate direction of $v(t)$, when we move along $v(t+1)$, the gradient along this direction will remain perpendicular to the previous search direction $v(t)$, so the point that makes the second component zero is on this line. This means there's one point that have both gradients equal to zero. That point is the optimal weights.

#### Exercise 7.17

* Digit '2' will have $\phi_1, \phi_3$
* Digit '3' will have $\phi_1, \phi_6$
* Digit '4' will have $\phi_2$
* Digit '6' will have $\phi_3$
* Digit '7' will have $\phi_2$
* Digit '8' will have $\phi_3, \phi_6$
* Digit '9' will have $\phi_1, \phi_2$

The additional basic shapes shall be able to be composed together to construct each digit. They shouldn't be too big and should not overlap with each other.

#### Exercise 7.18

* (a) For a given $\phi_k(x)$, for the white pixels in the feature, we set $w_{ij} = 0$, and for the black pixels in the feature, we set $w_{ij}=1$ or some positive number, if the input $x$ has the same feature, we will have $\phi_k(x) = 1$. 

* (b) The inputs to the neural network node are the pixel values, i.e. $[x_{ij}]$ matrix 
* (c) See problem (a). 
* (d) I will choose $w_0$ as random gaussian distributed with mean 0 and variance of 1.

#### Exercise 7.19

The symmetry and intensity shall show up in later layers as high level features generated from lower level features. We don't have to manually generate them.

#### Problem 7.1

let

So we can have the input layer with 3 nodes: $1, x_1, x_2$.
The first hidden layer will have 8 nodes: $1, h_1, h_2, h_3, \dots, h_7$:
* $h_1 = sign(x_2-2)$
* $h_2 = sign(x_2-1)$
* $h_3 = sign(x_2+1)$
* $h_4 = sign(x_1+2)$
* $h_5 = sign(x_1+1)$
* $h_6 = sign(x_1-1)$
* $h_7 = sign(x_1-2)$

The second hidden layer will have 3 nodes: $1, h_A, h_B$, they are AND nodes, where

* $h_A = sign(-h_1 + h_2 + h_5 - h_6 -3.5)$
* $h_B = sign(-h_2 + h_3 + h_4 - h_7 -3.5)$

The output layer will have 1 OR node :

$ y = sign(h_A + h_B + 1.5)$

#### Problem 7.2 TODO


* (b) The first hyperplane divides the space into 2 parts, a second hyperplane will divide each of them into 2 parts again, so we have $2\times 2=2^2$ regions, continue this to $M$ hyperplanes, we have at most $2^M$ distinct regions.

#### Problem 7.3

For a given $x$, $f(x)=1$ as long as $x$ belongs to one of the positive regions. If we denote all the positive regions by $r_1,\dots, r_k$, and suppose there's a function $t_{r}(x)=1$ when $x \in r$, then using OR operation, $f=t_{r_1}+t_{r_2}+\dots + t_{r_k}$

For any combinations of $c =(c_1,c_2,\dots,c_M)$, we have a corresponding $t_r = h^{c_1}_1\dots h^{c_M}_M$, which always equals to 1 since $h_m =1$ if $c_m=1$ and $h_m=-1,\bar{h}_m=1$ if $c_m=-1$. This is an AND operation meaning that when $x$ has the exact labels, we have $t_r=1$.

This indicates that if $x$ falls into one of the positive region $r$, we'll have $t_r(x) =1$, which is desired in the previous step.

#### Problem 7.4

Since $f=t_{r_1}+t_{r_2}+\dots + t_{r_k}$, we have

* Input layer: the number of nodes equals to the dimension of $x$ plus 1.
* 1st hidden layer: $M+1$ nodes, each non-bias node represents $h_m = sign(w^Tx)$
* 2nd hidden layer: $k+1$ nodes, each non-bias node represents $t_{r}$
* The output layer: one node that represents $f$

#### Problem 7.5 TODO

#### Problem 7.6 TODO

What is the $W$ here in $O(W^2)$?

Let $E(W)$ be the in-sample error we compute, to calculate its derivatives w.r.t. all $w$, we need compute $\frac{\partial{E}}{\partial{W^{(l)}_{ij}}}$ for each layer $l$. 

* For layer $L$, we perturb each weight, and use the finite difference to calculate the gradient, $\frac{\partial{E}}{\partial{W^{(L)}_{ij}}}$. Since all other weights from previous layers don't impact this calculation, we only need $2O(W^{(L)})$ calculations on the layer $L$. (Each difference involves two re-calculations of $E(w)$)
* For layer $L-1$, once we perturb the weights in this layer, we need re-compute the outputs of $x^{(L-1)}$ and $x^{(L)}$, which involves $2O(W^{(L)}+W^{(L-1)})$ calculations.


#### Problem 7.7

* (a) let  $q = \text{dim}(y)$, then $y_n, \hat{y}_n$ are $q\times 1$ vectors. Let $\Delta y_n = y_n - \hat{y}_n$. Then $\Delta Y$ is a $N\times q$ matrix.


\begin{align*}
E_{in} &= \frac{1}{N}\sum^N_{n=1} |y_n - \hat{y}_n|^2 \\
&= \frac{1}{N}\sum^N_{n=1} (y_n - \hat{y}_n)^T(y_n - \hat{y}_n) \\
&= \frac{1}{N}\sum^N_{n=1} \Delta y^T_n \Delta y_n \\
&= \frac{1}{N}\text{trace}(\Delta Y \Delta Y^T) \\
&= \frac{1}{N}\text{trace}\left( (Y-\hat{Y}) (Y-\hat{Y})^T\right) \\
\end{align*}



* (b) Consider the derivatives of $E_{in}$ without the $\frac{1}{N}$ factor, we have:

\begin{align*}
\frac{\partial{E_{in}}}{\partial{V}} &= \frac{\partial{E_{in}}}{\partial{\hat{Y}}} \frac{\partial{\hat{Y}}}{\partial{V}}\\
\end{align*}

Apply the derivatives of trace: $\frac{\partial{Tr(XA)}}{\partial{X}} = A^T$, $\frac{\partial{Tr(AX^T)}}{\partial{X}} = A$, $\frac{\partial{Tr(XBX^T)}}{\partial{X}} = XB^T+XB$, we have

$\frac{\partial{E_{in}}}{\partial{\hat{Y}}} = \frac{\partial{(YY^T-Y\hat{Y}^T-\hat{Y}Y^T+\hat{Y}\hat{Y}^T)}}{\partial{\hat{Y}}} = -Y-Y+2\hat{Y} = 2(\hat{Y}-Y) = 2(ZV-Y)$

On the other hand, we have $\frac{\partial{\hat{Y}}}{\partial{V}} = \frac{\partial{(ZV)}}{\partial{V}} = Z^T$ 

So we have

\begin{align*}
\frac{\partial{E_{in}}}{\partial{V}} &= \frac{\partial{E_{in}}}{\partial{\hat{Y}}} \frac{\partial{\hat{Y}}}{\partial{V}}\\
&= Z^T2(ZV-Y) \\
&= 2Z^TZV - 2Z^TY\\
\end{align*}


Now we compute $\frac{\partial{E_{in}}}{\partial{W}}$, first we re-write $\hat{Y}=ZV$

\begin{align*}
\hat{Y} &= ZV \\
&= \begin{bmatrix}1 & \theta(XW) \end{bmatrix} \begin{bmatrix} V_0 \\ V_1 \end{bmatrix}\\
&= 1V_0 + \theta(XW)V_1 \\
\end{align*}


consider 

\begin{align*}
\frac{\partial{E_{in}}}{\partial{Z}} &= 2(\hat{Y} - Y)V^T\\
\end{align*}

Since the first column of $Z$ are all ones, we can get rid of that column and have 

$\frac{\partial{E_{in}}}{\partial{Z}} = 2(\hat{Y} - Y)V^T_1 = 2(1V_0 + \theta(XW)V_1 - Y)V^T_1 = 2(1V_0V^T_1 + \theta(XW)V_1V^T_1 - YV^T_1 $


Notice that the $j+1$-th column of $Z$ (which is the $j$-th column of $\theta(XW)$) depends on $W_{ij}$, now we can calculate

\begin{align*}
\frac{\partial{E_{in}}}{\partial{W_{ij}}} &= \sum^N_{k=1}\frac{\partial{E_{in}}}{\partial{Z_{k,j+1}}}\frac{\partial{Z_{k,j+1}}}{\partial{W_{i,j}}}\\
&= \sum^N_{k=1}\frac{\partial{E_{in}}}{\partial{\theta(XW)_{k,j}}}\frac{\partial{\theta(XW)_{k,j}}}{\partial{W_{i,j}}}\\
&= \sum^N_{k=1} 2 (\hat{y}-y)^T(V^T_1)_{,j}\theta'(XW)_{kj}x^{(0)}_{ki}\\
\end{align*}

Where $\hat{y}$ has dimension $d^{y}\times 1$, $(V^T_1)_{,j}$ is the $j$-th column of matrix $V^T_1$, and $x^{(0)}_n = \begin{bmatrix}1 \\ x_n\end{bmatrix}$ has a dimension of $(d+1)\times 1$ and

$\frac{\partial{\theta(XW)_{k,j}}}{\partial{W_{i,j}}} = \theta'(XW)_{kj}x^{(0)}_{ki}$

Write the above derivatives out and sort out the items, we find that 

$\frac{\partial{E_{in}}}{\partial{W}} = 2X^T\left[\theta'(XW)\otimes \left(\theta(XW)V_1V^T_1 + 1V_0V^T_1 - YV^T_1\right)\right]$

![Problem 7.7](files/LFD7.7_1.jpg "page 1")

![Problem 7.7](files/LFD7.7_3.jpg "page 3")

![Problem 7.7](files/LFD7.7_2.jpg "page 2")

#### Problem 7.8

* (a) If the quadratic curve's minimum is not within the interval $[\eta_1, \eta_3]$, then $\eta_1, \eta_3$ must be both on one side of the $\eta^*$ that corresponds to the minimum. For quadratic curve, we know that at $\eta^*$ the derivative of $E(\eta^*) = 0$, while at its two sides, the derivative is either positive or negative. Consider the right side of $\eta^*$, the derivatives are always positive, meaning as we increase $\eta$, $E(\eta)$ increases. However, we know that $\eta_1 < \eta_2 < \eta_3$ and $E(\eta_2) < E(\eta_3)$, this contradicts. Similarly, we know that $\eta_1, \eta_3$ can't be on the left side of $\eta^*$ neither. So the minimum of the quadratic interpolant is within the interval $[\eta_1, \eta_3]$.

* (b) Solving the linear equations with 3 points, we find out that 

$a = \frac{(\eta_1-\eta_3)(e_1-e_2)-(\eta_1-\eta_2)(e_1-e_3)}{(\eta_1-\eta_2)(\eta_1-\eta_3)(\eta_2-\eta_3)} = \frac{(\eta_2-\eta_3)e_1 + (\eta_3-\eta_1)e_2+(\eta_1-\eta_2)e_3}{(\eta_1-\eta_2)(\eta_1-\eta_3)(\eta_2-\eta_3)}$

$b = \frac{e_1-e_2}{\eta_1-\eta_2} - a(\eta_1 + \eta_2) $

$c = e_1 - a\eta^2_1 - b\eta_1$

Take derivative of $E(\eta)$ w.r.t. $\eta$, and let it equal to 0, we have the 

\begin{align*}
\bar{\eta} &= -\frac{b}{2a} \\
&= -\frac{1}{2}\left(\frac{e_1-e_2}{a(\eta_1-\eta_2)} - (\eta_1 + \eta_2)\right)\\
&= \frac{1}{2}\left((\eta_1 + \eta_2) - \frac{e_1-e_2}{a(\eta_1-\eta_2)} \right)\\
&= \frac{1}{2}\left((\eta_1 + \eta_2) - \frac{(e_1-e_2)(\eta_1-\eta_3)(\eta_2-\eta_3)}{(\eta_1-\eta_3)(e_1-e_2)-(\eta_1-\eta_2)(e_1-e_3)} \right)\\
&= \frac{1}{2}\frac{(\eta_1 + \eta_2)\left((\eta_1-\eta_3)(e_1-e_2)-(\eta_1-\eta_2)(e_1-e_3)\right) - (e_1-e_2)(\eta_1-\eta_3)(\eta_2-\eta_3)}{(\eta_1-\eta_3)(e_1-e_2)-(\eta_1-\eta_2)(e_1-e_3)}\\
&= \frac{1}{2}\frac{(\eta^2_1-\eta^2_3)(e_1-e_2)-(e_1-e_3)(\eta^2_1-\eta^2_3)}{(\eta_1-\eta_3)(e_1-e_2)-(\eta_1-\eta_2)(e_1-e_3)}\\
\end{align*}

* (c) 

  * If $\bar{\eta}$ is on the left of $\eta_2$ and $E(\bar{\eta}) < E(\eta_2)$, the smaller U-arrangement is $[\eta_1, \bar{\eta}, \eta_2]$
  * If $\bar{\eta}$ is on the left of $\eta_2$ and $E(\bar{\eta}) > E(\eta_2)$, the smaller U-arrangement is $[ \bar{\eta}, \eta_2, \eta_3]$
  * If $\bar{\eta}$ is on the right of $\eta_2$ and $E(\bar{\eta}) < E(\eta_2)$, the smaller U-arrangement is $[\bar{\eta}, \eta_2, \eta_3]$
  * If $\bar{\eta}$ is on the right of $\eta_2$ and $E(\bar{\eta}) > E(\eta_2)$, the smaller U-arrangement is $[\eta_1, \eta_2, \bar{\eta}]$  
  
* (d) If $\bar{\eta} =\eta_2$, we can perturb $\bar{\eta}$ and re-compute $E(\bar{\eta})$ (the original function) and choose the smaller U-arrange accordingly.

#### Problem 7.9 TODO

* (a)  Since $H$ is positive definite and symmetric, we can decompose it into $H=Q\Lambda Q^T$, where $Q$ is orthogonal eigenvector matrix, and $\Lambda$ is the diagonal eigenvalue matrix. And all the eigenvalues are positive.

Let $u=Qx$, 

\begin{align*}
P[E\le E(w^*) + \epsilon] &= P[E(w^*) + \frac{1}{2}(w-w^*)^TH(w-w^*) \le E(w^*) + \epsilon]\\
&= P[\frac{1}{2}(w-w^*)^TH(w-w^*) \le \epsilon]\\
\end{align*}

#### Problem 7.10

If all weights are zero, then all inputs to non-input layers are zero, and all outputs are zero from $\tanh$. But formula 7.4, all gradients are zero. There won't be any update on the weights. So we don't want to initialize weights to zero.

#### Problem 7.11

Not sure what to do here.

#### Problem 7.12 TODO
#### Problem 7.13 TODO



#### Problem 7.14

* (a) Let $\alpha$ be the Lagrange multiplier, we have the Lagrangian

\begin{align*}
\mathcal{L}&= E_{in}(w_t + \delta w) + \alpha (\Delta w^T \Delta w - \eta^2)\\
&= E_{in}(w_t) + g^T_t\Delta w + \frac{1}{2}\Delta w^TH_t\Delta w + \Delta w^T\alpha I \Delta w - \alpha\eta^2\\
&= E_{in}(w_t) + g^T_t\Delta w + \frac{1}{2}\Delta w^T(H_t+2\alpha I)\Delta w - \alpha\eta^2\\
\end{align*}

As $\alpha \eta^2$ has nothing to do with $\Delta w$ and we can drop it from the optimization problem, so 
$\mathcal{L} = E_{in}(w_t) + g^T_t\Delta w + \frac{1}{2}\Delta w^T(H_t+2\alpha I)\Delta w$

* (b) Take derivative of $\mathcal{L}$ w.r.t. $\Delta w$ and $\alpha$ respectively, and let the derivatives to be 0. We have

\begin{align*}
\frac{\partial{\mathcal{L}}}{\partial{\Delta w}} &= g_t + (H_t + 2\alpha I)\Delta w \\
&= 0\\
\end{align*}

Note, we have used the fact that $H_t$ is symmetric. 
So $\Delta w = -(H_t + 2\alpha I )^{-1}g_t$

and $\Delta w^T \Delta w = \eta^2$

* (c) From $\Delta w = -(H_t + 2\alpha I )^{-1}g_t$, we have $2\alpha I \Delta w = -(g_t + H_t \Delta w)$, multiply both sides by $\Delta w^T$ from left, we have 

$2\alpha \Delta w^T \Delta w = -(\Delta w^T g_t + \Delta w^T H_t \Delta w)$, i.e. 

$\alpha = -\frac{1}{2\eta^2}(\Delta w^T g_t + \Delta w^T H_t \Delta w)$

Since $\Delta w^T \Delta w = \eta^2$, so we have $|\Delta w| = \eta$, the first term $\frac{1}{2\eta^2}|\Delta w^T g_t| \sim O(\frac{|g_t|}{\eta})$. The second term $\frac{1}{2\eta^2}|\Delta w^T H_t \Delta w| \le \frac{1}{2\eta^2}|\Delta w|^2 |H_t| \sim O(|H_t|)$

So as $\eta$ decreases, we have large $\alpha$.


* (d) Assume that $\alpha$ is large. Since $H_t$ is the symmetric Hessian,  apply eigenvalue decomposition, we have 

$H_t = Q\Lambda Q^T$, where $\Lambda$ is the diagonal matrix with eigenvalues on the diagonal, and $Q$ is the orthonormal eigenvector matrix and $Q^{-1}=Q^T$ Thus we have 

\begin{align*}
\Delta w &= -(H_t + 2\alpha I )^{-1}g_t \\
&= - (Q\Lambda Q^T + 2\alpha I)^{-1}g_t\\
&= - \left(Q(\Lambda + 2\alpha I)Q^T\right)^{-1}g_t\\
&= - Q(\Lambda + 2\alpha I)^{-1}Q^Tg_t\\
\end{align*}

Since $\alpha$ is large, we have the matrix $\Lambda + 2\alpha I \to 2\alpha I$, so 

\begin{align*}
\Delta w &= - Q(\Lambda + 2\alpha I)^{-1}Q^Tg_t\\
&\approx -Q (2\alpha I)^{-1}Q^T g_t \\
&= -\frac{1}{2\alpha}QQ^T g_t \\
&= -\frac{1}{2\alpha}g_t \\
\end{align*}

Take this into $\Delta w^T \Delta w = \eta^2$, we have

$\eta^2 = \frac{1}{4\alpha^2}g^T_tg_t = \frac{1}{4\alpha^2}|g_t|^2 $

so $\alpha = \frac{|g_t|}{2\eta}$

* (e) Take $\alpha = \frac{|g_t|}{2\eta}$ into the formula of $\Delta w = -(H_t + 2\alpha I )^{-1}g_t$, we have

$\Delta w = -(H_t + 2\frac{|g_t|}{2\eta}I )^{-1}g_t =-(H_t + \frac{|g_t|}{\eta}I )^{-1}g_t $


#### Problem 7.15

* (a) Consider $H_{k+1} = H_k + g_{k+1}g^T_{k+1}$ and apply the formula for inverse of $(A+zz^T)^{-1}$, we have

\begin{align*}
H^{-1}_{k+1} &=  \left(H_k + g_{k+1}g^T_{k+1}\right)^{-1}\\
&=  H^{-1}_k - \frac{H^{-1}_kg_{k+1}g^T_{k+1}H^{-1}_k}{1+g^T_{k+1}H^{-1}_kg_{k+1}} \\
\end{align*}


* (b) We start with $H_0 = \epsilon I$, which is $W\times W$ matrix. Then we iterate $N$ times to compute $H^{-1}_{k}$ one by one. In each iteration, we spend $O(W^2)$ time on computing the second term in above formula for $H^{-1}_{k+1}$, so the total time is $O(NW^2)$. 

The second term is $O(W^2)$ because:

  * $g^T_{k+1}H^{-1}_k$ takes $O(W^2)$ time
  * So $g^T_{k+1}H^{-1}_kg_{k+1}$ takes $O(2W^2)$ time.
  * $H^{-1}_kg_{k+1}g^T_{k+1}H^{-1}_k$ takes $O(3W^2)$ time.
  * The total time on the second term is $O(W^2)$.


#### Problem 7.16 TODO
