In [1]:
# Add lib input sys.path
import os
import sys
import time

import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from scipy.optimize import minimize
import math
from sklearn.preprocessing import normalize
from functools import partial
import h5py
from scipy.spatial import distance

nb_dir = os.path.split(os.getcwd())[0]
if nb_dir not in sys.path:
    sys.path.append(nb_dir)

from matplotlib.colors import ListedColormap
import libs.linear_models as lm
import libs.data_util as data
import libs.nn as nn
import libs.plot as myplot

%matplotlib inline

#### Exercise 7.1

We consider the regions of $f$ which are '+' here. From the top '+' region, we have $\bar{h}_1h_2h_3$, from the bottom left '+' region, we have $h_1h_2\bar{h}_3$ and from the bottom right region, we have $h_1\bar{h}_2h_3$. If either of them are True, then $f$ is True.

so $f=\bar{h}_1h_2h_3 + h_1h_2\bar{h}_3 + h_1\bar{h}_2h_3$

#### Exercise 7.2

* (a) Skipped for the graph. The Boolean $OR(x_1,\dots,x_M)$ is similar to the $OR(x_1,x_2)$, except the weight for bias is $M-0.5$, all the weights for $x_i$ are still 1. For $AND(x_1, \dots, x_M)$, the weight for bias is $-(M-0.5)$, while other weights are all 1.
* (b) Skipped. The weights are $w_i$ for each $x$. 
* (c) Skipped. $\bar{x}_2$ will take the input from $x_2$, and $x_1, \bar{x}_2, x_3$ form the ordinary $OR$ operation, the weight on bias is 2.5

#### Exercise 7.3

It's straightforward to verify that the graph is consistent with $f(x)$.

#### Exercise 7.4

If we have $h_1(x)=\text{sign}(w^T_1x)$, $h_2(x)=\text{sign}(w^T_2x)$, $h_3(x)=\text{sign}(w^T_3x)$, then we have 
$f= sign(sign(-h_1+h_2+h_3-2.5) + sign(h_1-h_2+h_3-2.5) - sign(h_1+h_2-h_3-2.5) + 2.5)$

#### Exercise 7.5
Follow the hint that for large enogh $\alpha$, we have $sign(x)\approx tanh(\alpha x)$, so given $w_1$ and $\epsilon >0$, we set $x = w^T_1x_n$, and $\alpha = w^T_2 w^{-T}_1$, such that $\alpha x = w^T_2x_n$. If we want the difference to be small, then we want $\alpha = w^T_2 w^{-T}_1$ to be large enough, that is for a large enough $\alpha$, we have $w^T_2 = \alpha w^T_1$

#### Exercise 7.6

On each layer $l$, we compute $s^{(l)} = \left(W^{(l)}\right)^Tx^{(l-1)}$, this takes $d^{(l)}(d^{(l-1)}+1)$ multiplications and additions. Then we compute $x^{(l)} = \theta(s^{(l)})$, which takes $d^{(l)}$ $\theta$-evaluations. Add them up from $l=1$ to $L$, we have a total of $O(Q)$ multiplications and additions, and $O(V)$ $\theta$-evaluations.

#### Exercise 7.7

$\tanh(x) = \frac{e^x-e^{-x}}{e^x+e^{-x}}$, we have $\frac{d \tanh(x)}{dx} = \frac{4}{(e^x+e^{-x})^2} = 1 - \tanh^2(x)$ , so we have

\begin{align*}
\nabla E_{in}(w) &= \nabla \frac{1}{N}\sum^N_{n=1}\left(\tanh(w^Tx_n)-y_n\right)^2 \\
&= \frac{1}{N}\sum^N_{n=1}\nabla\left(\tanh(w^Tx_n)-y_n\right)^2 \\
&= \frac{1}{N}\sum^N_{n=1}2\left(\tanh(w^Tx_n)-y_n\right)\frac{d \tanh(w^Tx_n)}{dx}  \\
&= \frac{1}{N}\sum^N_{n=1}2\left(\tanh(w^Tx_n)-y_n\right)\left(1 - \tanh^2(w^Tx_n)\right)x_n  \\
&= \frac{2}{N}\sum^N_{n=1}\left(\tanh(w^Tx_n)-y_n\right)\left(1 - \tanh^2(w^Tx_n)\right)x_n  \\
\end{align*}

If $w \to \infty$, $\tanh(w^Tx_n) \to 1$, and $\nabla E_{in}(w)\to 0$, so the gradient won't change when $w \to \infty$, which is hard to improvoe the perceptron solution.

#### Exercise 7.8

The weight matrices are the same as in example 7.1. For the data point $x=2$, $y=1$, we have $x^{(0)} = \begin{bmatrix}1 \\ 2 \end{bmatrix}$, $s^{(1)}$ is the same as before, $s^{(1)} = \begin{bmatrix}0.7 \\ 1 \end{bmatrix}$. 

So $x^{(1)} = \begin{bmatrix}1 \\ 0.7 \\ 2 \end{bmatrix}$, and $s^{(2)} = \begin{bmatrix} -2.1 \end{bmatrix}$, 

$x^{(2)} = \begin{bmatrix}1 \\ -2.1 \end{bmatrix}$

$s^{(3)} = \begin{bmatrix} -3.2 \end{bmatrix}$

$x^{(3)} = -3.2$


Apply backpropagation and note that $\theta'(s^{(l)}) = 1$ to compute 

$\delta^{(3)} = 2(x^{(3)}-y) = -8.4$

$\delta^{(2)} = \theta'(s^{(2)}) \otimes \left[W^{(3)}\delta^{(3)}\right] = -16.8$

$\delta^{(1)} = \begin{bmatrix}-16.8 \\ 50.4 \end{bmatrix}$

$\frac{\partial{e}}{\partial{W^{(1)}}} = x^{(0)}(\delta^{(1)})^T = \begin{bmatrix}-16.8 & 50.4 \\ -33.6 & 100.8\end{bmatrix}$

$\frac{\partial{e}}{\partial{W^{(2)}}} = x^{(1)}(\delta^{(2)})^T = \begin{bmatrix}-16.8 \\ -11.76 \\ -33.6\end{bmatrix}$

$\frac{\partial{e}}{\partial{W^{(3)}}} = x^{(2)}(\delta^{(3)})^T = \begin{bmatrix}-8.4 \\ -17.64 \end{bmatrix}$


#### Exercise 7.9 

If we initialize all weights to be 0, all the inputs $s^{(l)}$ are zeros, sensitivity $\delta^{(l)} = 0$ during backpropagation. So the gradient $G^{(l)}(x_n) = 0$ and the in-sample error $E_{in}$ is not updated through iterations.

#### Exercise 7.10

For layers 1 to $L$, the weights $W^{(l)}$ has a dimension of $d^{(l-1)}+1$ by $d^{(l)}$, so the total number of weight parameters are 

$Q = \sum^L_{l=1}d^{(l)}(d^{(l-1)}+1)$

For $L=3$ and $d^{(1)}=d^{(2)}=10$, also we assume $d^{(0)}=d$ where $d$ is the dimension of input $x$ and $d^{(3)}=1$, so $Q=131 + 10d$

#### Exercise 7.11


Take derivative w.r.t. $w^{(l)}_{ij}$ in the second term of $E_{aug}(w,\lambda)$, we have its derivative equals to $\frac{\lambda}{N}\frac{2w^{(l)}_{ij}}{\left(1+(w^{(l)}_{ij})^2\right)^2}$

This proves the equation.

We use the ratio of gradient versus weight to check the rate of decay.

From the derivative, we check the ratio of the second term to the weight $w^{(l)}_{ij}$, and we have $\frac{2\lambda}{N}\frac{1}{\left(1+(w^{(l)}_{ij})^2\right)^2}$, which achieves maximum value of 1 when $w^{(l)}_{ij} \to 0$. So the smaller the weight, the larger the decay w.r.t. itself.

This indicates that small weights decay much faster  than large ones.

In [25]:
import numpy as np
W1 = np.array([[0.1, 0.2], [0.3, 0.4]])
W2 = np.array([[0.2], [1], [-3]])
W3 = np.array([[1], [2]])
x0 = np.array([[1], [2]])
s1 = np.matmul(W1.transpose(), x0)
x1 = np.array([[1], [0.60436778], [0.76159416]])
s2 = np.matmul(W2.transpose(), x1)
x20 = np.tanh(s2)
x2 = np.array([[1], [-0.90154566]])
s3 =np.matmul(W3.transpose(), x2)
x3 = np.tanh(s3)
x3

array([[-0.66576144]])

In [24]:
-2.1*8.4

-17.64

In [8]:
np.tanh(-0.8)

-0.6640367702678489