In [26]:
import numpy as np
from neural_net import NeuralNet
%load_ext autoreload
%autoreload 2
import scipy.io
%matplotlib inline


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [27]:
filename = "toy_multiclass_1"
nn = NeuralNet.fromMAT(filename, type="train")
print nn.X
print nn.T

[[  1.00000000e+00   1.01180584e+00  -5.38457383e-02]
 [  1.00000000e+00   1.06541790e+00  -1.84298034e-01]
 [  1.00000000e+00   7.98278299e-01  -1.75002024e-01]
 [  1.00000000e+00   9.53283007e-01  -1.91718955e-01]
 [  1.00000000e+00   1.12958282e+00   1.23101001e-01]
 [  1.00000000e+00   1.04653253e+00  -1.49017208e-01]
 [  1.00000000e+00   9.44380555e-01  -1.14392868e-01]
 [  1.00000000e+00   1.01263348e+00   2.88987258e-02]
 [  1.00000000e+00   9.65880694e-01  -2.74318154e-01]
 [  1.00000000e+00   7.43437429e-01   1.77656391e-01]
 [  1.00000000e+00   1.10137953e+00  -5.29204447e-01]
 [  1.00000000e+00   1.15520049e+00  -7.77235681e-02]
 [  1.00000000e+00   1.05060393e+00  -1.20411222e-01]
 [  1.00000000e+00   1.07029575e+00   1.42107542e-01]
 [  1.00000000e+00   1.07903931e+00   8.13729782e-02]
 [  1.00000000e+00   1.06848329e+00   1.19003762e-01]
 [  1.00000000e+00   9.09102025e-01  -1.00930070e-01]
 [  1.00000000e+00   1.08007921e+00   3.71085908e-01]
 [  1.00000000e+00   1.09905

In [28]:
print np.shape(nn.T)
print nn.M
print np.shape(nn.W1)
print np.shape(nn.W2)

(300, 3)
30
(30, 3)
(3, 31)


Activation function is sigmoid

$$ g(z) = \frac{1}{1 + e^{-z}} $$



Each $W^{(1)}$ and $W^{(2)}$ (they call them $w^{(1)}$ and $w^{(2)}$) is a matrix of weights:


$$ W^{(1)} = \begin{bmatrix} w_{1,0} & w_{1,1} & ... & ... & w_{1,D}\\
w_{2,0} &w_{2,1} & ... & ... & w_{2,D} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
w_{M,0} &w_{M,1} & ... & ... & w_{M,D}  \end{bmatrix}$$

$$ W^{(2)} = \begin{bmatrix} w_{1,0} & w_{1,1} & ... & ... & w_{1,M}\\
w_{2,0} &w_{2,1} & ... & ... & w_{2,M} \\
\vdots & \vdots & \vdots & \vdots & \vdots \\
w_{K,0} &w_{K,1} & ... & ... & w_{K,M}  \end{bmatrix}$$


And we assume $M=N/10$ to start, AKA the amount of hidden units we have in our 1 hidden layer is $M$, and we initialize it to $N/10$, where is $N$ is the number of training examples we have

For our toy dataset, we just have $D$ = 2

The loss function is the NLL and is, given training data and parameters $w$:

$$ l(w) = \sum_{i=1}^N \sum_{k=1}^K \big[ -y_k^{(i)} \log (h_k (x^{(i)}. w)) = (1 - y_k^{(i)}) \log(1-(h_k(x^{(i)},w)) \big] $$

But in order to avoid overfitting, we add regularization terms using the Frobenius norm $$||A||_F$ and use as our cost function:

$$ J(w) = l(w) + \lambda(||w^{(1)}||^2_F + ||w^{(2)}||^2_F $$

### Filling in the holes

The homework description is annoyingly a description of what to do but without enough info to be helpful.

What we also need is the activations:

$$ a_j^{(1)} = \sum_{i=1}^d w_{ji}^{(1)}x_i + w_{j0}^{(1)}$$

And use $g$ to calculate the "feature" for each unit:

$$ z_j = g(a_j) $$


To nicely vectorize our computation of the activations $a$, the first step is to augment our input data with a "1" for each training sample, so that we allow for the $M$ bias weights to be included in the matrix:

$$ x_{aug} = \begin{bmatrix} 1_{N \times 1} &| & x\end{bmatrix} $$

We can now nicely vectorize our computation for the activations:
    
$$ a^{(1)} = W^{(1)} x_{aug} $$

Where we note the dimensionality of each: 
- $x_{aug}$ is a vector of dimension $D+1 \times 1$, where $D$ is the dimensionality of the input data
- $W^{(1)}$ is a matrix of dimension $M \times D+1$, where $M$ is the number of hidden units
- $a^{(1)}$ is a vector of dimension $M \times 1$

Also note that $x_{aug}$ is just for one of the training data samples, $n = 1,...,N$

### Implementing backprop

The notes from class are not super clear, but Bishop 5.3 is.

Here's what we'll do:

- Apply an input vector $x_n$ to the network and forward propagate through the network using 5.48 and 5.49 in order to find the activations of all hidden and output units
- Evaluate the $\delta_k$ for all the output units using 5.54
- Backpropagate the $\delta$'s using 5.56 to obtain $\delta_j$ for each hidden unit in the network
- Use 5.53 to evaluate the required derivatives


5.48 (should be implemented as a matrix multiplication as described above): $$a_j = \sum_i w_{ji}z_i$$

5.49 (is already vectorized as long as $h()$ accepts numpy arrays as input: $$z_j = h(a_j)$$

5.54 (can just be implemented as one subtraction of K-dimensional vectors): $$\delta_k = y_k - t_k$$

5.56 for going from output to hidden layer: $$\delta_j = h'(a_j)\sum_k w_{kj} \delta_k$$

Can be vectorized as:

$$\delta_{prev layer} = h'(a_{prev layer}) \ \  .* \ \ W_{no bias weights}^T \delta_{outputs}  $$

Where .* is element-wise multiply, $\delta_{outputs}$ is a $K$-dimensional vector, $W_{no bias weights}^T$ is $M \times K$

5.53: $$ \frac{\partial E_n}{\partial w_{ji}} = \delta_j z_i $$

Which can be vectorized as on outer product:

$$ \frac{\partial E_n}{\partial W} = z \delta^T$$ 



While implementing as a batch method, we finally sum over all input data samples:


$$ \frac{\partial E}{\partial w_{ji}} = \sum_{n} \frac{\partial E_n}{\partial w_{ji}}$$

In [29]:
print nn.X[0,:]

[ 1.          1.01180584 -0.05384574]


In [30]:
nn.forwardProp(nn.X[0,:])

In [31]:
W2 = np.ones((10,11))
print np.shape(W2)
print np.shape(W2.T)

(10, 11)
(11, 10)


In [32]:
1 / (1 + np.exp(-1))

0.7310585786300049

In [33]:
nn.train()

(30,)
(3, 30)
(31, 3)
