# A Simple Neural network
## Part 3: Hidden layer

This tutorial is part 3 of the previous tutorials on neural networks (TODO: url). While the previous tutorials described very simple single layer regression and classification models, this tutorial will describe a 2-class classification neural network with 1 input dimension, and a non-linear hidden layer with 2 dimensions. While we didn't add the bias parameters to the previous 2 models, we will add them to this model. The network of this model is shown in the following figure:

![Image of the logistic model](https://dl.dropboxusercontent.com/u/8938051/Blog_images/SimpleANN03.png)

In [None]:
# Python imports
import numpy as np # Matrix and vector computation package
import matplotlib.pyplot as plt  # Plotting library
# Allow matplotlib to plot inside this notebook
%matplotlib inline
# Set the seed of the numpy random number generator so that the tutorial is reproducable
np.random.seed(seed=1)

## Define the dataset 

In this example the target classes $t$ will be generated from 2 class distributions: blue ($t=1$) and red ($t=0$). Where the red class is a circular distribution that surrounds the distribution of the blue class. This results in a 2D dataset that is not linearly seperable. The model from part 2 won't be able to classify both classes correctly since it can learn only linear seperators. By adding a hidden layer the model will be able to train a non-linear seperator.

In [None]:
# Define and generate the samples
nb_of_samples_per_class = 50  # The number of sample in each class

# Generate blue samples
blue_mean = [0,0]  # The mean of the blue class
blue_std_dev = 0.3  # standard deviation of blue class
x_blue = np.random.randn(nb_of_samples_per_class, 2) * blue_std_dev + blue_mean

# Generate red samples as circle around blue samples
red_radius_mean = 1.3  # mean of the radius
red_radius_std_dev = 0.2  # standard deviation of the radius
red_rand_radius = np.random.randn(nb_of_samples_per_class) * red_radius_std_dev + red_radius_mean
red_rand_angle = 2 * np.pi * np.random.rand(nb_of_samples_per_class);
x_red = np.asmatrix([red_rand_radius * np.cos(red_rand_angle), 
                     red_rand_radius * np.sin(red_rand_angle)]).T

# Merge samples in set of input variables x, and corresponding set of
# output variables t
x = np.vstack((x_blue, x_red))
T = np.vstack((np.ones((x_blue.shape[0],1)), np.zeros((x_red.shape[0],1))))

In [None]:
# Plot both classes on the x1, x2 plane
plt.plot(x_red[:,0], x_red[:,1], 'ro', label='class red')
plt.plot(x_blue[:,0], x_blue[:,1], 'bo', label='class blue')
plt.grid()
plt.legend(loc=2)
plt.xlabel('x1')
plt.ylabel('x2')
# plt.axis([-4, 4, -4, 4])
plt.title('red vs blue classes in the input space')
plt.show()

## Optimization by backpropagation

We will train this model by using the [backpropagation](http://en.wikipedia.org/wiki/Backpropagation) algorithm that is typically used to train neural networks. Each step in the backpropagation algorithm consists of two steps:

1. A forward propagation step to compute the output of the network.
2. A backward propagation step in which the error at the end of the network is propagated backwards through all the neurons, while updating their parameters.

### 1. Forward step

During the forward step the input will be propagated layer by layer through the network to compute the final output of the network.

#### Compute activations of hidden layer
The $n$ input samples with $2$ variables each are given as a $n \times 2$ matrix $X = [[x_{11},x_{12}] \ldots [x_{n1},x_{n2}]]$ where $x_{ij}$ is the value of the $j$-th variable of the $i$-th input sample. These inputs are projected onto the 2 dimension of the hidden layer $H$ by following computation: 

$$H = \sigma(X \cdot W_h + b_h) = \frac{1}{1+e^{-(X \cdot W_h + b_h)}} $$

With $H$ resulting in a $n \times 2$ matrix, and where $W_h = [[w_{h11}, w_{h12}],[w_{h21}, w_{h22}]]$ the $2 \times 2$ weight matrix ($w_{hij}$ is the weight of the connection between input variable $i$ and hidden neuron activation $j$), and $b_h = [b_{h1}, b_{h2}]^T$ the $1 \times 2$ bias vector ($b_{hi}$ is the bias of hidden neuron activation $i$).

Note that by adding a column of $1$s we can rewrite $X$ as a $n \times 3$ matrix:

$$X =
\begin{bmatrix} 
1 & x_{11} & x_{12} \\
\vdots & \vdots & \vdots \\
1 & x_{n1} & x_{n2}
\end{bmatrix}$$

And we can incorporate the bias $b_h$ into $W_h$ as a new $3 \times 2$ matrix:

$$W_h =
\begin{bmatrix} 
b_{h1} & b_{h2} \\
w_{h11} & w_{h12} \\
w_{h21} & w_{h22}
\end{bmatrix}$$

After which we can simplify the computation of $H$ by:

$$H = \frac{1}{1+e^{-(X \cdot W_h)}} $$


#### Compute activations of output 

To compute the output activations we can use the same trick as above to simplify the equations, we can add a column of  $1$s so can rewrite $H$ as a $n \times 3$ matrix:

$$H =
\begin{bmatrix} 
1 & h_{11} & h_{12} \\
\vdots & \vdots & \vdots \\
1 & h_{n1} & h_{n2}
\end{bmatrix}$$

and write $W_o$ as a $3 \times 1$ matrix:

$$W_o =
\begin{bmatrix} 
b_o \\
w_{o1} \\
w_{o2}
\end{bmatrix}$$

The output activations $Y = [y_{1} \ldots y_{n}]^T$ are then computed from the hidden activations by the output layer according to:

$$Y = \sigma(H \cdot W_o) = \frac{1}{1+e^{-(H \cdot W_o)}} $$

With $Y$ resulting in a $n \times 1$ matrix.

In [None]:
# Add a column of ones to X
X = np.hstack((np.ones((x.shape[0],1)),x))

# Define the logistic function
def logistic(z): return 1 / (1 + np.exp(-z))

# Function to compute the hidden activations
def hidden_activations(X, Wh):
    H = logistic(X * Wh)
    # Add a column of ones to H
    return np.hstack((np.ones((H.shape[0],1)),H))

# Define output layer feedforward
def output_activations(H, Wo):
    return logistic(H * Wo)

# Define the neural network function
def nn(X, Wh, Wo): 
    return output_activations(hidden_activations(X, Wh), Wo)

# Define the neural network prediction function that only returns
#  1 or 0 depending on the predicted class
def nn_predict(X, Wh, Wo): 
    return np.around(nn(X, Wh, Wo))

In [None]:
# Weights and biases
bh = np.asarray([0, 0])
Wh = np.asmatrix([bh, [1, 1],[1, 1]])

bo = np.asarray([0])
Wo = np.asmatrix([bo, [1], [1]])


print 'X.shape: ', X.shape
print 'Wh.shape: ', Wh.shape
print 'bh.shape: ', bh.shape

H = hidden_activations(X, Wh)
print 'H.shape: ', H.shape
print 'Wo.shape: ', Wo.shape
print 'bo.shape: ', bo.shape

O = output_activations(H, Wo)
print 'O.shape: ', O.shape

Y = nn(X, Wh, Wo)
print 'Y.shape: ', Y.shape

Y_pred = nn_predict(X, Wh, Wo)
print 'Y_pred.shape: ', Y_pred.shape

### 2. Backward step

The backward step will begin with computing the cost at the output node. This cost will then be propagated backwards layer by layer through the network to update the parameters.

The [gradient descent](http://en.wikipedia.org/wiki/Gradient_descent) algorithm is used in every layer to update the parameters in the direction of the negative [gradient](http://en.wikipedia.org/wiki/Gradient).

The parameters $w$ are updated by $w(k+1) = w(k) - \Delta w(k+1)$. $\Delta w$ is defined as: $\Delta w = \mu \delta_{w}$ with $\mu$ the learning rate and $\delta_{w}$ the local gradient at the output of a neuron that has $w$ as a parameter.

#### Compute the cost function

The cost function $\xi$ used in this model is the same cross-entropy cost function used in part 2:

$$\xi(T,Y) = - \sum_{i=1}^{n} \left[ t_i log(y_i) + (1-t_i)log(1-y_i) \right]$$


#### Update the output layer

At the output the gradient $\delta_{w}$ is computed by ${\partial \xi}/{\partial w}$. We worked out this formula for batch processing in part 2:

$$\frac{\partial \xi}{\partial w_{oi}} = \frac{\partial z_{o}}{\partial w_{oi}} \frac{\partial y}{\partial z_{o}} \frac{\partial \xi}{\partial y} = \sum_{j=1}^n h_{ji} (y_j - t_j)$$

With $z_{o} = H * W_o$. We can write this formula out as matrix operations with the parameters of the output layer:

$$\frac{\partial \xi}{\partial W_o} = H^T \cdot (Y-T) $$

The resulting gradient is a $3 \times 1$ [Jacobian matrix](http://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant):

$$J_o =
\begin{bmatrix} 
\frac{\partial \xi}{\partial b_o} \\
\frac{\partial \xi}{\partial w_{o1}} \\
\frac{\partial \xi}{\partial w_{o2}}
\end{bmatrix}$$

#### Update the input layer

At the hidden layer the gradient $\delta_{w}$ of the neuron with parameter $w$ is computed the same way:

$$\frac{\partial \xi}{\partial w_{hij}} = \frac{\partial z_{hj}}{\partial w_{hij}} \frac{\partial h_j}{\partial z_{hj}} \frac{\partial \xi}{\partial h_j}$$

With for all samples $k$: $z_{hj} = \sum_{i=0}^3 x_{ki} * w_{hij} $. And with ${\partial \xi}/{\partial h_j}$ the gradient of the error at the output of the hidden node $h_j$ with respect to this $h_j$. This error can be interpreted as the contribution of $h_j$ to the total error.
How do we define this error at the output of the hidden nodes? For each output neuron $h_j$ we can compute this as the sum of the errors propagated back from the connections going out of $h_j$:

$$\frac{\partial \xi}{\partial h_j} = \frac{\partial z_{o}}{\partial h_j} \frac{\partial \xi}{\partial z_{o}} = w_{oj} \frac{\partial \xi}{\partial z_{o}}$$

Because of this, and because ${\partial z_{hj}}/{\partial w_{hij}} = x_{i}$ and ${\partial h_j}/{\partial z_{hj}} = h_j * (1-h_j)$ and ${\partial \xi}/{\partial z_{o}} = y - t = E$ we compute ${\partial \xi}/{\partial w_{hij}}$ over all samples $k$ as:

$$\frac{\partial \xi}{\partial w_{hij}} = \sum_{k=1}^n x_{ki} * h_{kj} * (1-h_{kj}) * w_{oj}  * (y_k-t_k)$$

This can be written in matrix format as:

$$\frac{\partial \xi}{\partial W_h} = X^T \cdot [H \circ (1-H) \circ (Y - T) * W_o]$$

With $\circ$ the [elementwise product](http://en.wikipedia.org/wiki/Hadamard_product).

The resulting gradient is a $3 \times 2$ [Jacobian matrix](http://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant):

$$J_o =
\begin{bmatrix} 
\frac{\partial \xi}{\partial b_{h1}} & \frac{\partial \xi}{\partial b_{h2}} \\
\frac{\partial \xi}{\partial w_{h11}} & \frac{\partial \xi}{\partial w_{h12}} \\
\frac{\partial \xi}{\partial w_{h21}} & \frac{\partial \xi}{\partial w_{h22}}
\end{bmatrix}$$

To start out the gradient descent algorithm, you typically start with picking the initial parameters at random and start updating these parameters in the direction of the negative gradient with help of the backpropagation algorithm.

In [None]:
# Define the cost function
def cost(Y, T):
    return - np.sum(np.multiply(T, np.log(Y)) + np.multiply((1-T), np.log(1-Y)))

# Define the error function
def error(Y, T):
    return Y - T

# define the gradient function for the output layer
def gradient_output(H, E): 
    return  H.T * E

def gradient_input(x, wh, bh, error_gradient):
    return np.sum(x * error_gradient, axis=0)

# # define the update function delta w which returns the 
# #  delta w for each weight in a vector
# def delta_w(w_k, x, t, learning_rate):
#     return learning_rate * gradient(w_k, x, t)

In [None]:
C = cost(Y, T)
print 'C.shape: ', C.shape

E = error(Y, T)
print 'E.shape: ', E.shape

print 'H.shape: ', H.shape
Jo = gradient_output(H, E)
print 'Jo.shape: ', Jo.shape