# A Simple Neural network
## Part 4: Vectorization

This tutorial is part 3 of the previous tutorials on neural networks (TODO: url). While the previous tutorials described very simple single layer regression and classification models, this tutorial will describe a 2-class classification neural network with 1 input dimension, and a non-linear hidden layer with 2 dimensions. While we didn't add the bias parameters to the previous 2 models, we will add them to this model. The network of this model is shown in the following figure:

![Image of the logistic model](https://dl.dropboxusercontent.com/u/8938051/Blog_images/SimpleANN03.png)

In [None]:
# Python imports
import numpy as np # Matrix and vector computation package
import matplotlib.pyplot as plt  # Plotting library
# Allow matplotlib to plot inside this notebook
%matplotlib inline
# Set the seed of the numpy random number generator so that the tutorial is reproducable
np.random.seed(seed=1)

## Define the dataset 

In this example the target classes $t$ will be generated from 2 class distributions: blue ($t=1$) and red ($t=0$). Where the red class is a circular distribution that surrounds the distribution of the blue class. This results in a 2D dataset that is not linearly seperable. The model from part 2 won't be able to classify both classes correctly since it can learn only linear seperators. By adding a hidden layer the model will be able to train a non-linear seperator.

In [None]:
# Define and generate the samples
nb_of_samples_per_class = 50  # The number of sample in each class

# Generate blue samples
blue_mean = [0,0]  # The mean of the blue class
blue_std_dev = 0.3  # standard deviation of blue class
x_blue = np.random.randn(nb_of_samples_per_class, 2) * blue_std_dev + blue_mean

# Generate red samples as circle around blue samples
red_radius_mean = 1.3  # mean of the radius
red_radius_std_dev = 0.2  # standard deviation of the radius
red_rand_radius = np.random.randn(nb_of_samples_per_class) * red_radius_std_dev + red_radius_mean
red_rand_angle = 2 * np.pi * np.random.rand(nb_of_samples_per_class);
x_red = np.asmatrix([red_rand_radius * np.cos(red_rand_angle), 
                     red_rand_radius * np.sin(red_rand_angle)]).T

# Define target vectors for blue and red
t_blue_vector = np.asarray([1, 0])
t_red_vector = np.asarray([0, 1])
# Define the full target matrix for each class
t_blue = np.tile(t_blue_vector, (nb_of_samples_per_class, 1))
t_red = np.tile(t_red_vector, (nb_of_samples_per_class, 1))

# Merge samples in set of input variables x, and corresponding set of
# output variables t
X = np.vstack((x_blue, x_red))
T = np.vstack((t_blue, t_red))

In [None]:
# Plot both classes on the x1, x2 plane
plt.plot(x_red[:,0], x_red[:,1], 'ro', label='class red')
plt.plot(x_blue[:,0], x_blue[:,1], 'bo', label='class blue')
plt.grid()
plt.legend(loc=2)
plt.xlabel('x1')
plt.ylabel('x2')
# plt.axis([-4, 4, -4, 4])
plt.title('red vs blue classes in the input space')
plt.show()

## Vectorization of backpropagation

### 1. Vectorization of the forward step

#### Compute activations of hidden layer
The $n$ input samples with $2$ variables each are given as a $n \times 2$ matrix $X$:

$$X =
\begin{bmatrix} 
x_{11} & x_{12} \\
\vdots & \vdots \\
x_{n1} & x_{n2}
\end{bmatrix}$$

Where $x_{ij}$ is the value of the $j$-th variable of the $i$-th input sample. These inputs are projected onto the 3 dimensions of the hidden layer $H$ by the weight matrix $W$ ($w_{hij}$ is the weight of the connection between input variable $i$ and hidden neuron activation $j$) and bias vector $b$ :

$$\begin{align}
W_h =
\begin{bmatrix} 
w_{h11} & w_{h12} & w_{h13} \\
w_{h21} & w_{h22} & w_{h23}
\end{bmatrix}
&& b_h = 
\begin{bmatrix} 
b_{h1} \\
b_{h2} \\
b_{h3}
\end{bmatrix}
\end{align}$$

following computation: 

$$H = \sigma(X \cdot W_h + b_h) = \frac{1}{1+e^{-(X \cdot W_h + b_h)}} = \begin{bmatrix} 
h_{11} & h_{12} & h_{13} \\
\vdots & \vdots & \vdots \\
h_{n1} & h_{n2} & h_{n3}
\end{bmatrix}$$

With $\sigma$ the logistic function, and with $H$ resulting in a $n \times 3$ matrix.

#### Compute activations of output 

To compute the output activations the hidden layer activations can be projected onto the 2 dimensional output layer. This is done by the $3 \times 2$ weight matrix $W_o$ ($w_{oij}$ is the weight of the connection between hidden layer neuron $i$ and output activation $j$) and $2 \times 1$ bias vector $b_o$ :

$$\begin{align}
W_o =
\begin{bmatrix} 
w_{o11} & w_{o12} \\
w_{o21} & w_{o22} \\
w_{o31} & w_{o32}
\end{bmatrix}
&& b_o = 
\begin{bmatrix} 
b_{o1} \\
b_{o2}
\end{bmatrix}
\end{align}$$

following computation: 

$$Y = \sigma(H \cdot W_o + b_o) = \frac{1}{1+e^{-(H \cdot W_o + b_o)}} = \begin{bmatrix} 
y_{11} & y_{12}\\
\vdots & \vdots \\
y_{n1} & y_{n2} 
\end{bmatrix}$$

With $\sigma$ the logistic function, and with $Y$ resulting in a $n \times 2$ matrix.

In [None]:
# Define the logistic function
def logistic(z): return 1 / (1 + np.exp(-z))

# Function to compute the hidden activations
def hidden_activations(X, Wh, bh):
    return logistic(X * Wh + bh)

# Define output layer feedforward
def output_activations(H, Wo, bo):
    return logistic(H * Wo + bo)

# Define the neural network function
def nn(X, Wh, bh, Wo, bo): 
    return output_activations(hidden_activations(X, Wh, bh), Wo, bo)

# Define the neural network prediction function that only returns
#  1 or 0 depending on the predicted class
def nn_predict(X, Wh, bh, Wo, bo): 
    return np.around(nn(X, Wh, bh, Wo, bo))

In [None]:
# Weights and biases
bh = np.asarray([0, 0, 0])
Wh = np.asmatrix([[1, 1, 1],[1, 1, 1]])

bo = np.asarray([0, 0])
Wo = np.asmatrix([[1, 1], [1, 1], [1, 1]])


print 'X.shape: ', X.shape
print 'Wh.shape: ', Wh.shape
print 'bh.shape: ', bh.shape

H = hidden_activations(X, Wh, bh)
print 'H.shape: ', H.shape
print 'Wo.shape: ', Wo.shape
print 'bo.shape: ', bo.shape

O = output_activations(H, Wo, bo)
print 'O.shape: ', O.shape

Y = nn(X, Wh, bh, Wo, bo)
print 'Y.shape: ', Y.shape

Y_pred = nn_predict(X, Wh, bh, Wo, bo)
print 'Y_pred.shape: ', Y_pred.shape

### 2. Vectorization of the backward step

#### Compute the error at the output

The cost function $\xi$ for $n$ samples and $c$ classes used in this model is the cross-entropy cost function:

$\xi(T,Y) = \sum_{i=1}^n \xi(\mathbf{t}_i,\mathbf{y}_i) = - \sum_{i=1}^n \sum_{i=c}^{C} t_{ic} \cdot log( y_{ic}) $



The error gradient $\delta_{o}$ of this cost function at the softmax output layer is simply:

$$\delta_{o} = \frac{\partial \xi}{\partial Z_o} = Y - T$$

With $Z_o$ a $n \times 2$ matrix of inputs to the softmax layer ($Z_o = H \cdot W_o + b_o$), anc $T$ a $n \times 2$ target matrix that corresponds to $Y$. Note that $\delta_{o}$ is a 

#### Update the output layer weights
At the output the gradient $\delta_{w_{oij}}$ is computed by ${\partial \xi}/{\partial w_{oij}}$. We worked out this formula for batch processing over all $n$ samples in part 2:

$$\frac{\partial \xi}{\partial w_{oij}} = \frac{\partial Z_{o}}{\partial w_{oij}} \frac{\partial Y}{\partial Z_{o}} \frac{\partial \xi}{\partial Y} = \sum_{j=1}^n h_{ji} (y_j - t_j) = \sum_{j=1}^n h_{ji} \delta_{oj}$$

We can write this formula out as matrix operations with the parameters of the output layer:

$$\frac{\partial \xi}{\partial W_o} = H^T \cdot (Y-T) = H^T \cdot \delta_{o}$$

The resulting gradient is a $3 \times 2$ [Jacobian matrix](http://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant):

$$J_{W_o} = \nabla_{W_o} \xi =
\begin{bmatrix} 
\frac{\partial \xi}{\partial w_{o11}} & \frac{\partial \xi}{\partial w_{o12}} \\
\frac{\partial \xi}{\partial w_{o21}} & \frac{\partial \xi}{\partial w_{o22}} \\
\frac{\partial \xi}{\partial w_{o31}} & \frac{\partial \xi}{\partial w_{o32}}
\end{bmatrix}$$

The output weights $W_o$ can then be updated with learning rate $\mu$ according to:

$$ W_o(k+1) = W_o(k) - \mu * J_{W_o} $$

#### Update the output layer bias
The bias $b_o$ can be updated in the same manner. The gradient $\delta_{b_{oi}}$ is computed by ${\partial \xi}/{\partial b_{oi}}$. We worked out this formula for batch processing over all $n$ samples in part 2:

$$\frac{\partial \xi}{\partial b_{oi}} = \frac{\partial Z_{o}}{\partial b_{oi}} \frac{\partial Y}{\partial Z_{o}} \frac{\partial \xi}{\partial Y} = \sum_{j=1}^n 1 * (y_j - t_j) = \sum_{j=1}^n \delta_{oj}$$

The resulting gradient is a $2 \times 1$ Jacobian matrix:

$$J_{b_o} = \nabla_{b_o} \xi =
\begin{bmatrix} 
\frac{\partial \xi}{\partial b_{o1}}\\
\frac{\partial \xi}{\partial b_{o2}}
\end{bmatrix}$$

The output bias $b_o$ can then be updated with learning rate $\mu$ according to:

$$ b_o(k+1) = b_o(k) - \mu * J_{b_o} $$

#### Compute the error at the hidden layer

The error gradient $\delta_{h}$ of the cost function at the hidden layer is defined as:

$$\delta_{h} = \frac{\partial \xi}{\partial Z_h} = \frac{\partial H}{\partial Z_h} \frac{\partial \xi}{\partial H} = \frac{\partial H}{\partial Z_h} \frac{\partial Z_o}{\partial H}\frac{\partial \xi}{\partial Z_o}$$

With $Z_o$ a $n \times 2$ matrix of inputs to the softmax layer ($Z_o = H \cdot W_o + b_o$), anc $T$ a $n \times 2$ target matrix that corresponds to $Y$. Note that $\delta_{o}$ is a 



$$\delta_{h} = \frac{\partial \xi}{\partial h} = \frac{\partial z_{o}}{\partial h} \frac{\partial y}{\partial z_{o}} \frac{\partial \xi}{\partial y} = w_{o} (y - t) = w_{o} \delta_{o} $$

### 2. Backward step

The backward step will begin with computing the cost at the output node. This cost will then be propagated backwards layer by layer through the network to update the parameters.

The [gradient descent](http://en.wikipedia.org/wiki/Gradient_descent) algorithm is used in every layer to update the parameters in the direction of the negative [gradient](http://en.wikipedia.org/wiki/Gradient).

The parameters $w$ are updated by $w(k+1) = w(k) - \Delta w(k+1)$. $\Delta w$ is defined as: $\Delta w = \mu \delta_{w}$ with $\mu$ the learning rate and $\delta_{w}$ the local gradient at the output of a neuron that has $w$ as a parameter.

#### Compute the cost function

The cost function $\xi$ used in this model is the same cross-entropy cost function used in part 2:

$$\xi(T,Y) = - \sum_{i=1}^{n} \left[ t_i log(y_i) + (1-t_i)log(1-y_i) \right]$$


#### Update the output layer

At the output the gradient $\delta_{w}$ is computed by ${\partial \xi}/{\partial w}$. We worked out this formula for batch processing in part 2:

$$\frac{\partial \xi}{\partial w_{oi}} = \frac{\partial z_{o}}{\partial w_{oi}} \frac{\partial y}{\partial z_{o}} \frac{\partial \xi}{\partial y} = \sum_{j=1}^n h_{ji} (y_j - t_j)$$

With $z_{o} = H * W_o$. We can write this formula out as matrix operations with the parameters of the output layer:

$$\frac{\partial \xi}{\partial W_o} = H^T \cdot (Y-T) $$

The resulting gradient is a $3 \times 1$ [Jacobian matrix](http://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant):

$$J_o =
\begin{bmatrix} 
\frac{\partial \xi}{\partial b_o} \\
\frac{\partial \xi}{\partial w_{o1}} \\
\frac{\partial \xi}{\partial w_{o2}}
\end{bmatrix}$$

#### Update the input layer

At the hidden layer the gradient $\delta_{w}$ of the neuron with parameter $w$ is computed the same way:

$$\frac{\partial \xi}{\partial w_{hij}} = \frac{\partial z_{hj}}{\partial w_{hij}} \frac{\partial h_j}{\partial z_{hj}} \frac{\partial \xi}{\partial h_j}$$

With for all samples $k$: $z_{hj} = \sum_{i=0}^3 x_{ki} * w_{hij} $. And with ${\partial \xi}/{\partial h_j}$ the gradient of the error at the output of the hidden node $h_j$ with respect to this $h_j$. This error can be interpreted as the contribution of $h_j$ to the total error.
How do we define this error at the output of the hidden nodes? For each output neuron $h_j$ we can compute this as the sum of the errors propagated back from the connections going out of $h_j$:

$$\frac{\partial \xi}{\partial h_j} = \frac{\partial z_{o}}{\partial h_j} \frac{\partial \xi}{\partial z_{o}} = w_{oj} \frac{\partial \xi}{\partial z_{o}}$$

Because of this, and because ${\partial z_{hj}}/{\partial w_{hij}} = x_{i}$ and ${\partial h_j}/{\partial z_{hj}} = h_j * (1-h_j)$ and ${\partial \xi}/{\partial z_{o}} = y - t = E$ we compute ${\partial \xi}/{\partial w_{hij}}$ over all samples $k$ as:

$$\frac{\partial \xi}{\partial w_{hij}} = \sum_{k=1}^n x_{ki} * h_{kj} * (1-h_{kj}) * w_{oj}  * (y_k-t_k)$$

This can be written in matrix format as:

$$\frac{\partial \xi}{\partial W_h} = X^T \cdot [H \circ (1-H) \circ (Y - T) * W_o]$$

With $\circ$ the [elementwise product](http://en.wikipedia.org/wiki/Hadamard_product).

The resulting gradient is a $3 \times 2$ [Jacobian matrix](http://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant):

$$J_o =
\begin{bmatrix} 
\frac{\partial \xi}{\partial b_{h1}} & \frac{\partial \xi}{\partial b_{h2}} \\
\frac{\partial \xi}{\partial w_{h11}} & \frac{\partial \xi}{\partial w_{h12}} \\
\frac{\partial \xi}{\partial w_{h21}} & \frac{\partial \xi}{\partial w_{h22}}
\end{bmatrix}$$

To start out the gradient descent algorithm, you typically start with picking the initial parameters at random and start updating these parameters in the direction of the negative gradient with help of the backpropagation algorithm.

In [None]:
# Define the cost function
def cost(Y, T):
    return - np.sum(np.multiply(T, np.log(Y)) + np.multiply((1-T), np.log(1-Y)))

# Define the error function
def error(Y, T):
    return Y - T

# define the gradient function for the output layer
def gradient_output(H, E): 
    return  H.T * E

def gradient_input(x, wh, bh, error_gradient):
    return np.sum(x * error_gradient, axis=0)

# # define the update function delta w which returns the 
# #  delta w for each weight in a vector
# def delta_w(w_k, x, t, learning_rate):
#     return learning_rate * gradient(w_k, x, t)

In [None]:
C = cost(Y, T)
print 'C.shape: ', C.shape

E = error(Y, T)
print 'E.shape: ', E.shape

print 'H.shape: ', H.shape
Jo = gradient_output(H, E)
print 'Jo.shape: ', Jo.shape