# A Simple Neural network
## Part 3: Hidden layer

This tutorial is part 3 of the previous tutorials on neural networks (TODO: url). While the previous tutorials described very simple single layer regression and classification models, this tutorial will describe a 2-class classification neural network with 1 input dimension, and a non-linear hidden layer with 2 dimensions. While we didn't add the bias parameters to the previous 2 models, we will add them to this model. The network of this model is shown in the following figure:

![Image of the logistic model](https://dl.dropboxusercontent.com/u/8938051/Blog_images/SimpleANN03.png)

In [None]:
# Python imports
import numpy as np # Matrix and vector computation package
import matplotlib.pyplot as plt  # Plotting library
from matplotlib.colors import colorConverter, ListedColormap # some plotting functions
# Allow matplotlib to plot inside this notebook
%matplotlib inline
# Set the seed of the numpy random number generator so that the tutorial is reproducable
np.random.seed(seed=1)

## Define the dataset 

In this example the target classes $t$ corresponding to the inputs $x$ will be generated from 2 class distributions: blue ($t=1$) and red ($t=0$). Where the red class is a [multimodal distribution](http://en.wikipedia.org/wiki/Multimodal_distribution) that surrounds the distribution of the blue class. This results in a 1D dataset that is not linearly seperable. The model from part 2 won't be able to classify both classes correctly since it can learn only linear seperators. By adding a hidden layer with a non-linear transfer function the model will be able to train a non-linear classifier.

In [None]:
# Define and generate the samples
nb_of_samples_per_class = 20  # The number of sample in each class
blue_mean = [0]  # The mean of the blue class
red_left_mean = [-2]  # The mean of the red class
red_right_mean = [2]  # The mean of the red class

std_dev = 0.5  # standard deviation of both classes
# Generate samples from both classes
x_blue = np.random.randn(nb_of_samples_per_class, 1) * std_dev + blue_mean
x_red_left = np.random.randn(nb_of_samples_per_class/2, 1) * std_dev + red_left_mean
x_red_right = np.random.randn(nb_of_samples_per_class/2, 1) * std_dev + red_right_mean


# Merge samples in set of input variables x, and corresponding set of
# output variables t
x = np.vstack((x_blue, x_red_left, x_red_right))
# print x
t = np.vstack((np.ones((x_blue.shape[0],1)), 
               np.zeros((x_red_left.shape[0],1)), 
               np.zeros((x_red_right.shape[0], 1))))

In [None]:
# Plot samples from both classes as lines on a 1D space
plt.figure(figsize=(8,0.5))
plt.xlim(-4,4)
plt.ylim(-1,1)
# Plot samples
plt.plot(x_blue, np.zeros_like(x_blue), 'b|', ms = 30) 
plt.plot(x_red_left, np.zeros_like(x_red_left), 'r|', ms = 30) 
plt.plot(x_red_right, np.zeros_like(x_red_right), 'r|', ms = 30) 
plt.gca().axes.get_yaxis().set_visible(False)
plt.show()

## Non-linear transfer function 

The non-linear transfer function used in the hidden layer of this example is the [Gaussian](http://en.wikipedia.org/wiki/Gaussian_function) [radial basis function](http://en.wikipedia.org/wiki/Radial_basis_function) (RBF).  
The RBF is a transfer function that is not usally used in neural networks, exept for [radial basis function networks](http://en.wikipedia.org/wiki/Radial_basis_function_network). One of the most common transfer functions in neural networks is the [sigmoid](http://en.wikipedia.org/wiki/Sigmoid_function) transfer function.
The RBF will allow to seperate the blue samples from the red samples in this simple example by only activating for a certain region around the origin. The RBF is plotted in the figure below and is defined in this example as:

$$ \text{RBF} = \phi(z) = e^{-z^2} $$

The derivative of this RBF function is:

$$ \frac{d \phi(z)}{dz} = -2 z e^{-z^2} = -2 z \phi(z)$$

In [None]:
# Define the rbf function
def rbf(z): return np.exp(-z**2)

# Plot the rbf function
z = np.linspace(-6,6,100)
plt.plot(z, rbf(z), 'b-')
plt.xlabel('z')
plt.ylabel('$e^{-z^2}$')
plt.title('rbf function')
plt.grid()
plt.show()

## Optimization by backpropagation

We will train this model by using the [backpropagation](http://en.wikipedia.org/wiki/Backpropagation) algorithm that is typically used to train neural networks. Each step in the backpropagation algorithm consists of two steps:

1. A forward propagation step to compute the output of the network.
2. A backward propagation step in which the error at the end of the network is propagated backwards through all the neurons, while updating their parameters.

### 1. Forward step

During the forward step the input will be propagated layer by layer through the network to compute the final output of the network.

#### Compute activations of hidden layer

The activations $h$ of the hidden layer will be computed by:

$$h = \phi(x*w_h) = e^{-(x*w_h)^2} $$

With $w_h$ the weight parameter that transforms the input before applying the RBF transfer function.

#### Compute activations of output 

The output of the final layer and network will be computed by passing the hidden activations $h$ as input to the logistic output function:

$$ y = \sigma(h * w_o - 1) = \frac{1}{1+e^{-h * w_o - 1}} $$

With $w_o$ the weight parameter of the output layer.  
Note that we add a bias (intercept) term of $-1$ to the input of the logistic output neuron. Remember from part 2 that the logistic output neuron can only learn a decision boundary that goes through the origin $(0)$. Since the RBF in the hidden layer projects all input variables to a range between $0$ and $+ \infty$, the output layer without an intercept will not be able to learn any usefull classifier, because none of the samples will be below $0$ and thus lie on the left side of the decision boundary. By adding a bias term the decision boundary is moved from the intercept. Normally the value of this bias termed is learned together with the rest of the weight parameters, but to keep this model simple we just make this bias constant in this example.

In [None]:
# Define the logistic function
def logistic(z): return 1 / (1 + np.exp(-z))

# Function to compute the hidden activations
def hidden_activations(x, wh):
    return rbf(x * wh)

# Define output layer feedforward
def output_activations(h , wo):
    return logistic(h * wo - 1)

# Define the neural network function
def nn(x, wh, wo): 
    return output_activations(hidden_activations(x, wh), wo)

# Define the neural network prediction function that only returns
#  1 or 0 depending on the predicted class
def nn_predict(x, wh, wo): 
    return np.around(nn(x, wh, wo))

In [None]:
# Weights
wh = 1
wo = 1

h = hidden_activations(x, wh)
print 'h.shape: ', h.shape
o = output_activations(h, wo)
print 'o.shape: ', o.shape

y = nn(x, wh, wo)
print 'y.shape: ', y.shape

y_pred = nn_predict(x, wh, wo)
print 'y_pred.shape: ', y_pred.shape

### 2. Backward step

The backward step will begin with computing the cost at the output node. This cost will then be propagated backwards layer by layer through the network to update the parameters.

The [gradient descent](http://en.wikipedia.org/wiki/Gradient_descent) algorithm is used in every layer to update the parameters in the direction of the negative [gradient](http://en.wikipedia.org/wiki/Gradient).

The parameters $w$ are updated by $w(k+1) = w(k) - \Delta w(k+1)$. $\Delta w$ is defined as: $\Delta w = \mu \delta_{w}$ with $\mu$ the learning rate and $\delta_{w}$ the local gradient at the output of a neuron that has $w$ as a parameter.

#### Compute the cost function

The cost function $\xi$ used in this model is the same cross-entropy cost function used in part 2 of these tutorials (TODO url):

$$\xi(t,y) = - \sum_{i=1}^{n} \left[ t_i log(y_i) + (1-t_i)log(1-y_i) \right]$$

This cost function is plotted for the $w_h$ and $w_o$ paramters in the next figure. Note that this error surface is not convex anymore, and that the $w_h$ parameter mirrors the cost function along the $w_h = 0$ axis.

In [None]:
# Define the cost function
def cost(y, t):
    return - np.sum(np.multiply(t, np.log(y)) + np.multiply((1-t), np.log(1-y)))

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm

# Plot the cost in function of the weights
# Define a vector of weights for which we want to plot the cost
nb_of_ws = 200 # compute the cost nb_of_ws times in each dimension
wsh = np.linspace(-10, 10, num=nb_of_ws) # weight 1
wso = np.linspace(-15, 15, num=nb_of_ws) # weight 2
ws_x, ws_y = np.meshgrid(wsh, wso) # generate grid
cost_ws = np.zeros((nb_of_ws, nb_of_ws)) # initialize cost matrix
# Fill the cost matrix for each combination of weights
for i in xrange(nb_of_ws):
    for j in xrange(nb_of_ws):
        cost_ws[i,j] = cost(nn(x, ws_x[i,j], ws_y[i,j]) , t)
# Plot the cost function surface
# plt.contourf(ws_x, ws_y, cost_ws, 20)
# cbar = plt.colorbar()
# cbar.ax.set_ylabel('cost')
fig = plt.figure()
ax = Axes3D(fig)
surf = ax.plot_surface(ws_x, ws_y, cost_ws, linewidth=0, cmap=cm.coolwarm)
ax.view_init(elev=40, azim=-30)
fig.colorbar(surf)
plt.xlabel('$w_h$')
plt.ylabel('$w_o$')
plt.title('Cost function surface')
plt.grid()
plt.show()

#### Update the output layer

At the output the gradient ${\partial \xi}/{\partial w_o}$ can be worked out the same way as we did in part 2:

$$\frac{\partial \xi}{\partial w_o} = \frac{\partial z_o}{\partial w_o} \frac{\partial y}{\partial z_o} \frac{\partial \xi}{\partial y} = h (y-t) = h * \delta_{o}$$

With $z_o = h * w_o$, and $\delta_{o}$ the gradient of the error at the output of the neural network with respect to this output.


#### Update the input layer

At the hidden layer the gradient ${\partial \xi}/{\partial w_{h}}$ of the hidden neuron is computed the same way:

$$\frac{\partial \xi}{\partial w_{h}} = \frac{\partial z_{h}}{\partial w_{h}} \frac{\partial h}{\partial z_{h}} \frac{\partial \xi}{\partial h}$$

With $z_{h} = x * w_{h} $. And with ${\partial \xi}/{\partial h} = \delta_{h}$ the gradient of the error at the output of the hidden neuron with respect to this $h$. This error can be interpreted as the contribution of $h$ to the final error.
How do we define this error gradient $\delta_{h}$ at the output of the hidden nodes? It can be computed as the error gradient propagated back from the connection going out of the neuron with output $h$.

$$\delta_{h} = \frac{\partial \xi}{\partial h} = \frac{\partial z_{o}}{\partial h} \frac{\partial y}{\partial z_{o}} \frac{\partial \xi}{\partial y} = w_{o} (y - t) = w_{o} \delta_{o} $$

Because of this, and because ${\partial z_{h}}/{\partial w_{h}} = x$ and ${\partial h}/{\partial z_{h}} = -2 z_h h$ we compute ${\partial \xi}/{\partial w_{h}}$ as:

$$\frac{\partial \xi}{\partial w_{h}} = x * -2 z_h h * \delta_{h}  $$

The gradients for each parameter can again be summed up to compute the update for a batch of input examples.

To start out the gradient descent algorithm, you typically start with picking the initial parameters at random and start updating these parameters in the direction of the negative gradient with help of the backpropagation algorithm.

In [None]:
# Define the error function
def gradient_output(y, t):
    return y - t

# Define the gradient function for the weight parameter at the output layer
def gradient_weight_out(h, grad_output): 
    return  h * grad_output

# Define the gradient function for the hidden layer
def gradient_hidden(wo, grad_output):
    return wo * grad_output

# Define the gradient function for the weight parameter at the hidden layer
def gradient_weight_hidden(x, zh, h, grad_hidden):
    return x * -2 * zh * h * grad_hidden

# Define the update function to update the network parameters over 1 iteration
def backprop_update(x, t, wh, wo, learning_rate):
    # Compute the output of the network
    # This can be done with y = nn(x, wh, wo), but we need the intermediate 
    #  h and zh for the weight updates.
    zh = x * wh
    h = rbf(zh)  # hidden_activations(x, wh)
    y = output_activations(h, wo)
    # Compute the gradient at the output
    grad_output = gradient_output(y, t)
    # Get the delta for wo
    d_wo = learning_rate * gradient_weight_out(h, grad_output)
    # Compute the gradient at the hidden layer
    grad_hidden = gradient_hidden(wo, grad_output)
    # Get the delta for wh
    d_wh = learning_rate * gradient_weight_hidden(x, zh, h, grad_hidden)
    # return the update parameters
    return (wh-d_wh.sum(), wo-d_wo.sum())

In [None]:
# Run backpropagation
# Set the initial weight parameter
wh = 2
wo = -5
# Set the learning rate
learning_rate = 0.1

# Plot the error surface
# plt.contourf(ws_x, ws_y, cost_ws, 20)
# cbar = plt.colorbar()
# cbar.ax.set_ylabel('cost')
# plt.title('Backpropagation updates')

fig = plt.figure()
ax = Axes3D(fig)
surf = ax.plot_surface(ws_x, ws_y, cost_ws, linewidth=0, cmap=cm.coolwarm)
ax.view_init(elev=60, azim=-30)
cbar = fig.colorbar(surf)
cbar.ax.set_ylabel('cost')

def c_inp(x, wh, wo, t):
    return cost(nn(x, wh, wo) , t)

# Start the gradient descent updates and plot the iterations
nb_of_iterations = 60  # number of gradient descent updates
lr_update = 1 - (1.0/100)
for i in xrange(nb_of_iterations):
    learning_rate *= lr_update
#     print learning_rate
    # Plot the weight-cost value and the line that represents the update
    ax.plot([wh], [wo], [c_inp(x, wh, wo, t)], 'ko')  # Plot the weight cost value
    w_new = backprop_update(x, t, wh, wo, learning_rate) # update the weights
#     print w_new
    ax.plot([wh, w_new[0]], [wo, w_new[1]], [c_inp(x, wh, wo, t), c_inp(x, w_new[0], w_new[1], t)], 'k-')
#     plt.text(wh-0.2, wo+0.4, '$w({})$'.format(i))
    wh, wo = w_new  # set the weight to the updated weights
    
# Plot the last weight, axis, and show figure
ax.plot([wh], [wo], [c_inp(x, wh, wo, t)], 'ko')
# plt.text(wh-0.2, wo+0.4, '$w({})$'.format(nb_of_iterations))  
ax.set_xlabel('$w_h$')
ax.set_ylabel('$w_o$')
ax.set_zlabel('cost')
plt.title('Gradient descent updates on cost surface')
plt.grid()
plt.show()

In [None]:
# Plot samples from both classes as lines on a 1D space
plt.figure(figsize=(8,0.5))
plt.xlim(-0.01,1)
plt.ylim(-1,1)
# print hidden_activations(x, wh)
# Plot samples
plt.plot(hidden_activations(x_blue, wh), np.zeros_like(x_blue), 'b|', ms = 30) 
plt.plot(hidden_activations(x_red_left, wh), np.zeros_like(x_red_left), 'r|', ms = 30) 
plt.plot(hidden_activations(x_red_right, wh), np.zeros_like(x_red_right), 'r|', ms = 30) 
plt.gca().axes.get_yaxis().set_visible(False)
plt.show()

print nn(x, 3, 10)

In [None]:
# Create a color map to show the classification colors of each grid point
cmap = ListedColormap([
        colorConverter.to_rgba('r', alpha=0.40),
        colorConverter.to_rgba('b', alpha=0.40)])

nb_of_xs = 10
xs = np.linspace(-4, 4, num=nb_of_xs)
ys = np.linspace(-1, 1, num=nb_of_xs)
xx, yy = np.meshgrid(xs, ys) # create the grid
# Initialize and fill the classification plane
classification_plane = np.zeros((nb_of_xs, nb_of_xs))
for i in xrange(nb_of_xs):
    for j in xrange(nb_of_xs):
#         classification_plane[i,j] = rbf(xx[i,j] * 3)
        classification_plane[i,j] = nn_predict(xx[i,j], wh, wo)
print classification_plane
# Create a color map to show the classification colors of each grid point
cmap = ListedColormap([
        colorConverter.to_rgba('r', alpha=0.25),
        colorConverter.to_rgba('b', alpha=0.25)])

# Plot the classification plane with decision boundary and input samples


# Plot samples from both classes as lines on a 1D space
plt.figure(figsize=(8,2))
plt.contourf(xx, yy, classification_plane, cmap=cmap)
plt.colorbar()
plt.xlim(-4,4)
plt.ylim(-1,1)
# Plot samples
plt.plot(x_blue, np.zeros_like(x_blue), 'b|', ms = 30) 
plt.plot(x_red_left, np.zeros_like(x_red_left), 'r|', ms = 30) 
plt.plot(x_red_right, np.zeros_like(x_red_right), 'r|', ms = 30) 
plt.gca().axes.get_yaxis().set_visible(False)
plt.show()