#  Artificial Neural Networks

## Overview

<img src="figures/nn.png" width="450"/>

This is the companion "guide/implementation" to the tutorial "A Practitioners Guide to Neural Networks.pdf".  
Please go through that paper so that you understand the steps below.
The path to learning is not simply executing code, but critical thinking, reading new material, and repetition until you understand the topic.
However, for some students, the gap between books and implementation is rather large.
The tutorial was written for these people, and this notebook was provided to put the code from the tutorial in one place.


## Introduction

An artificial neural network (ANN) is a numerical implementation of the biological brain.
Both "systems" are analogous to switches that change their output state based on the strength of the input.
While a single switch, or neuron, produces a single output, the interconnection of thousands or millions of neurons represents a structured output.
Through *learning*, some neurons are *triggered* differently than others, and the repetition of learning, reinforces those connections and structure.
The reinforcement process to produce a desired result is called *feedback*.

The *Artificial* neural networks attempt to simplify and mimic the biological behavior discussed above.
*Training* an ANN takes two forms, *supervised* or *unsupervised*.
In a *supervised* ANN, the network is trained by providing matched input and output data samples, minimizing the error between.
For example: an e-mail spam filter might use specific text within an email to determine the authenticity of the e-mail, and the training of the ANN filter requires many examples, with iterations of feedback, before it will correctly *classify* an e-mail.
An *unsupervised* ANN attempts to "understand" the structure of the input "on its own", requiring special *clustering* or *dimension reduction* algorithms; this method will be discussed in a future tutorial.

This tutorial will walk you through the steps to build, train, and test an artificial neural network.  In our example, we will use the MNIST handwritten number dataset.

Lets get started!

We begin with the standard imports:

In [None]:
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt 
import numpy.random as r
import numpy as np

import warnings
warnings.filterwarnings('ignore')

## Function Definitions

Next we will define some functions.

The biological neuron is simulated in an ANN by an activation function (AF), or switch. If
the input is above a user defined threshold, the AF switches state, e.g. from 0 to 1, −1 to 1
or from 0 to > 0. A commonly used activation function is the sigmoid function.

Here we define the function (which is used in the feed-forward) and its first-derivative (which we will use during training).

In [None]:
def f(x):
    return 1 / (1 + np.exp(-x)) # 1 / (1+exp(-x))

def f_deriv(x):
    return f(x) * (1 - f(x))  # f(x) * (1-f(x))

Now defined the feed-forward function

Feed-forward is the process of computing the output of an ANN when the weights and
biases of the nodes are known. This is the process used after training to classify input
data.

In [None]:
def feed_forward(x, W, b):
    h = {1: x}
    z = {}
    #
    for l in range(1, len(W) + 1):
        # if it is the first layer, then the input into the weights is x, otherwise, 
        # it is the output from the last layer
        if l == 1:
            node_in = x
        else:
            node_in = h[l]
        #
        z[l+1] = W[l].dot(node_in) + b[l] # z^(l+1) = W^(l)*h^(l) + b^(l)  
        h[l+1] = f(z[l+1]) # h^(l+1) = f(z^(l+1)) 
    #    
    return h, z

Define the initialization functions to initialize the weights for each layer.

To simplify the code, we’ll use Python dictionary objects (initialized by ). Next, initialize the weights to random values, using the NumPy function random sample, to ensure convergence during training. 
The weight initialization function is shown below.

In [None]:
def setup_and_init_weights(nn_structure):
    W = {}
    b = {}
    #
    for l in range(1, len(nn_structure)):
        W[l] = r.random_sample((nn_structure[l], nn_structure[l-1]))
        b[l] = r.random_sample((nn_structure[l],))
    #    
    return W, b

Next, set the mean accumulation values $\Delta W$ and $\Delta b$ to zero (these need to be the same size as the weight and bias matrices).

In [None]:
def init_tri_values(nn_structure):
    tri_W = {}
    tri_b = {}
    #
    for l in range(1, len(nn_structure)):
        tri_W[l] = np.zeros((nn_structure[l], nn_structure[l-1]))
        tri_b[l] = np.zeros((nn_structure[l],))
    #    
    return tri_W, tri_b

Define a few helper functions

In [None]:
def calculate_out_layer_delta(y, h_out, z_out):
    # delta^(nl) = -(y_i - h_i^(nl)) * f'(z_i^(nl))
    return -(y-h_out) * f_deriv(z_out)

def calculate_hidden_delta(delta_plus_1, w_l, z_l):
    # delta^(l) = (transpose(W^(l)) * delta^(l+1)) * f'(z^(l))
    return np.dot(np.transpose(w_l), delta_plus_1) * f_deriv(z_l)

def convert_y_to_vect(y):
    y_vect = np.zeros((len(y), 10))
    for i in range(len(y)):
        y_vect[i, y[i]] = 1
    return y_vect

Define the *training* function for the neural network

In [None]:
def train_nn(nn_structure, X, y, iter_num=3000, alpha=0.25):
    W, b = setup_and_init_weights(nn_structure)
    cnt = 0
    m = len(y)
    avg_cost_func = []
    #
    print('Starting gradient descent for {} iterations'.format(iter_num))
    #
    while cnt < iter_num:
        if cnt%1000 == 0:
            print('Iteration {} of {}'.format(cnt, iter_num))
        #    
        tri_W, tri_b = init_tri_values(nn_structure)
        avg_cost = 0
        #
        for i in range(len(y)):
            delta = {}
            # perform the feed forward pass and return the stored h and z values, to be used in the gradient descent step
            h, z = feed_forward(X[i, :], W, b)
            #
            # loop from nl-1 to 1 backpropagating the errors
            for l in range(len(nn_structure), 0, -1):
                if l == len(nn_structure):
                    delta[l] = calculate_out_layer_delta(y[i,:], h[l], z[l])
                    avg_cost += np.linalg.norm((y[i,:]-h[l]))
                else:
                    if l > 1:
                        delta[l] = calculate_hidden_delta(delta[l+1], W[l], z[l])
                    #
                    # triW^(l) = triW^(l) + delta^(l+1) * transpose(h^(l))
                    tri_W[l] += np.dot(delta[l+1][:,np.newaxis], np.transpose(h[l][:,np.newaxis]))
                    #
                    # trib^(l) = trib^(l) + delta^(l+1)
                    tri_b[l] += delta[l+1]
        #
        # perform the gradient descent step for the weights in each layer
        for l in range(len(nn_structure) - 1, 0, -1):
            W[l] += -alpha * (1.0/m * tri_W[l])
            b[l] += -alpha * (1.0/m * tri_b[l])
        #
        # complete the average cost calculation
        avg_cost = 1.0/m * avg_cost
        avg_cost_func.append(avg_cost)
        cnt += 1
        #
    return W, b, avg_cost_func

Define the *testing* function

In [None]:
def predict_y(W, b, X, n_layers):
    m = X.shape[0]
    y = np.zeros((m,))
    for i in range(m):
        h, z = feed_forward(X[i, :], W, b)
        y[i] = np.argmax(h[n_layers])
    return y

## Train and Test

Now that we have defined everything, we can finally train our neural network and test it!

The MNIST dataset is a standard dataset in neural network and deep learning literature.
The dataset consists of images of hand-written digits that are labeled, or “tagged”, so that we can train and compare results of images against the true labeled value. 
Each image is of dimension 8 × 8 gray-scale pixels, a total of 64 values that indicate pixel intensity. 
We will use the Python Machine Learning library, scikit learn. 
An example of the image (and conveniently part of the scikit learn dataset) is shown in the code below.

In [None]:
# load the MNIST dataset
digits = load_digits()
print(digits.data.shape)

# plot the digit that we will test
plt.figure()
plt.gray() 
plt.matshow(digits.images[1]) 
#plt.savefig('nn_digit.png')

To correctly utilize the activation function that has an x-axis sensitivity range of ±1, we need to scale our input data to a range of ±1.

First consider one of the dataset pixel representations. 
Notice that the input data ranges from 0 up to 15.

In [None]:
# show the original pixel data
print('dataset pixel representations \n', digits.data[0,:], '\n\n')

Scaling to the range of ±1, 1$\sigma$, using scikit learn results in the following.
By default, scikit learn normalizes the input by subtracting the mean and dividing by the standard deviation. 
As shown, the data is centered around zero with a 1$\sigma$ standard deviation of ±1.

In [None]:
# show the scaled pixel data
X_scale = StandardScaler()
X = X_scale.fit_transform(digits.data)
print('scale the input data \n', X[0,:], '\n\n')

In the context of ML, the term “over-fitting” implies the tendency for ANN models very accurately predict specific inputs, based on extensive training, but poorly predict inputs
that slightly deviate from mean of the training data. Simply stated, “over-fitting” results in the inability to predict anything the ANN has not “seen” previously. 
Therefore, given a set of data, 60 − 80% of the data is used for training, while the remaining data is used for testing.
Using scikit learn, we can split the data into training and testing sets, as show below

In [None]:
# set the digits data
y = digits.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4)

To predict digits from 0 to 9, we need 10 nodes in the output layer. 
For example: the prediction of the digit “2” should produce the output layer result [0, 0, 1, 0, 0, 0, 0, 0, 0, 0]. 
However, in reality the result will more closely resemble [0.05, 0.05, 0.8, 0, 0, 0.05, 0, 0.02, 0.01, 0.02], which sums to 1, and the most likely value indicated by the largest value 0.8, representing the the digit “2”.

The MNIST data supplied in scikit learn, the “targets” or the classification of the handwritten digits is in the form of a single number. 
We need to convert that single number into a vector so that it lines up with our 10 node output layer. 
In other words, if the target value in the dataset is “1” we want to convert it into the vector [0, 1, 0, 0, 0, 0, 0, 0, 0, 0], as shown below.

In [None]:
# convert data to vector
y_v_train = convert_y_to_vect(y_train)
y_v_test = convert_y_to_vect(y_test)
print('y_train[0]=',y_train[0])
print('y_v_train[0]=', y_v_train[0])

Next we will specify the structure of the neural network. 
The input layer requires 64 nodes for the 64 pixels in each image. 
The output layer requires 10 nodes to predict the digits. 
The hidden layer requires enough nodes to account for the complexity of the data. 
Using the relation $\frac{N_{input}}{2}$, where $N_{input}$ represents the number of nodes in the input layer, define 30 nodes for the hidden layer. 
Therefore

In [None]:
# set the neural net structure
#    64 nodes to cover the 64 pixels in the image 
#    30 hidden layer nodes to allow for the complexity of the task
#    10 output layer nodes to predict the digits
nn_structure = [64, 30, 10]

Finally we start training the ANN.
Notice that the function above does not terminate upon reaching a threshold.
Instead, the function terminates at $3,000$ iterations so that we can monitor the change in the average cost function, see $avg\_cost\_func$.
In each gradient descent iteration, the function cycles through each training sample ($range(len(y)$) and performs the feed-forward pass followed by the back-propagation.
The back-propagation step is an iteration through the layers starting at the output layer and working backwards – $range(len(nn\_structure), 0, -1)$.
The average cost is calculated at the output layer ($l == len(nn_structure)$).
The mean accumulation values, $\Delta W$ and $\Delta b$, designated as $tri\_W$ and $tri\_b$, are updated for every layer.
After iterating through all training samples, accumulating the $tri_W$ and $tri_b$ values, the gradient descent step is computed to change and the values for the weight and bias are updated

$$W^{(l)} = W^{(l)} - \alpha [\frac{1}{m} \Delta W^{(l)}]$$
$$b^{(l)} = b^{(l)} - \alpha [\frac{1}{m} \Delta b^{(l)}]$$

At termination, the function returns the trained weight and bias values, as well as the tracked average cost for each iteration. 

In [None]:
# train the neural network
W, b, avg_cost_func = train_nn(nn_structure, X_train, y_v_train)

After the function terminates, we can plot the average cost for each iteration.
As shown in the Figure below, by $3,000$ iterations the average cost has started to "plateau", implying that additional iterations are not likely to improve performance.

In [None]:
# plot the results
plt.figure()
plt.plot(avg_cost_func)
plt.ylabel('Average J')
plt.xlabel('Iteration number')
plt.grid(True)
plt.show()
#plt.savefig('nn_average_cost_vs_iteration.png')

With an adequately trained MNIST neural network model, we can test a (64 pixel) input from the MNIST dataset.
This is performed by a \emph{single} feed-forward pass through the network using our trained weight and bias values.
As discussed previously, we assess the prediction of the output layer by taking the node with the maximum output as the predicted digit using the $numpy.argmax$ function.

In [None]:
# show accuracy
y_pred = predict_y(W, b, X_test, 3)
print('accuracy score =', accuracy_score(y_test, y_pred)*100)

## Final Comments

Now that you have reached the end of this tutorial, you should find that the skills that you gained here will help you as you tackle more complex topics.
Good luck!