# Tutorial for Neural Networks

The purpose of this notebook is to create an implementation of a simple  artificial neural network for learning purposes. 

Here are the main concepts associated with an ANN:
- Activation Function 
- Forward Propagation
- Gradient Descent
- Backpropagation

In [3]:
import numpy as np

In [4]:
X = np.array(([3,5, 8],[5,1, 10]), dtype=float)
y = np.array(([75],[82]), dtype=float)

In [35]:
# Normalize the data
X = X/np.amax(X, axis=0)
y = y/100

In [37]:
X

array([[ 0.6,  1. ,  0.8],
       [ 1. ,  0.2,  1. ]])

In [127]:
class NeuralNet(object):
    def __init__(self):
        self.num_layers = 3
        self.input_layer_size = 3
        self.output_layer_size = 1
        self.hidden_layer_size  = 3
        
#         self.w = []
#         for i in range(num_layers-1):
#             self.w[i] = np.random.randn(self.arr[i], self.arr[i+1])
        self.w1 = np.random.randn(self.hidden_layer_size, self.input_layer_size + 1)    
        self.w2 = np.random.randn(self.output_layer_size, self.hidden_layer_size + 1)
        self.w = [self.w1, self.w2]

    def forward_propagation(self, X):
        for l in range(self.num_layers-1):
            if l == 0:
                bias_values = np.array([1 for r in range(X.shape[0])]).reshape(X.shape[0], 1)
                node_in = np.concatenate((bias_values, X), axis=1)
                node_in = node_in.reshape(node_in.shape[1], node_in.shape[0])
            else:
                bias_values = np.array([1 for r in range(h.shape[1])]).reshape(1, h.shape[1])
                node_in = np.concatenate((h, bias_values), axis=0)
                
            z = np.dot(self.w[l], node_in)            
            h = self.sigmoid(z)     

        return h

    def sigmoid(self, z):
        return 1/(1+np.exp(-z))
    
    def sigmoid_derivative(self, z):
        return np.exp(-z)/(1+np.exp(-z)**2)
    
    def cost_function_derivative(self, X, y):
        self.y_hat = self.forward_propagation(X)
        
        delta3 = np.multiply(-(y-self.y_hat), self.sigmoid_derivative(self.z3))
        dj_dw2 = np.dot(self.a2.T, delta3)
        
        delta2 = np.dot(delta3, self.w2.T)*self.sigmoid_derivative(self.z2)
        dj_dw1 = np.dot(X.T, delta2)
        
        return dj_dw1, dj_dw2

In [128]:
NN = NeuralNet()
print NN.forward_propagation(X)
print y

[[ 0.68330412  0.74938673]]
[[ 0.0075]
 [ 0.0082]]


## Activation Function
There are a few important parts to building out a basic Artificial Neural Network. The first thing to do is define the concept of an activation/sigmoid function.

$$f(z)= \frac{1}{1+\exp(-x)}$$

This is the function applied to the inputs of all nodes in the network  and its result is passed to the next layer as an input. 

In [19]:
def sigmoid(self, z):
    """Given an input, returns the result of the sigmoid function."""
    return 1/(1+np.exp(-z))

## Forward Propagation

This activation function has a weight assigned to its input for each individual connection from one node to another. The sum of these results become the output to feed into the next layer of nodes. This concept is called forward propagation.

Implementation-wise, we use numpy arrays and matrix operations to make this function simpler and more time-efficient.

In [57]:
def forward_propagation(self, X):
    for l in range(self.num_layers-1):
        if l == 0:
            node_in = X.T
        else:
            node_in = h
        z = np.dot(self.w[l],node_in) # + b[l] add bias later
        h = self.sigmoid(z)           
    return h

## Gradient Descent and Cost Function

We now have a function that will take an input based on initial arbitrarily generated weights. This will produce a value that makes a prediction of what the data should be. However, this is bound to be off by a certain measure. This difference between a predicted and actual value is a cost, which we will strictly define through a cost function.

$$J(w) = \frac{1}{m}\sum \frac{1}{2} (y^{z} - h^{n_{l}}(x^{z}))^{2} $$

Here, m represents the number of training samples and $h^{n_{l}}$ is the output of the final activation layer.

We can attempt to minimize this cost function iteratively by calculating its derivative and using that gradient to iteratively move through the function to find its minimum value, aka the best predictor of the actual data. We are specifically manipulating each weight in the network to achieve this, shown in the computation below:

$$w^{(l)}_{ij} = w^{(l)}_{ij} - \alpha \frac{\partial}{\partial  w^{(l)}_{ij}}J(w)$$

Here, $\alpha$ represents the step size that indicates the magnitude of the change in weight. It will determine the speed at which gradient descent converges to a solution. To stop gradient descent, we will need to define how accurate (how small the error) the model should be. 

Also, i and j refers to the nodes that the weights are associated with, where i is the destination node and j is the source. 

We'll delve into some math now to show how a partial derivative for a given $w_{ij}$ is cancelled. Let's look at $\frac{\partial J}{\partial  w^{(2)}_{12}}$. We can separate this into a chain of derivatives.

Let's make some definitions first. we will say the output layer function looks like the following:

$$h_{1}^{3} = f(w_{11}^{2}h_{1}^{2} + w_{12}^{2}h_{2}^{2} + w_{13}^{2}h_{3}^{2}) = f(z_{1}^{(2)})$$

So, we can can define our derivative as:

$$\frac{\partial J}{\partial w^{(2)}_{12}} = \frac{\partial J}{\partial h^{(3)}_{1}} \frac{\partial h^{(3)}_{1}}{\partial z_{1}^{2}} \frac{\partial z_{1}^{2}}{\partial w^{(2)}_{12}} $$

We will now evaluate each one separately. We can simplify the third term to:

$$\frac{\partial z_{1}^{2}}{\partial w^{(2)}_{12}} = h_{2}^{2}$$

For the second term, we simply require the derivative of the activation function defined earlier:

$$\frac{\partial h}{\partial z} = f(z)(1 - f(z))$$

The final term is the derivative of the cost function with respect to the output of the activation function. This result is:

$$\frac{\partial J}{\partial h} = -(y_{1} - h_{1}^{(3)})$$

So, the product of thee three results will give us the derivative for that particular weight. 

To generalize this result, let's define a new variable $\delta$:

$$\delta^{(n_{l})}_{i} = -(y_{i}-h_{i}^{(n_{l})})f'(z_{i}^{(n_{l})})$$

Extrapolating for any given connection ij, we can see that:

$$\frac {\partial J(W)}{\partial W_{ij}^{(l)}} = h_{j}^{l} 
\delta_{i}^{(l+1)}$$



## Backpropagation

The above derivations work well for the weights that are closest to the output layer. We need a better way to update the weights in deeper layers. We achieve this with the backpropagation method. 

We need to propagate the $\delta^{(n_{l})}_{i}$ to previous layers. The delta function of nodes from previous layers will be the delta function of the next layer multiplied by the connecting weight from the source to destination node. 

$$\delta^{(l)}_{j} = \delta^{(l=1)}_{1}w^{(l)}_{1j}f'(z_{j})^{(l)}$$




In [95]:
NN = NeuralNet()

print NN.forward_propagation(X[0:1])


(1, 3)

In [35]:
NN  = Neural_Net()

c1 = NN.cost_function_derivative(X, y)

In [36]:
 a,b = NN.cost_function_derivative(X,y)

In [37]:
a

array([[-0.06366729,  0.02766799,  0.25473514],
       [-0.04287048,  0.01929333,  0.18918441]])

In [38]:
b

array([[-0.2191875 ],
       [-0.32555071],
       [-0.19873838]])

$$c = \sqrt{a^2 + b^2}$$