In [4]:
import numpy as np

#Input array
X=np.array([[1,0,1,0],[1,0,1,1],[0,1,0,1]])

#Output
y=np.array([[1],[1],[0]])

#Sigmoid Function
def sigmoid (x):
    return 1/(1 + np.exp(-x))

#Derivative of Sigmoid Function
def derivatives_sigmoid(x):
    return x * (1 - x)

#Variable initialization
epoch=5000 #Setting training iterations
lr=0.1 #Setting learning rate
inputlayer_neurons = X.shape[1] #number of features in data set
hiddenlayer_neurons = 3 #number of hidden layers neurons
output_neurons = 1 #number of neurons at output layer

#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))

for i in range(epoch):

    #Forward Propogation
    hidden_layer_input1=np.dot(X,wh)
    hidden_layer_input=hidden_layer_input1 + bh
    hiddenlayer_activations = sigmoid(hidden_layer_input)
    output_layer_input1=np.dot(hiddenlayer_activations,wout)
    output_layer_input= output_layer_input1+ bout
    output = sigmoid(output_layer_input)

    #Backpropagation
    E = y-output
    slope_output_layer = derivatives_sigmoid(output)
    slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)
    d_output = E * slope_output_layer
    Error_at_hidden_layer = d_output.dot(wout.T)
    d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer
    wout += hiddenlayer_activations.T.dot(d_output) *lr
    bout += np.sum(d_output, axis=0,keepdims=True) *lr
    wh += X.T.dot(d_hiddenlayer) *lr
    bh += np.sum(d_hiddenlayer, axis=0,keepdims=True) *lr

print(output)

[[ 0.97983896]
 [ 0.97419403]
 [ 0.03703582]]


# Simple intuition behind neural networks
If you have been a developer or seen one work – you know how it is to search for bugs in a code. You would fire various test cases by varying the inputs or circumstances and look for the output. The change in output provides you a hint on where to look for the bug – which module to check, which lines to read. Once you find it, you make the changes and the exercise continues until you have the right code / application.

Neural networks work in very similar manner. It takes several input, processes it through multiple neurons from multiple hidden layers and returns the result using an output layer. This result estimation process is technically known as <b> “Forward Propagation“</b>.

- Next we compare the result with actual output. The task is to make output to neural network as to the actual(desired) output.
        1. Each of the neurons contributing to the error to the final output.
        
- We try to minimize the value/weight of the neurons those contributing to the error and this happens while travelling back to the neurons of the neural network and finding were the error lies is known as <b>"Back Propogation"</b>.
- In order to reduce these number of iterations to minimize the error, the neural networks use a common algorithm known as “Gradient Descent”, which helps to optimize the task quickly and efficiently.

# Multi Layer Perceptron and its basics

Just like atoms form the basics of any material on earth – the basic forming unit of a neural network is a perceptron. So, what is a perceptron?

A perceptron can be understood as anything that takes multiple inputs and produces one output. For example, look at the image below.
<img src='https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/05/27090210/tikz0.png'/>

The above structure takes three inputs and produces one output. The next logical question is what is the relationship between input and output? Let us start with basic ways and build on to find more complex ways.

Below, I have discussed three ways of creating input output relationships:

<b>1. By directly combining the input and computing the output</b> based on threshold value. For eg: Take x1=0, x2=1, x3=1 and setting a threshold =0. So, if <b> x1+x2+x3>0 </b>, the output is 1 otherwise 0. You can see that in this case, the perceptron calculates the output as 1.

<b>2. Next, let us add weights to the inputs. </b> <i>Weights give importance to an input.</i> For example, you assign w1=2, w2=3 and w3=4 to x1, x2 and x3 respectively. To compute the output, we will multiply input with respective weights and compare with threshold value as <b> w1*x1 + w2*x2 + w3*x3 > threshold.</b> These weights assign more importance to x3 in comparison to x1 and x2.

<b>3. Next, let us add bias</b> Each perceptron also has a bias which can be thought of as how much flexible the perceptron is. <b>bias is somehow similar to the constant <i>b</i> of a linear function y = ax + b.</b> <i> It allows us to move the line up and down to fit the prediction with the data better.</i> Without b the line will always goes through the origin (0, 0) and you may get a poorer fit. For example, a perceptron may have two inputs, in that case, it requires three weights. One for each input and one for the bias. Now linear representation of input will look like, <b> w1*x1 + w2*x2 + w3*x3 + 1*b.</b>

People thought of evolving a perceptron to what is now called as artificial neuron. A neuron applies non-linear transformations (activation function) to the inputs and biases.


# What is an activation function?
Activation Function takes the sum of weighted input <b>(w1*x1 + w2*x2 + w3*x3 + 1*b)</b> as an argument and return the output of the neuron. 

The activation function is mostly used to make a non-linear transformation which allows us to fit nonlinear hypotheses or to estimate the complex functions. <b>There are multiple activation functions, like: “Sigmoid”, “Tanh”, ReLu</b> and many other.

# Forward Propagation, Back Propagation and Epochs
Till now, we have computed the output and this process is known as <b>“Forward Propagation“</b>. But what if the estimated output is far away from the actual output (high error). In the neural network what we do, we update the biases and weights based on the error. This weight and bias updating process is known as <b>“Back Propagation“</b>.

<b>Back-propagation (BP) algorithms</b> work by determining the loss (or error) at the output and then propagating it back into the network. The weights are updated to minimize the error resulting from each neuron. The first step in minimizing the error is to determine the gradient (Derivatives) of each node w.r.t. the final output. To get a mathematical perspective of the Backward propagation, refer below section.

This one round of forward and back propagation iteration is known as one training iteration aka <b>“Epoch“</b>.

# Multi-layer perceptron
- Now, let’s move on to next part of Multi-Layer Perceptron. So far, we have seen just a single layer consisting of 3 input nodes i.e x1, x2 and x3 and an output layer consisting of a single neuron. But, for practical purposes, the single-layer network can do only so much.

- An MLP consists of multiple layers called Hidden Layers stacked in between the Input Layer and the Output Layer as shown below.
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/05/26094834/Screen-Shot-2017-05-26-at-9.47.51-AM-768x404.png"/>
- The image above shows just a single hidden layer in green but in practice can contain multiple hidden layers. 
- Another point to remember in case of an MLP is that <b>all the layers are fully connected i.e every node in a layer(except the input and the output layer) is connected to every node in the previous layer and the following layer.</b>

# Full Batch Gradient Descent and Stochastic Gradient Descent
Both variants of Gradient Descent perform the same work of updating the weights of the MLP by using the same updating algorithm but the difference lies in the number of training samples used to update the weights and biases.

Full Batch Gradient Descent Algorithm as the name implies uses all the training data points to update each of the weights once whereas Stochastic Gradient uses 1 or more(sample) but never the entire training data to update the weights once.

Let us understand this with a simple example of a dataset of 10 data points with two weights w1 and w2.

<b>Full Batch:</b> You use 10 data points (entire training data) and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2.

<b>SGD:</b> You use 1st data point and calculate the change in w1 (Δw1) and change in w2(Δw2) and update w1 and w2. Next, when you use 2nd data point, you will work on the updated weights

# Steps involved in Neural Network methodology
<img src="https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/05/26094834/Screen-Shot-2017-05-26-at-9.47.51-AM-768x404.png"/>

Let’s look at the step by step building methodology of Neural Network (MLP with one hidden layer, similar to above-shown architecture). At the output layer, we have only one neuron as we are solving a binary classification problem (predict 0 or 1). We could also have two neurons for predicting each of both classes.

First look at the broad steps:

0.) We take input and output

X as an input matrix
y as an output matrix
1.) We initialize weights and biases with random values (This is one time initiation. In the next iteration, we will use updated weights, and biases). Let us define:

<b>
- wh as weight matrix to the hidden layer
- bh as bias matrix to the hidden layer
- wout as weight matrix to the output layer
- bout as bias matrix to the output layer
</b>

2.) We take matrix dot product of input and weights assigned to edges between the input and hidden layer then add biases of the hidden layer neurons to respective inputs, this is known as linear transformation:

    hidden_layer_input= matrix_dot_product(X,wh) + bh

3) Perform non-linear transformation using an activation function (Sigmoid). Sigmoid will return the output as 1/(1 + exp(-x)).

    hiddenlayer_activations = sigmoid(hidden_layer_input)

4.) Perform a linear transformation on hidden layer activation (take matrix dot product with weights and add a bias of the output layer neuron) then apply an activation function (again used sigmoid, but you can use any other activation function depending upon your task) to predict the output

    output_layer_input = matrix_dot_product (hiddenlayer_activations * wout ) + bout
    output = sigmoid(output_layer_input)

All above steps are known as <b>“Forward Propagation“</b>


5.) Compare prediction with actual output and calculate the gradient of error (Actual – Predicted). Error is the mean square loss = ((Y-t)^2)/2

    E = y – output

6.) Compute the slope/ gradient of hidden and output layer neurons ( To compute the slope, we calculate the derivatives of non-linear activations x at each layer for each neuron). Gradient of sigmoid can be returned as x * (1 – x).

    
    slope_output_layer = derivatives_sigmoid(output)
    slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)
    

7.) Compute change factor(delta) at output layer, dependent on the gradient of error multiplied by the slope of output layer activation

     d_output = E * slope_output_layer 

8.) At this step, the error will propagate back into the network which means error at hidden layer. For this, we will take the dot product of output layer delta with weight parameters of edges between the hidden and output layer (wout.T).

     Error_at_hidden_layer = matrix_dot_product(d_output, wout.Transpose) 

9.) Compute change factor(delta) at hidden layer, multiply the error at hidden layer with slope of hidden layer activation

     d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer 

10.) Update weights at the output and hidden layer: The weights in the network can be updated from the errors calculated for training example(s).

    wout = wout + matrix_dot_product(hiddenlayer_activations.Transpose, d_output)*learning_rate 
    wh =  wh + matrix_dot_product(X.Transpose,d_hiddenlayer)*learning_rate

learning_rate: The amount that weights are updated is controlled by a configuration parameter called the learning rate)

11.) Update biases at the output and hidden layer: The biases in the network can be updated from the aggregated errors at that neuron.

    bias at output_layer =bias at output_layer + sum of delta of output_layer at row-wise * learning_rate 
    bias at hidden_layer =bias at hidden_layer + sum of delta of output_layer at row-wise * learning_rate  
    bh = bh + sum(d_hiddenlayer, axis=0) * learning_rate </b>
    bout = bout + sum(d_output, axis=0)*learning_rate </b>

Steps from 5 to 11 are known as “Backward Propagation“

One forward and backward propagation iteration is considered as one training cycle. As I mentioned earlier, When do we train second time then update weights and biases are used for forward propagation.

Above, we have updated the weight and biases for hidden and output layer and we have used full batch gradient descent algorithm.