# Multi-Layer Neural Network:

Multi-Layer Neural Network is considered as extension to [two layer neural network](https://www.kaggle.com/hamzafar/two-layers-neural-network), we have built before. In this notebook, we will see the generic architecture of neural network as:
1. Description of generic Neural Network
2. Deployment of Network
3. Performance on different legth Training Examples

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt # ploting graph

### Generate Data
To discuss Generalized behavior of Regression model, by Generalized we mean it can work on any shape of data, we have created the generate_bits functions. The function is simple as it takes desired number of rows m and number of feature n_x and it randomly generated binary data i.e. 0 and 1.

In [None]:
def generate_bits(n_x, m):
# Generate a m x n_x array of ints between 0 and 1, inclusive:
# m: number of rows
# n_x : number of columns per rows/ feature set
    np.random.seed(1)
    data = np.random.randint(2, size=(n_x, m))
    return(data)

### Create Labels:
For training/updating derivatives of parameters weight and bias, the loss function determine the difference between actual values and the activation values. The actual value is the value that each example(row) has as it label. Like the the actual value of OR operation:
$$1+0=1=actualValue>:[oroperation=+]$$
The generate_label function below takes data as input and apply XOR operation row wise.

In [None]:
def generate_label(data, m):
    # generate label by appyling xor operation to individual row
    # return list of label (results)
        # data: binary data set of m by n_x size
    lst_y = []
    y= np.empty((m,1))
    k = 0
    for tmp in data.T:
        xor = np.logical_xor(tmp[0], tmp[1])

        for i in range(2, tmp.shape[0]):
            xor = np.logical_xor(xor, tmp[i])
    #     print(xor)
        lst_y.append(int(xor))
        y[k,:] = int(xor)
        k+=1
    return(y.T)

## Multi Layer Neural Network:
### Forward Propagation:
In Multi Layer Neural Network, commonly known as Multi Layer Perceptrons (MLP), the input and out layer are dependent on the input feature *n_x* and targe label *y* but the hidden layer consiting of hidden units can varry. A *MLP* can have one or more than more hidden layer, with one or more hidden units. The single hidden unit works same as *[simple Logist unit](https://image.ibb.co/b1mkJS/Single_Neuron.png)*.  The figure below is self descriptive of the concept of MLP:

![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/mlp/1.jpg?raw=true)

The above architecutre is unfolded to *Three layer Neural Network* in the figure below:

![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/mlp/2.jpg?raw=true)

Let's consider the single feature of single training example, that is Feeded Forward thorugh the network. By Feed Forward we menan that the data is passed through left to right of network. the data is mulitplied by some weight *$w_1$* and bias *$b_1$* is added to it and then a non-linearity function *tanh* is applied to *$z_1$* values, resulting in *$a_1$*.  This *$a_1$* is then computed by next state *$a_2$* non-linearity functon, by having multiplication of *$w_2$* and addition of *$b_2$*. The same proces then continues to the out layer, to compute loss/cost of network.
The above process can be generalized, by passing input of shape **($n_x, m$)** and parameters **$w$** and **$b$** converted to shape of pervious and next hidden units/ activations. The subscripts in the each state above shows the hidden layer number.
Now, let's pass some input data of shape **(5,2)** with network structure of **two hidden layers**. In first layer we have used **four hidden units** and **three hidden units** in the second layer. The data flow and its shape consistency during each state is shown below:


![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/mlp/3.jpg?raw=true)

In the following below, we will generalize this concept, and write down a pseudo code. and then implement it to actual code to get cost of network.

                    For Loop l (1 to L): # L is the total number of hidden layer
 $$w_l = \textrm{(next # of hidden units, previous # of input feature/ # of hidden units)}$$                                              
$$b_l = \textrm{(next # of hidden units, 1)}$$                                              

$$z_l = w_l * prev_a + b_l $$                                              

$$a_l = \textrm{Compute activations each term}(z_l)$$                                              

                  cost = logistic difference of  a_L , y

**initialize_param**: This helper function will create parameter w's and b's for each of neural network layer; the inital value of w's will be random and b's will be zero. The size of w's and b's will be as follow: Since, in the first hidden layer, the parameter w's are multiplied by x's i.e. each of input feature of x is multiplied by respective weight w and adding a bias b is resulted into single value z that is passed to non-linear function resulting a single unit (neuron). And if we have n number of neurons then the size of weight parameter will be of shape (number of neurons, number of input feature) and the bias b will be of shape (number of neurons, 1). When focus on the second layer of network, then the generic idea of parameter will be same but instead the shape changes according to layers presented there. So, we have paramter w of shape (number of output values, number of neurons) and bias b will be of shape (number of output values, 1).

In [None]:
def initialize_paramaters(layer_dims, pt = False):
    # return weights and bias of required network shape
        #layer_dims: layer dim
        # pt: wehter to print shapes
    parameters = {}
    L = len(layer_dims)
    for l in range(1,L):
        parameters['w'+str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1]) * 0.01
        parameters['b'+str(l)] = np.zeros(shape=(layer_dims[l], 1))
        
        assert(parameters['w' + str(l)].shape == (layer_dims[l], layer_dims[l - 1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))
        
        if(pt == True):
            print('w'+str(l))
            print(layer_dims[l],layer_dims[l-1])
    return(parameters)

**Non-Linear Function:** Create two non-linear function *sigmoid* and *relu*.

In [None]:
def sigmoid(z):
    # Takes input as z and return sogmoid of value
    s = 1 / (1 + np.exp(-z))
    return s

In [None]:
def relu(z):
    # Takes input as z and return relu of value    
    r = np.maximum(0, z)
    return r

**single_layer_forward_pass:** The helper function works as a *[simple Logist unit](https://image.ibb.co/b1mkJS/Single_Neuron.png)* like it compute linear and non-linear values. There are two  non-linear functions here, (i). **$sigmoid$**, (ii). **$relu$**

In [None]:
def single_layer_forward_pass(prev_a, w, b, act_fun):
    # return activations a and z
        # prev_a: last layer activations values
        # w, b : parameters weight and bias
        # act_func: either sigmoid or relu
    z = np.dot(w,prev_a) + b
    a = np.zeros((z.shape))
    if(act_fun == 'sigmoid'):
        a = sigmoid(z)
    elif(act_fun == 'relu'):
        a = relu(z)
    return(a,z)

**multi_layer_forward_pass: ** This function compute actiovation at the output layer. It simply call above functions and steps over the multiple layers to output layer, by looping over each of the single layer.

In [None]:
def multi_layer_forward_pass(x, parameters, layer_dims, act_fun, pt = False):
    # return activations at output layer
        # x: input data
        # parameters: dictonary object of weights and bias
        # layer_dims: layer dim
        # act_fun: activation function either relu or sigmoid
        # pt: wehter to print shapes
    L = len(layer_dims)
    prev_a = x
    activations = {}
    for l in range(1,L):
        if(l == L-1):
#             print('last layer')
            w = parameters['w'+str(l)]
            b = parameters['b'+str(l)]
            a, z = single_layer_forward_pass(prev_a, w, b, 'sigmoid')
            activations['a'+str(l)] = a
            
            assert((a.shape) == (w.shape[0], prev_a.shape[1]))
            
            if(pt == True):
                print('w'+str(l))
                print(a.shape)
        else:
            w = parameters['w'+str(l)]
            b = parameters['b'+str(l)]
            a, z = single_layer_forward_pass(prev_a, w, b, act_fun)
            prev_a = a
            if(pt == True):
                print('w'+str(l))
                print(a.shape)
            activations['a'+str(l)] = a
            
            assert((a.shape) == (w.shape[0], prev_a.shape[1]))
            
    return activations

**compute_cost:** The compute cost function, calculate loss of each of the training example in the data set by comparing difference between actual and produce activation from the network and then average all the loss to get a single cost value.

In [None]:
def compute_cost(y, activations, layer_dims, m):
    # calculate cost w.r.t. activation activations and actual value y
        # y: target value
        # activations: dictionary of activations
        # layer_dims: network layers dimensions
        # m: number of training examples in data set
    L = len(layer_dims)   
    a = activations['a'+ str(L-1)]
    loss = -1*(y* np.log(a) + (1-y) * np.log(1-a))
    cost = np.sum(loss)/m
    return(cost)

### Backward Propagation:
In the above section, we have implement neural network forward pass and computed cost of the network, now we will implement the Back-Propagation algorithm to update parameters *w's* and *b's* for optimizing network.
The below diagram shows the typical flow of backward computation. As in Forward propagtion, we moved from left to rigth, in this process we will go from right to left update parameters.


![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/mlp/4.jpg?raw=true)

In the figure above, we only considered three layers architecture. Using this architecture we will first derive equations for partial derivatives of parameters *w's* and *b's* then we will generalize equation to implement it deeper network. Beow are the equations for partial derivatives of above nodes:

\begin{equation}
\partial{z_3} = a_3-y_3\\
\partial{b_3} = \partial{z_3} \tag{1}\\
\partial{w_3} = a_2 * \partial{z_3}\\
\partial{z_2} = (w_3 *\partial{z_3}) * \partial{a_2}\\
\textrm{ In case of tanh: }  \partial{a} = (1-a^2)\\
\textrm{ In case of sigmoid: }  \partial{a} = (1-a)*a\\
\begin{split}
\textrm{ In case of relu: }  \partial{a}&=  0 \textrm{ if } a<=0\\
&=1 \textrm{ if } a>0\\
\end{split}\\
\partial{b_2} = \partial{z_2}\\
\partial{w_2} = a_1 * \partial{z_2}\\
\partial{z_1} = (w_2 *\partial{z_2}) * \partial{a_1}\\
\partial{b_1} = \partial{z_1}\\
\partial{w_2} = x * \partial{z_1}
\end{equation}



From the above, we have found a pattern i.e. the $\partial{b}$ is equal to $\partial{z}$ of same hidden layer, and $\partial{w}$ is product of $\partial{z}$ of same layer and activations or input values of previous layer. 

The  exceptation, we have made in last layer, as we just skipped in between steps and computed $\partial{z}$ as we know thre will only sigmoid function at output layer

                                        For Loop(L to 1):
                        
                                            if(L = n-l): # last layer of network
$$\partial{z_L} = a_L - y$$

                                            else:
                    
$$\partial{z_L} = (w_{L+1} * \partial{z_{L+1}}) * da$$
$$\textrm{where: } da \textrm{ is partial derivative of tanh, relu, and sigmoid } $$
$$\partial{b_L} = \partial{d_{L}}$$
$$\partial{w_L} = a_{L-1}*\partial{d_{L}}$$



**derivatives of non-linear functions:** As, we have dervied, partial darivatives of non-linear functions, the below two pyhthon functions *sigmoid_deri and relu_deri* will implement the equations.

In [None]:
def sigmoid_deri(a):
    # return partial derivative of sigmoid
        #a: activation
    da = (1-a) * a
    return da

In [None]:
def relu_deri(a):
    # return partial derivative of relu
        #a: activation
    da = np.array(a, copy=True)
    da[a <= 0] = 0
    return da

**derivaive_parameters: **  According to generic equations, we have derived in pesudo code above, *derivaive_parameters* will impement the equations for taking paritial derivative of weight *w's* and bias *b's*. The equation is as follows:
$$\partial{w_L} = a_{L-1}*\partial{d_{L}}$$
$$\partial{b_L} = \partial{d_{L}}$$


In [None]:
def derivaive_parameters(dz, a):
    # compute partial derivatives of w and b
        # dz: partial derivative of z
        # a: activations
    dw = (1 / m) * np.dot(dz, a.T)
    db = (1 / m) * np.sum(dz, axis=1, keepdims=True)
    return(dw, db)

**derivaive_z: ** This helper function computes the derivate of liner functions:
$$\partial{z_L} = (w_{L+1} * \partial{z_{L+1}}) * da$$

In [None]:
def derivaive_z(dz, w, da):
    # computes derivative of z
        # dz: next node dz
        # w: paramter weight of next node
        # da: partial derivative of activation functions.
    dz = np.dot(w.T, dz) * da
    return dz

**back_propagate: ** This functions implements the pesudo code algorithm, we have mentioned in the backpropagation section beneath derivative w.r.t cost equation.

In [None]:
def back_propagate(x, y, m, parameters, activations, layer_dims, act_fun):
    # return gradeint of paramters
        # x: input data
        # y: target label
        # m: number of training examples in data set
        # parameters: dictonary object of weights and bias
        # activations: dictionary object containing activation on different layers
        # layer_dims: layer dim
        # act_fun: activation function either relu or sigmoid

    L = len(layer_dims)
    grads = {}
    grads_line = {}

    for l in reversed(range(L-1)):
        l = l+1
        if(l == L-1):
#             print('a'+str(l))
    #         print(activations['a'+str(l+1)])
            grads_line['dz'+str(l)] = activations['a'+str(l)] - y
        
            grads['dw'+str(l)], grads['db'+str(l)] = derivaive_parameters(grads_line['dz'+str(l)], activations['a'+str(l-1)])
            
            assert(grads['dw'+str(l)].shape) == (parameters['w'+str(l)].shape)
            assert(grads['db'+str(l)].shape) == (parameters['b'+str(l)].shape)
        else:
#             print('a'+str(l))
            da = np.empty(activations['a'+str(l)].shape)

            if(act_fun == 'sigmoid'):
                da = sigmoid_deri(activations['a'+str(l)])            
            elif(act_fun == 'relu'):
                da = relu_deri(activations['a'+str(l)])

            grads_line['dz'+str(l)] = derivaive_z(grads_line['dz'+str(l+1)], parameters['w'+str(l+1)], da)
        
            if(l-1 != 0):
                grads['dw'+str(l)], grads['db'+str(l)] = derivaive_parameters(grads_line['dz'+str(l)], activations['a'+str(l-1)])
            else:
                 grads['dw'+str(l)], grads['db'+str(l)]  = derivaive_parameters(grads_line['dz'+str(l)], x)
                    
            assert(grads['dw'+str(l)].shape) == (parameters['w'+str(l)].shape)
            assert(grads['db'+str(l)].shape) == (parameters['b'+str(l)].shape)
                
    return grads

**update_parameters:** This function update paramaters w's and b's according to the following equation:
$$w := w - \partial{w} * \textrm{learning rate}$$
$$b := b - \partial{b} * \textrm{learning rate}$$

In [None]:
def update_parameters(parameters, grads, layer_dims):
    # update and return parameters 
        # parameters: dictonary object of w's and b's
        # grads: dictionary: object of gradient of w's and b's
        # layer_dims: layer dim

    L = len(layer_dims)
    for l in range(1,L):
#         print(l)
        parameters['w'+str(l)] = parameters['w'+str(l)] -(lr * grads['dw'+str(l)])
        parameters['b'+str(l)] = parameters['b'+str(l)] -(lr * grads['db'+str(l)])
    return parameters    

In [None]:
def optimize_parameters(x, y, parameters, act_fun, layer_dims, m, num_iter):
    lst_cost = []
    
    for i in range(num_iter):
        activations = multi_layer_forward_pass(x, parameters, layer_dims, act_fun, pt = False)
        cost = (compute_cost(y, activations, layer_dims, m))
        grads = back_propagate(x, y, m, parameters, activations, layer_dims, act_fun)
        parameters = update_parameters(parameters, grads, layer_dims)
        lst_cost.append(cost)
#         lst_cost.append(format(cost, '.4f'))
    
    return (lst_cost, parameters)

In [None]:
def plt_res(lst, ylab, lr):
    #This will plot the list of values at y axis while x axis will contain number of iteration
    #lst: lst of action/cost
    #ylab: y-axis label
    #lr: learning rate
    plt.plot(lst)
    plt.ylabel(ylab)
    plt.xlabel('iterations')
    plt.title("Learning rate =" + str(lr))
    plt.show()

we have implemented Regression using 10000 , 100000 and 1000000 samples and sotred their respective learned paramteres (weights and bias) and cost of the function.

In [None]:
n_x = 50
n_y = 1
m = 10000
lr = 0.5
num_iter = 30

x = generate_bits(n_x, m)
y = generate_label(x, m)

layer_dims = [n_x,4,3,n_y]
act_fun = 'relu'

parameters = initialize_paramaters(layer_dims, pt = False)

lst_cost_s, parametes_s = optimize_parameters(x, y, parameters, act_fun, layer_dims, m, num_iter)

# plt_res(lst_cost, 'cost', lr)

m = 100000
x = generate_bits(n_x, m)
y = generate_label(x, m)

parameters = initialize_paramaters(layer_dims, pt = False)
lst_cost_m, parametes_m = optimize_parameters(x, y, parameters, act_fun, layer_dims, m, num_iter)


m = 1000000
x = generate_bits(n_x, m)
y = generate_label(x, m)

parameters = initialize_paramaters(layer_dims, pt = False)
lst_cost_l, parametes_l = optimize_parameters(x, y, parameters, act_fun, layer_dims, m, num_iter)


**Prediction** In prediciton step, we have done following two steps:

1. Calculate $ Y$^ $=A=Ïƒ(w^T * X+b) $
1. Convert the entries of a into 0 (if activation <= 0.5) or 1 (if activation > 0.5), stores the predictions in a vector Y_prediction
To validate how the MLP is performing; we just created new dataset of 0.1 times m size and computed its predictions from the get_prediction function. The label of created data is also generated that are matched with the prediction values to get accuracy. Since we have trained weights for two different datasets, we computed the accuracy for both.

In [None]:
def get_prediction(x, parameters, layer_dims, act_fun,  m):
    # returns the prediction on the dataset
        # x: input data (unseen)
        # w, b: parameters weights and bias
        # m: total sample set
    L = len(layer_dims)   
    
    activations = multi_layer_forward_pass(x, parameters, layer_dims, act_fun, pt = False)
    a = activations['a'+ str(L-1)]
    y_prediction = np.zeros((1, m))
    for i in range(a.shape[1]):
        y_prediction[0,i] = 1 if a[0, i] > 0.5 else 0
    return(y_prediction)

In [None]:
def get_accuracy(y, y_prediction, m):
    # return the accuracy by calculated the difference between actual and predicted label
        # y: actual values
        # y_prediction: prediction acquired from the get_prediction
        # m: total number of sample
    df = pd.DataFrame()
    df['actual'] = y[0]
    df['prediction'] = y_prediction[0]
    df['compare']= df['prediction'] == df['actual']

#     print(df[df['compare']==True])
#     print('Accuracy: ' ,len(df[df['compare']==True]['compare'])/m)
    return(len(df[df['compare']==True]['compare'])/m)

In [None]:
tm = int(0.1 * m)
x = generate_bits(n_x, tm)
y = generate_label(x, tm)

y_prediction = get_prediction(x, parametes_s, layer_dims, act_fun,  tm)
acc_s = get_accuracy(y, y_prediction, tm)

y_prediction = get_prediction(x, parametes_m, layer_dims, act_fun,  tm)
acc_m = get_accuracy(y, y_prediction, tm)

y_prediction = get_prediction(x, parametes_l, layer_dims, act_fun,  tm)
acc_l = get_accuracy(y, y_prediction, tm)

In [None]:
print('------- 10000 training set-------------')
print('Accurcy at 10000 training set: ', acc_s)
plt_res(lst_cost_s, 'cost', lr)

print('-------100000 training set-------------')
print('Accurcy at 100000 training set: ', acc_m)
plt_res(lst_cost_m, 'cost', lr)

print('-------1000000 training set-------------')
print('Accurcy at 1000000 training set: ', acc_l)
plt_res(lst_cost_l, 'cost', lr)

## Discussion: ##
We have deployed Multi Layer Perceptrons with layer size $[n_x,4,3,n_y]$ and activation function $relu$. The parameters are optimized about 30 number of iteration. This same setting is used to learn weights on three different data-set size i.e. *10000, 100000 and 1000000*. Surprizingly, the network gives almost same accuracy ~50% of all three data sets. 

Although, the model coverengece to maximum, around 10 iteration and after that it shows a slow convergence to less cost.

---