#  Two Layers Neural Network:

In the previous kernel, [Regression for XOR](https://www.kaggle.com/hamzafar/regression-for-xor) we have learned to build Regressoin for XOR fuction. To extend this concept we will consider whole operation in Regression as single unit. By whole operation we mean the multiplication of inputs with weights and addition of bias to them and applying non-linear function(sigmoid) to get activation values. The single unit, we say this *Neuron*, is shown below:

![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/Two%20Layers_Neural_Network/1.png?raw=true)

These neurons can be linked up together to from a network, commonly known as *Artificial Neural Network*. We say this artificial because this architecture is inspired from *Human Brain*.  The architecture mainly consists of *input layer, hidden layer, and output layer*. The *input layer and output layer* are conected to external world; and are connected to input information and target values respectively. In our case the input would be *Binary data of n_x length* and output would be *label generated from XOR operation from the same data.*
The term *hidden layer(s)* is a bit complex, as it may contain one or more hidden layers and with each layer there can be one or more *neuron*. The figure belw has 3-layer architecture, containing 2-hidden layer and 1-output layer. There are 4-hidden units(neuron) in each of the hidden layers. This architecture takes 3-input feature data.

![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/Two%20Layers_Neural_Network/2.jpeg?raw=true)
In this notebook we will do following:
1. A function will generate data set of desired length and width (row, columns)
1. The labelled is created using XOR; the deata genereted in the above step then passed to this step to get the value of XOR operation
1. Implement two layer neural network model ( feed forward and back propagation)
1. Discuss parameter dimensions  preservance.

In [18]:
# This Python 3 environment comes with many helpful analytics libraries installed

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import matplotlib.pyplot as plt # ploting graph

### Generate Data
To discuss Generalized behavior of Regression model, by Generalized we mean it can work on any shape of data, we have created the generate_bits functions. The function is simple as it takes desired number of rows m and number of feature n_x and it randomly generated binary data i.e. 0 and 1.

In [19]:
def generate_bits(n_x, m):
# Generate a m x n_x array of ints between 0 and 1, inclusive:
# m: number of rows
# n_x : number of columns per rows/ feature set
    np.random.seed(1)
    data = np.random.randint(2, size=(n_x, m))
    return(data)


### Create Labels:
For training/updating derivatives of parameters weight and bias, the loss function determine the difference between actual values and the activation values. The actual value is the value that each example(row) has as it label. Like the the actual value of OR operation:
\begin{equation}
 1+0=1=actualValue>:[oroperation=+]\\
 \end{equation}
The generate_label function below takes data as input and apply XOR operation row wise.

In [20]:
def generate_label(data, m):
    # generate label by appyling xor operation to individual row
    # return list of label (results)
        # data: binary data set of m by n_x size
    lst_y = []
    y= np.empty((m,1))
    k = 0
    for tmp in data.T:
        xor = np.logical_xor(tmp[0], tmp[1])

        for i in range(2, tmp.shape[0]):
            xor = np.logical_xor(xor, tmp[i])
    #     print(xor)
        lst_y.append(int(xor))
        y[k,:] = int(xor)
        k+=1
    return(y.T)

## Neural Network Architecture:
We will use *KEEP IT SO SIMPLE* rule and start with 2-layer neural network i.e. *1-hidden layer* and *1-output layer*. In the hidden layer we will use *tanh* as nonlinear function, and in the output layer *sigmoid* function is used.  Below is the figure of concept:

![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/Two%20Layers_Neural_Network/3.jpg?raw=true)

in the above figure the paramater's subscript denoate the layer number i.e. $w_1$, $b_1$ show that these parameters are associated with *layer-1* while $w_2$, $b_2$ are for *layer-2*. the activations $a_1$, $a_2$  are related to *layer-1 and layer-2* and are output of *tanh and sigmoid* functions respectively. We can extend this concept to multiple hidden units in single neuron by stacking up multiple hidden units wiht non-linearity function for all hidden layers.

The following are the functions that will implement neural network architecture and update the parameters.

** 1. layer_size:**
This function is responsible for creating accurate number of input uinits, hidden units and output units. The input units will be same as the feature set n_x of input data,  the hidden units will be according to user desired one and the output will be equal to target values, in our case we have one output layer as it can work for binary classification by giving *1* when *true* and *0* elsewehre.

In [21]:
def layer_sizes(x, y, hidden):
    # create neural network layers and return input, output and hidden units size
        # x : imput data
        # y : output labels
        # hidden : number of hidden units in hidden layer
    n_x = x.shape[0]
    n_h = hidden
    n_y = y.shape[0]
    
    return(n_x, n_h, n_y)

** 2. initialize_param:**
This helper function will create parameter *w's* and *b's* for each of neural network layer; the inital value of *w's* will be random and *b's* will be zero. The size of *w's* and *b's*  will be as follow:
Since, in the first hidden layer, the parameter *w's* are multiplied by *x's* i.e. each of input feature of *x* is multiplied by respective weight *w* and adding a bias *b* is resulted into single value *z* that is passed to non-linear function resulting a single unit (neuron). And if we have *n* number of neurons then the size of weight parameter will be of shape *(number of neurons, number of input feature)*   and the bias *b* will be of shape *(number of neurons, 1)*.
When focus on the second layer of network, then the generic idea of parameter will be same but instead the shape changes according to layers presented there. So, we have paramter *w* of shape *(number of output values, number of neurons)* and bias *b* will be of shape *(number of output values,  1)*.

In [22]:
def initialize_param(n_x, n_h, n_y):
    # intialize w to random and b to zero and return paramters
        # n_x : number of input feature
        # n_h : number of neurons
        # n_y : number of output
    np.random.seed(15)
    
    w1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros(shape=(n_h, 1))
    w2 = np.random.rand(n_y, n_h) * 0.01
    b2 = np.zeros(shape=(n_y, 1))
    
    parameters = {'w1' : w1,
             'w2' : w2,
             'b1' : b1,
             'b2' : b2
            }
    return(parameters)

In [23]:
def sigmoid(z):
    # Takes input as z and return sogmoid of value
    s = 1 / (1 + np.exp(-z))
    return s

** 3. initialize_param:**
This helper function compute the activation values from each of hidden neuron in each layer. Since we do have only two layers, therefore, in first layer it compute activations with *tanh* function in second layer it does with *sigmoid* function. $a_1, a_2$ referes to activations at layer one and two respectively.

In [24]:
def forward_propagte(x,y, parameters):
    # compute activations and return z and a
        # x: input data
        # y: target value
        # parameters: dictonary object of w's and b's
    z1 = np.dot(parameters['w1'], x) + parameters['b1']
    a1 = np.tanh(z1)
    z2 = np.dot(parameters['w2'], a1) + parameters['b2']
    a2 = sigmoid(z2)
    cache = {"z1": z1,
         "a1": a1,
         "z2": z2,
         "a2": a2}
    return(cache)

If we consider input data *x* with $n_x = 2$ with training examples of 10, and having hidden neuron in first layer to *4* and output with single neuron; then the **Feed Forward** operation will be as of the following figure:
![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/Two%20Layers_Neural_Network/4.jpg?raw=true)


**4. compute_cost: **
The compute cost function, calculate loss of each of the training example in the data set by comparing difference between actual and produce activation from the network and then average all the loss to get a single cost value.

In [25]:
def compute_cost(y, parameters, cache, m):
    # calculate cost w.r.t. activation a2 and actual value y
        # y: target value
        # cache: dictionary of activations and z's
        # m: number of training examples in data set
    a = cache['a2']
    loss = -1*(y* np.log(a) + (1-y) * np.log(1-a))
    cost = np.sum(loss)/m
    return(cost)

### Derivative w.r.t cost:

Considering the [KISS](https://image.ibb.co/dGWSyS/Capture.jpg) graph above; we have derived partial derivatives of parameters *w's* and *b's*  for optimization purpose. $d$ subscript, in the starting of each variable represents the partial derivatives. One important thing to note here,  the $\frac{\partial{a_1}}{\partial{z_1}}$ is derived as $1- a^2$. The detail understanding of $\partial{tanh}$ can be found in the [blog post](http://ronny.rest/blog/post_2017_08_16_tanh/).

\begin{equation}
dz_2  = a_2-y\\
db_2 = dz_2 ... (1)\\
dw_2 = a_1 * dz_2 ... (2)\\
da_1 = w_2 * dz_2\\
\begin{split}
dz_1 & = \frac{\partial loss}{\partial z_1}\\
& = \frac{\partial loss}{\partial a_1} * \frac{\partial a_1}{\partial z_1}\\
and:> \frac{\partial a_1}{\partial z_1} & = tanh(a_1)\\
& = 1- {a_1}^2\\
\end{split}\\
so:> dz_1 = (w_2 * dz_2) * (1- {a_1}^2)\\
db_1 = dz_1 ... (3)\\
dw_1 = x_1 * dz_1 ... (4)\\
\end{equation}
Although the the graph only contains single variable for parameter at each layer but we can assign matrix to them to deal with big strucutre. 

** 5. back_propagate: **
In the above derivations, we will use equation number $1, 2, 3$ and $4$ to compute partial derivatives of weights *w's* and bias *b's* in the back_propagate function. As of the compute cost, we will also average the values of derivatives by dividing it by *m*.

In [26]:
def back_propagate(x, y, parameters, cache, m):
    # compute and return derivatives of paramters 
        # x: input data
        # y: target value
        # parameters: dictonary object of w's and b's
        # cache: dictionary of activations and z's
        # m: number of training examples in data set
        
    a2 =cache['a2']
    a1 =cache['a1']
    w2 = parameters['w2']

    dz2 = a2 - y

    dw2 = (1 / m) * np.dot(dz2, a1.T)
    db2 = (1 / m) * np.sum(dz2, axis=1, keepdims=True)

#     np.multiply(np.dot(W2.T, dZ2), 1 - np.power(A1, 2))
    
    dtan = 1 - np.power(a1, 2)
    dz1 = np.dot(w2.T, dz2) * dtan

    dw1 =(1 / m) * np.dot(dz1, x.T)
    db1 = (1 / m) * np.sum(dz1, axis=1, keepdims=True)
    
    grads = {'dw1': dw1,
             'dw2': dw2,
             'db1': db1,
             'db2': db2,
        }
    return(grads)

During **Back Propagation**, the consistency of the vectors/matrices of the paramters or states should be remain constant. Like the shape of any graph node during back propagation does not change due to maths operation. 
We have showed, in each of the graph node, how the shape during *forward pass* and *backward pass* remain constant by attaching formula related to each node, and their resultinm matrix in the figure below and this will keep consist shape during multiple iteration in *Forward and Backward Pass*:

![](https://github.com/hamzafar/deep_learning_toys/blob/master/images/Two%20Layers_Neural_Network/5.png?raw=true)

**6. update_parameters: **
This function update paramaters *w's* and *b's* according to the following equation:
\begin{equation}
w := w - \partial{w} * learningRate\\
b := b - \partial{b} * learningRate
\end{equation}

In [27]:
def update_parameters(parameters, grads, lr):
    # update and return parameters 
        # parameters: dictonary object of w's and b's
        # grads: dictionary object of gradient of w's and b's
                
    parameters['w1'] = parameters['w1'] -(lr * grads['dw1'])
    parameters['w2'] = parameters['w2'] -(lr * grads['dw2'])

    parameters['b1'] = parameters['b1'] -(lr * grads['db1'])
    parameters['b2'] = parameters['b2'] -(lr * grads['db2'])

    return(parameters)

In [28]:
def optimize_parameters(x, y, parameters, m, num_iter):
    # This function will iterate according to desired number, and update the paramters
    # also return paramters and list of cost values.
        # x: input data
        # y: target value
        # parameters: dictonary object of w's and b's
        # m: number of training examples in data set
        #num_iter: number of iteration to update parameters and comoute cost
    lst_cost = []
    
    for i in range(num_iter):
        cache = forward_propagte(x,y, parameters)
    #     print('cost of ite:', compute_cost(y, parameters, cache, m))
        grads = back_propagate(x, y, parameters, cache, m)
        parameters = update_parameters(parameters, grads, lr)
        lst_cost.append(compute_cost(y, parameters, cache, m))
    return(parameters, lst_cost)

In [None]:
def plt_res(lst, ylab, lr):
    #This will plot the list of values at y axis while x axis will contain number of iteration
    #lst: lst of action/cost
    #ylab: y-axis label
    #lr: learning rate
    plt.plot(lst)
    plt.ylabel(ylab)
    plt.xlabel('iterations')
    plt.title("Learning rate =" + str(lr))
    plt.show()

Now, we  generate data set with *10000*  and *100000* training examples with feature set equal to *50*. The paramaters *w's* and *b's* are updated about *1000* times.  The neural network architecture contains *10 hidden neuron* in first layers. The learining rate is set to 0.07. At the end the *cost* is plotted on *y-axis* against *number of iteration* on x-axis to see the behavior of reduction in cost.

In [None]:
n_x = 50
m = 10000
lr = 0.07
num_iter = 1000

x = generate_bits(n_x, m)
y = generate_label(x, m)

n_x, n_h, n_y = layer_sizes(x, y, 10)
parameters = initialize_param(n_x, n_h, n_y)

parameters, lst_cost_s = optimize_parameters(x, y, parameters, m, num_iter)

################################################################################
m = 100000

x = generate_bits(n_x, m)
y = generate_label(x, m)

n_x, n_h, n_y = layer_sizes(x, y, 10)
parameters = initialize_param(n_x, n_h, n_y)

parameters, lst_cost_m = optimize_parameters(x, y, parameters, m, num_iter)


In [None]:
print('------- 10000 training sample-------------')
plt_res(lst_cost_s, 'cost', lr)
print('------- 100000 training sample-------------')
plt_res(lst_cost_m, 'cost', lr)

----