# Understanding and coding Neural Networks From Scratch in Python and R

URL: https://www.analyticsvidhya.com/blog/2017/05/neural-network-from-scratch-in-python-and-r/

SUNIL RAY , MAY 29, 2017

![Example Neural Network](https://s3-ap-south-1.amazonaws.com/av-blog-media/wp-content/uploads/2017/05/26094834/Screen-Shot-2017-05-26-at-9.47.51-AM.png)

+ $X$ = inputs matrix
+ $Y$ = output matrix
+ $E = (Y - t)^2 / 2$ with $t$ = actual result
+ $w_h$ = weight matrix to the hidden layer
+ $b_h$ = bias matrix to the hidden layer
+ $w_o$ = weight matrix to the output layer
+ $b_o$ = bias matrix to the output layer
+ $f(x)$ = activation function with $f(x) = sigmoid(x) = \frac{1}{x (1 -e^{-x})}$ and its derivative $f'(x) = x \cdot (1 - e^{-x})$
+ $\eta$ = learning rate, a configuration parameter

Step 0: Read input and output

In [1]:
import numpy as np

#Input array
X=np.array([[1,0,1,0],[1,0,1,1],[0,1,0,1]])

#Output
y=np.array([[1],[1],[0]])

#Sigmoid Function
def sigmoid (x):
    return 1/(1 + np.exp(-x))

#Derivative of Sigmoid Function
def derivatives_sigmoid(x):
    return x * (1 - x)

Step 1: Initialize weights and biases with random values (There are methods to initialize weights and biases but for now initialize with random values)

In [2]:
#Variable initialization
epoch=5000                              # Setting training iterations
lr=0.1                                  # Setting learning rate
inputlayer_neurons = X.shape[1]         # number of features in data set
hiddenlayer_neurons = 3                 # number of hidden layers neurons
output_neurons = 1                      # number of neurons at output layer

#weight and bias initialization
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))

Step 2: Calculate hidden layer input:
$$H_i = w_h \cdot X + b_h$$
where $H_i$ = hidden layer input

Original format: hidden_layer_input= matrix_dot_product(X,wh) + bh

Step 3: Perform non-linear transformation on hidden linear input
$$ H_o = f(H_i)$$
where $H_o$ = output of hidden layer

original format: hiddenlayer_activations = sigmoid(hidden_layer_input)

Step 4: Perform linear and non-linear transformation of hidden layer activation at output layer
\begin{array}{rcl}
    O_i & = & w_o \cdot H_o + b_o \\
    O_o & = & f(O_i)
\end{array}
where $O_i$ = input of the output layer, $O_o$ = output of the output layer

original format: output_layer_input = matrix_dot_product (hiddenlayer_activations $*$ wout ) + bout

original format: output = sigmoid(output_layer_input)

Step 5: Calculate gradient of Error(E) at output layer
$$ E = (O_o - y)^2 / 2$$
where $y$ = actual output.
Here $E$ is the MSE value

original format: E = y-output

Step 6: Compute slope at output and hidden layer

$$ \sigma_o = f^{\prime}(O_o) = O_o \cdot (1 - O_o)$$
$$ \sigma_h = f^{\prime}(H_o) = H_o \cdot (1 - H_o)$$
where $\sigma_o$ = gradient of the output layer activetion function (derivative) and $\sigma_h$ = gradient of the hidden layer activiation function


original format: Slope_output_layer= derivatives_sigmoid(output)

original format: Slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)

Step 7: Compute delta at output layer
$$ \delta_o = E * \sigma_o * \eta$$
where $\delta_o$ = change of the output layer contributing to the error

original format: d_output = E $*$ slope_output_layer $*$ lr

Step 8: Calculate Error at hidden layer
$$e_h = w_o^{T} \cdot \delta_o$$
where $e_h$ = error at the output layer

original format: Error_at_hidden_layer = matrix_dot_product(d_output, wout.Transpose)

Step 9: Compute delta at hidden layer
$$\delta_h = e_h * \sigma_h$$
where $\delta_h$ = changeof the hidden layer contributing to the error

original format: d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer

Step 10: Update weight at both output and hidden layer
$$ w_o = w_o + \delta_o \cdot \delta_h^T * \eta$$
$$ w_h = w_h + \delta_h \cdot X^T * \eta$$

original format: wout = wout + matrix_dot_product(hiddenlayer_activations.Transpose, d_output)$*$learning_rate

original format: wh =  wh+ matrix_dot_product(X.Transpose,d_hiddenlayer)$*$learning_rate

Step 11: Update biases at both output and hidden layer
\begin{array}{lcr}
    b_h & = & b_h + \sum(\delta_h) * \eta \\
    b_o & = & b_o + \sum(\delta_o) * \eta
\end{array}
original format: bh = bh + sum(d_hiddenlayer, axis=0) $*$ learning_rate

original format: bout = bout + sum(d_output, axis=0) $*$ learning_rate

In [3]:
for i in range(epoch):

    #Forward Propogation
    hidden_layer_input1=np.dot(X,wh)
    hidden_layer_input=hidden_layer_input1 + bh
    hiddenlayer_activations = sigmoid(hidden_layer_input)
    
    output_layer_input1=np.dot(hiddenlayer_activations,wout)
    output_layer_input= output_layer_input1+ bout
    output = sigmoid(output_layer_input)

    #Backpropagation
    E = y-output
    slope_output_layer = derivatives_sigmoid(output)
    slope_hidden_layer = derivatives_sigmoid(hiddenlayer_activations)
    
    d_output = E * slope_output_layer
    Error_at_hidden_layer = d_output.dot(wout.T)    
    d_hiddenlayer = Error_at_hidden_layer * slope_hidden_layer
    
    wout += hiddenlayer_activations.T.dot(d_output) *lr
    bout += np.sum(d_output, axis=0,keepdims=True) *lr
    
    wh += X.T.dot(d_hiddenlayer) *lr
    bh += np.sum(d_hiddenlayer, axis=0,keepdims=True) *lr

In [13]:
print("Output layer: ")
print("weights: \n{}".format(wout))
print("biases: \n{}\n".format(bout))

print("Huidden layer: ")
print("weights: \n{}".format(wh))
print("biases: \n{}".format(bh))

Output layer: 
weights: 
[[ 3.54081355]
 [ 3.11519372]
 [-2.46159562]]
biases: 
[[-1.46017449]]

Huidden layer: 
weights: 
[[ 1.61560569  1.46930404 -0.80612237]
 [-1.73880225 -1.41772989  1.23533085]
 [ 1.73563891  1.73681884 -0.55815855]
 [-0.64345475 -1.02417878  0.78395742]]
biases: 
[[-0.36208101 -0.07956337  0.16433623]]


# Matematical Practice of BP Algorithm

$$ H_i = W_i \cdot X + b_h$$
$$ H_o = f(H_i) = f(W_i \cdot X) + f(b_h)$$
$$ O_i = w_h \cdot H_o + b_o $$
$$ O_o = f(O_i) = f(W_h \cdot H_o) + f(b_o)$$
where $H_i$ = the input of the hidden layer, $H_o$ = the output of the hidden layer (after applying activation function), $O_i$ = input of the output layer, and $O_o$ = output of the output layer

$$ E = (O_o - Y)^2 / 2 \space \Rightarrow \space \frac{\partial E}{\partial O_o} = (O_o - Y)$$
where $Y$ = the actual output

The errors regarding to the output and hidden layers:
$$ \frac{\partial E}{\partial w_h} = \frac{\partial E}{\partial O_o} \frac{\partial O_o}{\partial O_i} \frac{\partial O_i}{\partial w_h} = (O_o - Y) \cdot O_o (1 - O_o) \cdot H_o$$
where 
\begin{array}{rcl}
    \frac{\partial E}{\partial O_o} & = & O_o - Y \\
    \frac{\partial O_o}{\partial O_i} & = & O_o (1 - O_o) \\
    \frac{\partial O_i}{\partial O_o} & = & H_o
\end{array}

$$ \frac{\partial E}{\partial w_i} = \frac{\partial E}{\partial O_o} \frac{\partial O_o}{\partial O_i} \frac{\partial O_i}{\partial H_o} \frac{\partial H_o}{\partial H_i} = (O_o - Y) \cdot O_o (1 - O_o) \cdot w_h * H_o(1-Ho) \cdot X$$
where
\begin{array}{rcl}
    \frac{\partial O_i}{\partial H_o} & = & \frac{\partial O_i}{\partial H_0} \frac{\partial H_o}{\partial H_i} \\
    & = & w_h \cdot H_o (1 - H_o) \\
    \frac{\partial H_o}{\partial H_i} & = & X
\end{array}