# Neural Networks

An **Artificial Neural Network (ANN)** is a computing system inspired by the biological neural networks that constitute the animal brain.<br><br>

An ANN is based on a collection of connected units or nodes called artificial neurons bound in various layers which attempt to learn the underlying relationships in a set of data through various algorithms mimicing the way brain learns.<br><br>

## Neural Network Implementation


1. Pick a **Neural Network Architecture** <br>

2. Randomly initialize weights (or bias) in range [-ε, ε] 
where ε =$\sqrt{6}$	/ $\sqrt{Lin + Lout}$ 	
Lin, Lout = number of units in layer adjacent to $\theta$	<br>

3. feed **input data** into the Neural Network <br>

4. implement **Forward Propagation** to compute $ℎ\_\theta(𝑥)$  <br>
the data flows **from layer to layer** until we have the output   <br>

5. Adjust the given parameter (weight or bias) using **Backpropagation** by subtracting the derivative of the error with respect to the parameter itself

6. iterate throught the process using Gradient Descent (or some other optimization algorithm) to minimize Loss/Cost

The most important step is the 5th. We want to be able to have as many layers as we want, and of any type. But if we modify/add/remove one layer from the network, the output of the network is going to change, which is going to change the error, which is going to change the derivative of the error with respect to the parameters. We need to be able to compute the derivatives regardless of the network architecture, regardless of the activation functions, regardless of the loss we use.<br><br>

In order to achieve that, we must implement **each layer separately**

## Forward Propagation

Every layer that we might create have at least 2 things in common: **input** and **output** data.    <br><br>

![layer_common.png](images/layer_common.png)

The output of one layer is the input of the next layer <br><br>

![forward_propagation.png](images/forward_propagation.png)
<br>
This is **Forward Propagation**. <br><br>
Essentially, the input data is fed into the first layer, then the ouput of every layer becomes the input of the next layer until we reach the end of the network. <br><br>

We compute the outcome (hypotheses  $ h_\theta(X))$ by feeding the input (X) with random initialized Weights (W) layer by layer forward into the Neural Network. <br><br>

Next, we calculate the **Error : E** by feeding the computed values (**Y***) and actual values (**Y**) into an error function.<br>
can calculate an error **E**. <br><br>

### Gradient Descent

**Gradient Descent** is an iterative first order optimization algorithm to find a local minimum/local maximum of a given function.<br><br>

**Gradient Descent** is used as an advanced optimization technique to minimize the error _E_ in Neural Networks 

_this is a quick reminder on Gradient Descent and not a learning resource_ <br><br>

Basically, we change parameters (**W** and **B**) so as to decrease the error **E**:  

$  W = W - \alpha *\frac{\partial E}{\partial W} $ <br>
$  B = B - \alpha *\frac{\partial E}{\partial b} $

where, $\alpha$ is called learning rate which we manually set in the range [0, 1]

## Backward Propagation

**Backward Propagation or Backpropagation** is an algorithm for supervised learning of artificial neural networks using gradient descent. <br><br>

Given an Artificial Neural Network and an Error function, the method calculates the gradient of the error function with respect to the Neural Network's weights (W) i.e. $ \frac{\partial E}{\partial W} $ which is further used in Gradient Descent to find the optimal values of W and B <br><br>

Implementation: <br><br>

Having caluclated the output **Y*** = $ h_\theta(X) $ and the error **E** = error_function(Y*, Y), we can compute the derivative of the error w.r.t that layer's output $ \frac{\partial E}{\partial Y} $. <br><br>

Next, using the **Chain Rule**, we can compute 
1. the derivative of the error w.r.t the parameters $ \frac{\partial E}{\partial W}, \frac{\partial E}{\partial B} $ <br>
2. the derivative of the error w.r.t the input $ \frac{\partial E}{\partial X} $ (because as stated earlier, output of the current layer is the input of the next layer)

![backpropagation_layer_by_layer.png](images/backpropagation_layer_by_layer.png)
<br><br>

**Formula** (_skipping the derivation_)

![backpropagation_formula](images/backpropagation_formula.png)
<br><br>
**NOTE**: *E* is scalar (a number) and *X*, *Y*, *B* and *W* are matrices.<br><br>

**NOTE**: ∂E/∂X needs to be calculated because since the output of the current layer acts as the input of the next layer, the derivative of error w.r.t the input for later layer acts as derivative of error w.r.t the output for previous layer (ex: ∂E/∂X of layer 3 = ∂E/∂Y of layer 2)

## NOTE:

`layer`, `DenseLayer`, `ActivationLayer`, `activation_function` and `activation_function_derivative`, `cost_fun` and `cost_derivative` and `Network` classes/methods explained and defined below are written in separate python files and finally imported into the main file as modules

## Abstract Class: Layer

The abstract class *Layer*, which all other layers will inherit from, handles simple properties which are an input, an output, and both a forward propagation and backward propagation methods.<br><br>

The `forward_propagation` method takes in the input and gives us the ouput **Y*** or **$h_\theta(X)$**<br><br>

The `backward_propagation` method takes in the derivate of the error w.r.t the output i.e. ∂E/∂Y (output_grad) and is responsible for 2 things:
1. updating the parameters (weights & bias) if any
2. return the derivative of the error w.r.t the input i.e ∂E/∂X 
<br>

NOTE: since `layer` is an Abstract class, `forward_propagation` and `backward_propagation` methods are only declared and not defined.<br>
These methods are defined in the inherited classes

In [1]:
#Abstract class
class Layer:
    def __init__(self):
        self.input = None
        self.output = None

    #computes the output Y of a layer for a given input X
    def forward_propagation(self, input):
        raise NotImplementedError

    #computes dE/dX for a given dE/dY (and update parameters if any)
    def backward_propagation(self, output_error, learning_rate):
        raise NotImplementedError

## Dense Layer

**Dense Layer** or **Fully Connected Layer** is a layer that is deeply connected with its preceding layer which means the neurons of the layer are connected to every neuron of its preceding layer.<br>
In other words, every input neuron is connected to every output neuron<br>


![fully_connected_layer.png](images/fully_connected_layer.png)

The most basic purpose of a Dense layer is to **change the dimension of vector by using every neuron**<br><br>

`Dense` inherits from the abstract class `layer` and defines the `forward_propagation` and `backward_propagation` methods

### Forward Propagation

The value of each output neuron can be calculated as :
$ Y = WX + B $

### Backward Propagation

Given a matrix `output_gradient` containing the derivative of the error with respect to the layer's output i.e. $ \frac{\partial E}{\partial Y} $, we compute:

1. The derivative of the error w.r.t the parameters $ \frac{\partial E}{\partial W} $, $ \frac{\partial E}{\partial B} $

2. The derivative of the error w.r.t the input $ \frac{\partial E}{\partial X} $

Upon derivation, we conclude:

`weights_gradient`  $ \frac{\partial E}{\partial W}     =  X^T \frac{\partial E}{\partial Y}   $ 

`output_gradient `  $ \frac{\partial E}{\partial B}     =  \frac{\partial E}{\partial Y}       $

`input_gradient `   $ \frac{\partial E}{\partial X}     =   \frac{\partial E}{\partial Y}  W^T $

In [3]:
from layer import Layer
import numpy as np

#inherit from base class Layer
class DenseLayer(Layer):
    #input_size = number of input neurons
    #output_size = number of output neurons

    def __init__(self, input_size, output_size):
        self.weights = np.random.rand(input_size, output_size) - 0.5
        self.bias    = np.random.rand(1, output_size) - 0.5

    #returns output for a given input
    def forward_propagation(self, input_data):
        self.input  = input_data
        self.output = np.dot(self.input, self.weights) + self.bias
        return self.output

    #computes dE/dW, dE/dB for a given output_error(output_gradient)=dE/dY. Returns input_error(input_gradient)=dE/dX.
    def backward_propagation(self, output_gradient, learning_rate):
        input_gradient   = np.dot(output_gradient, self.weights.T)
        weights_gradient = np.dot(self.input.T, output_gradient)

        #update parameters
        self.weights -= learning_rate * weights_gradient
        self.bias    -= learning_rate * output_gradient
        return input_gradient

### Activtion Layer

The activation function is the most important factor in a neural network which decided whether or not a neuron will be activated or not and transferred to the next layer. <br>
This simply means that it will decide whether the neuron's input to the network is relevant or not in the process of prediction.<br>

Example : Binary Step Function, Linear Function, Sigmoid function, Tanh, etc<br><br>



The Activation Layer simply takes in a group of neurons (a vector) and passes them through the activation function to give activated neurons (vector) of the same shape as the input neurons.<br>

`ActivationLayer` takes 2 parameters : 
1. `activation`: activation function 
2. `activation_prime`: derivative of the activation function

It has 2 methods:
1. `forward_propagation` $ Y= f(X) $
2. `backward_propagation` $ \frac{\partial E}{\partial X} = \frac{\partial E}{\partial Y} ⊙ f'(X) $ <br><br>

**NOTE**: We don't use an Activation Function within the Dense Layer because it adds complicated calculations in the Dense Layer. <br>
The Activation functions is just another layer, so it is implemented in it's own class.<br><br>

In [4]:
from layer import Layer

class ActivationLayer(Layer):
    def __init__(self, activation, activation_prime):
        self.activation = activation
        self.activation_prime = activation_prime

    #returns the activated input
    def forward_propagation(self, input_data):
        self.input  = input_data
        self.output = self.activation(self.input)
        return self.output

    #returns input_error=dE/dX for a given output_gradient=dE/dY.
    #learning_rate is not used because there is no "learnable" parameters.
    def backward_propagation(self, output_gradient, learning_rate):
        return self.activation_prime(self.input) * output_gradient

In [5]:
import numpy as np

# activation function and its derivative
def activation_function(x):
    return np.tanh(x)

def activation_function_derivative(x):
    return 1-np.tanh(x)**2

### Cost Function

**Cost Function** measures the Error between the predicted value (Y*) and the actual value (Y).<br><br>

There are many ways to define the error, and we use them as per the problem's need and/or the algorithm used.<br>
We also need to define the derivative of the Cost Function.<br>

One of the most known Cost Function is the **MSE — Mean Squared Error** (which we use here)

![mse.png](images/mse.png)
Where **y*** and **y** denotes desired output and actual output respectively.<br><br>

Now to define $ \frac{\partial E}{\partial Y} $,

![partial_derivative.png](images/partial_derivative.png)

In [1]:
import numpy as np

#Cost function and its derivative
def cost_fun(y_true, y_pred):
    return np.mean(np.power(y_true-y_pred, 2))

def cost_derivative(y_true, y_pred):
    return 2*(y_pred-y_true)/y_true.size

## Network Class

`Network` defines methods to build the Neural Network, train it and predict output for given input<br><br>

In [14]:
class Network:
    def __init__(self):
        self.layers = []
        self.cost   = None
        self.cost_derivative = None

    def add(self, layer):
        '''add layer to network (Dense/ActivationLayer)'''
        self.layers.append(layer)

    def use(self, cost, cost_derivative):
        '''set Cost function and it's derivative to use'''
        self.cost = cost
        self.cost_derivative = cost_derivative

    def predict(self, input_data):
        '''predict output for given input'''
        
        #sample dimension first
        samples = len(input_data)
        result  = []

        #run network over all samples
        for i in range(samples):
            #forward propagation
            output = input_data[i]
            for layer in self.layers:
                output = layer.forward_propagation(output)
            result.append(output)
        print(type(result))
        return result

    def fit(self, x_train, y_train, epochs, learning_rate):
        '''train the Neural Network'''

        #epoch means one complete pass of the training dataset through the algorithm
        #i.e. the number of times a learning algorithm sees the complete dataset.
        
        #sample dimension first
        samples = len(x_train)

        #training loop
        for e in range(epochs):
            err = 0
            for x, y in zip(x_train, y_train):
                #forward propagation
                output = x
                for layer in self.layers:
                    output = layer.forward_propagation(output)

                #compute cost (for display purpose only)
                err += self.cost(y, output)

                #backpropagation
                grad = self.cost_derivative(y, output)
                for layer in reversed(self.layers):
                    grad = layer.backward_propagation(grad, learning_rate)

            #calculate average error on all samples
            err /= samples
            print('epoch %d/%d   error=%f' % (e+1, epochs, err))


## using the above Neural Network

`layer`, <br>
`DenseLayer`->Layer.py, <br>
`ActivationLayer`->activationlayer.py, <br>
`activation_function` and `activation_function_derivative`->activationfunctions.py, <br>
`cost_fun` and `cost_derivative`->losses.py and <br>
`Network`->network.py<br>
classes/methods are written in aforementioned python files and imported into the following (main) file as modules for implementation

In [None]:
import numpy as np

from network import Network
from denselayer import DenseLayer
from activationlayer import ActivationLayer
from activationfunctions import activation_function, activation_function_derivative
from losses import cost_fun, cost_derivative

#training data
X = input_data.reshape(4, 2, 1)
Y = actual_ouput.reshape(4, 1, 1)

#network
net = Network()

#net.add(DenseLayer(no_of_ip_neurons, no_of_op_neurons))
#net.add(ActivationLayer(activation_function, activation_function_derivative))

net.add(DenseLayer(2, 3)) 
net.add(ActivationLayer(activation_function, activation_function_derivative))
net.add(DenseLayer(3, 1))
net.add(ActivationLayer(activation_function, activation_function_derivative))

#train
net.use(cost_fun, cost_derivative)
net.fit(x_train, y_train, epochs=1000, learning_rate=0.1)

#test
output = net.predict(x_train)
print(output)

for better understanding on how to use the modules, refer **example_xor.py** in this repository

## References

- [Machine Learning by Andrew Ng](https://www.coursera.org/learn/machine-learning/home/welcome)
- [Neural Network from scratch in Python - article](https://towardsdatascience.com/math-neural-network-from-scratch-in-python-d6da9f29ce65)
- [Neural Network from scratch in Python - video tutorial](https://www.youtube.com/watch?v=pauPCy_s0Ok&t=2s)