# Multiple Layers with Reusable Library for MNIST Classification

## Details

* Reusable library allows arbitrary number of layers without adjusting propogation loop
* Layers = multiple
* Actionvation Function = tanh
* Error Function = MSE

## References

* [Neural Network from scatch in Python](https://towardsdatascience.com/math-neural-network-from-scratch-in-python-d6da9f29ce65)

## Do the Math

### Key Formulas for Back-Propagation


#### Derivative of Error wrt Weights ($dE/dW$)

$$ dE/dW = dE/dY * dY/dW $$

* Derivative of Error wrt Weights is dependent on Derivative of Error wrt Output, as well as Derivative of Output wrt Weights. 
* This is the gradient we are trying to descend.
* $dE/dY$ depends on choice of Error Function (Cost Function)
  * For Mean Squred Error (MSE), $dE/dY$ simplifies to difference between Predicted vs Expected Output ($Y_p - Y_0$)
* $dY/dW$ depends on what that layer actually does
 
  * for **Activation Layer**, there is no "learnable" parameters, hence derivative is not available and consider it as 0.
  * for **Fully Connected Layer**, it simplies to $dY/dW = X$

#### Derivative of Error wrt Bias ($dE/dB$)

$$ dE/dB = dE/dY $$

* Derivative of Error wrt Bias is just Derivative of Error wrt Output


#### Derivative of Error wrt Input ($dE/dX$)

$$ dE/dX = dE/dY * dY/dX $$

* Derivative of Error wrt Input can be calculated from Derivative of Error wrt Output. 
* **Hence, error can be back-progated up the layers!**
* Again, $dY/dX$ depens on the layer type
  * for **Activation Layer**, it depends on chosen **Activation Function** and its Derivative
  * for **Fully Connected Layer**, it simplies to $dY/dX = W$


### Apply Chain Rule Across Layers for Back-Propagating Error

For a 3 layer net, error $dE/dY$ can be back-propagated via below application of Chain Rule.

$$ dE/dX = dH_1/dX * dH_2/dH_1 * dY/dH_2 * dE/dY $$

where,

* $dH_1/dX$ is the derivative of Layer 1 wrt to Original Input
* $dH_2/dH_1$ is the derivative of Layer 2 wrt Layout 1 Output
* $dY/dH_2$ is the derivative of Layer 3 wrt Layouer 2 Output
* $dE/dY$ is the derivative of final output's Error with respect to final Prediction
* $dE/dX$ is the derivative of final output's Error with respect to Original input



### Nota Bene

* Each layer needs to propagate Error backwards by mutiplying that layer's Output Error by the Derivative of Output with respect to Input, for that layer
* Activation Layers have no "learnable" weights, hence there are no weights to adjust and there is no need to calculate $dE/dW$ (it would be zero anyways, since $dY/dW$ is undefined and basically 0).

## Define the Library

### Layer base class

In [12]:
# Base class
class Layer:
    def __init__(self):
        self.input = None
        self.output = None
        np.random.seed(42)

    # computes the output Y of a layer for a given input X
    def forward_propagation(self, input):
        raise NotImplementedError

    # computes dE/dX for a given dE/dY (and update parameters if any)
    def backward_propagation(self, output_error, learning_rate):
        raise NotImplementedError

### Fully Connected Layer

In [13]:
# from layer import Layer

import numpy as np

# inherit from base class Layer
class FCLayer(Layer):
    # input_size = number of input neurons
    # output_size = number of output neurons
    def __init__(self, input_size, output_size):
        self.weights = np.random.rand(input_size, output_size) - 0.5
        self.bias = np.random.rand(1, output_size) - 0.5

    # returns output for a given input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = np.dot(self.input, self.weights) + self.bias
        return self.output

    # computes dE/dW, dE/dB for a given output_error=dE/dY. Returns input_error=dE/dX.
    def backward_propagation(self, output_error, learning_rate):
        input_error = np.dot(output_error, self.weights.T)
        weights_error = np.dot(self.input.T, output_error)
        # dBias = output_error

        # update parameters
        self.weights -= learning_rate * weights_error
        self.bias -= learning_rate * output_error
        return input_error

### Activation Layer

In [14]:
# from layer import Layer

# inherit from base class Layer
class ActivationLayer(Layer):
    def __init__(self, activation, activation_prime):
        self.activation = activation
        self.activation_prime = activation_prime

    # returns the activated input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = self.activation(self.input)
        return self.output

    # Returns input_error=dE/dX for a given output_error=dE/dY.
    # learning_rate is not used because there is no "learnable" parameters.
    def backward_propagation(self, output_error, learning_rate):
        return self.activation_prime(self.input) * output_error

### Define Activation Function and it's Derivative

In [15]:
import numpy as np

# activation function and its derivative
def tanh(x):
    return np.tanh(x);

def tanh_prime(x):
    return 1-np.tanh(x)**2;

### Define Loss Function and its Derivative

In [16]:
import numpy as np

# loss function and its derivative
def mse(y_true, y_pred):
    return np.mean(np.power(y_true-y_pred, 2));

def mse_prime(y_true, y_pred):
    return 2*(y_pred-y_true)/y_true.size;

### Wrap everything together and expose fit/predict APIs via a Network class

In [17]:
class Network:
    def __init__(self):
        self.layers = []
        self.loss = None
        self.loss_prime = None

    # add layer to network
    def add(self, layer):
        self.layers.append(layer)

    # set loss to use
    def use(self, loss, loss_prime):
        self.loss = loss
        self.loss_prime = loss_prime

    # predict output for given input
    def predict(self, input_data):
        # sample dimension first
        samples = len(input_data)
        result = []

        # run network over all samples
        for i in range(samples):
            # forward propagation
            output = input_data[i]
            for layer in self.layers:
                output = layer.forward_propagation(output)
            result.append(output)

        return result

    # train the network
    def fit(self, x_train, y_train, epochs, learning_rate):
        # sample dimension first
        samples = len(x_train)

        # training loop
        for i in range(epochs):
            err = 0
            for j in range(samples):
                # forward propagation
                output = x_train[j]
                for layer in self.layers:
                    output = layer.forward_propagation(output)

                # compute loss (for display purpose only)
                err += self.loss(y_train[j], output)

                # backward propagation
                error = self.loss_prime(y_train[j], output)
                for layer in reversed(self.layers):
                    error = layer.backward_propagation(error, learning_rate)

            # calculate average error on all samples
            err /= samples
            if ((i+1) % 20 == 0):
                print('epoch %d/%d   error=%f' % (i+1, epochs, err))

## Unit Test

### Solve XOR

In [27]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

# training data
x_train = np.array([[[0,0]], [[0,1]], [[1,0]], [[1,1]]])
y_train = np.array([[[0]], [[1]], [[1]], [[0]]])

# network
net = Network()
net.add(FCLayer(2, 3))
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(3, 1))
net.add(ActivationLayer(tanh, tanh_prime))

# train
net.use(mse, mse_prime)
net.fit(x_train, y_train, epochs=500, learning_rate=0.1)

# test
out = net.predict(x_train)
print(out)

epoch 20/500   error=0.294873
epoch 40/500   error=0.278176
epoch 60/500   error=0.241852
epoch 80/500   error=0.210989
epoch 100/500   error=0.187060
epoch 120/500   error=0.147641
epoch 140/500   error=0.071084
epoch 160/500   error=0.021973
epoch 180/500   error=0.010032
epoch 200/500   error=0.006044
epoch 220/500   error=0.004197
epoch 240/500   error=0.003164
epoch 260/500   error=0.002515
epoch 280/500   error=0.002074
epoch 300/500   error=0.001757
epoch 320/500   error=0.001519
epoch 340/500   error=0.001334
epoch 360/500   error=0.001187
epoch 380/500   error=0.001068
epoch 400/500   error=0.000969
epoch 420/500   error=0.000886
epoch 440/500   error=0.000815
epoch 460/500   error=0.000754
epoch 480/500   error=0.000701
epoch 500/500   error=0.000655
[array([[0.00203502]]), array([[0.96193642]]), array([[0.9660953]]), array([[-0.00299372]])]
