# More Efficient Implementation

In nn_basic.ipybb, I was able to make a working neural network using just for loops and functions however, there were a couple of large issues with the implementation. For one, it didn't actually have any way to implement more than one hidden layer so it was limited to being a single layer MLP. Also there was not a lot of room for improving the neural network with more features and additions over time. Lastly the algorithm was incredibly ineffecient as all the mathematics were done using for loops, when in reality it is much quicker to use a C wrapper like numpy to calculate these derivatives with matrix operations. 

This implemetation will be directly following the implementation made in this video:

https://www.youtube.com/watch?v=pauPCy_s0Ok 

In [2]:
import numpy as np

### Base Layer Implementation:

The base layer is the layer that simply takes in the input of the model and stores them as some activation for individual neurons to be used in the next layer; the Dense layer.

In [15]:
class Layer:
    def __init__(self):
        self.input=None
        self.output=None
    def forward(self, input):
        # To do: forward propagate with given input
        pass
    def backward(self, output_gradient, learning_rate):
        # TODO : propagate backwards
        pass


### Dense Layer Implementation:

The dense layer connects every single neuron in the base layer to a weight to form a weighted some calculation for a neuron's activation in this next dense layer. 
For the mathematical notation note that:
- $i$ denotes amount of neurons in the previous layer
- $j$ denotes amount of neurons to be made in this dense layer.
- $w_{ij}$ denotes the weight for the ith input into the jth output neuron.
- $b_j$ denotes the bias for the jth output neuron.
- ${x_i}$ denotes the set of input values from the previous layer
- ${y_j}$ denotes the set of output values from this dense layer.

You can write a given output as: $y_j = x_1w_{1j} + ... + x_1w_{ij} + b_j$

However using matrix multiplication you can rewrite this as:
- $W_j$ denotes the $j$ x $i$ matrix that stores all the weight values for a output neuron.
- $b$ denotes a vector of bias terms.

$$Y = W_jX + b$$

In [16]:
class Dense(Layer):
    def __init__(self, input_size, output_size):
        self.weights= np.random.randn(output_size,input_size) # returns array of size (j,i) sampled randomly from standard normal distribution.
        self.bias=np.random.randn(output_size, 1) # returns array of size (j,1) sampled randomly from standard normal distribution.

    def forward(self, input):
        self.input = input
        return np.dot(self.weights, self.inputs) + self.bias # performs caclulation from formula above.
    
    def backward(self, output_gradient, learning_rate):
        # TODO : propagate backwards
        pass

### Back Propagation:

- $\frac{\delta E}{\delta W} = \frac{\delta E}{\delta Y} \bullet X^T$

- $\frac{\delta E}{\delta B} = \frac{\delta E}{\delta Y}$

- $\frac{\delta E}{\delta X} = W^T \bullet \frac{\delta E}{\delta Y}$

In [22]:
class Dense(Layer):
    def __init__(self, input_size, output_size):
        self.weights= np.random.randn(output_size,input_size) # returns array of size (j,i) sampled randomly from standard normal distribution.
        self.bias=np.random.randn(output_size, 1) # returns array of size (j,1) sampled randomly from standard normal distribution.

    def forward(self, input):
        self.input = input
        return np.dot(self.weights, self.input) + self.bias # performs caclulation from formula above.
    
    def backward(self, output_gradient, learning_rate):
        self.weights_gradient = np.dot(output_gradient, self.input.T)
        self.weights -= learning_rate * self.weights_gradient
        self.bias -= learning_rate * output_gradient
        
        return np.dot(self.weights.T, output_gradient)

### Implementing Activation Layer:

Pretty easy to notice the lack of an activation function in the implementation of the last type of layer. To address this you create a new type of layer that allows for easy customization makes sense when you think about it. An activation layer is really just a very specific kind of layer where every neuron in the last layer is connected to a single neuron with a given activation function (sigmoid, reLu, tanh). 

Making it its own layer makes calculations much faster and easier.

$$Y = f(X)$$ 

$$\frac{\delta E}{\delta X} = \frac{\delta E}{\delta Y} \cdot f'(X)$$

In [18]:
class Activation(Layer):
    def __init__(self, activation, activation_prime):
        self.activation=activation # this is a function.
        self.activation_prime=activation_prime # also a function.
    def forward(self, input):
        self.input = input
        return self.activation(self.input) # applies inputs to activation function and returns outputs.
    def backward(self, output_gradient, learning_rate):
        return np.multiply(output_gradient, self.activation_prime(self.input))

### Implementing Specific Activation Function:
**TanH**: Give description of what tanh good at and used for.
$$f(x) = tanh(x)$$

$$s'(x) = 1 - [tanh(x)]^2$$
**Sigmoid**: Give description of what sigmoid good at and used for.
$$s(x) = \frac{1}{1 + e^{-x}}$$

$$s'(x) = s(x) \cdot (1-s(x))$$
**ReLu**: Give description of what relu good at and used for.
$$r(x) = argmax(0,x)$$

$$r'(x) = 1 \text{ ; if } x >= 0$$
$$r'(x) = 0 \text{ ; else}$$

In [19]:
class Tanh(Activation):
    def __init__(self):
        tanh = lambda x: np.tanh(x)
        tanh_prime = lambda x: 1 - np.tanh(x) ** 2
        super().__init__(tanh, tanh_prime)

class Sigmoid(Activation):
    def __init__(self):
        sigmoid = lambda x: 1 / (1 + np.exp(-x))
        sigmoid_prime = lambda x: sigmoid(x) * (1-sigmoid(x))
        super().__init__(sigmoid, sigmoid_prime)
        
class ReLu(Activation):
    def __init__(self):
        relu = np.max(0,x)
        relu_prime = lambda x: 1 if x >= 0 else 0
        super().__init__(relu, relu_prime)

### Error:

$y_i$ : an expected output of the model.

$\hat{y}_i$ : an predicted output of the model.

$$E = \frac{1}{n}\sum_{i}^{n} (y_i - \hat{y_i})^2$$
$$E' = \frac{2}{n}(\hat{Y} - Y)$$


In [20]:
def mse(y_true, y_pred):
    return np.mean(np.power(y_true-y_pred, 2))

def mse_prime(y_true, y_pred):
    return 2 * (y_pred-y_true) / np.size(y_true)

### Trying It Out:

To test this out I followed the video and tried an implementation of a XOR model.
XOR is a deceptively simple model, that only has 2 inputs and follows this ruleset:

- if $x1 = 0$ and $x2 = 0$ then: $y = 0$

- if $x1 = 1$ and $x2 = 0$ then: $y = 1$

- if $x1 = 0$ and $x2 = 1$ then: $y = 1$

- if $x1 = 1$ and $x2 = 1$ then: $y = 0$

The trick is there is no linear way to model this relationship. So it is a good test as our neural network will have to find a nonlinear way to map this model.

In [13]:
X = np.reshape([[0, 0], [0, 1], [1, 0], [1, 1]], (4, 2, 1))
Y = np.reshape([[0], [1], [1], [0]], (4, 1, 1))

for x,y in zip(X,Y):
    print('x:',x)
    print('y:',y)

x: [[0]
 [0]]
y: [[0]]
x: [[0]
 [1]]
y: [[1]]
x: [[1]
 [0]]
y: [[1]]
x: [[1]
 [1]]
y: [[0]]


In [23]:
X = np.reshape([[0, 0], [0, 1], [1, 0], [1, 1]], (4, 2, 1))
Y = np.reshape([[0], [1], [1], [0]], (4, 1, 1))

network = [
    Dense(2, 3),
    Tanh(),
    Dense(3, 1),
    Tanh()
]

epochs = 100
learning_rate = 0.1

# train

for e in range(epochs):
    error = 0
    for x,y in zip(X,Y):
        # forward
        output = x 
        for layer in network:
            output = layer.forward(output)
        
        error += mse(y, output)

        # backward
        grad = mse_prime(y, output)
        for layer in reversed(network):
            grad = layer.backward(grad, learning_rate)

    error /= len(X)
    print('%d/%d, error=%f' % (e+1, epochs, error))

1/100, error=0.620273
2/100, error=0.372152
3/100, error=0.279402
4/100, error=0.225152
5/100, error=0.194102
6/100, error=0.174230
7/100, error=0.156855
8/100, error=0.139415
9/100, error=0.122286
10/100, error=0.107066
11/100, error=0.094817
12/100, error=0.085120
13/100, error=0.076887
14/100, error=0.069449
15/100, error=0.062633
16/100, error=0.056464
17/100, error=0.050976
18/100, error=0.046164
19/100, error=0.041981
20/100, error=0.038358
21/100, error=0.035221
22/100, error=0.032494
23/100, error=0.030113
24/100, error=0.028022
25/100, error=0.026176
26/100, error=0.024535
27/100, error=0.023070
28/100, error=0.021754
29/100, error=0.020567
30/100, error=0.019492
31/100, error=0.018514
32/100, error=0.017621
33/100, error=0.016802
34/100, error=0.016050
35/100, error=0.015357
36/100, error=0.014716
37/100, error=0.014122
38/100, error=0.013570
39/100, error=0.013056
40/100, error=0.012577
41/100, error=0.012129
42/100, error=0.011708
43/100, error=0.011314
44/100, error=0.0109