<a href="https://colab.research.google.com/github/jfogarty/machine-learning-intro-workshop/blob/master/misc/data-explore-iris-data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Math of Neural Networks - from scratch in Python


This is Colab implementation from [Math of Neural Networks — from scratch in Python](https://medium.com/datadriveninvestor/math-neural-network-from-scratch-in-python-d6da9f29ce65) on medium.com by **Omar Aflak**.

In this notebook we will go through the mathematics of machine learning and code from scratch, in Python, a small library to build neural networks with a variety of layers (**Fully Connected**, **Convolutional**, etc.). 

Eventually, we will be able to create networks in a modular fashion (very similar to the [Keras Sequential](https://keras.io/models/sequential/) framework used) :

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-3-layer.png?raw=1" />
  <figcaption>3-layer neural network</fgcaption></center>
</figure>

I’m assuming you already have some knowledge about neural networks. The purpose here is not to explain why we make these models, but to show **how to make a proper implementation**.

## Layer by Layer

We need to keep in mind the big picture here :

1. We feed input data into the neural network.
1. The data flows from layer to layer until we have the output.
1. Once we have the output, we can calculate the error which is a scalar.
1. Finally we can adjust a given parameter (weight or bias) by subtracting the derivative of the error with respect to the parameter itself.
1. We iterate through that process.

The most important step is the **4th**. We want to be able to have as many layers as we want, and of any type. But if we modify/add/remove one layer from the network, the output of the network is going to change, which is going to change the error, which is going to change the derivative of the error with respect to the parameters. We need to be able to compute the derivatives regardless of the network architecture, regardless of the activation functions, regardless of the loss we use.

In order to achieve that, we must implement **each layer separately**.

## What every layer should implement

Every layer that we might create (fully connected, convolutional, maxpooling, dropout, etc.) have at least 2 things in common : **input** and **output** data.

<figure>&nbsp;&nbsp;&nbsp;
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-xy-layer.png?raw=1" />
  <figcaption></figcaption></center>
</figure>


### Forward propagation

We can already emphasize one important point which is : **the output of one layer is the input of the next one**.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-forward-propagation.png?raw=1" />
  <figcaption>forward propagation</figcaption></center>
</figure>


This is called **forward propagation**. Essentially, we give the input data to the first layer, then the output of every layer becomes the input of the next layer until we reach the end of the network. By comparing the result of the network ($Y$) with the desired output (let’s say $Y^*$), we can calculate en error $E$. 

The goal is to minimize that error by changing the parameters in the network. That is backward propagation (backpropagation).

### Gradient Descent

This is a quick **reminder**, if you need to learn more about [gradient descent](https://en.wikipedia.org/wiki/Gradient_descent) there are tons of resources on the [internet](https://www.washingtonpost.com/blogs/wonkblog/post/the-internet-is-in-fact-a-series-of-tubes/2011/09/20/gIQALZwfiK_blog.html).

Basically, we want to change some parameter in the network (call it $w$, usually referred to as a [**weight**](https://hackernoon.com/everything-you-need-to-know-about-neural-networks-8988c3ee4491)) so that the total error **$E$ decreases**. There is a clever way to do it (not randomly) which is the following :

<figure>&nbsp;&nbsp;&nbsp;
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-weight-adjust.png?raw=1" />
  <figcaption><br>systematic adjustment of weights</figcaption></center>
</figure>

Where $α$ is a parameter in the range $[0,1]$ that we set and that is called the [**learning rate**](https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10). 

Anyway, the important thing here is $\frac{\partial E}{\partial w}$ (the derivative of $E$ with respect to $w$). We need to be able to find the value of that expression for any parameter of the network regardless of its architecture.

### Backward propagation

Suppose that we give a layer the **derivative of the error with respect to its output** ($\frac{\partial E}{\partial Y}$), then it must be able to provide the **derivative of the error with respect to its input** ($\frac{\partial E}{\partial X}$).

<figure>&nbsp;&nbsp;&nbsp;
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-back-propagation.png?raw=1" />
  <figcaption><br>back propagation</figcaption></center>
</figure>

Remember that $E$ is a scalar (a number) and $X$ and $Y$ are matrices.

<figure>&nbsp;&nbsp;&nbsp;
    <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-back-propagation-matrix.png?raw=1" /></center>
</figure>

Let’s forget about $\frac{\partial E}{\partial X}$ for now. The trick here, is that if we have access to $\frac{\partial E}{\partial Y}$ we can very easily calculate $\frac{\partial E}{\partial W}$ (if the layer has any trainable parameters) **without knowing anything about the network architecture!** We simply use the [chain rule](https://www.khanacademy.org/math/ap-calculus-ab/ab-differentiation-2-new/ab-3-1a/a/chain-rule-review) :

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-chain-rule.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

The unknown is $\frac{\partial y_j}{\partial w}$ which totally depends on how the layer is computing its output. So if every layer has access to $\frac{\partial E}{\partial Y}$, where $Y$ is its own output, then we can update our parameters.

## But why ∂E/∂X ?

Don’t forget, the output of one layer is the input of the next layer. Which means $\frac{\partial E}{\partial X}$ for one layer is $\frac{\partial E}{\partial Y}$ for the previous layer.

That’s it; tt’s just a clever way to propagate the error!

Again, we can use the chain rule :

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-chain-rule2.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

This is very important, it’s the *key* to understand backpropagation!

After that, we’ll be able to code a Deep Convolutional Neural Network from scratch in no time!


### Diagram to understand backpropagation

This is what I described earlier. Layer 3 is going to update its parameters using $∂E/∂Y$, and is then going to pass $∂E/∂H_2$ to the previous layer, which is its own “$∂E/∂Y$”. Layer 2 is then going to do the same, and so on and so forth.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-chain-rule3.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

This may seem abstract here, but it will get very clear when we will apply this to a specific type of layer. 

Speaking of abstract, now is a good time to write our first python class.


# And finally some Python code

**Usage NOTE!** Use `Shift+Enter` to step through this notebook, executing the code as you go.

In [1]:
#@title Welcome
import datetime
print(f"Welcome to exploring this notebook at {datetime.datetime.now()}! ")

Welcome to exploring this notebook at 2019-08-09 04:41:24.285715! 


In [0]:
class Context:
    VERBOSE=False    # True for extensive logging during execution.
    QUIET=False      # True for minimal logging during execution.
    WARNINGS=False   # True to enable display of annoying but rarely useful messages.

## Abstract Base Class : Layer

The abstract class Layer, which all other layers will inherit from, handles simple properties which are an input, an output, and both a forward and backward methods.

In [0]:
# Base class
class Layer:
    def __init__(self):
        self.input = None
        self.output = None

    # computes the output Y of a layer for a given input X
    def forward_propagation(self, input):
        raise NotImplementedError

    # computes dE/dX for a given dE/dY (and update parameters if any)
    def backward_propagation(self, output_error, learning_rate):
        raise NotImplementedError

As you can see there is an extra parameter in `backward_propagation` that I didn’t mention, it is the `learning_rate`. This parameter should be something like an update policy, or an optimizer as they call it in Keras, but for the sake of simplicity we’re simply going to pass a learning rate and update our parameters using gradient descent.

## Fully Connected Layer

Now lets define and implement the first type of layer : fully connected layer or FC layer. FC layers are the most basic layers as every input neuron is connected to every output neurons. These connections are the weights ($W$), which is a matrix of $w$ parameters.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-nn-layer.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

This may seem abstract here, but it will get very clear when we will apply this to a specific type of layer. 

Before we an implement this, we need to know how we will compute the `forward_propagation` and `backward_propagation` functions for the class.

### Forward Propagation

The value of each output neuron can be calculated as the following :

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-fwd-output.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

With matrices, we can compute this formula for every output neuron in one shot using a [dot product](https://en.wikipedia.org/wiki/Dot_product) :

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-dot-product.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-dot-product2.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

We’re done with the forward pass. Now let’s do the backward pass of the FC layer.

The $B$ vector ($b$ values) contains [bias values](https://www.geeksforgeeks.org/effect-of-bias-in-neural-network/). Biases are tuned alongside weights by gradient descent. Where biases differ from weights is they are independent of the output from previous layers. Conceptually bias is caused by input from a neuron with a fixed activation of 1, and so is updated by subtracting the just the product of the delta value and learning rate.

Biases are typically initialised to be zero, since asymmetry breaking is provided by the small random numbers in the weights (see Weight Initialisation), although random values can also be used. 

*Note that I’m not using any [activation function]() yet, that’s because we'will implement it in a separate layer!


### Backward Propagation

As we said, suppose we have a matrix containing the derivative of the error with respect to that layer’s output ($∂E/∂Y$). We need :

1. The derivative of the error with respect to the parameters ($∂E/∂W$, $∂E/∂B$)
2. The derivative of the error with respect to the input ($∂E/∂X$)

Lets calculate $∂E/∂W$.

This matrix should be the same size as $W$ itself : $i$ x $j$ where $i$ is the number of input neurons and $j$ the number of output neurons. 

We need **one gradient for every weight** :

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-weight-derivatives.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

Using the chain rule stated earlier, we can write :


<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-weight-derivatives2.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

Therefore,

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-weight-derivatives3.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

That’s it we have the first formula to update the weights!

Now lets calculate $∂E/∂B$.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-update-weights.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

Again $∂E/∂B$ needs to be of the same size as $B$ itself, one gradient per bias. We can use the chain rule again :

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-update-weights2.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

And conclude that,

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-update-weights3.png?raw=1" />
  <figcaption></figcaption></center>
</figure>


Now that we have $∂E/∂W$ and $∂E/∂B$, we are left with $∂E/∂X$ which is **very important** as it will “act” as $∂E/∂Y$ for the layer before that one.

Again, using the chain rule,

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-update-weights4.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

Finally, we can write the whole matrix :

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-update-weights5.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

That’s it! We have the three formulas we needed for the **fully connected** (FC) layer!

## Coding the Fully Connected Layer

The python code that implements this math is the heart of the neural network.

In [0]:
# inherit from base class Layer
class FCLayer(Layer):
    # input_size = number of input neurons
    # output_size = number of output neurons
    def __init__(self, input_size, output_size):
        # Note: random weight matrix initialization -0.5 .. 0.5
        self.weights = np.random.rand(input_size, output_size) - 0.5
        
        # Note: random bias initialization -0.5 .. 0.5
        self.bias = np.random.rand(1, output_size) - 0.5

    # returns output for a given input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = np.dot(self.input, self.weights) + self.bias
        return self.output

    # computes dE/dW, dE/dB for a given output_error=dE/dY. Returns input_error=dE/dX.
    def backward_propagation(self, output_error, learning_rate):
        input_error = np.dot(output_error, self.weights.T)
        weights_error = np.dot(self.input.T, output_error)
        # dBias = output_error

        # update parameters
        self.weights -= learning_rate * weights_error
        self.bias -= learning_rate * output_error
        return input_error

## Activation Layer

All the calculation we did until now were completely linear. It's hopeless to learn anything with that kind of model. We need to add **non-linearity** to the model by applying non linear functions to the output of some layers. 

Now we need to redo the whole process for this new type of layer!

No worries, it’s going to be way faster as there are no *learnable* parameters. We just need to calculate $∂E/∂X$.

We will call $f$ and $f'$ the [**activation function**](https://towardsdatascience.com/activation-functions-neural-networks-1cbd9f8d91d6) and its derivative respectively.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-nonlinear.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

### Forward Propagation

As you will see, it is quite straightforward. For a given input $X$ , the output is simply the activation function applied to every element of $X$. Which means **input** and **output** have the same **dimensions**.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-nonlinear2.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

### Backward Propagation

Given $∂E/∂Y$, we want to calculate $∂E/∂X$.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-nonlinear3.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

Be careful, here we are using an **element-wise** multiplication between the two matrices (whereas in the formulas above, it was a dot product).


## Coding the Activation Layer

The code for the activation layer is as straightforward.

In [0]:
# inherit from base class Layer
class ActivationLayer(Layer):
    def __init__(self, activation, activation_prime):
        self.activation = activation
        self.activation_prime = activation_prime

    # returns the activated input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = self.activation(self.input)
        return self.output

    # Returns input_error=dE/dX for a given output_error=dE/dY.
    # learning_rate is not used because there is no "learnable" parameters.
    def backward_propagation(self, output_error, learning_rate):
        return self.activation_prime(self.input) * output_error

Now lets write some activation functions and their derivatives. These will be used later to create an `ActivationLayer`.

In [0]:
import numpy as np

# activation functions and their derivatives
def tanh(x):
    return np.tanh(x);

def tanh_prime(x):
    return 1-tanh(x)**2;

def sigmoid(x):
    s=1/(1+np.exp(-x))
    return s

def sigmoid_prime(x):
    s=sigmoid(x)
    ds=s*(1-s)
    return ds

def relu(x):
    return np.where(x > 0, x, 0)

def relu_prime(x):
    v = relu(x)
    return np.where(v > 0, 1, 0)

### Activation Functions

We've chosen to implement only one functions **tanh**, but quite a few others are available. The selection of functions is a key part of the network graph design, but outside the scope of this note.

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/activation-function-cheat-sheet.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/activation-function-derivatives.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

## Loss Function

Until now, for a given layer, we supposed that $∂E/∂Y$ was given (by the next layer). But what happens to the last layer?
How does it get $∂E/∂Y$?

We provide it manually, and it depends on how we define the error.

The error of the network, which measures how good or bad the network did for a given input data, is defined by **you**. There are many ways to define the error, and one of the most known is called [**MSE — Mean Squared Error**](https://en.wikipedia.org/wiki/Mean_squared_error).

Where $y^*$ and $y$ denotes desired output and actual output respectively. You can think of the loss as a last layer which takes all the output neurons and squashes them into one single neuron. What we need now, as for every other layer, is to define $∂E/∂Y$. Except now, we finally reached $E$ !

<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/nn-scratch-mse.png?raw=1" />
  <figcaption></figcaption></center>
</figure>

These are simply two python functions that you can put in a separate file. They will be used when creating the network.

In [0]:
import numpy as np

# loss function and its derivative
def mse(y_true, y_pred):
    return np.mean(np.power(y_true-y_pred, 2));

def mse_prime(y_true, y_pred):
    return 2*(y_pred-y_true)/y_true.size;

## Network Class

We're almost done!

We are going to make a Network class to create neural networks very easily akin to the first picture.

I commented almost every part of the code, it shouldn’t be too complicated to understand if you grasped the previous steps.

In [0]:
class Network:
    def __init__(self):
        self.layers = []
        self.loss = None
        self.loss_prime = None

    # add layer to network
    def add(self, layer):
        self.layers.append(layer)

    # set loss to use
    def use(self, loss, loss_prime):
        self.loss = loss
        self.loss_prime = loss_prime

    # predict output for given input
    def predict(self, input_data):
        # sample dimension first
        samples = len(input_data)
        result = []

        # run network over all samples
        for i in range(samples):
            # forward propagation
            output = input_data[i]
            for layer in self.layers:
                output = layer.forward_propagation(output)
            result.append(output)

        return result

    # train the network
    def fit(self, x_train, y_train, epochs, learning_rate):
        # sample dimension first
        samples = len(x_train)

        # training loop
        for i in range(epochs):
            err = 0
            for j in range(samples):
                # forward propagation
                output = x_train[j]
                for layer in self.layers:
                    output = layer.forward_propagation(output)

                # compute loss (for display purpose only)
                err += self.loss(y_train[j], output)

                # backward propagation
                error = self.loss_prime(y_train[j], output)
                for layer in reversed(self.layers):
                    error = layer.backward_propagation(error, learning_rate)

            # calculate average error on all samples
            err /= samples
            print('epoch %d/%d   error=%f' % (i+1, epochs, err))

## Building Neural Networks

Finally! 

We can use our class to create a neural network with as many layers as we want ! We are going to build two neural networks : a simple **XOR** and a **MNIST** solver.

### Solve XOR

Starting with XOR is always important as it’s a simple way to tell if the network is learning anything at all.

I don’t think I need to emphasize many things. Just be careful with the training data, you should always have the number of elements in each **input sample first**. For example here, the input shape is **(4,1,2)**; there are 4 training cases, each training sample has one vector; the vector has two elements.

In [9]:
import numpy as np

# training data
x_train = np.array([[[0,0]], [[0,1]], [[1,0]], [[1,1]]])
y_train = np.array([[[0]], [[1]], [[1]], [[0]]])

# network
net = Network()
net.add(FCLayer(2, 8))
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(8, 1))
net.add(ActivationLayer(tanh, tanh_prime))

print(f"- The training set inputs shape: {x_train.shape}")
print(f"- The training set output shape: {y_train.shape}")


- The training set inputs shape: (4, 1, 2)
- The training set output shape: (4, 1, 1)


In [10]:
# train the xor function.
net.use(mse, mse_prime)
net.fit(x_train, y_train, epochs=1000, learning_rate=0.1)

epoch 1/1000   error=0.526089
epoch 2/1000   error=0.361008
epoch 3/1000   error=0.333829
epoch 4/1000   error=0.325951
epoch 5/1000   error=0.321671
epoch 6/1000   error=0.318867
epoch 7/1000   error=0.316747
epoch 8/1000   error=0.314915
epoch 9/1000   error=0.313166
epoch 10/1000   error=0.311401
epoch 11/1000   error=0.309576
epoch 12/1000   error=0.307675
epoch 13/1000   error=0.305697
epoch 14/1000   error=0.303646
epoch 15/1000   error=0.301529
epoch 16/1000   error=0.299354
epoch 17/1000   error=0.297127
epoch 18/1000   error=0.294855
epoch 19/1000   error=0.292543
epoch 20/1000   error=0.290197
epoch 21/1000   error=0.287823
epoch 22/1000   error=0.285426
epoch 23/1000   error=0.283011
epoch 24/1000   error=0.280585
epoch 25/1000   error=0.278152
epoch 26/1000   error=0.275720
epoch 27/1000   error=0.273294
epoch 28/1000   error=0.270881
epoch 29/1000   error=0.268487
epoch 30/1000   error=0.266119
epoch 31/1000   error=0.263784
epoch 32/1000   error=0.261487
epoch 33/1000   e

In [11]:
# test the Xor function.
out = net.predict(x_train)
print(out)

[array([[0.00623708]]), array([[0.97785455]]), array([[0.98017154]]), array([[0.02027679]])]


This should produce a result close to the training set output such as:

```
  [array([[0.00069917]]), array([[0.97479752]]), array([[0.97443034]]), array([[-0.0002348]])]
```

If this works (which it really, really, should), great!
We can now solve something more interesting, let’s solve [MNIST](https://en.wikipedia.org/wiki/MNIST_database) [LeCunn](http://yann.lecun.com/exdb/mnist/)

## Solve MNIST

We didn’t implemented the Convolutional Layer but this is not a problem. All we need to do is to reshape our data so that it 
can fit into a Fully Connected Layer.

*The MNIST Dataset consists of images of digits from 0 to 9, of shape 28 x 28 x 1.*


<figure><br>
  <center><img src="https://github.com/jfogarty/machine-learning-intro-workshop/blob/master/images/MNIST.png?raw=1" />
  <figcaption>A sample of NMIST digits</figcaption></center>
</figure>


In [12]:
import numpy as np

from keras.datasets import mnist
from keras.utils import np_utils

# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')
x_train /= 255
# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')
x_test /= 255
y_test = np_utils.to_categorical(y_test)

print(f"- The training set inputs shape: {x_train.shape}")
print(f"- The training set output shape: {y_train.shape}")
print("")
print(f"- The test set inputs shape: {x_test.shape}")
print(f"- The test set output shape: {y_test.shape}")

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/img-datasets/mnist.npz
- The training set inputs shape: (60000, 1, 784)
- The training set output shape: (60000, 10)

- The test set inputs shape: (10000, 1, 784)
- The test set output shape: (10000, 10)


In [0]:
# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))                   # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(tanh, tanh_prime))

This implements simple full batch gradient descent which is pretty slow. Read [this](https://machinelearningmastery.com/gentle-introduction-mini-batch-gradient-descent-configure-batch-size/) for info on [implementing mini batch gradient descent](https://www.geeksforgeeks.org/ml-mini-batch-gradient-descent-with-python/) which would be considerably faster.

In [14]:
TRAINING_SAMPLES=1000
TRAINING_EPOCHS=35
LEARNING_RATE=0.1
# train on 1000 samples
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...

# Define the error function to use, along with its first derivative.
net.use(mse, mse_prime)

# Do the actual training.
net.fit(x_train[0:TRAINING_SAMPLES],
        y_train[0:TRAINING_SAMPLES],
        epochs=TRAINING_EPOCHS, 
        learning_rate=LEARNING_RATE)

epoch 1/35   error=0.230528
epoch 2/35   error=0.102204
epoch 3/35   error=0.080772
epoch 4/35   error=0.069671
epoch 5/35   error=0.061284
epoch 6/35   error=0.054669
epoch 7/35   error=0.049162
epoch 8/35   error=0.044423
epoch 9/35   error=0.040429
epoch 10/35   error=0.037183
epoch 11/35   error=0.034027
epoch 12/35   error=0.030873
epoch 13/35   error=0.027818
epoch 14/35   error=0.025197
epoch 15/35   error=0.022951
epoch 16/35   error=0.021148
epoch 17/35   error=0.019669
epoch 18/35   error=0.018392
epoch 19/35   error=0.017225
epoch 20/35   error=0.016249
epoch 21/35   error=0.015300
epoch 22/35   error=0.014442
epoch 23/35   error=0.013811
epoch 24/35   error=0.013465
epoch 25/35   error=0.012524
epoch 26/35   error=0.012085
epoch 27/35   error=0.011899
epoch 28/35   error=0.010598
epoch 29/35   error=0.010378
epoch 30/35   error=0.010056
epoch 31/35   error=0.009845
epoch 32/35   error=0.008965
epoch 33/35   error=0.008604
epoch 34/35   error=0.008474
epoch 35/35   error=0.0

Note that in epoch 1, the `error=0.241182` was an error rate of 25%.  After 35 epochs this shoud be close to 1% (0.01).


In [15]:
N=4
# test on 4 samples
out = net.predict(x_test[0:N])

np.set_printoptions(formatter={'float': lambda x: "{0:6.3f}".format(x)})

print("--- predicted values : ")
for v in out: print(v)

print("--- true values : ")
for v in y_test[0:N]: print(f"[{v}]")


--- predicted values : 
[[ 0.009 -0.030 -0.139 -0.016  0.047 -0.091 -0.011  0.962  0.024  0.056]]
[[-0.013  0.035  0.549  0.691 -0.035 -0.099 -0.067 -0.071 -0.098  0.042]]
[[ 0.000  0.973 -0.115  0.020 -0.007 -0.058 -0.025  0.004 -0.010 -0.006]]
[[ 0.972  0.025  0.023 -0.056 -0.010 -0.119 -0.023 -0.022  0.019 -0.004]]
--- true values : 
[[ 0.000  0.000  0.000  0.000  0.000  0.000  0.000  1.000  0.000  0.000]]
[[ 0.000  0.000  1.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000]]
[[ 0.000  1.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000]]
[[ 1.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000  0.000]]



This is working perfectly! 

**Amazing `:)`**

### End of notebook.

In [0]:
import pandas as pd
URL= 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

In [0]:
colnames = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']
df = pd.read_csv(URL, names=colnames)

In [24]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal_length    150 non-null float64
sepal_width     150 non-null float64
petal_length    150 non-null float64
petal_width     150 non-null float64
class           150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
