# Implementing a Multilayer Artificial Neural Network from Sractch

## Introducting the multiplayer neural network architecture

We are going to start our exploration of Neural networks with a **M**ulti**L**ayer **P**erceptron (MLP). We'll use it to learn how to connect multiple single neurons. The following picture illustrates an MLP consisting of **three** layers:

![MLP illustration](./etc/mlp.jpg)

The above MLP has one:

- Input Layer (1st layer)
- Hidden Layer (2nd layer)
- Output Layer (3rd layer)

And notice, our MLP is ***fully connected***. It means that each unit in the input layer is **connected to all** units in the hidden layer, and each units in the hidden layer is **connected to all** units in the output layer. If the network had _more than one_ hidden layer, we would call it a **deep artificial Neural Network (NN)**. And the field of **Deep Learning** is concerned with the development algorithm to help us train such structures.

Now, let's define how we will refer to elements in our neural network:

- We denote the $i^{th}$ activation unit in the $l^{th}$ layer: $a_i^{(l)}$
- We denote the connection between the $k^{th}$ unit in layer $l$ to the $j^{th}$ unit in layer $l + 1$ as: $w_{k,j}^{l}$

The book denotes the weight matrix that connects the input to the hidden layer as: **$W^{(h)}$** and the weight matrix that connects the hidden layer to the output layer as: **$W^{(out)}$**

---

Wait, what **Weight matrix?!**

Okay, let me walk you through how to derive $W^{(h)}$ and you can try to do the same for $W^{(out)}$

Let's say we have a data set of only **ONE** training example, and we want to "forward" propagate that one input from the input layer (in) to the hidden layer (h). To do that, we have to compute $a_1^{(h)}$, $a_2^{(h)}$, $\dots$, $a_d^{(h)}$. For instance,

$a_1^{(h)} = \phi(z_1^{(h)})$

$z_1^{(h)} = a_0^{(in)} w_{0,1}^{(in)} + a_1^{(in)} w_{1,1}^{(in)} + \dots + a_m^{(in)} w_{m,1}^{(in)}$

Now consider the following matrix for $z^{(h)}$

![Weight Matrix demonstration](./etc/demo.jpg)

Notice, the weight matrix $W^{(h)}$ is an $m x d$ matrix where $d$ is the number of units in the hidden layer and $m$ is the number of units in the input layer, including the bias unit.

As you can see, our training example is be multiplied by that weight matrix, to compute its net input vector $Z^{(h)}$.

$Z^{(h)} = a^{(in)}W^{(h)}$

$a^{(h)} = \phi(Z^{(h)})$

Here $a^{(in)}$, is our training example (an $1 \times m$ matrix). And since $W^{(h)}$ is an $m \times d$ matrix, the resulting net input vector $Z^{(h)}$ is an $1 \times d$ row matrix. That net input vector is then passed to the activation function to compute $a^{(h)}$ which is $1 \times d$ matrix. Now, we can generalize this computation to all $n$ example in the training dataset:

$Z^{(h)} = A^{(in)}W^{(h)}$. Here, $A^{(in)}$ is an $n x m$ matrix.

**Each** training example (row) in the matrix is multiplied by the weight matrix. This multiplication happens through the matrix-matrix multiplication of $A^{(in)}$ and $W^{(h)}$ because each row of $A^{(in)}$ gets multiplied to $W^{(h)}$. 

This matrix-matrix multiplication results in an $m x d$, net input matrix $Z^{(h)}$; where the $i^{(th)}$ row in $Z^{(h)}$ is the result of the matrix-vector multiplication of the $i^{(th)}$ training example and the weight matrix.

Finally, we apply the function $\phi(\bullet)$ to each value in the net input matrix to get the $n \times d$ **activation** matrix. 

$A^{(h)} = \phi(Z^{(h)})$

The $i^{(th)}$ row in the activation matrix contains the values the activation units in the hidden layer will contains after the $i^{th}$ training example forward propagates :)

Similarly, we can write the activation of the **output layer** in vectorized form for multiple examples:

$Z^{(out)} = A^{(h)}W^{(out)}$ and $A^{(out)} = \phi(Z^{(out)})$ where $A^{(out)}$ is an $n \times t$ matrix.

**Challenge**: Can you see why? Can you derive $W^{(out)}$?

---

Ok, cool. But why did we use those matrices?

For code effeciency and readability. We used our basic linear algebra skills to write the computations in a more compact way, so we do not use computationally expensive Python `for` loops. It also helps us delegate those computations to a GPU.

## Classifying handwritten digits

Let's implement and train our first multilayer NN to classify handwritten digits.

### Obtaining and preparing the MINIST dataset

The dataset is freely available on Yann Lecun's website. It consists of the following four parts:

1. Training dataset images
2. Training dataset labels
3. Test dataset images
4. Test dataset labels