# Overview

References:
* [Building a neural network from scratch](https://www.youtube.com/watch?v=w8yWXqWQYmU)
* [Understanding the math behind neural networks by building one from scratch (no TF/Keras, just numpy)](https://www.samsonzhang.com/2020/11/24/understanding-the-math-behind-neural-networks-by-building-one-from-scratch-no-tf-keras-just-numpy)
* [MNIST database](https://en.wikipedia.org/wiki/MNIST_database)


I'm going to build a neural network from the ground up using Python and basic libraries, without any ML libraries. I'm going to create a neural network to do handwriting recognition against the [MNIST database](https://en.wikipedia.org/wiki/MNIST_database).

Based on the [MNIST docs](http://yann.lecun.com/exdb/mnist/):
```
Each image in the database is 28x28 pixels with a grayscale value for each pixel, ranging between 0 and 255. Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). 
```

We're going to have $m$ training images.

Call the examples $E$ (a matrix), we would have a matrix where each row has 784 columns.

$
\begin{array}{@{}c@{}}
\begin{bmatrix}
    0 & 0 & 1 & ... & 0 & 0 & 0 \\
    0 & 0 & 0 & ... & 1 & 0 & 0 \\
    &&& ... &&& \\
    0 & 0 & 0 & ... & 0 & 0 & 0 \\
\end{bmatrix}
\qquad
\begin{array}{@{}l@{}}
    <- X^{(1)} \\
    <- X^{(2)} \\
    \\
    <- X^{(m)} \\
\end{array}
\end{array}
$

We will be transposing this to be $X = E^T$, where there are a fixed 784 rows, and a new column for each example. So, this will be a matrix of size $784 \times m$:

$
\begin{bmatrix}
    | & | & | & | \\
    | & | & | & | \\
    | & | & | & | \\
    X^{(1)} & X^{(2)} & ... & X^{(m)} \\
    | & | & | & | \\
    | & | & | & | \\
    | & | & | & | \\
\end{bmatrix}
$

We're going to create a Neural Network with an input layer (784 nodes), one 10-node hidden layers, and a 10-node output layer representing the digits 0 to 9.


# Forward Propagation

We'll call our input $A^{[0]}$, which is just equal to the tranformed matrix with our example data $X$. Therefore, the dimensions of this matrix will be $784 \times m$.

$
A^{[0]} = X
$

$Z^{[1]}$ is the *unactivated* first layer. We're going to apply a weight and a bias to get it.

$
Z^{[1]} = W^{[1]} A^{[0]} + b^{[1]}
$

The dimensions of the matrices involved in calculating $Z^{[1]}$ are as follows:

$
\begin{array}{@{}ll}
    Z^{[1]} \text{ is } 10 \times m \\
    W^{[1]} \text{ is } 10 \times 784 \\
    A^{[0]} \text{ is } 784 \times m \\
    b^{[1]} \text{ is } 10 \times 1 \\
\end{array}
$

Now we need to apply an activation function. We're going to use ReLU (Rectified Linear Unit). If we didn't apply an activation function, each node in the hidden layer would just be a linear combination of the nodes from the previous layer. If you don't apply an activation function, you're effectively just doing a linear regression, but by using a non-linear activation function you add non-linearity to the transformation, which allows more complexity in the solution space. ReLU allows it to be linear instead of using a sigmoid or something else **[research this more to understand it better]**.

$ReLU( x ) = x \text{ if } x > 0\text{; } 0 \text{ if } x <= 0$

We'll call our ReLU-activated output of our input layer $A^{[1]}$

$A^{[1]} \text{ is } 10 \times m$

$A^{[1]} = g( Z^{[1]} ) = ReLU( Z^{[1]} )$

Our unactivated output layer has the following dimensions:

$
\begin{array}{@{}ll}
    Z^{[2]} \text{ is } 10 \times m \\
    W^{[2]} \text{ is } 10 \times 10 \\
    A^{[1]} \text{ is } 10 \times m \\
    b^{[2]} \text{ is } 10 \times 1 \\
\end{array}
$

The unactivated output for the output layer is:

$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$


# Softmax explained

To get the activated output of the layer we apply a ***softmax*** function to get the activated output. Softmax is a function that gives you a percentage normalization by taking each node in the output layer and dividing it by the sum of all of the nodes in the output layer. 

From [Wikipedia](https://en.wikipedia.org/wiki/Softmax_function) we know the softmax function works like this:

$
\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}
$

> ℹ️ Note
>
> Don't confuse any of the above variables such as $z$, $i$, $j$ or $K$ with anything we're referencing in this problem. These are just the generic definition of the function as per the Wikipedia page for softmax. More specifically, in the context of the softmax function and its derivative, $i$ and $j$ are indices over the classes. We compute the softmax function for each class $j$, and when computing the derivative of this function, we need to consider two cases, $i = j$ and $i ≠ j$. Separately, as you'll see below during backpropagation within the context of the activated output layer matrix (e.g., $A^{[2]}$), $i$ is the class index and $j$ is the example index. In that case, we're computing the loss function for each example and each class, and when we differentiate the loss with respect to the activations, and $i$ and $j$ are used as indices in this matrix.


The softmax probability vector for the input vector $z$ is a normalized vector where each entry is the probability of the given input value. 

So for example, if we have the following input vector:
$$z = \begin{bmatrix} 
    2.3 \
    -1.7 \
    5.8 \
    0.9 \
    -3.2 \
    7.1 \
    4.6 \
    3.4 \
    -4.8 \
    6.5 \
\end{bmatrix}$$

Then

$$\text{softmax}(z) = \begin{bmatrix} 
    0.005 \
    0.001 \
    0.184 \
    0.002 \
    0.000 \
    0.801 \
    0.008 \
    0.004 \
    0.000 \
    0.016 \
\end{bmatrix}$$ 



# Output

Now that we have our unactivated output layer matrix $Z^{[2]}$, and we know how softmax works, we can use softmax to get our activated output layer probabability distribution. 

Remember, $Z^{[2]} \text{ is } 10 \times m$ so the dimensions of $A^{[2]} \text{ are also } 10 \times m$.

$A^{[2]} = softmax( Z^{[2]} )$

So for example, if $Z^{[2]}$ had the following for one of its examples:
$$Z^{[2]} = \begin{bmatrix} 
    2.3 & ... \\
    -1.7 & ... \\
    5.8 & ... \\
    0.9 & ... \\
    -3.2 & ... \\
    7.1 & ... \\
    4.6 & ... \\
    3.4 & ... \\
    -4.8 & ... \\
    6.5 & ... \\
\end{bmatrix}$$

Then applying that as the input of the softmax function would result in:
$$A^{[2]} = \text{softmax}(Z^{[2]}) = \begin{bmatrix} 
    0.005 & ... \\
    0.001 & ... \\
    0.184 & ... \\
    0.002 & ... \\
    0.000 & ... \\
    0.801 & ... \\
    0.008 & ... \\
    0.004 & ... \\
    0.000 & ... \\
    0.016 & ... \\
\end{bmatrix}$$

Each $A^{[2]}_{i,j}$ value represents the probability that the $j$ example is of class $i$.

If we want to get a "best guess" answer, then we can select the highest probability answer and return that as the selected answer. To do this, we can use a ***one-hot encoding*** function. 

Let's say we have a prediction vector for a single example as $\hat{y}$ as follows:

$$\hat{y} = \begin{bmatrix} 
    0.005 \
    0.001 \
    0.184 \
    0.002 \
    0.000 \
    0.801 \
    0.008 \
    0.004 \
    0.000 \
    0.016 \
\end{bmatrix}$$ 

Then we can define $y$ as the one-hot encoding of the correct label for the training example. If the label for a training example is 5 (because it has the highest probability as in the example above), then the one-hot encoded vector of $y$ would look like this:

$$y = \begin{bmatrix} 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 0 \ 0 \ 0 \ 0 \ \end{bmatrix}$$

This tells us that given the image input values for this example, the classifier has selected 5 as the best guess on what digit this image is. 


# Loss and Gradient Decent Explained

Backpropagation is the method of taking the loss of the output and using it to calculate changes that we should make to the weights and biases of the preceding layers to move us closer to a more accurate output. Each example will nudge us a little closer to a better solution. The approach we take is to use calculus to calculate the derivative of each component that contributes to the output and adjust accordingly. This approach is known as ***gradient decent***.

Because we're using softmax, we want to use a ***cross-entropy loss function***. We can calculate the loss using the following function:
$$J(\hat{y}, y) = -\sum_{i=0}^{c} y_i \log(\hat{y}_i)$$

Here, $\hat{y}$ is our prediction vector and $y$ is the one-hot encoding of the correct label from the examples in our training data. Because the one-hot vector is $0$ for all entries except for the one with the correct prediction, the loss will just be the $-\log()$ of the probability associated with the correct prediction. 

So, if
$$\hat{y} = \begin{bmatrix} 0.005 \ 0.001 \ 0.184 \ 0.002 \ 0.000 \ 0.801 \ 0.008 \ 0.004 \ 0.000 \ 0.016 \ \end{bmatrix}$$ 
and
$$y = \begin{bmatrix} 0 \ 0 \ 0 \ 0 \ 0 \ 1 \ 0 \ 0 \ 0 \ 0 \ \end{bmatrix}$$
then 
$$J(\hat{y}, y) = -\log(\hat{y}_i) = -\log(0.801) = 0.09636748392 $$

The closer the probability is to $1$ the closer the loss will be to $0$.

Now that we know what loss is, we need to work backwards through our network to find out what portion of the loss is composed by each of the weights and biases. 

$$
\begin{array}{@{}ll}
W^{[1]} := W^{[1]} - \alpha \frac{\delta J}{\delta W^{[1]}} \\
b^{[1]} := b^{[1]} - \alpha \frac{\delta J}{\delta b^{[1]}} \\
W^{[2]} := W^{[2]} - \alpha \frac{\delta J}{\delta W^{[2]}} \\
b^{[2]} := b^{[2]} - \alpha \frac{\delta J}{\delta b^{[2]}} \\
\end{array}
$$


# Backpropagation - Derivative of loss

In order to calculate the weights and biases for the previous layers, we have to work our way backwards. We have a loss function that is applied to an activated output layer (softmax) that is applied to an unactivated output layer.

So, how can we calculate the derivative of the loss with respect to the unactivate output layer?

Remember our loss function is:
$$J(\hat{y}, y) = -\sum_{i=0}^{c} y_i \log(\hat{y}_i)$$

Let's focus on a single example vector rather than the whole matrix to calculate the derivative. 

For our case, $\hat{y}$ is our prediction vector $A^{[2]}$ and $y$ is our one-hot encoded vector for a single example, which we can call $Y$. So:
$$
\hat{y} = A^{[2]} \text{, } y = Y 
$$

> ℹ️ Note
>
> In this section we're using $A^{[2]}$ and $Y$ to represent a single example vector and $A^{[2]}_i$ and $Y_i$ to refer to a single entry in that vector.

Now we can plug those in:
$$J(\hat{y}, y) = J( A^{[2]}, Y ) = -\sum_{i=0}^{c} Y_i \log(A^{[2]}_i)$$

Now we want to calculate:
$$\frac{\delta J}{\delta A^{[2]}_i}$$

Let's tackle the derivative inside of the summation, then apply the derivative of the summation. We'll bring the negative into the summation. We know that the derivative of $log(x)$ is $1/x$. Inside the summation we have:
$$ \frac{\delta J}{\delta A^{[2]}_i} -Y_i \log(A^{[2]}_i) = -\frac{Y_i}{A^{[2]}_i}$$

Next, we know that:
$A^{[2]} = softmax( Z^{[2]} )$

The derivative of the softmax function $\text{softmax}(x_i)$ with respect to $x_i$ simplifies to $\text{softmax}(x_i) * (1 - \text{softmax}(x_i))$. 

In our case, that means the derivative of the softmax is:
$$\frac{\delta A^{[2]}_i}{\delta Z^{[2]}_i} = \text{softmax}'( Z^{[2]}_i ) = \text{softmax}(Z^{[2]}_i) * (1 - \text{softmax}(Z^{[2]}_i) = A^{[2]}_i * (1 - A^{[2]}_i)$$

Using the chain rule, we can calculate the loss for $\frac{\delta J}{\delta Z^{[2]}_i}$

$$
\frac{\delta J}{\delta Z^{[2]}_i} = \frac{\delta J}{\delta A^{[2]}_i} * \frac{\delta A^{[2]}_i}{\delta Z^{[2]}_i} = -\frac{ Y_i }{A^{[2]}_i} * A^{[2]}_i * (1 - A^{[2]}_i)
$$

The derivative of a summation is the sum of the derivatives, so:
$$
\frac{\delta J}{\delta Z^{[2]}} = \sum_{i=0} -\frac{ Y_i }{A^{[2]}_i} * A^{[2]}_i * (1 - A^{[2]}_i)
$$

Because Y is **one-hot** this simplies. The reason for this is that when we take the derivative of the softmax cross entropy loss with respect to $Z^{[2]}_j$, we have to consider two cases, one when $i$ equals $j$, and one when $i$ doesn't equal $j$. In the first case, we get the derivative as:
$$Y_j * (1 - A^{[2]}_j) \text{ when } i = j$$

And in the second case, we get 
$$-Y_i * A^{[2]}_j$$. Summing over all $i$, only the terms where $i$ equals $j$ survive in the $Y$ vector (since $Y$ is a one-hot vector), giving us $A^{[2]}_j - Y_j$, or in vector form $A^{[2]} - Y$.

$$\frac{\delta J}{\delta Z^{[2]}} = A^{[2]} - Y$$


# Backpropagation

The first thing we'll want to do is calculate the partial derivative of the loss function with respect to the softmax activation function. 

$$\frac{\delta J}{\delta A^{[2]}} \text{ or } dA^{[2]}$$

Let's just consider a single example to figure out the derivative. Let's say that $a$ is a single example of the activated output layer. In other words, just think of a single column (or vector) of the matrix $A^{[2]}$. Also, let's consider $y$ as a single vector that represents the one-hot encoding for that single example. 

Remember, because we're 


Then, the loss is:
$L = - \sum{y_i} * log(a_i)$




We'll call that $dZ^{[2]}$ because it is the derivative of our output layer.

Dimensions:

$
\begin{array}{@{}ll}
dZ^{[2]} \text{ is } 10 \times m \\
A^{[2]} \text{ is } 10 \times m \\
Y \text{ is } 10 \times m \\
\end{array}
$

The derivative is as follows:
$$dZ^{[2]} = A^{[2]} - Y$$

Now we want to see how much the weights and the biases contributed to the error. We calculate the partial derivative **[go back and study the calculus and intuition behind this]** of W and b to get this.


$$
\begin{array}{@{}ll}
W^{[1]} := W^{[1]} - \alpha \frac{\delta J}{\delta W^{[1]}} \\
b^{[1]} := b^{[1]} - \alpha \frac{\delta J}{\delta b^{[1]}} \\
W^{[2]} := W^{[2]} - \alpha \frac{\delta J}{\delta W^{[2]}} \\
b^{[2]} := b^{[2]} - \alpha \frac{\delta J}{\delta b^{[2]}} \\
\end{array}
$$

For conciseness, we'll write the partial derivatives like this:

$$dW^{[1]}, db^{[1]}, dW^{[2]}, db^{[2]}$$




dW^[2] is the derivative of the loss function with respect to the weights in layer 2
db^[2] is just the average of the absolute error.


dW^[2] is 10 x 10
dZ^[2] is 10 x m
A^([1]T) is m x 10

SUM( dW^[2] ) is 10 x 1
db^[2] is 10 x 1

```
dW^[2] = 1/m dZ^[2] A^([1]T)

db^[2] = 1/m SUM( dZ^[2] )
```





Our objective in backprop is to find $$\frac{\delta J}{\delta W^{[1]}},\frac{\delta J}{\delta b^{[1]}},\frac{\delta J}{\delta W^{[2]}},$$ and $$\frac{\delta J}{\delta b^{[2]}}$$. For concision, we'll write these values as $$dW^{[1]}, db^{[1]}, dW^{[2]},$$and $$db^{[2]}$$. We'll find these values by stepping backwards through our network, starting by calculating $$\frac{\delta J}{\delta A^{[2]}}$$, or $$dA^{[2]}$$. Turns out that this derivative is simply:

$$dA^{[2]} = Y - A^{[2]}$$

If you know calculus, you can take the derivative of the loss function and confirm this for yourself. (Hint: $$\hat{y} = A^{[2]}$$)

From $$dA^{[2]}$$, we can calculate $$dW^{[2]}$$ and $$db^{[1]}$$:

$$dW^{[2]} = \frac{1}{m} dZ^{[2]} A^{[1]T} \ dB^{[2]} = \frac{1}{m} \Sigma {dZ^{[2]}}$$

Then, to calculate $$dW^{[1]}$$ and $$db^{[1]}$$, we'll first find $$dZ^{[1]}$$:

$$dZ^{[1]} = W^{[2]T} dZ^{[2]} .* g^{[1]\prime} (Z^{[1]})$$

I won't explain all the details of the math, but you can get some intuitive hints at what's going on here by just looking at the variables. We're applying $$W^{[2]T}$$ to $$dZ^{[2]}$$, akin to applying the weights between layers 1 and 2 in reverse. Then, we perform an element-wise multiplication with the derivative of the activation function, akin to "undoing" it to get the correct error values.