# First some theory

First some basic calculus:

$$
f(X) = A \cdot X
$$

$$
g(X) = X \cdot A
$$

$$
\frac{df}{dX} = A
$$

$$
\frac{dg}{dX} = A^T
$$

Let's start with a simple NN:

**Forward Propagation**

$$ Z = W \cdot X + b$$
$$ \hat{Y} = SoftMax(Z)$$

where 
$$
SoftMax(Z)_i = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}}
$$

**Shapes**

- $X \in \R^{784}$ is a single digit, i.e. a single instance,
- $W \in \R^{10 \times 784}$ is the matrix of weights,
- $b \in \R^{10}$ is the vector of biases,
- $Z \in \R^{10}$ is the output of the layer,
- $\hat{Y} \in \R^{10}$ is our prediction

Now, the true vector of values $Y$ is a one hot encoding vector with $1$ in the right label, and $0$ in the other places:

$$ 
Y_i = \begin{bmatrix}
           0 \\
           0 \\
           \vdots \\
           0 \\
           1
    \end{bmatrix}
$$

this means that $X_i$ is actually a "9".

** Loss function **

What loss function is better to use in this scenario? One simple possibility would be to use the $MSE(Y, \hat{Y})$, but in this scenario we have a better option:

$$
L(Y, \hat{Y}) = - \sum y_i \text{log}(\hat{y}_i)
$$

why is this a good choice? Because the vector of labels $Y$ is a one-hot-encoding vector, that means that all the values are $0$ expecte one that is $1$. Using MSE is not the best option here, we should instead focus on the only component that is not $0$.

In this way, summing all the components of $Y$ the only one that survives is the one with the $1$ and its corresponding value in $\hat{Y}$.

And what about $\text{log}(\hat{y}_i)$?

Ideally $\hat{y}_i$ should be close to $1$ (and all the other components equal to $0$), and if $\hat{y}_i \approx 1$ then $\text{log}(\hat{y}_i) \approx 0$, if instead $\hat{y}_i \approx 0$, then $\text{log}(\hat{y}_i) \approx -\infty$ and $L(Y, \hat{Y}) \approx + \infty$, and it is what we want for a loss function, close to 0 when the prediction is correct and very big when it is wrong.

Another reason why it's a good choice is the derivative that exits from it, i.e.:

$$
\frac{\partial L}{\partial Z} = \hat{Y} - Y
$$

if you want to find out why check this article: https://levelup.gitconnected.com/killer-combo-softmax-and-cross-entropy-5907442f60ba

but what we really want is to calculate the 

$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial W}
$$

Which is easily calculated noting that

$$
\frac{\partial L}{\partial Z} = \hat{Y}-Y
$$
$$
\frac{\partial Z}{\partial W} = X^T
$$

and therefore

$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial W} = (\hat{Y}-Y) \cdot X^T
$$

does the shapes match? Well, $\frac{\partial L}{\partial W}$ must be the same shape of $W$ which is $\R^{10 \times 784}$, and what we have is that

* $(\hat{Y}-Y) \in \R^{10}$
* $X \in \R^{784} \implies W^T \in \R^{1 \times 784}$

This imply that

* $(\hat{Y}-Y)\cdot X^T \in \R^{10 \times 784}$ which is exactly what we want

Now, what about $b$?

We still have that 

$$
\frac{\partial L}{\partial Z} = \hat{Y}-Y
$$
but...
$$
\frac{\partial Z}{\partial b} = I \in \R^{1 \times 1}
$$

and therefore 

$$
\frac{\partial L}{\partial b} = (\hat{Y}-Y) \cdot I \in \R^{10 \times 1}
$$

So putting all together we have+

**Forward propagation**

$$ Z = W \cdot X + b $$
$$ \hat{Y} = \text{SoftMax}(Z) $$

**Backward propagation**

$$ dZ = (\hat{Y}-Y) \in \R^{10 \times 1} $$
$$ dW = (\hat{Y}-Y) \cdot X^T \in \R^{10 \times 784} $$
$$ db = (\hat{Y}-Y) \cdot I \in \R^{10 \times 1} $$

**Parameters update**

$$ W = W - \alpha \cdot dW $$
$$ b = b - \alpha \cdot db $$


# Generalizing to $n$ observations

The formula provided above are for a single observations, i.e. $X \in \R^{784}$ is a single digit, a single image, a single observation. 
What we will have in reality is all the dataset i.e. $ \mathbb{X} \in \R^{784 \times n}$ where $n$ is the number of observations in my dataset and similarly I will have $\mathbb{Y} \in \R^{10 \times n}$. 

What we will have is:

* Each column of $\mathbb{X}$ is a single observation $X_i \in \R^{784}$
* Each column of $\mathbb{Y}$ is a single prediction $Y_i \in \R^{10}$

What we should do when we have $n$ observations?

Well let's think about what could be a good strategy:

When we had a single observation, lets call it $X_1$ in the end what we end up was 

$$
dW_1 = (\hat{Y}_1 - Y_1) \cdot X_1^T 
$$

which is the "adjustments" that we must do on $W$ because of the prediction $\hat{Y}_1$ from $X_1$. 
If we had another observation, $X_2$ the adjustment would be

$$
dW_2 = (\hat{Y}_2 - Y_2) \cdot X_2^T 
$$

and what would be a good idea to merge these two "adjustments"? A simple idea would be to do the average, i.e. to sum and average the two weights "adjustments" $W_1$ and $W_2$:

$$
dW = \frac{1}{2} \cdot (W_1 + W_2)
$$

and this is exactly what will do considering the generalized version:

**Shapes**

$$ \mathbb{X} \in \R^{784 \times n} $$
$$ \hat{\mathbb{Y}} \in \R^{10 \times n}$$

**Forward propagation**

$$ \mathbb{Z} = W \cdot \mathbb{X} + b $$
$$ \hat{\mathbb{Y}} = \text{SoftMax}(\mathbb{Z}) $$

**Backward propagation**

$$ d\mathbb{Z} = (\hat{\mathbb{Y}}-\mathbb{Y}) \in \R^{10 \times n} $$
$$ dW = 1/n \ (\hat{\mathbb{Y}}-\mathbb{Y}) \cdot \mathbb{X}^T \in \R^{10 \times 784} $$

which considering that

$$
(\hat{\mathbb{Y}}-\mathbb{Y}) = [\hat{Y}_1 - Y_1 | \hat{Y}_2 - Y_2 | \dots | \hat{Y}_n - Y_n]
$$
$$
\mathbb{X} = [X_1 | X_2 | \dots | X_n]
$$

doing 

$$
\frac{1}{n} \cdot (\hat{\mathbb{Y}}- \mathbb{Y}) \cdot \mathbb{X}^T = \frac{1}{n} \cdot ((\hat{Y}_1 - Y_1)\cdot X_1^T + \dots + (\hat{Y}_n - Y_n)\cdot X_n^T) = \frac{1}{n} \cdot (dW_1 + \dots + dW_n)
$$
which is exactly doing the average of all the "adjustments" of all the various predictions $\hat{Y}_1, \dots, \hat{Y}_n $

Also for $b$ the parameter update for a single observation is simply $\hat{Y}-Y$ where in this case is $\in \R^{10}$, but what if $(\hat{\mathbb{Y}}-\mathbb{Y}) \in \R^{10 \times n}$? Well in this case the better idea would be to do an average since each column here represents one observation, we averege over all the $m$ observations:

$$
db = 1/m \ \Sigma dZ
$$



# Intuition behind parameters update

Focus for a moment on the case where $X \in \R^{784}$ is a single observation and therefore $\hat{Y} \in \R^{10}$ is a single prediction, we have that
$$
\frac{\partial L}{\partial Z} = \hat{Y}-Y
$$
$$
\frac{\partial Z}{\partial W} = X^T
$$

and therefore

$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial W} = (\hat{Y}-Y) \cdot X^T
$$

what does tell $\hat{Y}-Y$? 

Since 

$$
dZ = \hat{Y}-Y
$$

suppose to have 


$$ 
Y_i = \begin{bmatrix}
           0 \\
           \vdots \\
           0 \\
           1
    \end{bmatrix}
$$

and 


$$ 
\hat{Y}_i = \begin{bmatrix}
           1 \\
           0 \\
           \vdots \\
           0 \\
    \end{bmatrix}
$$

then 


$$ 
dZ = \hat{Y}_i - Y_i = \begin{bmatrix}
           1 \\
           0 \\
           \vdots \\
           0 \\
           -1
    \end{bmatrix}
$$

what does tell to the model? 

$$
Z = Z - dZ = Z - \begin{bmatrix}
           1 \\
           0 \\
           \vdots \\
           0 \\
           -1
    \end{bmatrix} = Z + \begin{bmatrix}
           -1 \\
           0 \\
           \vdots \\
           0 \\
           +1
    \end{bmatrix}
$$

which is basically saying, "Hey man, increase the value of the last value of $Z$", and since the $SoftMax$ is a monotone increasing funcion of $Z$, this means 
* increase the probability that $X$ being predicted as a "9"!
* decrease the probability that $X$ being predicted as a "0"!

But since 

$$
Z = W \cdot X + b
$$

how we should adjust $W$ in order to improve $L$? Well we just saw how to adjust $Z$ in order to improve $L$, let's see how to adjust $Z$ as a function of $W$ and we will have how to adjust $W$ in order to improve $L$. 
These thing that I have just written in a terrible english is basically the chain rule:

$$
\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial W}
$$

And in formula in the end is equal to:

$$
\frac{\partial L}{\partial Z} \cdot \frac{\partial Z}{\partial W} = (\hat{Y}-Y) \cdot X^T
$$

which, if we call $(\hat{Y}-Y) = \delta$ and consider 

$$
X = \begin{bmatrix}
           x_1 \\
           x_2 \\
           \vdots \\
           x_{784} \\
    \end{bmatrix}
$$

we obtain that 

$$
dW = \frac{\partial L}{\partial W} = (\hat{Y}-Y) \cdot X^T = [\delta \cdot x_1 | \dots | \delta \cdot x_{784}] = 
\begin{bmatrix}
           x_1, 0, \dots, 0, x_{784} \\
           0, 0, \dots, 0, 0 \\
           \vdots \\
           0, 0, \dots, 0, 0 \\
           -x_1, 0, \dots, 0, -x_{784} \\
    \end{bmatrix}
$$

which is basically saying to $W$:

$$
W = W - dW
$$

"*Reduce the first row, which is responsible to the probability of the "0", and increase the last row, which is responsible for the probability of the "9"*."

The intuition behind $b$ parameter update instead is quite simple, since $b$ only affect by summing on $W \cdot X$, you simply add/remove the $\delta$ of the last prediction: $\hat{Y} - Y$.

# Move on to the first layer

Ok, so far we have played a little bit, in the sense that this layer is quite simple, lets complicate things a little bit by adding an initial layer with a well know activation function in NN, i.e. the $ReLu$ function:

$$
Relu(Z)_i = max(0, z_i)
$$

We obtain for the **Forward Propagation**:

$$
Z^1 = W^1 \cdot X + b^1
$$

$$
A^1 = ReLu(Z^1)
$$

$$
Z^2 = W^2 \cdot A^1 + b^2
$$

$$
\hat{Y} = A^2 = Softmax(Z^2)
$$

and since the loss function is always the same, i.e. the **cross entropy loss**:

$$
L(\hat{Y}, Y) = -\sum y_i \cdot log(\hat{y}_i)
$$

we have for the **Backward Propagation**:

$$
dZ^2 = (A^2 - Y)
$$

$$
dW^2 = 1/m \ dZ^2 \cdot (A^{1})^T
$$

$$
db^2 = 1/m \ \Sigma dZ^2
$$

Now we need to calculate

$$
dZ^1 = ... ? 
$$

$$
dW^1 = ... ? 
$$ 

$$
db^1 = ... ?
$$

Suppose to have $dZ^1$, then the other two would be simply:

$$
dW^1 = 1/m \ dZ^1 \cdot X^T
$$

$$
db^1 = 1/m \ \Sigma dZ^1
$$