# Math preliminaries

### Chain rule

**Composition of functions:**

$$
\cfrac{d f(g(x))}{dx} = \cfrac{d f}{dx} (g(x)) \cfrac{d g(x)}{dx}
$$

This means we derivate $f$, and evaluate the resulting function at $g(x)$. Then, we derivate $g$, and evaluate the resulting function at $x$. Finally, we multiply both.

**With a third dependent variable:**

$$
\cfrac{dz}{dx} = \cfrac{dz}{dy} \cdot \cfrac{dy}{dx}
$$


### Quotient rule

$$
\cfrac{df(x)}{dg(x)} = \cfrac{\cfrac{df(x)}{dx} \cdot g(x) - f(x) \cdot \cfrac{dg(x)}{dx}}{(g(x))^2}
$$

### Nabla notation

### Kronecker delta function

$$
d_{i, j} = 
    \begin{cases}
        0 \quad\text{if} ~ i \neq j \\
        1 \quad\text{if} ~ i = j
    \end{cases}
$$

### Cross-entropy loss

$$
L = - \sum_{k=0}^{n_L} y_k ~ log(a^L_k)
$$

Where:
* $k_y$ is the k-th entry of the one-hot encoded class indicator, $\mathbf{y}$
* $a_k^L$ is the activation of unit $k$ in the output layer (numbered as $L$)
* $n_L$ is the number of units at layer $L$ (the output layer)

### Softmax

$$
Softmax(k, L)=\cfrac{e^{z_k^L}}{\sum_{j=0}^{n_L} e^{z_j^L}}
$$

Where
* $z_k^L$ is the logit of unit $k$ at layer $L$
* $z_j^L$ is the logit of unit $j$ at layer $L$

### ReLU activation function

$$
ReLU(x) = \begin{cases}
    0 \quad \text{if} ~ x \leq 0 \\
    x \quad \text{if} ~ x > 0
\end{cases}
$$

This functions is commonly implemented as $max(0, x)$. The derivative of ReLU is also function in parts

$$
\cfrac{d ~ ReLU(x)}{dx} = ReLU'(x) = \begin{cases}
    0 \quad \text{if} ~ x = 0 \\
    1 \quad \text{if} ~ x \neq 0
\end{cases}
$$

### Hadamard product

### Outer product

# Training the last layer

Let's say we want to decrease the loss by modifying just a single weight from the last layer (layer $L$). We can use gradient descent for that. The first thing we need to know the how the Loss changes with respect to that weight, applying the chain rule, we split that into two partial derivaties that are easier to calculate

$$
\cfrac{\partial Loss}{\partial w_{ij}^L} = \cfrac{\partial Loss}{\partial z_i^L} \cfrac{\partial z_i^L}{\partial w_{ij}^L}
$$


### $\partial Loss / \partial z_i^L$

Let's consider the effect of a small change in the output from unit $i$ at the last layer ($L$) over the loss. $z_i^L$ is the output (the "logit") of unit $i$ in the output (final) layer $L$. Replacing the Loss with the definition of the cross-entropy we get:

$$
\cfrac{\partial Loss}{\partial z^L_i} = \cfrac{\partial}{\partial z_i^L} [- \sum_{k=0}^{n_L} y_k ~ log(a_k^L)]
$$


The sum and the partial derivative are linear with repect to each other, so, we can move the derivative into the sum

$$
= - \sum_{k=0}^{n_L} y_k ~ \cfrac{\partial}{\partial z_i^L} [log(a_k^L)]
$$


Following the chain rule for the composition of functions:

$$
= - \sum_{k=0}^{n_L} y_k \cdot \cfrac{1}{a_k^L} \cdot \cfrac{\partial a_k^L}{\partial z_i^L}
$$

Let's work out the remaining partial derivate. This is, how a small change in the logit of unit $i$ changes the activation of unit $k$. The activation at the last layer is produced by taking the softmax function over the logits:

$$
\begin{align}
    \cfrac{\partial a_k^L}{\partial z_k^L} &= \cfrac{\partial}{\partial z_i^L} Softmax(k, L) \\
        &= \cfrac{\partial}{\partial z_i^L} [\cfrac{e^{z_k^L}}{\sum_{j=0}^{n_L} e^{z_j^L}}]
\end{align}        
$$

Applying the Quotient rule:

$$
= \cfrac{(\cfrac{\partial e^{z_k^L}}{\partial z_i^L} \cdot \sum_{j=0}^{n_L} e^{z_j^L}) - (e^{z_k^L} \cdot \cfrac{\partial}{\partial z_i^L} \sum_{j=0}^{n_L} e^{z_j^L} )}{(\sum_{j=0}^{n_L} e^{z_j^L})^2}
$$

Again, let's consider each derivative individually. For the first one, notice that if $k = i$, then

$$
\cfrac{\partial e^{z_i^L}}{\partial z_i^L} = e^{z_i^L}
$$

Otherwise, if $k \neq i$, then 

$$
\cfrac{\partial e^{z_k^L}}{\partial z_i^L} = 0
$$

Hence

$$
\cfrac{\partial e^{z_k^L}}{\partial z_i^L} = \delta_{k, i} e^{z_i^L}
$$

For the second one, the can pull the partial derivative into the sum

$$
\cfrac{\partial}{\partial z_i^L} \sum_{j=0}^{n^L} e^{z_j^L} = \sum_{j=0}^{n_L} \cfrac{\partial}{\partial z_i^L} e^{z_j^L}
$$

Similarly to the derivative above, all the terms for which $j \neq i$ are zero, hence

$$
\cfrac{\partial}{\partial z_i^L} \sum_{j=0}^{n^L} e^{z_j^L} = e^{z_i^L}
$$

Plugging both derivatives back:

$$
\begin{align}
    \cfrac{\partial a_k^L}{\partial z_k^L} &= \cfrac{(\delta_{k, i} e^{z_i^L} \cdot \sum_{j=0}^{n_L} e^{z_j^L}) - (e^{z_k^L} \cdot e^{z_i^k}) }{(\sum_{j=0}^{n_L} e^{z_j^L})^2} \\
        &= \cfrac{e^{z_i^L}}{\sum_{j=0}^{n_L} e^{z_j^L} } \cdot \cfrac{\delta_{k, i} \sum_{j=0}^{n_L} e^{z_j^L} - e^{z_k^L}}{\sum_{j=0}^{n_L} e^{z_j^L}}
\end{align}
$$

The last step was just a little bit of algebraic re-arragement (factor our $e^{z_i^L}$, split into two fractions). Now, hopefully you can see that the first fraction is the definition of the $Softmax$ function. Also, we can split the second faction at the minus, the sums cancel each other, that leaves the delta only, and at the right yet another definition of the $Softmax$. In both instances, remember that the $a_j^L = Softmax(j, L)$.

$$
\cfrac{\partial a_k^L}{\partial z_k^L} = a_i^L (\delta_{k, i} - a_k^L)
$$



Finally, plugging that derivative back into the initial expression:


$$
\begin{align}
\cfrac{\partial Loss}{\partial z^L_i} &= - \sum_{k=0}^{n_L} y_k \cdot \cfrac{1}{a_k^L} \cdot \cfrac{\partial a_k^L}{\partial z_i^L} \\
    &= - \sum_{k=0}^{n_L} y_k \cdot \cfrac{1}{a_k^L} \cdot a_i^L (\delta_{k, i} - a_k^L)
\end{align}
$$

Notice that we can "extract" the case $k = i$ from the summation

$$
\begin{align}
= - (y_i \cdot \cfrac{1}{a_i^L} \cdot a_i^L (\delta_{i, i} - a_i^L)) &- \sum_{k=0, k \neq i}^{n_L} y_k \cdot \cfrac{1}{a_k^L} \cdot a_i^L (\delta_{k, i} - a_k^L)\\
= - y_i (1 - a_i^L) &- \sum_{k=0, k \neq i}^{n_L} y_k \cdot \cfrac{1}{a_k^L} \cdot a_i^L (0 - a_k^L) \\
= - y_i + y_i a_i^L &- \sum_{k=0, k \neq i}^{n_L} y_k \cdot - a_i^L \\
= - y_i + a_i^L y_i &+ a_i^L \sum_{k=0, k \neq i}^{n_L} y_k \\
= - y_i + a_i^L & (y_i + \sum_{k=0, k \neq i}^{n_L} y_k) \\
= - y_i &+ a_i^L \sum_{k=0}^{n_L} y_k
\end{align}
$$

In the last step, we added the $k=i$ case back into the sum by absorbing $y_i$. Now, $\mathbf{y}$ is a one-hot encoded vector, therefore $\sum_{k=0}^{n_L} y_k = 1$:

$$
\cfrac{\partial Loss}{\partial z^L_i} = a_i^L - y_i
$$

### $\partial z_i^L / \partial w_{ji}^L$

The logit of a unit is calculated by multiplying the previous layer activation by the weights matrix, then adding the biases. Substituting the definition we get:

$$
\begin{align}
\cfrac{\partial z_i^L}{\partial w_{ij}^L} &= \cfrac{\partial}{\partial w_{ij}^L} (\sum_{k=0}^{n_L} w_{ik}^{L} a_k^{L-1} + b_k^L) \\
    &= \sum_{k=0}^{n_L} \cfrac{\partial}{\partial w_{ij}^L} (w_{ik}^{L} a_k^{L-1} + b_k^L) \\
    &= a_j^{L-1}
\end{align}
$$

As in other cases above, the partial derivate evaluates to zero for all cases except $k = j$.

## $\partial Loss / \partial w_{ji}^L$

Finally!

$$
\begin{align}
    \cfrac{\partial Loss}{\partial w_{ij}^L} &= \cfrac{\partial Loss}{\partial z_i^L} \cfrac{\partial z_i^L}{\partial w_{ij}^L} \\
        &= (a_i^L - y_i) ~ a_j^{L-1} \\
        &= a_j^{L-1} ~ (a_i^L - y_i) 
\end{align}
$$

We've been using index notation, but in Numpy, we work with vectors and matrices. So, let's transform that expression into matrix form. To make it a little easier to see the correspondance between both representations, we can plug some actual indices there. For intance, input indices $j = {0, 1, 2, 3}$ and output indices $i = {0, 1, 2}$

$$
\begin{bmatrix}
    \cfrac{\partial Loss}{\partial w_{00}^L} & \cfrac{\partial Loss}{\partial w_{01}^L} & \cfrac{\partial Loss}{\partial w_{02}^L} \\
    \cfrac{\partial Loss}{\partial w_{10}^L} & \cfrac{\partial Loss}{\partial w_{11}^L} & \cfrac{\partial Loss}{\partial w_{12}^L} \\
    \cfrac{\partial Loss}{\partial w_{20}^L} & \cfrac{\partial Loss}{\partial w_{21}^L} & \cfrac{\partial Loss}{\partial w_{22}^L} \\
    \cfrac{\partial Loss}{\partial w_{30}^L} & \cfrac{\partial Loss}{\partial w_{31}^L} & \cfrac{\partial Loss}{\partial w_{32}^L} \\
\end{bmatrix}
$$

$$
=\begin{bmatrix}
    (a_0^L - y_0) a_0^{L-1} & (a_0^L - y_0) a_1^{L-1} & (a_0^L - y_0) a_2^{L-1} \\
    (a_1^L - y_1) a_0^{L-1} & (a_1^L - y_1) a_1^{L-1} & (a_1^L - y_1) a_2^{L-1} \\
    (a_2^L - y_2) a_0^{L-1} & (a_2^L - y_2) a_1^{L-1} & (a_2^L - y_2) a_2^{L-1} \\
    (a_3^L - y_3) a_0^{L-1} & (a_3^L - y_3) a_1^{L-1} & (a_3^L - y_3) a_2^{L-1} \\
\end{bmatrix}
$$

$$
=\begin{bmatrix}
    (a_0^L - y_0) \\
    (a_1^L - y_1) \\
    (a_2^L - y_2) \\
    (a_3^L - y_3) \\
\end{bmatrix}
\otimes
\begin{bmatrix}
    a_0^{L-1} \\
    a_1^{L-1} \\
    a_2^{L-1} \\
\end{bmatrix}
$$

In full matrix form:

$$
\nabla_{\mathbf{W^L}} L =  (\mathbf{a^L} - \mathbf{y}) \otimes \mathbf{a^{L-1}}
$$

### What about the biases?

Actually, we're almost done there:


$$
\cfrac{\partial Loss}{\partial b_{i}^L} = \cfrac{\partial Loss}{\partial z_i^L} \cfrac{\partial z_i^L}{\partial b_{i}^L}
$$

We just have to figure out $\partial z_i^L / \partial b_i^L$. Again, if we substitute with the definition for logits:


$$
\begin{align}
\cfrac{\partial z_i^L}{\partial b_i^L} &= \cfrac{\partial}{\partial b_i^L} (\sum_{k=0}^{n_L} w_{kj}^{L} a_k^{L-1} + b_k^L) \\
    &= \sum_{k=0}^{n_L} \cfrac{\partial}{\partial w_{ij}^L} (\sum_{k=0}^{n_L} w_{kj}^{L} a_k^{L-1} + b_k^L) \\
    &= 1
\end{align}
$$

For all cases when $k \neq i$ the derivative is zero, for $k = i$ it's 1. Then:

$$
\cfrac{\partial Loss}{\partial b_{i}^L} = \cfrac{\partial Loss}{\partial z_i^L} \cfrac{\partial z_i^L}{\partial b_{i}^L} = a_i^L - y_i
$$

In matrix form: 

$$
\nabla_{\mathbf{b^L}} L = \mathbf{a^L} - \mathbf{y}
$$

# Training the other layers

As ussual, let's apply the chain rule:

$$
\cfrac{\partial L}{\partial w_{ji}^{L-1}} = \cfrac{\partial L}{\partial z_i^{L-1}} \cfrac{\partial z_i^{L-1}}{\partial w_{ji}^{L-1}}
$$

Or in recursive form:

$$
\cfrac{\partial L}{\partial w_{ji}^{l}} = \cfrac{\partial L}{\partial z_i^{l}} \cfrac{\partial z_i^{l}}{\partial w_{ji}^{l}}
$$

### $\partial z_i^{l} / \partial w_{ji}^l$


There is nothing special about the result from the last layer:

$$
\cfrac{\partial z_i^{L}}{\partial w_{ij}^L} = a_j^{L-1}
$$

it can be written directly into a recursive form

$$
\cfrac{\partial z_i^{l}}{\partial w_{ij}^l} = a_j^{l-1}
$$

### $\partial L / \partial z_i^{l}$

A little change in the loss due to a change in a particular logit at layer $L-1$ is mediated by a change in **all** logits in layer $L$

$$
\cfrac{\partial L}{\partial z_i^{L-1}} = \sum_{k=0}^{n_L} \cfrac{\partial L}{\partial z_k^L} \cfrac{\partial z_k^L}{\partial z_i^{L-1}}
$$

We already know $\partial L / \partial z_k^L$, we can focus on the second derivative. Substituting the definition of the logit, then the definition of the activation:

$$
\begin{align}
    \cfrac{\partial z_k^L}{\partial z_i^{L-1}} &= \cfrac{\partial}{\partial z_i^{L-1}} (\sum_{j=0}^{n_L} w_{kj}^L a_j^{L-1} + b_j^L) \\
        &= \cfrac{\partial}{\partial z_i^{L-1}} (\sum_{j=0}^{n_L} w_{kj}^L ReLU(z_j^{L-1}) + b_j^L) \\ 
        &= \sum_{j=0}^{n_L} \cfrac{\partial}{\partial z_i^{L-1}} (w_{kj}^L ReLU(z_j^{L-1}) + b_j^L) \\
\end{align}
$$

As in other situations, the derivate is non zero only if $j = i$, then:

$$
\cfrac{\partial z_k^L}{\partial z_i^{L-1}} = w_{ki}^L ReLU'(z_i^{L-1})
$$

Plugging that back, we get:

$$
\cfrac{\partial L}{\partial z_i^{L-1}} = \sum_{k=0}^{n_L} \cfrac{\partial L}{\partial z_k^L} w_{ik}^L ReLU'(z_i^{L-1})
$$

And then back into the original expression:

$$
\begin{align}
    \cfrac{\partial L}{\partial w_{ij}^{L-1}} &= [\sum_{k=0}^{n_L} \cfrac{\partial L}{\partial z_k^L} w_{ik}^L ReLU'(z_i^{L-1}) ] \cfrac{\partial z_i^{L-1}}{\partial w_{ij}^{L-1}} \\
        &= [ReLU'(z_i^{L-1}) \sum_{k=0}^{n_L} \cfrac{\partial L}{\partial z_k^L} w_{ik}^L] ~ a_j^{L-2}
\end{align}
$$


We can transform that into a its recursive version

$$
    \cfrac{\partial L}{\partial w_{ij}^{l}} = [ReLU'(z_i^{l}) \sum_{k=0}^{n_{l+1}} \cfrac{\partial L}{\partial z_k^{l+1}} w_{ik}^{l+1}] ~ a_j^{l-1}
$$

Notice that (yep, just swaping the terms)

$$
\sum_{k=0}^{n_{l+1}} \cfrac{\partial L}{\partial z_k^{l+1}} w_{ik}^{l+1} = \sum_{k=0}^{n_{l+1}} w_{ik}^{l+1} \cfrac{\partial L}{\partial z_k^{l+1}}
$$

Is just the regular matrix-vector multiplication. Introducing the nabla operator, we can rewrite 

$$
\sum_{k=0}^{n_{l+1}} \cfrac{\partial L}{\partial z_k^{l+1}} w_{ik}^{l+1} = (\mathbf{W}^{l+1})^T ~ \nabla_{z^{l+1}} L 
$$

This help us to write the expression in matrix form:

$$
\nabla_{W^l}L = [ReLU'(z^l) \odot ((\mathbf{W}^{l+1})^T \nabla_{z^{l+1}} L)] \otimes \mathbf{a^{l-1}}
$$

### What about the biases?

Similarly to the weights

$$
\cfrac{\partial L}{\partial b_i^{L-1}} = \cfrac{\partial L}{\partial z_i^{L-1}} \cfrac{\partial z_i^{L-1}}{\partial b_i^{L-1}}
$$

Again, there is nothing special about the last layer regarding how the logits are calculates, it's the same linear operation:

$$
\cfrac{\partial z_i^{L-1}}{\partial b_i^{L-1}} = 1
$$

And we just derived the expression for $\partial L / \partial z_i^l$, we can go ahead and claim that:

$$
\nabla_{\mathbf{b}^l} L = ReLU'(z^l) \odot ((\mathbf{W}^{l+1})^T \nabla_{z^{l+1}} L)
$$

# Update rules for any layer

Ladies and Gentlemen, with you, the update rules any layer:

$$
\mathbf{W^l} \leftarrow \mathbf{W^l} - \eta \nabla_{\mathbf{W^l}} L 
$$

$$
\mathbf{b^l} \leftarrow \mathbf{b^l} - \eta \nabla_{\mathbf{b^l}} L
$$