# Part a) Analytical warm-up

When using our gradient machinery from project 1, we will need the expressions for the cost/loss functions and their respective gradients. The functions whose gradients we need are:

1. The mean-squared error (MSE) with and without the and norms (regression problems)
2. The binary cross entropy (aka log loss) for binary classification problems with and without and norms
3. The multiclass cross entropy cost/loss function (aka Softmax cross entropy or just Softmax loss function)

Set up these three cost/loss functions and their respective derivatives and explain the various terms. In this project you will however only use the MSE and the Softmax cross entropy.

We will test three activation functions for our neural network setup, these are the

1. The Sigmoid (aka logit) function,
2. the RELU function and
3. the Leaky RELU function

Set up their expressions and their first derivatives. You may consult the lecture notes (with codes and more) from week 42 at https://compphysics.github.io/MachineLearning/doc/LectureNotes/_build/html/week42.html

## Neural networks basics

$$\boldsymbol{z^l = W^l \cdot a^{l-1} + b^l}$$
$$ \boldsymbol{a^l} = \sigma(\boldsymbol{z^l})$$

Where $l$ denotes the layer. 

## Activation functions and their derivatives

### Sigmoid function

$$\sigma(\boldsymbol{z^l}) = \frac{1}{1 + e^{-\boldsymbol{z^l}}}$$

$$\frac{d \sigma }{d \boldsymbol{z^l}} = \sigma(\boldsymbol{z^l}) ' \cdot (1 - \sigma(\boldsymbol{z^l})$$

### ReLU function

$$$$

### Leaky ReLU function

# MSE as cost function and its derivatives

$$\text{MSE} = \boldsymbol{C(\theta)} = \frac{1}{2}(\boldsymbol{a^l - y})^2,$$

where $\boldsymbol{a^l}$ is the output of layer $l$, and $\boldsymbol{y}$ is the target value. $\boldsymbol{a^l}$ and the cost function is thus dependent on the specific activation function of the layer, and the parameters $\boldsymbol{\theta = (W, b)}$.

To find the derivative, we use the chain rule. 

$$\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{W}} = \frac{\partial \boldsymbol{C}}{\partial \boldsymbol{a}}\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z}}\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{W}}$$

$$\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{b}} = \frac{\partial \boldsymbol{C}}{\partial \boldsymbol{a}}\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z}}\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{b}}$$

### Using the Sigmoid activation function
We can do this step by step and gather the terms in the end. The sigmoid function is: $$\boldsymbol{a^l} = \sigma(\boldsymbol{z^l}) = \frac{1}{1 + e^{-\boldsymbol{z^l}}}$$

#### $\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{W}}:$

$$\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{a^l}} = \frac{\partial}{\partial \boldsymbol{a^l}} (\frac{1}{2}(\boldsymbol{a^l - y})^2) = (\boldsymbol{a^l - y})$$

$$\frac{\partial \boldsymbol{a^l}}{\partial \boldsymbol{z^l}} = \frac{\partial \sigma(\boldsymbol{z^l})}{\partial \boldsymbol{z^l}} = \frac{\partial}{\partial \boldsymbol{z^l}}(\frac{1}{1 + e^{-\boldsymbol{z^l}}}) = \sigma(\boldsymbol{z^l}) \cdot (1 - \sigma(\boldsymbol{z^l})) = \boldsymbol{a^l} \cdot (1 - \boldsymbol{a^l})$$

$$\frac{\partial \boldsymbol{z^l}}{\partial \boldsymbol{W^l}} = \frac{\partial}{\partial \boldsymbol{W^l}}(\boldsymbol{W^l \cdot a^{l-1} + b^l}) = \boldsymbol{a^{l-1}}$$

Gathering the terms yield:

$$\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{W^l}} = (\boldsymbol{a^l - y}) \boldsymbol{a^l} \cdot (1 - \boldsymbol{a^l}) \boldsymbol{a^{l-1}}$$

By defining $\delta^l = (\boldsymbol{a^l - y}) \boldsymbol{a^l} \cdot (1 - \boldsymbol{a^l}) = \frac{\partial \boldsymbol{C}}{\partial \boldsymbol{a^l}} \sigma'$, we have the final expression for the gradient of the cost function with respect to the weights, using MSE and Sigmoid, as:

$$\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{W^l}} = \delta^l \cdot \boldsymbol{a^{l-1}}$$

#### $\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{b}}:$

For the gradient with respect to the bias we can reuse expressions for $\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{a}}$ and $\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{z}}$.

$$\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{b}}=1$$

So the gradient is simply:

$$\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{b^l}} = \delta^l$$

**MSE cost function derivatives with Sigmoid function**

$$\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{W^l}} = \delta^l \cdot \boldsymbol{a^{l-1}}$$

$$\frac{\partial \boldsymbol{C}}{\partial \boldsymbol{b^l}} = \delta^l$$

$$\delta^l = \frac{\partial \boldsymbol{C}}{\partial \boldsymbol{a^l}} \sigma ' $$