#  BackPropagation With Layers

### Introduction

Over the last lessons, we learned how to write a hypothesis function of our neural network that both consists of multiple 

The reason why the chain rule is so important is because our neural network is one large composite function.  And if we want to see how to update the parameters of this composite function, then we need consider how each weight and bias ultimately effects the predictions we output from our neural network, and thus our cost function.  Let's see this, and then we'll use the chain rule to update the parameters of our function. 

### Neural Networks as a Composite Function

Remember that in our fully connected neural network, we have a series of layers, and ultimately output a prediction $\hat{y}$.

$$
\begin{aligned}
z_1 & = xW_1 + b_1 \\
a_1 & = sigmoid(z_1) \\
z_2 & = a_1W_2 + b_2 \\
\end{aligned}
$$

> Here we leave off the softmax function, to keep things simpler.  If interested, the resources below offer explanation as to how to include it in the model.

And we can rewrite this as a composite function:

$\hat{y} = z_2(\sigma(z_1(x)))$

So interpreting the composite function above, we take the features of an observation, $x$, and feed that to our linear layer, whose output is fed to the sigmoid function, and so on.

### Applying Gradient Descent

So above we saw that we evaluate our neural network, with the cost function:

$J(\theta) = (y_i - z_2(\sigma(z_1(x))))^2$

> We use $\theta$ to broadly refer to the parameters of a function.  So here, $\theta$ represents the weights and biases of both linear layers.

Now let's say we want to make an update to the parameters in our neural network.  Let's just focus on the parameters in our second linear layer, $z_2$.  How should we update the parameters in $z_2$?

Well we should find the direction of steepest descent by finding how nudging the parameters of $z_2$ change the output of the cost function.  And we do this through the chain rule.

The parameters of the second linear layer impact the cost function by changing the output of $z_2$, which changes the output of our cost function, $J$.  So:

$\frac{\delta J}{\delta{\theta_{z_2}}} = \frac{\delta J}{\delta{z_2}} * \frac{\delta z_2}{\delta{\theta}_{Z_2}}$

And if we want to see how to update the parameters of $z_1$ to approach the minimum of the cost curve, we again, need use the chain rule to consider the impact of the parameters of $z_1$ on the rest of the neural network, and ultimately the cost function.

So to do this, let's keep our procedure chain rule procedure of calculating each of our component derivatives individually.

### Moving to Gradients

Ok, now below we'll move through finding the gradient of some of the components of our neural network.  We do so to see the application of the chain rule in a neural network.  Our focus **will not** be on say finding the [derivative of the sigmoid function](https://towardsdatascience.com/derivative-of-the-sigmoid-function-536880cf918e).  If you'd like a deeper discussion of these individual derivatives, check out the resources below.

Ok, again this is our cost function.

$J(\theta) = (y_i - z_2(\sigma(z_1(x))))^2$

1. Derivative of Cost Function 

The first component that we can find the derivative of is the cost function.  Our cost function depends on the inputs from $z_2$.  And thus:

$J(z_2) = (y - z_2)^2$, and derivative of $J$ with respect to $z$ is:

$\frac{dJ}{dz_2} = 2(y - z_2)*-1 = -2y + 2z_2 = 2(z_2 - y)$

2. Derivative of a Linear layer

After finding the change in cost with respect to the outputs of $z_2$, we move backwards to calculating the derivative for the parameters of $z_2$.  How will our cost function change as we change these components?

$$J(z_2(x,W,b))$$

Remember that we our second linear layer $z_2$ looks like the following:

$z_2 = xW_2 + b_2$

In this case, $x$ is the output from our activation layer $\sigma$, which we'll refer to as $a_1$.  So:

$z_2 = a_1W_2 + b_2$

Now we'll need to use the derivative to find the impact of changing our parameters in $W$ and $b$.

> To do so, we need to go into matrix calculus.  We won't prove it here, but [the general rule](https://explained.ai/matrix-calculus/) is, if $f(x) = x\cdot W$, then 
> $\frac{\delta f}{\delta x} = W^T$.

So knowing this, we take the derivative of the parameters of $z_2 = a_1W_2 + b_2$ and get:

$\frac{\delta J}{\delta W_2}  = a_1^T \cdot \frac{\delta J}{\delta z_2} $ , and $\frac{\delta J}{\delta b_2} = [1 ... 1] \cdot \frac{\delta J}{\delta z_2} $ 

Now let's better understand these two derivatives.

1. Understanding $\frac{\delta J}{\delta z_2}$

Both of the derivatives above are dotted with $\frac{\delta J}{\delta z_2}$, because of the chain rule.  As we know, to find the impact from a change in the parameters of $z_2$, we want to assess the change on $z_2$, but also our cost function $J$.  For that we need the chain rule. 

And notice that we are not, explicitly calculating $\frac{\delta J}{\delta z_2}$.  Why not?  We already calculated it above: $2(z_2 - y)$.

> This is **backpropagation**.  We calculate the derivative of the outer function $\frac{\delta J}{\delta z_2}$, and then use this already calculated derivative in finding the gradient of the inner layers, via the chain rule.  We'll continue to see this.

2. We're not finished

So above, we see the impact of changing the parameters $W_2$ and the bias vector $b_2$.  But take another look at our second linear layer:

$$z_2 = a_1W_2 + b_2$$

We'll also have to find the derivative with respect to our inputs $a_1$.  Why?  Well, while it's true that we cannot directly change $a_1$ right here, we will need it when we calculate the derivative of our first linear layer, as changing that first linear layer, will impact the outputs of the activation layer, and so we will need to see the activation layer's impact on $z_2$, and ultimately $J$.  Ok, so here's the derivative of $J$ with respect to $a_1$.

$\frac{\delta J}{\delta a_1} = W_2^T * \frac{\delta J}{\delta z} $ 

> Again, we multiply by $\frac{\delta J}{\delta z}$ because of the chain rule.

3. Derivative of our activation layer

Finally, let's see the derivative of the sigmoid function.  Remember that we use our sigmoid function as our activation layer in our neural network:   $$J(z_2(\sigma(z_1(X))))$$.

Here we assign the sigmoid function to be $a_1(x) = \sigma(x)$.  Then the derivative of our cost function with respect to $a_1$ is:

* $\frac{\delta J}{\delta a_x}= \sigma(x)*(1 - \sigma(x)) *\frac{dJ}{da_1}$

### Turning this into Code

Now take a look at the code below.  One of the important things to see how we as we move down through the code, we are reusing the derivatives we previously calculated.

In [9]:
def sigma(x): return 1/(1 + np.exp(-x))

In [11]:
def backwards(L1, a1, W2, Y, Y_hat, X):
# grad loss
    # 2(y_ - y)
    dloss = (Y_hat - Y)
# grad z2 = a1W2 + b2
    #dJ/dW2 =  X.T * dJ/dz_2 
    dW2 = (a1.T).dot(dloss)
    # dL/db2 = [1]  * dJ/dz_2
    db2 = np.sum(dloss, axis=0, keepdims=True)
    #dJ/da1 =  dJ/dz_2 * W.T
    da1 = dloss.dot(W2.T) 

# grad sigma            
    # dL/dsigma = sig(L1)(1 - sig(L1))*da1 
    d_sigma = sigma(L1)*(1 - sigma(L1))*da1
# grad z1 = a1W2 + b2  
    # dz1/dW1 = dz1/dW1 * dsig/dL1
    dW1 = np.dot(X.T, d_sigma)
    # dz1/db1 = dz1/db1 * dsig/dL1
    db1 = np.sum(d_sigma, axis=0)
    
    return (dW1, db1, dW2, db2)

### Backpropagation in the code

The important component of the code above is seeing the backpropagation.  Take a look at the code where we find the derivative of the cost function with respect to the parameter of the sigma function.

In [None]:
# grad sigma            
    # dL/dsigma = sig(L1)(1 - sig(L1))*da1 
    d_sigma = sigma(L1)*(1 - sigma(L1))*da1

At the very end we multiply by `da1`, the previously calculated impact of nudging the activation layer.  Now let's look at `da1`.

```python
da1 = dloss.dot(W2.T)
```

`da1` is the $W2^T$ multiplied by `dloss` $\frac{\delta J}{dz_2}$.  

So the point is that because started at the output layer, and calculated the derivative of each layer, when we get further down in the neural network, applying the chain rule is not so difficult: we have already done the work.  Take a look through the code again to see how this works. 

### Summary

In this lesson we saw how backpropagation occurs in a neural network.  Backpropagation is simply an efficient way of using the chain rule to find the gradients of the parameters of a neural network.  

We saw that we can think of the loss function of a neural network as a composite function:

$J(\theta) = (y_i - z_2(\sigma(z_1(x))))^2$

And so to see how to update the weight matrix of a linear layer like $W_2$ we find the derivative of our cost function with respect to each of the functions starting with the outermost function $\frac{\delta J}{\delta z_2}$.  Then we continue to work our way towards the parameters of the input layer.  Because as we move downward towards the input layer, we have already calculated the derivatives up the chain, we do not need to recalculate these derivatives.

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="https://storage.cloud.google.com/curriculum-assets/curriculum-assets.nosync/mom-files/jigsaw-labs.png" width="15%" style="text-align: center"></a>
</center>

### Resources

* Code Resources
    * [Code derived from wildml's blog post](https://github.com/dennybritz/nn-from-scratch)
    * [gradient of bias code explained](https://datascience.stackexchange.com/questions/20139/gradients-for-bias-terms-in-backpropagation)
* Softmax Derivatives
    * [derivative of softmax](https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function)
    * [softmax](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/)
* Matrix Calculus Resources
    * [Excellent matrix calculus guide](http://cs231n.stanford.edu/vecDerivs.pdf)
    * [Fast ai matrix calculus](https://explained.ai/matrix-calculus/)