# Training with the Chain Rule

### Introduction

In the last lesson, we saw how we can use the derivative to find the rate of change of a composite function.  We learned about this our cost function is itself a composite function:

$J(w, b) = (y -  \sigma(z(w, b)))^2$

Remember that to find the parameters of $w$ and $b$, we want to see how the output of our cost function $J(w, b)$ changes as we change $w$ and $b$, but this affect is indirect.

### Seeing the issue

One way to see this, is simply to code out the cost function of our neuron.

In [17]:
import numpy as np
def sigmoid(val):
    sig = 1/(1 + np.exp(-val))
    print('sigma = ', sig)
    return sig 

def z(x, w, b):
    z_calc = w*x + b 
    print('z = ', z_calc)
    return z_calc

def h(x, w, b):
    return sigmoid(z(x, w, b))

And then a loss function $l$ to calculate the squared error at a single value.

In [25]:
def l(x, w, b, y):
    loss = (y - h(x, w, b))**2
    print('loss', loss)
    return loss

Ok, now let's see how our loss function changes as we change $w$ from .3 to .5.

In [36]:
l(1, .3, .3, 1)

z =  0.6
sigma =  0.6456563062257954
loss 0.12555945331754728


0.12555945331754728

In [37]:
l(1, .5, .3, 1)

z =  0.8
sigma =  0.6899744811276125
loss 0.0961158223520931


0.0961158223520931

Ok, great so by perhaps we can see that changing $w$ affects the output of the function: $J(w, b) = (y -  \sigma(z(w, b)))^2$

We see that changing $w$ changes the output of $z$ which changes the output of the $\sigma$ function, which changes the output of $J(w, b)$. 

### Introducing the Chain Rule

As we saw in the last function, we can still calculate the derivative of a composite function like this by multiplying the derivatives of these nested functions together.

So if we think of our function J(w, b) as:

$J(w, b) = l(y, (\sigma(z(w, b, x)))$

That is we should be able to find this derivative by calculating:

$J'(w, b) = l'*\sigma'*z'$

Of course, we are finding the rate of change in with respect to $w$ and $b$, so let's express this as the following:

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w} $

$\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta b} $

> Let's take a moment to articulate the above.  The change in our cost function J as we change our parameter w, is the change that w produces on the linear function $z$, times the amount that a change in $z$ changes the output of the $\sigma$, times the change that a change in $\sigma$ changes J. 

#### 1. Break it down

Now we can solve for the derivatives by calculating each of the components individually.

$$\frac{\delta l}{\delta \sigma}, \frac{\delta \sigma}{\delta z}, \frac{\delta z}{\delta w}, \frac{\delta z}{\delta b}$$ 

1. Finding $\frac{\delta l}{\delta \sigma}$

$l(\sigma) = (y - \sigma(z(w))^2$

$\frac{\delta l}{\delta \sigma} = 2(y - \sigma(z(w, b))*-1 = -2(y - \sigma(z(w, b)) = 2(\sigma(z(w, b) - y)$

And we can represent this in code as:

In [41]:
def dl_dsig(w, b, x, y):
    return 2*(h(w, b, x) - y)

> Remember that $h(x) = \sigma(z(w, b))$ so we use substitution above.

2. Finding $\frac{\delta \sigma}{\delta z}$

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Finding the derivative of the sigmoid function is something we could calculate from scratch, but it would take a while, and it will take us off track.  Let's just skip to the end.

$\sigma'(z) = \sigma(z(x))*(1 - \sigma(z(x)))$

3. Finding $\frac{\delta z}{\delta w}$

$z(w, b) = wx + b$

$\frac{\delta z}{\delta w} = x$

### Combining Together

Ok, now we have found all of the derivatives necessary to find $\frac{\delta J}{\delta w}$.

$\frac{\delta J}{\delta w} = \frac{\delta l}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w} = 2(\sigma(z(x_i) - y_i)*\sigma(z(x))(1 - \sigma(z(x))) * x_i$

So from there we write this in code as:

In [42]:
def dJ_dw(w, b, x, i):
    return dl_dsig(w, b, x, y)*dsig_dz(w, b, x, y)*dz_dw(x)

Now this is a perfectly valid way of calculating the derivative, but we'll generally see it written in code differently.  Here's how we normally see it:

In [45]:
def dJ_dw(w, b, x, i):
    dJ_dsig = dl_dsig(w, b, x, y)
    dJ_dz = dsig_dz(w, b, x, y)*dl
    dJ_dw = dz_dw(x)*dJ_dz
    return dw

What we just did was rewrite our function so that we are finding each component's impact on our cost function $J(w, b)$.  So it's a realization that each derivative we found is a combination of the local derivative and the upstream derivative it affects.

So for example, looking at the last line of the code, we rewrote our derivative $\frac{\delta J}{\delta w}$ as:

$\frac{\delta J}{\delta w} =   \frac{\delta J}{\delta z}*\frac{\delta z}{\delta w}$

And this works because we already calculated $\frac{\delta J}{\delta z}$ as:

$\frac{\delta J}{\delta z} =  \frac{\delta J}{\delta \sigma}*  \frac{\delta \sigma}{\delta z} $

And we already calculated $\frac{\delta J}{\delta \sigma }$ as well.

So let's look at our code again.  This approach is called back propagation.  

In [None]:
def dJ_dw(w, b, x, i):
    dJ_dsig = dl_dsig(w, b, x, y)
    dJ_dz = dsig_dz(w, b, x, y)*dl
    dJ_dw = dz_dw(x)*dJ_dz
    return dw

The idea is that we calculate each function's impact on our cost function starting with the outermost layer.  Then as we move further down, we can calculate each derivative by calculating the local derivative and multiplying it by the upstream derivative a change in the function affects. 

So this allows us to write

$\frac{\delta J}{\delta w} =   \frac{\delta J}{\delta z}*\frac{\delta z}{\delta w}$

### Summary

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="https://storage.cloud.google.com/curriculum-assets/curriculum-assets.nosync/mom-files/jigsaw-labs.png" width="15%" style="text-align: center"></a>
</center>