# Training with the Chain Rule

### Introduction

In the last lesson, we saw how we can use the derivative to find the rate of change of a composite function.  We learned about this because our cost function itself a composite function:

$J(w, b) = (y -  \sigma(z(w, b)))^2$

And with gradient descent, we want to descend along a cost curve, and we do this by calculating the slope of our curve as we change our parameters, $w$ and $b$.  Calculating this slope means to calculate the derivative, which involves the chain rule.

This is because to see how nudging the parameters $w$ and $b$ affects the output of our cost function $J(w, b)$, this affect is indirect.

### Seeing the issue

Just like before, as a first step we can identify the components of our function, here:

$J(w, b) = (y -  \sigma(z(w, b)))^2$

And it consists of functions:

* $J(y -h(x))$
* $\sigma(z)$
* $z(w, b)$

And we can then translate our these functions into code.

In [2]:
import numpy as np
def sigmoid(z):
    output = 1/(1 + np.exp(-z))
    print(f'sig(z) = sigmoid({round(z, 2)}) = ', round(output, 2))
    return output 

def z(x, w, b):
    output = w*x + b 
    print(f'z(x, w, b) = ({x}, {w}, {b}) = ', output)
    return output

def h(x, w, b):
    return sigmoid(z(x, w, b))

And then a loss function $J$ to calculate the squared error at a single value.

In [3]:
def J(y, x, w, b):
    y_hat = h(x, w, b)
    loss = (y - y_hat)**2
    print(f'J = (y - y_hat)^2 = ({y} - {round(y_hat, 2)})^2 =' , loss)
    return loss

Ok, now let's see how our loss function changes as we change $w$ from $.1$ to $.5$.

* $w = .1$

In [4]:
J(1, 1, .1, 1)

z(x, w, b) = (1, 0.1, 1) =  1.1
sig(z) = sigmoid(1.1) =  0.75
J = (y - y_hat)^2 = (1 - 0.75)^2 = 0.062370014857361766


0.062370014857361766

* $w = .5$

In [5]:
J(1, 1, .5, 1)

z(x, w, b) = (1, 0.5, 1) =  1.5
sig(z) = sigmoid(1.5) =  0.82
J = (y - y_hat)^2 = (1 - 0.82)^2 = 0.033279071736023486


0.033279071736023486

Great, so we can see that changing $w$ changes the output of $z$ which changes the output of the $\sigma$ function, which changes the output of $J(w, b)$. 

> One thing to note, is that we are not using the function $J = \frac{1}{n}\sum (y - \hat{y})^2$, but simply $J = (y - \hat{y})^2$.  This is ok, it will not change the logic, or the math -- we'll just add back taking the avarage at the end.  

### Back to the Chain Rule

As we saw in the last function, we can still calculate the derivative of a composite function like this by multiplying the derivatives of these nested functions together.

So if we think of our function $J(w, b)$ as:

$J(w, b) = J(y, (\sigma(z(w, b, x)))$

So then we should be able to find the derivative $J'$ as:

$J'(w, b) = J'(\sigma(z(w, b))*\sigma'(z(w, b))*z'(w,b)$

But, because we are finding the rate of change in with respect to $w$ and $b$, we can express the partial derivatives $\frac{\delta J}{\delta w}$, $\frac{\delta J}{\delta b}$ as the following:

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w} $

$\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta b} $

> Let's take a moment to articulate the first derivative $\frac{\delta J}{\delta w}$ above.  The above is saying that the change in our cost function $J$ as we change our parameter $w$, is the affect that $w$ has on the linear function $z$, times the amount that a change in $z$ affects the output of the $\sigma$, times the change that a change in $\sigma$ affects $J$. 

#### 1. Break it down

Now we can solve for the derivatives by calculating each of the components individually.

$$\frac{\delta J}{\delta \sigma}, \frac{\delta \sigma}{\delta z}, \frac{\delta z}{\delta w}, \frac{\delta z}{\delta b}$$ 

> Note that we can calculate both $\frac{\delta J}{\delta w}$ and $\frac{\delta J}{\delta b}$ with only the four components above.

Ok, time to calculate our first derivative.

1. Finding $\frac{\delta J}{\delta \sigma}$

$J(\sigma) = (y - \sigma)^2$

> Where $\sigma$ really is  $\sigma(z(w, b))$

So now finding $\frac{\delta J}{\delta \sigma}$ we get:

$\frac{\delta J}{\delta \sigma} = 2(y - \sigma)*-1 = -2(y - \sigma) = 2(\sigma - y )$

And we can represent this in code as:

In [41]:
def dl_dsig(w, b, x, y):
    return 2*(h(w, b, x) - y)

> This works because $\sigma = \sigma(z(w, b))$ and $h(x) = \sigma(z(w, b))$ so we use substitution above.

Now onto finding the derivative of the second component.

2. Finding $\frac{\delta \sigma}{\delta z}$

$\sigma(z) = \frac{1}{1 + e^{-z}}$

$\sigma'(z(x)) = \sigma(z(x))*(1 - \sigma(z(x)))$

3. Finding $\frac{\delta z}{\delta w}$

Remember that $z(w, b) = wx + b$, so:

$\frac{\delta z}{\delta w} = x$

### Combining Together

Ok, now we have found all of the derivatives necessary to find $\frac{\delta J}{\delta w}$.

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w} = 2(\sigma(z(x_i) - y_i)*\sigma(z(x))(1 - \sigma(z(x))) * x_i$

So and we can represent this in code as simply the multiplication of all of our previously found derivatives:

In [42]:
def dJ_dw(w, b, x, i):
    return dl_dsig(w, b, x, y)*dsig_dz(w, b, x, y)*dz_dw(x)

Excellent we did it.  We found $\frac{\delta J}{\delta w}$.

### Calculating the other partial derivative

$\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta b} $

In [12]:
def dJ_db(w, b, x, i):
    return dl_dsig(w, b, x, y)*dsig_dz(w, b, x, y)*1

### Summary

In this lesson we saw how to think of our cost function, $J$ as a composite function:

$J(w, b) = (y -  \sigma(z(w, b)))^2$

Therefore, to see how a nudging our parameters, here $w$ and $b$, will affect the output of our cost function, we need to apply the chain rule.  This means that we solved for the rate of change of our cost function, with respect to one of our parameters $\frac{\delta J}{\delta w}$ or $\frac{\delta J}{\delta b}$ by calculating each of the components individually.

$$\frac{\delta l}{\delta \sigma}, \frac{\delta \sigma}{\delta z}, \frac{\delta z}{\delta w}, \frac{\delta z}{\delta b}$$ 

Once we calculated these derivatives we then could combine the derivatives to find how the cost function changes with respect to each of our parameters.  That is we, could then find $\frac{\delta J}{\delta w}$ and $\frac{\delta J}{\delta b}$ as:

$\frac{\delta J}{\delta w}  = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w} $  and

$\frac{\delta J}{\delta b}  = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta b} $   

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="jigsaw-icon.png" width="10%" style="text-align: center"></a>
</center>