# Understanding the Chain Rule

### Introduction

Ok, so now that we understand a bit more about derivatives, it's time to go a bit deeper with gradient descent.  Remember that this is why we learned about derivatives.  Derivatives allow us to calculate the slope of our cost curve, or to put it another way, they allow us to calculate how our cost function changes as we change our parameters in our neural network.  

<img src="./cost-curve-slopes.png" width="40%">

Now we generally define our cost function as $J(\theta)$.  So our cost changes as we change our parameters $\theta$.  So really when we have our formula of $next\_w_1 = w_1 - learning\_rate*slope\_at(w_1)$ it's generally written as:

* $next\_w_1 = w_1 - learning\_rate* \frac{\delta J}{\delta w_1}$

Now, while we've seen some rules for calculating the derivative at a certain value, this one is different.  In the past, we were given a function like $f(x) = x^2$ and calculated that $\frac{\delta f}{\delta x} = 2x$.  And the expression $\frac{\delta f}{\delta x}$ means to calculate the how our function's output changes as we nudge the value of $x$.  

This time we'll want to calculate how the cost function $J$ changes as we change the value of $w$ -- $\frac{\delta J}{\delta w}$.  What makes this difficult is that our value $w$ is set way back in our linear layer, and it's impact isn't felt until further down in our cost function.  

In this lesson, we'll see how neural networks solve this issue.

### Seeing the Problem

Before seeing the solution, it help to first make sure we understand the issue presented to us.  We'll see the issue by once again using our cancer dataset.  Let's start by loading up our data.

In [9]:
import pandas as pd
import torch

df = pd.read_csv('./cell_data-2.csv', index_col = 0)
df[:2]

Unnamed: 0,mean_area,is_cancerous
0,1.001,1
1,1.326,1


So above, we can see that again want to use the mean area to predict if a cell is cancerous.

> Now let's convert to Pytorch tensors.

In [10]:
X = torch.tensor(df[['mean_area']].values).float()
y_actuals = torch.tensor(df['is_cancerous'].values).float()

Ok, now with gradient descent, ultimately our goal is to figure out how nudging our parameter $w$ will change the output from our entire cost function.  And we get an output from the cost function by passing from one layer to the other until it ultimately calculates the cost.  Mathematically, this looks like the following:

* $z(x_i) = w_1*x_i $
* $a(z) =  \frac{1}{1 + e^{-z(x)}} $
* $ SSE = \sum  (y\_actual - y\_hat)^2 $

And we can translate these layers into code with the following:

In [11]:
W = torch.tensor([[1.]], requires_grad = True) # we randomly assign 

def linear_layer(x):
    z = x @ W
    print('z = X * W = ', float(z[0]))
    return z

def activation_layer(z):
    a = torch.sigmoid(z)
    print('sigma(z) = ', float(a))
    return a
    
def sse(y_actuals, y_hats):
    squared_errors = (y_actuals - y_hats)**2
    sum_squares = torch.sum(squared_errors)
    print('sse = sum (y_actuals - y_hats)**2 = ', float(sum_squares))
    return sum_squares

Ok, so let's see what what occurs if we pass through a single value.

In [12]:
x = X[0] # 1.0010

In [13]:
z = linear_layer(x)
y_hat = activation_layer(z)
sse(y_actuals[0], y_hat)

z = X * W =  1.0010000467300415
sigma(z) =  0.7312551140785217
sse = sum (y_actuals - y_hats)**2 =  0.07222381234169006


tensor(0.0722, grad_fn=<SumBackward0>)

So above, we wrote out our functions so that we print out the output from each layer until we ultimately get cost, $.072$.  The point of printing these outputs is to better illustrate the problem.  With gradient descent, we want to see how a change in $W$, back in layer one, will ultimately change our cost in layer 3.  And this affect is indirect.  Changing a parameter in our linear layer, will change the output of the linear layer, which will then change the output of the activation layer, which then changes the SSE. 

We can of course see this by changing $W$ to from `1.0` to `2` just a little bit.

In [14]:
# W = torch.tensor([[1.]], requires_grad = True) 
W = torch.tensor([[1.01]], requires_grad = True) 

z = linear_layer(x)
y_hat = activation_layer(z)
sse(y_actuals[0], y_hat)

z = X * W =  1.0110100507736206
sigma(z) =  0.7332177758216858
sse = sum (y_actuals - y_hats)**2 =  0.07117275148630142


tensor(0.0712, grad_fn=<SumBackward0>)

So again, there's a chain reaction that trickles down the line.  

### Starting the calculation

Ok, so we ultimately want to calculate how changing $w$ will change our output of our cost function.  But to get there, really we need to tackle this a layer at a time.  This means that our first task is see how the output of our linear layer $z$ changes as we change our parameter value $w$.  So let's do this: 

* $z(w) = x*w $

Now we represent the how $z$ changes as we change $w$ as the following:

$\frac{\delta z}{\delta w}$

And this means that in calculating our derivative, $w$ is our changing variable, and we treat $x$ just like a number.  This gives us: 

$\frac{\delta z}{\delta w} = x$.

In [28]:
def delta_z_delta_w(x):
    return x

That's it!  For every unit that we increase our weight, we increase the output by the value of $x$.  Not so bad.  Let's use the value from our first observation.

In [30]:
df = pd.read_csv('./cell_data-2.csv', index_col = 0)
df[:2]

Unnamed: 0,mean_area,is_cancerous
0,1.001,1
1,1.326,1


In [29]:
x = 1.001

delta_z_delta_w(x)

1.001

And we can even check this with what we saw when printing our outputs above: remember that we first passed through $w = 1.0$, and got an output of $z = X * W =  1.001$ and then we changed let $w =  1.01$ and got an output of $z = X * W =  1.011$.  And if we plug this into our derivative formula: 

$\frac{\delta f}{\delta x} = \frac{f(x_1) - f(x_0)}{x_1 - x_0}$ we get the following:

In [15]:
f_x0 = 1.00100
f_x1 = 1.01101
x0 = 1.0
x1 = 1.01

(f_x1 - f_x0)/(x1 - x0)

1.0010000000000066

Looks pretty good.

### The next layer

Now the next step is to calculate the derivative of the second layer $\sigma(z)$.  Now the actual calculation is not so important -- we can essentially give to you.  What's more important is to see why we need to make this calculation.  The reason we need it is because again, if our goal is to see how nudging our parameter $w$ will ultimately alter our cost function, then now that we calculated how changing $w$ effects the output of $z$, because the output of $z$ then gets passed to our activation layer, the next step is to see how this change in $z$ will change the output of $\sigma$.  Or to put it mathematically, we need to calculate $\frac{\delta \sigma}{\delta z}$.

So what is $\frac{\delta \sigma}{\delta z}?$  Well we'll just tell you.

We saw before that $\sigma(x) =  \frac{1}{1 + e^{-x}} $, and it turns out that:

$\frac{\delta \sigma}{\delta x} = \sigma(x)*(1 - \sigma(x))$

And in code is the following:

In [16]:
import torch

def deriv_sigma(z):
    return torch.sigma(z)*(1 - torch.sigma(z))

> But again, it's not important.

So now let's think back to our hypothesis function for our neuron. 

* $z(mean\_area) = w_1*mean\_area $
* $\sigma(z) =  \frac{1}{1 + e^{-z(x)}} $

We want to calculate the impact that nudging $w$ has, not just on our linear layer, but on our activation layer as well.  In math terms we want to find:

$\frac {\delta \sigma}{\delta w}$

So to do that we first calculate the change that nudging $w$ has on $z$, and then multiply this by the impact that nudging $z$ has on $\sigma$.  Or in other words:

$\frac{\delta \sigma}{\delta w} = \frac{\delta z}{\delta w} * \frac{\delta \sigma}{\delta z}$

Now what's nice is that we've already calculated both of these components, $\frac{\delta z}{\delta w}$, and  $\frac{\delta \sigma}{\delta z}$.

$\frac{\delta z}{\delta w} = x = 1.001$, and $\frac{\delta \sigma}{\delta z} = \sigma(z)*(1 - \sigma(z))$ 

Ok, now let's calculate $\frac{\delta \sigma}{\delta w} = \frac{\delta z}{\delta w} * \frac{\delta \sigma}{\delta z}$

In [17]:
import torch
def deriv_sigma(z):
    return torch.sigmoid(z)*(1 - torch.sigmoid(z))

z = torch.tensor(1.001) # z = w*x = 1 * 1.001 
delta_z_delta_w = 1.001

delta_z_delta_w*deriv_sigma(z)

tensor(0.1967)

And we can check our work by looking at the change in outputs that we printed out above.

In [18]:
x_0 = 1.0
x_1 = 1.01

a_1 = 0.7332177758216858
a_0 = 0.7312551140785217
# and then again use our derivative formula 

(a_1 - a_0)/(x_1 - x_0)

0.19626617431640608

Ok, so what we just saw is called the chain rule.  And the point is that we can calculate chain reaction that nudging an value like $w$  in our linear function $z$, will have on an upstream function like $\sigma$ by multiplying the two component derivatives together.

### A trick to check our work

Now ultimately, we want to see how nudging $w$ will impact our cost function $J$.  But first let's cover a couple of tricks about the chain rule.  The key to getting the chain rule right is just to frame it properly.  Above we want to see the impact that a change in $w$ has on our activation function $\sigma$.  So far we wrote our hypothesis function like so:  

* $z(w_1, x) = w_1*x$
* $\sigma(z) =  \frac{1}{1 + e^{-z(w1, x)}} $

And then to see the impact that the changing $w_1$ has down on the activation function $\sigma$, or $\frac{\delta \sigma}{\delta w}$.  And then to calculate this, it's just the derivative of the first function times the derivative of the second.

$\frac{\delta \sigma}{\delta w} = \frac{\delta z}{\delta w} * \frac{\delta \sigma}{\delta z} = \frac{\delta \sigma}{\delta w}$

We can check our simply by cross multiplying.  So above, notice that the $\delta z$'s cancel out, and what's remaining is $\delta \sigma$ on top and $\delta w$ on the bottom.  Same as on the left:

<img src="./practice.png" width="20%">

Ok, so we can see that everything looks good.

### Putting it together

Ok so in our task to see what effect nudging our parameter $w$ will have on our cost function $J$, we next have to calculate the derivative of our cost function.  That is, we need to determine how a change in the output of our hypothesis function will change our cost function $ J(w) = \frac{1}{n}\sum_{i=0}^n (y_i - h(x_i))^2 = \frac{1}{n}\sum_{i=0}^n (y_i - \sigma(x_i))^2 $.  

If you want to calculate this derivative, we walk you through in the bonus section below.  But for now, just take our word for it:
    
$\frac{\delta J}{\delta \sigma} = \frac{1}{n}\sum_{i = 0}^n 2(\sigma(x_i) - y_i)$

Which we can code as:

In [31]:
import torch
def delta_J_delta_sigma(x_i, y_i):
    return 2*(torch.sigmoid(x_i) - y_i)

So now, if our goal is to determine how the changing $w$ back in layer 1 effects our cost, we multiply the derivative from each layer together, that is:

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w}$

In [20]:
x_i = torch.tensor(1.001) 
y_i = torch.tensor(1)

# we get the values above from the data in our first observation

delta_J_delta_sigma(x_i, y_i)*deriv_sigma(z)*delta_z_delta_w

tensor(-0.1057)

And if you want to see this written out in math, it looks like the line below (yes, a bit overwhelming).

$\frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w} = \frac{1}{n}\sum_{i = 0}^n 2(\sigma(x_i) - y_i) * \sigma(z)*(1 - \sigma(z)) * x$

Ok, so now let's again check our calculation above, where we got `-0.1057` with the change in the cost function that we printed out above.  These are the numbers we saw above.

In [22]:
x_0 = 1.
x_1 = 1.01

y_0 = 0.07222381234169006
y_1 = 0.07117275148630142

# and then using our rate of change formula we get...

(y_1 - y_0)/(x_1 - x_0)

-0.10510608553886404

Ok, pretty damn close to our calculation of `-0.1057`.

Ok, so this how we can perform gradient descent:

$next\_w_1 = w_1 - learning\_rate* \frac{\delta J}{\delta w_1}$

We just update our $w_1$ by the learning rate, and no matter how many layers there are to our neural network, we just apply the chain rule.

### Summary

In this lesson, we saw we can use the chain rule to calculate how nudging our parameter $w_1$ will affect the cost function $J$.  

Now the challenge of performing this calculation is that the effect is indirect.  Changing $w_1$ affects the output in the linear layer, which affects the output from our activation layer, which affects the output from our cost function.

In [24]:
W = torch.tensor([[1.]], requires_grad = True) # we randomly assign 

z = linear_layer(x)
y_hat = activation_layer(z)
sse(y_actuals[0], y_hat)

z = X * W =  1.0010000467300415
sigma(z) =  0.7312551140785217
sse = sum (y_actuals - y_hats)**2 =  0.07222381234169006


tensor(0.0722, grad_fn=<SumBackward0>)

But we needed to perform this calculation to see how changing nudging our parameter would change our cost function in gradient descent:

* $next\_w_1 = w_1 - learning\_rate* \frac{\delta J}{\delta w_1}$

To make this calculation, we used the chain rule.  With the chain rule, we calculate the derivative of each component and then multiply the derivatives together.  So to calculate the impact of nudging $w$ through the following layers: 

* $z(x_i) = w_1*x_i $
* $\sigma(z) =  \frac{1}{1 + e^{-z(x)}} $
* $ SSE = \sum  (y\_actual - y\_hat)^2 $

We need to calculate: $\frac{\delta J}{\delta w} = \frac{\delta z}{\delta w}*\frac{\delta \sigma}{\delta z}*\frac{\delta J}{\delta \sigma}$

And the logic of this is that this represents that chain reaction we printed out above: we can track the impact of nudging $w$ on $J$, through calculating the change in $w$'s impact on $z$, and $z$'s impact on $\sigma$ and $\sigma$'s impact on $J$.

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="jigsaw-icon.png" width="10%" style="text-align: center"></a>
</center>

### Bonus -- Calculating the derivative of our cost function

Ok so in our task to see what effect nudging our parameter $w$ will have on our cost function $J$, we next have to calculate the derivative of our cost function.  That is, we need to determine how a change in the output of our hypothesis function will change our cost function $ J(w) = \frac{1}{n}\sum_{i=0}^n (y_i - h(x_i))^2 $.  

Ok, we can still simplify this a bit further.  First, is to know that we can just ignore the $\frac{1}{n}\sum_{i = 0}^n$ when calculating the derivative.  So let's get rid of it, and then add it back in at the end.  So we'll define this function without the summation as $j$ and have:

$j(w) = (y_i - h(x_i))^2 $

Next is to use the chain rule again, by first defining an error function $\epsilon$ where $\epsilon_i = y_i - h(x_i)$ and then via substitution we can define $j$ to be $j = (y_i - h(x_i))^2  = \epsilon_i^2$.

* $\epsilon = y_i - \sigma(x_i)$

* $j(x) = \epsilon^2$

So now we can calculate each of the derivatives again:

* $\frac{\delta \epsilon}{\delta \sigma} = 0 - \sigma(x_i)^{1 - 1} = - \sigma(x_i)^{0} = -1$

* $\frac{\delta j}{\delta \epsilon} = 2 \epsilon $

So we get that: $\frac{\delta j}{\delta \sigma} = \frac{\delta j}{\delta \epsilon} * \frac{\delta \epsilon}{\delta \sigma} = 2*\epsilon *-1 $ and because $\epsilon = y_i - \sigma(x_i)$ via substitution we get:

* $\frac{\delta j}{\delta \sigma} = -2(y_i - \sigma(x_i)) = 2(\sigma(x_i) - y_i)$

And finally adding back in our summation and mean, we get:

$\frac{\delta J}{\delta \sigma} = \frac{1}{n}\sum_{i = 0}^n 2(\sigma(x_i) - y_i)$.