# Introducing Backpropagation

### Introduction

$J(w, b) = (y -  \sigma(z(w, b)))^2$

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w} $

$\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta b} $

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w} = 2(\sigma(z(x_i) - y_i)*\sigma(z(x))(1 - \sigma(z(x))) * x_i$

$\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta b} = 2(\sigma(z(x_i) - y_i)*\sigma(z(x))(1 - \sigma(z(x))) *1$

In [1]:
def dJ_dw(w, b, x, i):
    return dl_dsig(w, b, x, y)*dsig_dz(w, b, x, y)*dz_dw(x)

In [3]:
def dJ_db(w, b, x, i):
    return dl_dsig(w, b, x, y)*dsig_dz(w, b, x, y)*1

In [1]:
def dl_dsig(w, b, x, y):
    return 2*(h(w, b, x) - y)

### A little note about derivatives

Before moving onto backpropagation, let's show a little trick with the derivatives.  Remember that we said that $\frac{\delta J}{\delta w}$ is the following:

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w}$

One way of checking your work is treat the derivatives on the right as if they were fractions, and make sure that the two sides of the equation really are equal.  So applying this to the above, we see that:

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w} = \frac{\delta J {\delta{\sigma}} \delta z}{\delta \sigma \delta z \delta w}= \frac{\delta J}{\delta w}$

Similarly we can check that we have identified the components of $\frac{\delta J}{\delta b}$ by treating the derivatives on the right as fractions and ensuring the two sides are equal:

$\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta b} = \frac{\delta J {\delta{\sigma}} \delta z}{\delta \sigma \delta z \delta b}= \frac{\delta J}{\delta b}$

### Moving to backpropagation

So what we saw above are perfectly valid ways of calculating the derivative.  For example, we calculated $\frac{\delta J}{\delta w}$ as:

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w}$

And as we know, $\frac{\delta J}{\delta w}$ is the impact of a change in $w$ on $J$.  

But what if we want to know some of the intermediate functions' impact on our cost function $J$.  For example, what if we want to know how a change in the output $z$ affects the output of $J$.  Well we can calculate each intermediate derivative with respect to $J$ **with no extra work** if we recalculate our derivative using the following procedure:

* Given a function:

$J(w, b) = (y -  \sigma(z(w, b)))^2$, then:

1. $\frac{\delta J}{\delta \sigma} = \frac{\delta J}{\delta \sigma}$ 

2. $ \frac{\delta J}{\delta z} = \frac{\delta J}{\delta \sigma}* \frac{\delta \sigma}{\delta z}$

3. $ \frac{\delta J}{\delta w} = \frac{\delta J}{\delta z}* \frac{\delta z}{\delta w}$

So looking at the second partial derivative, $\frac{\delta J}{\delta z}$ we calculate it by multiplying:

* the immediately upstream derivative, $\frac{\delta J}{\delta \sigma}$, by 
* the local derivative, $\frac{\delta \sigma}{\delta z}$.

And looking at the final partial derivative, $\frac{\delta J}{\delta w}$, we calculate it by multiplying:

* the immediately upstream derivative, $\frac{\delta J}{\delta z}$, by 
* the local derivative, $\frac{\delta z}{\delta w}$.

This is the idea behind backpropagation.  The idea is that given our cost function, here:

$J(w, b) = (y -  \sigma(z(w, b)))^2$

1. $\frac{\delta J}{\delta \sigma} = \frac{\delta J}{\delta \sigma}$ 

2. $ \frac{\delta J}{\delta z} = \frac{\delta J}{\delta \sigma}* \frac{\delta \sigma}{\delta z}$

3. $ \frac{\delta J}{\delta w} = \frac{\delta J}{\delta z}* \frac{\delta z}{\delta w}$

### Backpropagation in code

So the change above is slight.  And the change to our code that calculates the derivative `dJ_dw` is also pretty slight. 

> This was our original code:

In [5]:
def dJ_dw(w, b, x, i):
    return dl_dsig(w, b, x, y)*dsig_dz(w, b, x, y)*dz_dw(x)

* And our new code represents the steps that we saw above:

1. $\frac{\delta J}{\delta \sigma} = \frac{\delta J}{\delta \sigma}$ 

2. $ \frac{\delta J}{\delta z} = \frac{\delta J}{\delta \sigma}* \frac{\delta \sigma}{\delta z}$

3. $ \frac{\delta J}{\delta w} = \frac{\delta J}{\delta z}* \frac{\delta z}{\delta w}$

In [6]:
def dJ_dw(w, b, x, i):
    dJ_dsig_result = dJ_dsig(w, b, x, y)
    dJ_dz_result = dsig_dz(w, b, x, y)*dJ_dsig_result
    dJ_dw_result = dz_dw(x)*dJ_dz_result
    return dJ_dw

Now the reason why something like this is so valuable is because we can continue to reuse these derivatives when we go to calculate $\frac{\delta J}{\delta b}$.  This is because we have already done a lot of the work.  After all, both $\frac{\delta J}{\delta w}$ and $\frac{\delta J}{\delta b}$ are both the upstream derivative multiplied by the local derivative:

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta z}\frac{\delta z}{\delta w}$

$\frac{\delta J}{\delta b} = \frac{\delta J}{\delta z}\frac{\delta z}{\delta b}$

And in code we can calculate both derivatives by just slightly changing our code.

In [50]:
def dJ_dw_and_dJ_db(w, b, x, y):

    dJ_dsig_result = dl_dsig(w, b, x, y)
    dJ_dz_result = dJ_dsig_result*dsig_dz(w, b, x, y)
    # the upstream derivatives above, are shared
    # by the partial derivatives below
    
    dJ_dw = dJ_dz_result*dz_dw(x)
    dJ_db = dJ_dz_result*dz_db(x)
    return (dJ_dw, dJ_db)

So you can see by calculating the derivatives from starting at the outermost layer $\frac{\delta J}{\delta \sigma}$ and working inwards, we can then continue to reuse these already calculated upstream derivatives as we go. 

### Summary

In this lesson we learned about backpropagation.  We saw that with backpropagation, we start with a cost function, in this lesson:

$J(w, b) = (y -  \sigma(z(w, b)))^2$

Then, we calculate the derivative with a series of steps, starting with the outermost function and moving inwards.  Along the way, we calculate each derivative's impact on our cost function $J$.  This means that we change our original approach of calculating $\frac{\delta J}{\delta w}$ as:

$\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} \frac{\delta \sigma}{\delta z} \frac{\delta z}{\delta w}$

to:

1. $\frac{\delta J}{\delta \sigma} = \frac{\delta J}{\delta \sigma}$ 

2. $ \frac{\delta J}{\delta z} = \frac{\delta J}{\delta \sigma}* \frac{\delta \sigma}{\delta z}$

3. $ \frac{\delta J}{\delta w} = \frac{\delta J}{\delta z}* \frac{\delta z}{\delta w}$

And the key is that to calculate each derivative, we mutliply the derivative directly upstream by the local derivative.  So for example, to find $\frac{\delta J}{\delta z}$ we use the upstream derivative $\frac{\delta J}{\delta \sigma}$ and multiply it by the local derivative $\frac{\delta \sigma}{\delta z}$.  

This allows us to reuse our calculations.  Like when we calculate $\frac{\delta J}{\delta b}$ by using the already calculated upstream derivative $\frac{\delta J}{z}$ as:

$ \frac{\delta J}{\delta b} = \frac{\delta J}{\delta z}* \frac{\delta z}{\delta b}$

Being able to our upstream derivatives will become even more important as we add more layers to our network, but we'll save that for a future lesson.

<center>
<a href="https://www.jigsawlabs.io/free" style="position: center"><img src="./jigsaw-icon.png" width="10%" style="text-align: center"></a>
</center>