# BackPropagation and Gradient Descent

### Introduction

Ok, so if you've made it this far, you've seen the main components of training a neural network through gradient descent.  We saw that we can calculate the effect of a change in our parameters across multiple layers through the chain rule.  And we also saw how we can descend along a cost curve for a neural network with multiple parameters: we calculate the partial derivative of each parameter and move in proportion to each parameter's partial derivative.  

In this lesson, we'll layer on a couple components to our knowledge -- learning about backpropagation.



### The Setup

Now believe it or not, we've already seen both forward propagation and backward propagation.  Let's review what we did in the last lesson and then we'll call out these two steps.  

In [1]:
import pandas as pd

df = pd.read_csv('./cell_data-2.csv', index_col = 0)
df[:2]

Unnamed: 0,mean_area,is_cancerous
0,1.001,1
1,1.326,1


And converted it into tensors.

In [88]:
import torch
X_tensor = torch.tensor(df[['mean_area']].values)
y_tensor = torch.tensor(df['is_cancerous'].values)

Then we select our first observation.

In [14]:
first_X = X[0]
first_y = y[0]

first_X, first_y

(tensor([1.0010], dtype=torch.float64), tensor(1))

And we define the hypothesis function for our neuron: a linear layer and our activation function.

In [15]:
def linear_fn(w, x, b):
    return w*x + b 

In [16]:
def activation_fn(z):
    return torch.sigmoid(z)

Now we'll need the linear and activation layers to make a prediction.  And we'll need the derivatives we calculated in the previous lessons to update the parameters of our hypothesis function.

Remember that we'll need the components needed to calculate $\frac{\delta J}{\delta w}$ and $\frac{\delta J}{\delta b}$.

* $\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w}$

* $\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta b}$

And as we calculated in previous lessons, these component derivatives are the following:

In [70]:
import torch
def delta_J_delta_sigma(y_hat, y):
    return torch.sum(2*(y_hat - y))

In [71]:
def delta_sigma_delta(z):
    return torch.sigmoid(z)*(1 - torch.sigmoid(z))

In [72]:
def delta_z_delta_w(x):
    return x

In [73]:
def delta_z_delta_b():
    return 1

### The prediction

Ok, so now it's time to make our prediction.  We'll initialize our parameters with some random values.

In [18]:
w = torch.tensor(2.)
b = torch.tensor(-2.)

And make our prediction with just our first datapoint.

In [49]:
z = linear_fn(w, first_x, b)
z

tensor([0.0020], dtype=torch.float64)

In [50]:
y_hat = activation_fn(z)
y_hat

tensor([0.5005], dtype=torch.float64)

Now this step of calculating the output at each layer is called **forward propagation**.  We are passing data through each layer until we get to a prediction -- above $\hat{y} = .50005$.

### Calculating the derivative

Ok, now that we completed the forward propagation, where we passed through our data and calculated an output at each layer, the next step is backward propagation.  With backward propagation, we do something slightly different than we saw in the last lesson.  Previously, we calculated how the parameters $w$ and $b$ impact the cost curve $J$ by multiplying together the component derivatives like so:

* $\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w}$

* $\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta b}$

And it's the same thing in code.  We calculate the component derivatives...

In [40]:
dj_dsig = delta_J_delta_sigma(y_hat, first_y)

In [84]:
dsig_dz = delta_sigma_delta(z)

In [42]:
dz_dw = delta_z_delta_w(first_x)

And then to calculate $\frac{\delta J}{\delta w}$ we just multiply the components together. 

In [44]:
dj_dsig*dsig_dz*dz_dw

tensor([-0.2500], dtype=torch.float64)

### Onto backpropagation

With backpropagation, we work backwards, calculating each layer's impact on $J$.  It's easier to understand by example:  

1. We start by seeing the impact of a small change in the output of our last layer $\sigma$ on $J$, $\frac{\delta J}{\delta \sigma}$.

> This is the same thing we calculated previously.

In [51]:
dj_dsig = delta_J_delta_sigma(y_hat, first_y)
dj_dsig

tensor(-0.9990, dtype=torch.float64)

2. Then we find the impact of a small change in the output of the linear layere on $J$, $\frac{\delta J}{\delta z}$.

> This, we have **not** found calculated before.

This is the formula: $\frac{\delta J}{\delta z} = \frac{\delta \sigma}{\delta z}*\frac{\delta J}{\delta \sigma}$.

Let's think about how the formula above makes sense.  If we want to find a change in $z's$ impact on $J$, this is the impact that $z$ has on the activation layer $\sigma$ multiplied by $\sigma$'s impact on $J$.  So to perform this calculation, we only need to calculate one thing new: $\frac{\delta \sigma}{\delta z}$.  We already calculated $\frac{\delta J}{\delta \sigma}$ above.

In [54]:
dsig_dz = deriv_sigma(z)
dz_dJ = dsig_dz*dj_dsig
dz_dJ

tensor([-0.2497], dtype=torch.float64)

So let's add some terms to our two component derivatives in the formula: $\frac{\delta J}{\delta z} = \frac{\delta \sigma}{\delta z}*\frac{\delta J}{\delta \sigma}$.  

* We'll call our newly calculated derivative $\frac{\delta \sigma}{\delta z}$ our **local derivative**, and
* we'll call the derivative we calculated in the previous layer $\frac{\delta J}{\delta \sigma}$ our **upstream derivative**.

As we'll see this is the approach we'll continue to use as we move down through our layers: calculate the new local derivative and multiply it by the derivative upstream derivative.

3. Find the impact of a change in $w$ on $J$, that is $\frac{\delta J}{\delta w}$.

Now to make this calculation, our formula is: $\frac{\delta J}{\delta w} = \frac{\delta w}{\delta z}*\frac{\delta J}{\delta z}$.  So once again, it's our local derivative where we find $w's$ impact on $z$.  

And then we multiply by the derivative we just calculated upstream, $\frac{\delta J}{\delta z}$, $z$'s impact on our cost function $J$.

In [55]:
dw_dJ = dz_dw*dz_dJ
dw_dJ

tensor([-0.2500], dtype=torch.float64)

So we get the same answer when multiplying all three terms together: $\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w}$.

In [56]:
dj_dsig*dsig_dz*dz_dw

tensor([-0.2500], dtype=torch.float64)

What's nice about this is that **we can reuse $\frac{\delta z}{\delta J}$** when calculating $\frac{\delta J}{\delta b}$.  Let's see this.

4. Find the impact of a change in $b$ on $J$, that is $\frac{\delta J}{\delta b}$.

This again is our local derivative multiplied by the same directly upstream derivative $\frac{\delta z}{\delta J}$.

In [57]:
dJ_db = dz_db*dz_dJ
dJ_dw

tensor([-0.2497], dtype=torch.float64)

So this is backpropagation.  It's a small trick that allows us to reuse some of our earlier calculations.  With backpropagation, we start with our top most layer, here $\frac{\delta \sigma}{\delta J}$, and then move down through our layers each time calculating each layer's impact on the cost function $J$ by multiplying the local derivative by the direectly upstream derivative.

### All together now

Ok, now let's sum our process of forward propagation and backward propagation.  With forward propagation we initialize the values of our parameters.

In [63]:
w = torch.tensor(2.)
b = torch.tensor(-2.)

And then pass our data through our layers to calculate the output at each layer.

In [62]:
z = linear_fn(w, first_x, b)
print('z = ', z)
y_hat = activation_fn(z)
print('y_hat = ', y_hat)

z =  tensor([0.0020], dtype=torch.float64)
y_hat =  tensor([0.5005], dtype=torch.float64)


And with backward propagation, we then calculate how nudging each layer's output will change our cost function.  And we make this calculation by multiplying the local derivative with the upstream derivative.  So to calculate $\frac{\delta J}{\delta w}$, we just move downward through our layers.

In [67]:
dj_dsig = delta_J_delta_sigma(y_hat, first_y)
dJ_dz = delta_sigma_delta(z)*dj_dsig
dJ_dw = delta_z_delta_w(first_x)*dJ_dz

dJ_dw

tensor([-0.2500], dtype=torch.float64)

And then in calculating $\frac{\delta J}{\delta b}$, we already calculated the upstream derivative, so we only have to calculate the local derivative.

In [68]:
dJ_db = delta_z_delta_b()*dJ_dz
dJ_db

tensor([-0.2497], dtype=torch.float64)

> Notice, that in calculating our derivatives above, we need to use calculations from forward propagation.  For example, we passed through `y_hat` to calculate `delta_J_delta_sigma`.  And we passed through `z` to calculate `delta_sigma_delta(z)`.

Ok, so now let's repeatedly perform forward and backward propagation to descend along the cost curve.

In [95]:
w = torch.tensor(2.)
b = torch.tensor(-2.)
eta = .0001 # define learning rate 
for i in range(20):
    for x, y in zip(X_tensor, y_tensor):
    # forward propagation
        z = linear_fn(w, x, b)
        y_hat = activation_fn(z)
    
    # backward propagation
        dj_dsig = delta_J_delta_sigma(y_hat, y)
        dJ_dz = delta_sigma_delta(z)*dj_dsig
        dJ_dw = delta_z_delta_w(first_x)*dJ_dz
        
        dJ_db = delta_z_delta_b()*dJ_dz
    
    # update params w, b
        w = w - eta*dJ_dw
        b = b - eta*dJ_db

In [96]:
w, b

(tensor([2.0255], dtype=torch.float64), tensor([-1.9746], dtype=torch.float64))

And notice how this aligns with the gradient descent approach we saw in Pytorch. 

```python
for (x, y) in zip(X_train_tensor_gpu, y_train_tensor_gpu): # loop through observations and labels
    net.zero_grad() # remove calculated derivatives
    y_hat = net(X_tensor)             # 1. Forward prop: With current weights make a prediction
    loss = criterion(y_hat, y_tensor) # 2. See how off the prediction is according to the cost function
    loss.backward()                   # 3. Back prop: Calculate how each layer's change in output affects J
    opt.step()                        # 4. Update params based on eta and calculated derivatives 
```

So that's forward and backward propagation.  The only thing left is to show, how we can use what we know about matrix algebra to perform the tasks above.  We'll tackle that in our next lesson.

### Summary

In this lesson, we learned about forward and backward propagation.  Forward propagation just means to calculate the output of each layer when we pass data through our hypothesis function.  

And with backward propagation we calculate each layer's impact on the cost function by multiplying the local derivative by the upstream derivative.  So moving backwards through our layers, we calculated:

* $\frac{\delta J}{\delta \sigma}$

* $\frac{\delta J}{\delta z} = \frac{\delta \sigma}{\delta z}*\frac{\delta J}{\delta \sigma}$

* $\frac{\delta J}{\delta w} = \frac{\delta w}{\delta z}*\frac{\delta J}{\delta z}$

And finally, we calculated:

* $\frac{\delta J}{\delta b} = \frac{\delta b}{\delta z}*\frac{\delta J}{\delta z}$

Each time, we multiplied the local derivative by the upstream derivative.  

Finally, we used forward and backward propagation to perform gradient descent, and saw how it aligned with our code in Pytorch.