# BackPropagation and Gradient Descent

### Introduction

Ok, so if you've made it this far, you've seen the main components of training a neural network through gradient descent.  We saw that we can calculate the effect of a change in our parameters across multiple layers through the chain rule.  And we also saw how we can descend along a cost curve for a neural network with multiple parameters: we calculate the gradient.  That means, we calculate partial derivative of each parameter and move in proportion to each parameter's partial derivative.  

In this lesson, we'll see a small update to the way learn neural networks calculate how a change in parameters affects the cost curve -- by performing backpropagation.



### The Setup

Now believe it or not, we've already seen much of what is involvd in both forward propagation and backward propagation -- whatever that means.  Let's review what we did in the last lesson and then we'll call out these two steps.  

> We'll begin by loading our data.

In [125]:
import pandas as pd

df = pd.read_csv('./cell_multiple.csv')

In [126]:
df[:2]

Unnamed: 0,mean_area,mean_concavity,is_cancerous
0,1.001,0.3001,1
1,1.326,0.0869,1


In [124]:
updated_df.to_csv('./cell_multiple.csv', index = False)

And converted it into tensors.

In [153]:
import torch
X_tensor = torch.tensor(df[['mean_area', 'mean_concavity']].values).float()
y_tensor = torch.tensor(df['is_cancerous']).float()

Then we select our first observation.

In [202]:
first_x = X_tensor[0]
first_y = y_tensor[0]

first_x, first_y

(tensor([1.0010, 0.3001]), tensor(1.))

### The prediction

Ok, so now it's time to make our prediction.  We'll define the components of our hypothesis function.

In [203]:
def linear_fn(x, w, b):
    return x @ w + b

In [204]:
def activation_fn(z):
    return torch.sigmoid(z)

And initialize the related weight vectors and bias.

In [205]:
w = torch.tensor([.5, .3]).float()
b = torch.tensor(-2.).float()

And make our prediction with just our first datapoint.

In [208]:
linear_fn(first_x, w, b)

tensor(-1.4095)

In [209]:
y_hat = activation_fn(z)
y_hat

tensor([0.1669], dtype=torch.float64)

Now this step of calculating the output at each layer is called **forward propagation**.  We are passing data through each layer until we get to a prediction -- above $\hat{y} = .1669$.  So that's it forward propagation is just passing data through the layers of the hypothesis function of our neural network.

### Reviewing the gradient

Now, of course, the parameters of our neural network, $w$ and $b$ were just set randomly by us.  We'll need to use gradient desccent to find the parameters that minimize the output from our cost function.  As we saw in the past lesson, we can summarize our gradient descent formula as the following:

$$ \theta = \theta - \eta \frac{\delta J}{\delta \theta}  $$

And remember that this term $\frac{\delta J}{\delta \theta}$ is a vector of each parameter's partial derivative.  

Our partial derivatives really consist of the following $\frac{\delta J}{\delta w}$ and $\frac{\delta J}{\delta b}$:

* $\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w}$

* $\frac{\delta J}{\delta b} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta b}$

And as we saw in previous lessons, these component derivatives are the following:

In [70]:
import torch
def delta_J_delta_sigma(y_hat, y):
    return torch.sum(2*(y_hat - y))

In [71]:
def delta_sigma_delta(z):
    return torch.sigmoid(z)*(1 - torch.sigmoid(z))

In [72]:
def delta_z_delta_w(x):
    return x

In [73]:
def delta_z_delta_b():
    return 1

So then we can calculate $\frac{\delta J}{\delta w}$ by multiplying the component derivatives together.

In [215]:
w_grad = delta_z_delta_w(first_x)*delta_sigma_delta(z)*delta_J_delta_sigma(y_hat, first_y)
w_grad

tensor([-0.2319, -0.0695], dtype=torch.float64)

And similarly we can calculate $\frac{\delta J}{\delta b}$ by multiplying the component derivatives together.

In [216]:
b_grad = delta_z_delta_b()*delta_sigma_delta(z)*delta_J_delta_sigma(y_hat, first_y)
b_grad

tensor([-0.2317], dtype=torch.float64)

Now one thing we'll notice from the above is in calculating `w_grad` and `b_grad` we repeated the same operation twice: `delta_sigma_delta(z)*delta_J_delta_sigma(y_hat, first_y)`.  This may not seem like a big deal, but this number crunching of the gradient is pretty time consuming, so we'd like to be more efficient if possible.  

So let's see how through backpropagation, we can avoid this duplication.

### Onto backpropagation

With backpropagation, this time instead of simply multiplying together the component derivatives at the end, we will work backwards through our layers, and calculatee each layer's impact on $J$.

It's easier to understand by example.  Remember that these were our layers of our neural network:

* $z(x_i) = w_1*x_i $
* $\sigma(z) =  \frac{1}{1 + e^{-z(x)}} $
* $ J(\hat{y}, y) = \sum  (y - \hat{y})^2 $

So now we work backwards seeing how our last layer, $\sigma$ has an impact on our cost function.  So that's where we'll start.

1. Calculate the impact of a small change in the output of our last layer $\sigma$ on $J$, $\frac{\delta J}{\delta \sigma}$.

In [218]:
dj_dsig = delta_J_delta_sigma(y_hat, first_y)
dj_dsig

tensor(-1.6662, dtype=torch.float64)

> This is exactly the same calclulation $\frac{\delta J}{\delta \sigma}$ that we calculated previously, so we just reuse the function from above.

2. Then we find the impact of a small change in the output of the linear layer $z$ on $J$, $\frac{\delta J}{\delta z}$.

> This, we have **not** found calculated before.

This is the formula: $\frac{\delta J}{\delta z} = \frac{\delta \sigma}{\delta z}*\frac{\delta J}{\delta \sigma}$.

Let's think about how the formula above makes sense.  If we want to find a change in $z's$ impact on $J$, this is the impact that $z$ has on the activation layer $\sigma$ multiplied by $\sigma$'s impact on $J$.  So to perform this calculation, we only need to calculate one thing new: $\frac{\delta \sigma}{\delta z}$.  We already calculated $\frac{\delta J}{\delta \sigma}$ above.

In [219]:
dsig_dz = delta_sigma_delta(z)
dz_dJ = dsig_dz*dj_dsig
dz_dJ

tensor([-0.2317], dtype=torch.float64)

Next let's add some terms to our two component derivatives in the formula: $\frac{\delta J}{\delta z} = \frac{\delta \sigma}{\delta z}*\frac{\delta J}{\delta \sigma}$.  

* We'll call our newly calculated derivative $\frac{\delta \sigma}{\delta z}$ our **local derivative**, and
* we'll call the derivative we calculated in the previous layer $\frac{\delta J}{\delta \sigma}$ our **upstream derivative**.

As we'll see this is the approach we'll continue to use as we move down through our layers: calculate the new local derivative and multiply it by the derivative upstream derivative.

Ok, let's now see how the next layer down has an impact on $J$.

3. Find the impact of a change in $w$ on $J$, that is $\frac{\delta J}{\delta w}$.

Now to make this calculation, our formula is: $\frac{\delta J}{\delta w} = \frac{\delta w}{\delta z}*\frac{\delta J}{\delta z}$.  So once again, it's our local derivative where we find $w's$ impact on $z$.  

And then we multiply by the derivative we just calculated upstream, $\frac{\delta J}{\delta z}$, $z$'s impact on our cost function $J$.

In [55]:
dw_dJ = dz_dw*dz_dJ
dw_dJ

tensor([-0.2500], dtype=torch.float64)

So we get the same answer when multiplying all three terms together: $\frac{\delta J}{\delta w} = \frac{\delta J}{\delta \sigma} * \frac{\delta \sigma}{\delta z} * \frac{\delta z}{\delta w}$.

In [56]:
dj_dsig*dsig_dz*dz_dw

tensor([-0.2500], dtype=torch.float64)

What's nice about this is that **we can reuse $\frac{\delta z}{\delta J}$** when calculating $\frac{\delta J}{\delta b}$.  Let's see this.

4. Find the impact of a change in $b$ on $J$, that is $\frac{\delta J}{\delta b}$.

This again is our local derivative multiplied by the same directly upstream derivative $\frac{\delta z}{\delta J}$.

In [57]:
dJ_db = dz_db*dz_dJ
dJ_dw

tensor([-0.2497], dtype=torch.float64)

So this is backpropagation.  It's a small trick that allows us to reuse some of our earlier calculations.  With backpropagation, we start with our top most layer, here $\frac{\delta \sigma}{\delta J}$, and then move down through our layers each time calculating each layer's impact on the cost function $J$ by multiplying the local derivative by the direectly upstream derivative.

### All together now

Ok, now let's sum our process of forward propagation and backward propagation.  With forward propagation we initialize the values of our parameters.

In [220]:
w = torch.tensor([.5, .3]).float()
b = torch.tensor(-2.).float()

And then perform forward propagation by passing our data through our layers and calculating the output at each layer.

In [222]:
z = linear_fn(first_x, w, b)
print('z = ', z)
y_hat = activation_fn(z)
print('y_hat = ', y_hat)

z =  tensor(-1.4095)
y_hat =  tensor(0.1963)


Once we calculated the output at each layer, we move to backward propagation.  With backward propagation, we calculate how nudging each layer's output will change our cost function.  And we make this calculation by multiplying the local derivative with the upstream derivative.  So to calculate $\frac{\delta J}{\delta w}$, we just move downward through our layers.

In [67]:
dj_dsig = delta_J_delta_sigma(y_hat, first_y)

dJ_dz = delta_sigma_delta(z)*dj_dsig
dJ_dw = delta_z_delta_w(first_x)*dJ_dz

dJ_dw

tensor([-0.2500], dtype=torch.float64)

And then in calculating $\frac{\delta J}{\delta b}$, we already calculated the upstream derivative $\frac{\delta J}{\delta z}$, and so we only have to calculate the local derivative.

In [68]:
dJ_db = delta_z_delta_b()*dJ_dz
dJ_db

tensor([-0.2497], dtype=torch.float64)

> Notice, that in calculating our derivatives above, we did need to use calculations from forward propagation.  For example, we passed through `y_hat` to calculate `delta_J_delta_sigma`.  And we passed through `z` to calculate `delta_sigma_delta(z)`.

Ok, so now let's repeatedly perform forward and backward propagation to descend along the cost curve.

In [223]:
w = torch.tensor([.5, .3]).float()
b = torch.tensor(-2.).float()
eta = .0001 # define learning rate 
for i in range(20):
    for x, y in zip(X_tensor, y_tensor):
    # forward propagation
        z = linear_fn(w, x, b)
        y_hat = activation_fn(z)
    
    # backward propagation
        dj_dsig = delta_J_delta_sigma(y_hat, y)
        dJ_dz = delta_sigma_delta(z)*dj_dsig
        dJ_dw = delta_z_delta_w(first_x)*dJ_dz
        
        dJ_db = delta_z_delta_b()*dJ_dz
    
    # update params w, b
        w = w - eta*dJ_dw
        b = b - eta*dJ_db

In [224]:
w, b

(tensor([0.5786, 0.3236]), tensor(-1.9215))

And notice how this aligns with the gradient descent approach we saw in Pytorch. 

```python
for (x, y) in zip(X_train_tensor_gpu, y_train_tensor_gpu): # loop through observations and labels
    net.zero_grad() # remove calculated derivatives
    y_hat = net(X_tensor)             # 1. Forward prop: With current weights make a prediction
    loss = criterion(y_hat, y_tensor) # 2. See how off the prediction is according to the cost function
    loss.backward()                   # 3. Back prop: Calculate how each layer's change in output affects J
    opt.step()                        # 4. Update params based on eta and calculated derivatives 
```

So that's forward and backward propagation.  Let's now go back to using Pytorch to perform image classification and see if we can understand the process any better.

### Summary

In this lesson, we learned about forward and backward propagation.  Forward propagation just means to calculate the output of each layer when we pass data through our hypothesis function.  

And with backward propagation we calculate each layer's impact on the cost function by multiplying the local derivative by the upstream derivative.  So moving backwards through our layers, we calculated:

* $\frac{\delta J}{\delta \sigma}$

* $\frac{\delta J}{\delta z} = \frac{\delta \sigma}{\delta z}*\frac{\delta J}{\delta \sigma}$

* $\frac{\delta J}{\delta w} = \frac{\delta w}{\delta z}*\frac{\delta J}{\delta z}$

And finally, we calculated:

* $\frac{\delta J}{\delta b} = \frac{\delta b}{\delta z}*\frac{\delta J}{\delta z}$

Each time, we multiplied the local derivative by the upstream derivative.  

Finally, we used forward and backward propagation to perform gradient descent, and saw how it aligned with our code in Pytorch.