# Gradient Descent with Math

### Introduction

Ok, so now that we understand a bit more about derivatives, it's time to go a bit deeper with gradient descent.  Remember that this is why we learned about derivatives.  Derivatives allow us to calculate the slope of our cost curve, or to put it another way, they allow us to calculate how our cost function changes as we change our parameters in our neural network.  

<img src="./cost-curve-slopes.png" width="40%">

Now we generally define our cost function as $J(\theta)$.  So our cost changes as we change our parameters $\theta$.

So really when we have our formula of $next\_w_1 = w_1 - learning\_rate*slope\_at(w_1)$ it's generally written as:

* $next\_w_1 = w_1 - learning\_rate* \frac{\delta J}{\delta w_1}$

That is we update our weight $w_1$ proportional to the amount that our cost function $J$ changes as we nudge $w_1$.  

In this lesson, we'll see how Pytorch applies the rules we saw for the derivative, and how we can use the derivative to descend a cost curve with multiple parameters.

### Back to Pytorch

Now let's make sense of what it means for Pytorch to use the derivative in gradient descent. Ok, now with gradient descent ultimately our goal is to figure out how nudging our parameter $w$ will change the output from our entire cost function -- but let's just start with how nudging $w$ will change the output from our linear layer.

So let's go back to loading up our cancer cell data.

In [71]:
import pandas as pd
import torch

df = pd.read_csv('./cell_data-2.csv', index_col = 0)

X = torch.tensor(df[['mean_area']].values).float()
y = torch.tensor(df['is_cancerous'].values).float()

And then defining the linear layer of our neuron with the following:

In [72]:
x = X[0] # 1.0010
W = torch.tensor([[1.]], requires_grad = True) 

# linear layer
z = x @ W
# activation layeer
a = torch.sigmoid(z)

Ok, now let's think about how the output of our linear layer will change as we change $w$.  Mathematically, we can represent this problem as the following:

$l(w) = w* 1.0010$

> We call our linear function $l$, and notice that we define it as a function of $w$, as that is what we'll be changing.

Then using our rules for the derivative can calculate the derivative as:

$l'(w) = 1*w^{0}*1.0010 = 1.0010$

> So for every unit we increase $w$ by, we increase the output by $1.0010$.

And this is what we can see this is what Pytorch calculates.

In [73]:
z.backward()

In [74]:
W.grad

tensor([[1.0010]])

Now as we said Pytorch doesn't *only* have the task of calculating how the output from the linear layer changes, but from the linear layer to the activation layer, to the cost function.