## Training in Pytorch

### Introduction

In this lesson, we'll learn about an extra feature that we get by using Pytorch tensors.  And this is the ability of the tensor to calculate their own gradient.  This is probably *the key feature* of Pytorch.  In this lesson, we'll review the importance of a gradient, and see how Pytorch can help us with calculating the gradient.

### A forward and backward pass

Now let's review how we would update our parameters.  So currently our forward pass of the data looks like the following:

* $x_i = [4, 5]$
* $z = 2x_1 + 3x_2  -3$
* $\ell = (y_i - z(x_i))^2$

Now, if we think about gradient descent.  The next step is to update the parameters by first calculating the gradient $\frac{\delta \ell}{\delta w_1}$ and $\frac{\delta \ell}{\delta w_2}$.  So how much does a change in $w_1$ and $w_2$ impact change our loss, $\ell$. 

Remember, we determine this through backpropagation.  So moving backwards, we calculate:

* $\frac{\delta \ell}{\delta z} = 2(z - y)$
* $\frac{\delta z}{\delta w_1} = x_{i1}$
* $\frac{\delta z}{\delta w_2} = x_{i2}$

And with our current values of $x_1 = [4, 5]$, $y = 24$, and $w = [2, 3]$, $b = -3$, we get:

* $\frac{\delta \ell}{\delta z} = 2(24 - (23 - 3)) = 8$
* $\frac{\delta z}{\delta w_1} = x_{i1} = 4$
* $\frac{\delta z}{\delta w_2} = x_{i2} = 5$

### Translating to Pytorch

Let's begin by defining initializing some data in Pytorch.

In [24]:
x = torch.tensor([4., 5.], requires_grad = True).T
#          mean area, concavities
x

tensor([4., 5.], grad_fn=<PermuteBackward>)

We begin by defining a tensor representing a single observation. Let's say that this is our cancer observation with mean area of 4 and concavities at 5.

And we'll initialize our parameters of $w_1 = 2$, $w_2 = 3$ and $b = -3$.

In [60]:
w = torch.tensor([2, 3.], requires_grad = True)
b = torch.tensor([-3.], requires_grad = True)

Now let's define our linear function.

In [61]:
def z(x):
    return w @ x + b

In [62]:
z(x)

tensor([20.], grad_fn=<AddBackward0>)

We can define a loss function as the following:

In [63]:
def l(y, z):
    return (y - z) @ (y - z)

In [64]:
y = torch.tensor(24., requires_grad = True)

In [65]:
loss = l(y, z(x))
loss
# (20 - 16)^2

tensor(16., grad_fn=<DotBackward>)

In [66]:
loss.backward()

In [67]:
y.grad

tensor(8.)

In [68]:
w.grad

tensor([-32., -40.])

> It is unclear why this number is negative.

But what's interesting is that here, pytorch has calculated the gradient $\frac{\delta \ell}{\delta w}$.  In other words, to do this, it took the upstream gradient, on $\ell$ and multiplied it by the local gradient from $w_1$ and $w_2$.

### Knowledge of upstream gradient

Let's take a moment to appreciate what we just saw.  We initialized our data:

In [113]:
x = torch.tensor([4., 5.], requires_grad = True).T
w = torch.tensor([2, 3.], requires_grad = True)
b = torch.tensor([-3.], requires_grad = True)
y = torch.tensor(24., requires_grad = True)

And then we made a forward pass of the data with the following:

In [114]:
z = (w @ x + b)
loss = (y - z) @ (y - z) 
loss 

tensor(16., grad_fn=<DotBackward>)

And we can now find the change in the $\frac{\delta \ell}{\delta w}$ , by performing the following:

In [115]:
# loss.backward()

Somehow our tensor $w$ is both finding the local gradient, and also multiplying by the upstream gradient.  How does it know about the upstream gradient?  This occurs in the forward pass.  Let's take a look at loss.  

In [116]:
loss

tensor(16., grad_fn=<DotBackward>)

We can see that the loss tensor knows how it was created, through a dot product, and it also knows tensors what preceded that.

In [117]:
loss.grad_fn.next_functions

((<SubBackward0 at 0x10be43190>, 0), (<SubBackward0 at 0x10be431d0>, 0))

It sees that the components of the dot product were made by two subtractions.  

The takeaway, is that, if we specify `requires_grad = True`, Pytorch will keep track of the forward pass and how each tensor was produced.

* $x_i = [4, 5]$
* $z = 2x_1 + 3x_2  -3$
* $\ell = (y_i - z(x_i))^2$

In [118]:
loss.grad_fn

<DotBackward at 0x1145f8790>

Because it each tensor keeps track of the operations and tensors that produced it, when we call `loss.backward()` Pytorch knows to calculate the gradient of $w_1$ and $w_2$, and multiply their local gradient by the upstream gradient.

In [119]:
loss.grad, w.grad

(None, None)

In [120]:
loss.backward()

In [121]:
y.grad, w.grad

(tensor(8.), tensor([-32., -40.]))

So we can see that in w, pytorch found the local gradient and multiplied it by the upstream gradient.

> The local gradient it $x_1$ and $x_2$

In [122]:
local_gradient = w.grad/y.grad
local_gradient

tensor([-4., -5.])

In [123]:
x

tensor([4., 5.], grad_fn=<PermuteBackward>)

And we see that the upstream gradient is $8$.

In [124]:
y.grad

tensor(8.)

In [125]:
w.grad

tensor([-32., -40.])

### Finishing up 

After calculating the gradient, the next step is to change the parameters $w$ by the gradient times the learning rate.

In [126]:
w

tensor([2., 3.], requires_grad=True)

In [127]:
w + .001*w.grad

tensor([1.9680, 2.9600], grad_fn=<AddBackward0>)

Then after updating the parameters, we should zero the gradient on the parameter vector, and recalculate the gradient for the next change. 

In [129]:
w.grad

tensor([-32., -40.])

In [130]:
w.grad.zero_()

tensor([0., 0.])

### Summary

In this lesson, we got a sense of how backpropagation occurs in Pytorch.  We saw that if we set `requires_grad = True` on our tensors, then Pytorch will keep track of how children of those tensors are created.

> For example, below, our tensor $z$ knows that it was created by a dot product.

In [134]:
x = torch.tensor([4., 5.], requires_grad = True).T
w = torch.tensor([2, 3.], requires_grad = True)

z = w @ x
z

tensor(23., grad_fn=<DotBackward>)

And z knows that the dot product came from tensors $x$ and $w$.  Becuase of this knowledge, e can perform backpropagation, where we start with the outermost function (above z) and work downwards.

From this, downstream tensors will have knowledge of gradients calculated upstream, and include these upstream gradients in their calculation.

### Resources

[Autograd paperspace](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)