<a href="https://colab.research.google.com/github/kimiyayam/macro/blob/main/lab05_autograd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 5 – Autograd

Before working on this notebook:
  - Create a copy in your drive
  - Set your Runtime to None

Adapted from: [Original Source](https://www.youtube.com/redirect?event=video_description&redir_token=QUFFLUhqa2IzMmVuNGxhUm9XcnN1UXVydWVjZEpkaEJvd3xBQ3Jtc0ttTkp6Tml1MzlqT1Fia3dFNTdteFVlbW5BVGxDNzMxZW51YzVnTUR6cURhOU1TRHdqQmtUaWZfekppMkswUm52SUZ6d285SHA5YVdfMHF3WmhyYWZuODNER0trLTUyM3VQNHpCcnEtakZxWXMwNXI1RQ&q=https%3A%2F%2Fpytorch-tutorial-assets.s3.amazonaws.com%2Fyoutube-series%2FVideo%2B3%2B-%2BAutograd.ipynb)

PyTorch's *Autograd* feature is part of what make PyTorch flexible and fast for building machine learning projects. It allows for the rapid and easy computation of multiple partial derivatives (also referred to as *gradients)* over a complex computation. This operation is central to backpropagation-based neural network learning.

The power of autograd comes from the fact that it traces your computation dynamically *at runtime,* meaning that if your model has decision branches, or loops whose lengths are not known until runtime, the computation will still be traced correctly, and you'll get correct gradients to drive learning. This, combined with the fact that your models are built in Python, offers far more flexibility than frameworks that rely on static analysis of a more rigidly-structured model for computing gradients.

## What Do We Need Autograd For?

In training a model, we want to minimize the loss, $L$. In the idealized case of a perfect model, that means adjusting its learning weights - that is, the adjustable parameters of the function - such that loss is zero for all inputs. In the real world, it means an iterative process of nudging the learning weights until we see that we get a tolerable loss for a wide variety of inputs.

How do we decide how far and in which direction to nudge the weights? We want to *minimize* the loss, which means making its first derivative with respect to the input equal to 0: $\frac{\partial L}{\partial x} = 0$.

Recall, though, that the loss is not *directly* derived from the input, but a function of the model's output (which is a function of the input directly). By the chain rule of differential calculus, we have $\frac{\partial {L({\vec y})}}{\partial x}$ = $\frac{\partial L}{\partial y}\frac{\partial y}{\partial x}$ = $\frac{\partial L}{\partial y}\frac{\partial M(x)}{\partial x}$.

$\frac{\partial M(x)}{\partial x}$ is where things get complex. The partial derivatives of the model's outputs with respect to its inputs, if we were to expand the expression using the chain rule again, would involve many local partial derivatives over every multiplied learning weight, every activation function, and every other mathematical transformation in the model. The full expression for each such partial derivative is the sum of the products of the local gradient of *every possible path* through the computation graph that ends with the variable whose gradient we are trying to measure.

In particular, the gradients over the learning weights are of interest to us - they tell us *what direction to change each weight* to get the loss function closer to zero.

Since the number of such local derivatives (each corresponding to a separate path through the model's computation graph) will tend to go up exponentially with the depth of a neural network, so does the complexity in computing them. This is where autograd comes in: It tracks the history of every computation. Every computed tensor in your PyTorch model carries a history of its input tensors and the function used to create it. Combined with the fact that PyTorch functions meant to act on tensors each have a built-in implementation for computing their own derivatives, this greatly speeds the computation of the local derivatives needed for learning.

## A Simple Example

Let's start with a straightforward example. First, we'll do some imports to let us graph our results:

In [None]:
%matplotlib inline

In [None]:
import torch

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import math

Next, we'll create an input tensor full of evenly spaced values on the interval $[0, 2{\pi}]$, and specify `requires_grad=True`. (Like most functions that create tensors, `torch.linspace()` accepts an optional `requires_grad` option.) Setting this flag means that in every computation that follows, autograd will be accumulating the history of the computation in the output tensors of that computation.

In [None]:
a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)
print(a)

In [None]:
#Q1. What would happen if requires_grad is not set to True?


Next, we'll perform a computation, and plot its output in terms of its inputs:

In [None]:
b = torch.sin(a)
plt.plot(a.detach(), b.detach()) # can't call plot on tensors that require grads. Detach them first


Let's have a closer look at the tensor `b`. When we print it, we see an indicator that it is tracking its computation history:

In [None]:
print(b)

In [None]:
#Q2. What does the argument grad_fn=<SinBackward0> indicate here?


Let's perform some more computations:

In [None]:
c = 2 * b
print(f'c = {c}')

d = c + 1
print(f'd = {d}')

Finally, let's compute a single-element output. 

In [None]:
out = d.sum()
print(out)

## Tracking the computations
Each `grad_fn` stored with our tensors allows you to walk the computation all the way back to its inputs with its `next_functions` property. We can see below that drilling down on this property on `d` shows us the gradient functions for all the prior tensors. Note that `a.grad_fn` is reported as `None`, indicating that this was an input to the function with no history of its own.

In [None]:
print('d:')
print(d.grad_fn)
print(d.grad_fn.next_functions)
print(d.grad_fn.next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)
print('\nc:')
print(c.grad_fn)
print('\nb:')
print(b.grad_fn)
print('\na:')
print(a.grad_fn)

With all this machinery in place, how do we get derivatives out? You call the `backward()` method on the output, and check the input's `grad` property to inspect the gradients. When you call `.backward()` on a tensor with no arguments, it expects the calling tensor to contain only a single element, as is the case when computing a loss function.

In [None]:
out.backward()
print(a.grad)
plt.plot(a.detach(), a.grad.detach())

In [None]:
#Q3. What is the range of values of the gradients for this function?


Recall the computation steps we took to get here:

```
a = torch.linspace(0., 2. * math.pi, steps=25, requires_grad=True)
b = torch.sin(a)
c = 2 * b
d = c + 1
out = d.sum()
```

Adding a constant, as we did to compute `d`, does not change the derivative. That leaves $c = 2 * b = 2 * sin(a)$, the derivative of which should be $2 * cos(a)$. Looking at the graph above, that's just what we see.

### Exercise

In [None]:
# Q4. Create a tensor containing the numbers 0 to 4 (5 floats) and set it to keep track of the history of the computation in the output tensors


In [None]:
#DELETE THIS LINE!
# x.zero_grad() # zero the gradients

# Q5. Create a new tensor y to be the dot product of x on itself. Use torch.dot()

#print(y)

# Q6. Calculate the gradient of y with respect to x by calling the function for backpropagation 

# Q7. Print the gradient


In [None]:
# Q8. Are the gradients consistent with what you would expect from differentiating the function y = x^2?

# Q9. Plot the graph for x and its gradients. Use detach() to detach them first


In [None]:
# Q10. What function is dy/dx based on the plot? Verify further by running this code.
x.grad == 2 * x


## Autograd in NN Training

We've had a brief look at how autograd works, but how does it look when it's used for its intended purpose? Let's define a small model and examine how it changes after a single training batch. First, define a few constants, our model, and some stand-ins for inputs and outputs:

In [None]:
BATCH_SIZE = 16
DIM_IN = 1000
HIDDEN_SIZE = 100
DIM_OUT = 10

class TinyModel(torch.nn.Module):

    def __init__(self):
        super(TinyModel, self).__init__()
        
        self.layer1 = torch.nn.Linear(1000, 100)
        self.relu = torch.nn.ReLU()
        self.layer2 = torch.nn.Linear(100, 10)
    
    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x
    
some_input = torch.randn(BATCH_SIZE, DIM_IN, requires_grad=False)
ideal_output = torch.randn(BATCH_SIZE, DIM_OUT, requires_grad=False)

model = TinyModel()

In [None]:
#Q11. How many input units does this NN have?

#Q12. How many output units does it have?

#Q13. How many hidden layers does it have?


One thing you might notice is that we never specify `requires_grad=True` for the model's layers. Within a subclass of `torch.nn.module`, it's assumed that we want to track gradients on the layers' weights for learning.

## Initial Parameter Values
If we look at the layers of the model, we can examine the values of the weights:

In [None]:
print(f"Layer2 sample weights = \n{model.layer2.weight[0][0:10]}") # just a small slice
print(f"Layer2 sample bias = \n{model.layer2.bias[0]}") # just one bias

print("\nGradients:")
print(f"Weights = {model.layer2.weight.grad}")
print(f"Bias = {model.layer2.bias.grad}")


In [None]:
#Q14. Why are the gradients 'None' for the sample weights and bias?


## Forward Pass & Loss Calculation
Let's see how this changes when we run through one training batch. For a loss function, we'll just use the square of the Euclidean distance between our `prediction` and the `ideal_output` (MSE), and we'll use a basic stochastic gradient descent optimizer.

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

prediction = model(some_input)

loss = (ideal_output - prediction).pow(2).sum() # MSE
print(loss)

## Backpropagating the Loss: `backward()`
Now, let's call `loss.backward()` and see what happens:

In [None]:
loss.backward()
# Print the parameters
print("Parameters:")
print(f"Layer2 sample weights = {model.layer2.weight[0][0:10]}")
print(f"Layer2 sample bias = {model.layer2.bias[0]}")

# Print the gradients
print("\nGradients:")
print(f"Layer2 sample weight gradients = {model.layer2.weight.grad[0][0:10]}")
print(f"Layer2 sample bias gradient = {model.layer2.bias.grad[0]}")


In [None]:
#Q15. Are the sample weights and bias the same as before the loss is backpropagated?

#Q16. Are the gradients of the sample weights and bias the same as before the loss is backpropagated?


## Updating the Parameters with the Optimiser: `step()`
We can see that the gradients have been computed for each learning weight, but the weights remain unchanged, because we haven't run the optimiser yet. The optimiser is responsible for updating model weights based on the computed gradients.

In [None]:
optimizer.step()
print("Parameter updates:")
print(f"Layer2 weights = {model.layer2.weight[0][0:10]}")
print(f"Layer2 bias = {model.layer2.bias[0]}")

print("\nGradients:")
print(f"Layer2 weight gradients = {model.layer2.weight.grad[0][0:10]}")
print(f"Layer2 bias gradients = {model.layer2.bias.grad[0]}")


You should see that `layer2`'s sample weights and bias have changed.

## Resetting the gradients: `zero_grad()`
One important thing about the process: After calling `optimizer.step()`, you need to call `optimizer.zero_grad()`, or else every time you run `loss.backward()`, the gradients on the learning weights will accumulate:

In [None]:
print(f"Layer2 some weights: \n{model.layer2.weight[0][0:10]}")
print(f"Layer2 one bias: {model.layer2.bias[0]}")

for i in range(0, 5):
    prediction = model(some_input)
    loss = (ideal_output - prediction).pow(2).sum()
    loss.backward()
    
print("\nAfter a few iterations of training:")
print(f"Layer2 some weights GRADs: \n{model.layer2.weight.grad[0][0:10]}")
print(f"Layer2 one bias GRAD: {model.layer2.bias.grad[0]}")

optimizer.zero_grad()

print("\nAfter resetting gradients:")
print(f"Layer2 some weights = {model.layer2.weight.grad[0][0:10]}")
print(f"Layer2 one bias = {model.layer2.bias.grad[0]}")


In [None]:
#Q17. Why are the gradients after running loss.backward() multiple times much bigger?



## Turning Autograd Off and On

There are situations where you will need fine-grained control over whether autograd is enabled. There are multiple ways to do this, depending on the situation.

The simplest is to change the `requires_grad` flag on a tensor directly:

In [None]:
a = torch.ones(2, 3, requires_grad=True)
print(f'a = {a}')

b1 = 2 * a
print(f'b1 = {b1}')

a.requires_grad = False
b2 = 2 * a
print(f'b2 = {b2}')


In the cell above, we see that `b1` has a `grad_fn` (i.e., a traced computation history), which is what we expect, since it was derived from a tensor, `a`, that had autograd turned on. When we turn off autograd explicitly with `a.requires_grad = False`, computation history is no longer tracked, as we see when we compute `b2`.

If you only need autograd turned off temporarily, a better way is to use the `torch.no_grad()`:

In [None]:
a = torch.ones(2, 3, requires_grad=True) * 2
b = torch.ones(2, 3, requires_grad=True) * 3

c1 = a + b
print(f"c1: \n{c1}")

with torch.no_grad():
    c2 = a + b

print(f"c2: \n{c2}")

c3 = a * b
print(f"c3: \n{c3}")

In [None]:
#Q18. Can we do differentiation on computations on c1? Why?

#Q19. Can we do differentiation on computations on c2? Why?

#Q20. Can we do differentiation on computations on c3? Why?


`torch.no_grad()` can also be used as a function or method dectorator:

In [None]:
def add_tensors1(x, y):
    return x + y

@torch.no_grad()
def add_tensors2(x, y):
    return x + y


a = torch.ones(2, 3, requires_grad=True) * 2
b = torch.ones(2, 3, requires_grad=True) * 3

c1 = add_tensors1(a, b)
print(c1)

c2 = add_tensors2(a, b)
print(c2)

There's a corresponding context manager, `torch.enable_grad()`, for turning autograd on when it isn't already. It may also be used as a decorator.

Finally, you may have a tensor that requires gradient tracking, but you want a copy that does not. For this we have the `Tensor` object's `detach()` method - it creates a copy of the tensor that is *detached* from the computation history:

In [None]:
x = torch.rand(5, requires_grad=True)
y = x.detach()

print(x)
print(y)

We did this above when we wanted to graph some of our tensors. This is because `matplotlib` expects a NumPy array as input, and the implicit conversion from a PyTorch tensor to a NumPy array is not enabled for tensors with requires_grad=True. Making a detached copy lets us move forward.
