<a href="https://colab.research.google.com/github/lblogan14/PyTorch_tutorial_colab/blob/main/3_Autograd.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#A Simple Example

In [None]:
import torch

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import math

In [None]:
a = torch.linspace(0., 2. * math.pi, steps=25,
                   requires_grad=True)
print(a)

`requires_grad=True` means that in every computation that follows, autograd will be accumulating the history of the computation in the output tensors of that computation.

Next, perform a computation and plot its output in terms of its inputs:

In [None]:
b = torch.sin(a)
plt.plot(a.detach(), b.detach())

When we print `b`, we see an indicator that it is tracking its computation history:

In [None]:
print(b)

This `grad_fn` shows that when we execute the backpropagation step and compute gradients, we will need to compute the derivative of sin(x) for all this tensor's inputs.

Let's perform some more computation

In [None]:
c = 2 * b
print(c)

In [None]:
d = c + 1
print(d)

In [None]:
out = d.sum()
print(out)

When we call `.backward()` on a tensor with no arguments, it expects the calling tensor to contain only a single element, as is the case when computing a loss function.

Each `grad_fn` stored with our tensors allows us to walk the computation all the way back to its inputs with its `next_functions` property.

In [None]:
print('d:')
print(d.grad_fn)
print(d.grad_fn.next_functions)
print(d.grad_fn.next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions)
print(d.grad_fn.next_functions[0][0].next_functions[0][0].next_functions[0][0].next_functions)

In [None]:
print('c:')
print(c.grad_fn)

In [None]:
print('b:')
print(b.grad_fn)

In [None]:
print('a:')
print(a.grad_fn)

Note that `a.grad_fn` is reported as `None`, indicating that this was an input to the function with no history of its own.

To get the derivatives, we need to call the `backward()` method on the output, and check the input's `grad` property ot inspect the graidents:

In [None]:
out.backward()
print(a.grad)
plt.plot(a.detach(), a.grad.detach())

Be aware that only leaf nodes of the computation have their gradients computed. If we try to print `c.grad`, we will get back `None`.

In [None]:
print(c.grad)

#Autograd in Training
See how autograd actually works after a single training batch.

First, define some constants, model, and stand-ins for inputs and outputs:

In [None]:
BATCH_SIZE = 16
DIM_IN = 1000
HIDDEN_SIZE = 100
DIM_OUT = 10

In [None]:
class TinyModel(torch.nn.Module):
    def __init__(self):
        super(TinyModel, self).__init__()
        self.layer1 = torch.nn.Linear(1000, 100)
        self.relu = torch.nn.ReLU()
        self.layer2 = torch.nn.Linear(100, 10)

    def forward(self, x):
        x = self.layer1(x)
        x = self.relu(x)
        x = self.layer2(x)
        return x

In [None]:
some_input = torch.randn(BATCH_SIZE, DIM_IN,
                         requires_grad=False)
ideal_output = torch.randn(BATCH_SIZE, DIM_OUT,
                           requires_grad=False)

model = TinyModel()

Within a subclass of `torch.nn.Module`, it is assumed that we want to track gradients on the layers' weights for learning.

To see the layers of the model, we can examine the values of the weights, and verify that no gradients have been computed yet:

In [None]:
print(model.layer2.weight[0][:10]) # just a small slice

In [None]:
print(model.layer2.weight.grad)

For a loss function, we just use the square of the Euclidean distance between `prediction` and the `ideal_output`. Use basic SGD for optimizer:

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

prediction = model(some_input)

loss = (ideal_output - prediction).pow(2).sum()
print(loss)

If we call `loss.backward()` now,

In [None]:
loss.backward()
print(model.layer2.weight[0][:10])

In [None]:
print(model.layer2.weight.grad[0][:10])

The gradients have been computed for each learning weight, but the weights remain unchanged, because we have not run the optimizer yet.

In [None]:
optimizer.step()
print(model.layer2.weight[0][:10])
# layer2 weights have changed

In [None]:
print(model.layer2.weight.grad[0][:10])

After calling `optimizer.step()`, we need to call `optimizer.zero_grad()`, or else every time we run `loss.backward()`, the gradients on the learning weights will accumulate:

In [None]:
print(model.layer2.weight.grad[0][:10])

for i in range(5):
    prediction = model(some_input)
    loss = (ideal_output - prediction).pow(2).sum()
    loss.backward()
print(model.layer2.weight.grad[0][:10])

In [None]:
optimizer.zero_grad()
print(model.layer2.weight.grad[0][:10])

Failing to zero the gradients before running our next training batch will cause the gradients to blow up in this manner, causing incorrect and unpredictable learning results.

#Turning Autograd off and on

The simplest way is to change the `requires_grad` flag on a tensor directly:

In [None]:
a = torch.ones(2, 3, requires_grad=True)
print(a)

In [None]:
b1 = 2 * a
print(b1)

In [None]:
a.requires_grad = False
b2 = 2 * a
print(b2)

If we only need autograd turned off temporarily, a better way is to use the `torch.no_grad()`:

In [None]:
a = torch.ones(2,3, requires_grad=True) * 2
b = torch.ones(2,3, requires_grad=True) * 3

c1 = a + b
print(c1)

In [None]:
with torch.no_grad():
    c2 = a + b
print(c2)

In [None]:
c3 = a * b
print(c3)

`torch.no_grad()` can also be used as a function or method dectorator:

In [None]:
def add_tensor1(x, y):
    return x + y

@torch.no_grad()
def add_tensor2(x, y):
    return x + y

a = torch.ones(2,3, requires_grad=True) * 2
b = torch.ones(2,3, requires_grad=True) * 3

In [None]:
c1 = add_tensor1(a, b)
print(c1)

In [None]:
c2 = add_tensor2(a, b)
print(c2)

There is a corresponding context manager, `torch.enable_grad()`, for turning autograd on when it is not already. It may also be used as a decorator.

If we have a tensor that requires gradient tracking but we want a copy that does not, we can use the `Tensor` object's `detach()` method - it creates a copy of the tensor that is *detached* from the computation history:

In [None]:
x = torch.rand(5, requires_grad=True)
y = x.detach()

print(x)
print(y)

#Autograd and In-place Operations
Autograd needs the intermediate values of a computation to perform gradient computations. For this reason, we must be careful about using in-place operations when using autograd.

In [None]:
a = torch.linspace(0., 2. * math.pi, steps=25,
                   requires_grad=True)
torch.sin_(a)

#Autograd Profier
The computation history combined with timing information will make a handy profiler - and autograd has that feature baked in:

In [None]:
device = torch.device('cpu')
run_on_gpu = False
if torch.cuda.is_available():
    device = torch.device('cuda')
    run_on_gpu = True

x = torch.rand(2,3, requires_grad=True)
y = torch.rand(2,3, requires_grad=True)
z = torch.ones(2,3, requires_grad=True)

In [None]:
with torch.autograd.profiler.profile(use_cuda=run_on_gpu) as prf:
    for _ in range(1000):
        z = (z / x) * y

print(prf.key_averages().table(sort_by='self_cpu_time_total'))

This profier can also label individual sub-blocks of code, break out the data by input tensor shape, and export data as a Chrome tracing tools file.

#More Autograd Detail

`torch.autograd` is an engine for computing gradients. The `backward()` call can also take an optional vector input. This vector represents a set of gradients over the tensor, which are multiplied by the Jacobian of the autograd-traced tensor that precedes it:

In [None]:
x = torch.rand(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2
print(y)

If we try to call `y.backward()`, we will get a runtime error:

In [None]:
y.backward()

For a multi-dimensional output, autograd expets us to provide gradients for those three outputs that it can multiply into the Jacobian:

In [None]:
v = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float) # stand-in for gradients
y.backward(v)
print(x.grad)

#The High-Level API
There is an API on autograd that gives us direct access to important differential matirx and vector operations.

To calculate the Jacobian of a simple function, evaluated for a 2 single-element inputs:

In [None]:
def exp_adder(x,y):
    return 2 * x.exp() + 3 * y

inputs = (torch.rand(1), torch.rand(1)) # arguments for the function
print('Inputs:')
print(inputs)
print('\nJacobian:')
torch.autograd.functional.jacobian(exp_adder, inputs)

We can also do this with higher-order tensors:

In [None]:
inputs = (torch.rand(3), torch.rand(3))
print('Inputs:')
print(inputs)
print('\nJacobian:')
torch.autograd.functional.jacobian(exp_adder, inputs)

We can also compute the Hessian with `torch.autograd.functional.hessian()` method:

In [None]:
inputs = (torch.rand(1), torch.rand(1))
print('Inputs:')
print(inputs)
print('\nHessian:')
torch.autograd.functional.hessian(exp_adder, inputs)

There is also a function to directly compute the vector-Jacobian product:

In [None]:
def do_some_doubling(x):
    y = x * 2
    while y.data.norm() < 1000:
        y = y * 2
    return y

inputs = torch.randn(3)
my_gradients = torch.tensor([0.1, 1., 0.0001])
torch.autograd.functional.vjp(do_some_doubling, inputs, v=my_gradients)

The `torch.autograd.functional.jvp()` method performs the same matrix multipilication as `vjp()` with the operands reversed. The `vhp()` and `hvp()` methods do the same for a vector-Hessian product.