# Autograd Mechanism

This note will present an overview of how autograd workds and records the operations.

## Excluding subgraphs from backward

Every Tensor has a flag: `requires_grad` that allows for fine grained exclusion of subgraphs from gradient computation and can increase efficiency.

### `requires_grad`

If there's a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don't require gradient, the output also won't require it. Backward computation is never performed in the subgraphs, where all Tensors didn't require gradients.

In [1]:
import torch
import torch.nn as nn
import torch.functional as F

In [3]:
x = torch.rand(5, 5)  # requires_grad=False by default
y = torch.rand(5, 5)
z = torch.rand(5, 5, requires_grad=True)

In [4]:
a = x + y
a.requires_grad

False

In [5]:
b = a + z
b.requires_grad

True

This is especially useful when you want to freeze part of your model, or you known in advance that you're not going to use gradient w.r.t. some parameters. For example if you want to finetune a pretrained CNN, it's enough to switch the `requires_grad` flags in the frozen base, and no intermediate buffers will be save, until the computation gets to the last layer, where the affine transfrom will use weights that require gradient, and the output of the network will also require them.

In [7]:
import torch.optim as optim

In [6]:
import torchvision

In [None]:
model = torchvision.models.resnet18(pretrained=True)
for param in model.parameters():
    param.requires_grad = False
    
# Replace the last fully-connected layer
# Parameters of newly contructed modules have requires_grad=True by default
model.fc = nn.Linear(512, 100)

# Optimize only the classifier
optimizer = optim.SGD(model.fc.parameters(), lr=1e-2, momentum=0.9)

Downloading: "https://download.pytorch.org/models/resnet18-5c106cde.pth" to /Users/hotbaby/.cache/torch/checkpoints/resnet18-5c106cde.pth


HBox(children=(FloatProgress(value=0.0, max=46827520.0), HTML(value='')))

## How autograd encodes the history

Autograd is reverse automatic differentiation system.Conceptually, autograd records a graph recording all of the operations that created the datas as you execute operations, giving you a directed acyclic graph whose leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule.

Internally, autograd represents this graph of `Function` objects(realy expressions), which can be `apply()` ed to compute the result of evaluating the graph. When computing the forwards pass, autograd simultaneously performs the requested computations and built up a graph representing the function that computes the gradient(the `.grad_fn` attribute of each `torch.Tensor` is an entry point into this graph). When the forwards pass is completed, we evaluate this graph in the backwards pass to compute the gradients.

**An important thing to note is that the graph is recreated from scratch at every iteration, and this exactly what allows for using arbitrary Python control flow statements, that can change the overall shape and size of the graph at every iteration. You don't have to encode all possible paths before you launch the training - what you run is what you differentiate.**

## In-place operations with autograd

Supporting in-place operations in autograd is a hard matter, and we discourage their use in most cases. Autograd's aggressive buffer freeing and reuse make it every effcient and there are every few occasions when in-place operations actually lower memory by any significant amout. Unless you're operating under heavy memory pressure, you might never need to use them.

There are two main reasons that limit the applicability of in-place operations:

1. In-place operations can potentially overwrite values required to compute gradients.
2. Every in-place operations actually requires the implementation to rewrite the computational graph. Out-of-place versions simply allocate new objects and keep references to the old graph, while in-place operations, require changing the creator of all inputs to the `Function` representing this operation. This can be tricky, especailly if there are many Tensors that reference the same storage, and in-place functions will actually raise an error if the storage off modified inputs is referenced by any other `Tensor`.

## In-place correctness checks

Every tensor keeps a version counter, that is incremented every time it is marked dirty in any operation. When a Function saves any tensors for backword, a version counter of their containing Tensor is saved as well. Once you access `self.saved_tensors` it is checked, and if it is greater than the saved value an error is raised. This ensures that if you're using in-place functions and not seeing any errors, you can be sure that the computed gradients are correct.