# 2.1 Automatic Differentiation in PyTorch

In Section 1.3, we described the computation graph as a “chain of responsibility”: if the loss takes on a certain value, we can trace backward along the chain to determine how much each parameter is “responsible.” In this section, we switch to a more engineering-oriented perspective: How does a deep learning framework automatically build this chain of responsibility, and compute gradients when needed?

Let’s state the problem more plainly: during training, what we need are gradients. But what we actually write is just code — additions, multiplications, convolutions, activation functions… These operations execute line by line during the forward pass and eventually produce a `loss`. So where do the gradients come from? Does the framework symbolically derive one gigantic expression?

Of course not. What deep learning frameworks actually do is more like this:

- During the forward pass, they keep track of what operations you performed, who depends on whom, and what the intermediate results are.
- During the backward pass, they trace back along this “ledger”: starting from the `loss`, they go backward, and whenever they encounter an operation, they use its own “local derivative rule” to pass the gradient further down.

Understanding this mechanism is crucial. It not only explains “where the gradients come from,” but also directly impacts many phenomena we will encounter later: such as why gradients accumulate, why intermediate variables don’t have a `.grad` attribute by default, why some operations can cut off the gradient chain, and why there’s always a trade-off between memory and computation.


In [1]:
import torch
import torch.autograd.functional as AF

## 2.1.1 A Computation Graph is Not Drawn — It is Executed

To understand automatic differentiation in PyTorch, the best approach is not to memorize definitions first, but to observe one simple fact: you are only performing forward computation — yet the computation graph is automatically constructed during execution.

Suppose we have a simple function:

$$ z = \sin(x \cdot y) $$

We can decompose it into two basic steps:

1. Compute the dot product: $q = x \cdot y$
2. Apply the sine function: $z = \sin(q)$

Now we tell PyTorch that we want to compute the gradients of `z` with respect to `x` and `y`.

When we create tensors with:


In [2]:
x = torch.arange(1.0, 5.0, requires_grad=True)
y = torch.arange(5.0, 9.0, requires_grad=True)

The argument `requires_grad=True` can be understood as a declaration: these variables need to be “held accountable.” From that point on, any result computed from them will automatically inherit the ability to require gradients, and PyTorch will internally record:

- Which operation produced this tensor?
- Which tensors did it depend on?

Now we perform two ordinary forward computations:


In [3]:
q = torch.dot(x, y)
z = torch.sin(q)
print('z.requires_grad:', z.requires_grad)

z.requires_grad: True


At the surface level, this still looks like pure numerical computation. But internally, PyTorch has already done two important things:

- `z` automatically becomes a tensor that requires gradients (because it depends on `x` and `y`, which require gradients).
- he creation process of `q` and `z` is recorded: `z` is produced by a sin operation, `q` is produced by a dot operation, and `q` depends on `x` and `y`.

At this stage, don’t worry about what the computation graph “looks like.” Instead, observe a more intuitive phenomenon: before you explicitly trigger backpropagation, gradients do not magically appear.

If we inspect:


In [4]:
print('x.grad:', x.grad)
print('y.grad:', y.grad)

x.grad: None
y.grad: None


we will see it is not zero — but None. This makes sense.

Gradients are the result of a backward traversal. Only when you explicitly initiate that traversal (for example, by calling `backward()`), will PyTorch follow the recorded dependency structure, compute gradients, and write them back to the leaf nodes.

If you do not call `backward()`, PyTorch will not compute gradients — and therefore nothing will be populated.


## 2.1.2 What Does `backward()` Actually Do?

In the previous section, we only performed forward computation, but PyTorch had already silently recorded the dependency structure. Now the real question is: when we call `backward()`, what exactly does the framework do? And can we trust the gradients it computes?

We continue with the same example:

$$ q = x^\top y, \quad z = \sin(q) $$

If we compute the gradients manually, we obtain:

$$ \frac{\partial z}{\partial x} = \frac{\partial z}{\partial q} \cdot \frac{\partial q}{\partial x} = \cos(q) \cdot y $$
$$ \frac{\partial z}{\partial y} = \frac{\partial z}{\partial q} \cdot \frac{\partial q}{\partial y} = \cos(q) \cdot x $$

Now let PyTorch do the computation. We initiate backpropagation from the output `z`:


In [5]:
z.backward()
print('x.grad:', x.grad)
print('y.grad:', y.grad)

x.grad: tensor([3.1666, 3.7999, 4.4332, 5.0666])
y.grad: tensor([0.6333, 1.2666, 1.9000, 2.5333])


After this call, `.grad` is no longer `None`. The gradients have been written back to the leaf tensors `x` and `y`. Intuitively, you can think of `backward()` like this:

1. Start from `z`, and assume by default that $\frac{\partial z}{\partial z} = 1$;
2. Follow the dependency chain recorded during the forward pass, but in reverse;
3. At each operator node, apply its **local derivative rule**, and propagate the gradient upstream.

We can verify that PyTorch’s result matches our manual derivation:


In [6]:
assert torch.allclose(x.grad, y * x.dot(y).cos())
assert torch.allclose(y.grad, x * x.dot(y).cos())

At this point, the core idea of automatic differentiation should be clear: the framework does not derive one giant symbolic expression. It only needs to know how to differentiate each primitive operation locally, and then chain these local rules together according to the computation graph.

**Looking Inside the Graph: `grad_fn`**

If we dig a little deeper, PyTorch exposes part of this backward chain:


In [7]:
print('z.grad_fn:', z.grad_fn.name())
print('q.grad_fn:', q.grad_fn.name())
print('x.grad_fn:', x.grad_fn)
print('y.grad_fn:', y.grad_fn)

z.grad_fn: SinBackward0
q.grad_fn: DotBackward0
x.grad_fn: None
y.grad_fn: None


You will see names such as `SinBackward0` and `DotBackward0`. These names can be roughly interpreted as:

- `z` is not created out of thin air; it is produced by an operator (here, `sin`).
- `grad_fn` is the corresponding backward function object for that operator.

During backpropagation, PyTorch starts from the root node and calls each node’s backward function in sequence:

- When you call `z.backward()`, PyTorch first invokes `SinBackward0`, which computes $\frac{\partial z}{\partial q}$.
- It then passes this value to `DotBackward0`, which computes $\frac{\partial q}{\partial x}$ and $\frac{\partial q}{\partial y}$.
- Finally, the gradients $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$ are accumulated into the leaf tensors.

Leaf tensors (`x`, `y`) do not have `grad_fn` because they are the starting points of the graph — there is nothing upstream to differentiate.

**`next_functions`: Who is Upstream?**

Each backward node also keeps track of its upstream dependencies:


In [8]:
node_q = z.grad_fn.next_functions[0][0]
node_x = node_q.next_functions[0][0]
node_y = node_q.next_functions[1][0]
print('grad_fn of z.child -> q:', node_q.name())
print('grad_fn of q.child -> x:', node_x.name())
print('grad_fn of q.child -> y:', node_y.name())

grad_fn of z.child -> q: DotBackward0
grad_fn of q.child -> x: struct torch::autograd::AccumulateGrad
grad_fn of q.child -> y: struct torch::autograd::AccumulateGrad


These connections describe where the backward pass should go next. For example, `SinBackward0` points to `DotBackward0`, `DotBackward0` points to two special nodes called `AccumulateGrad`.

`AccumulateGrad` is a special node type. Each leaf tensor that requires gradients has an associated `AccumulateGrad` node. Its job is simple: take the computed gradient and accumulate it into the tensor’s `.grad` attribute. That is why `x.grad` and `y.grad` become populated only after calling `backward()`.


## 2.1.3 Why Non-Scalar Outputs Cannot Call `backward()` Directly

In the previous example, `z` was a scalar. That is why we could confidently write `z.backward()`. However, many people will immediately encounter a seemingly unreasonable limitation the first time they replace a scalar output with a vector or matrix:


In [9]:
x = torch.arange(1.0, 5.0, requires_grad=True)
y = torch.arange(5.0, 9.0, requires_grad=True)
Z = torch.outer(x, y)
try:
    Z.backward()  # This will raise an error because z is not a scalar
except RuntimeError as err:
    print('RuntimeError:', err)

RuntimeError: grad can be implicitly created only for scalar outputs


This is not PyTorch being unreasonable. The reason is that the starting point of backpropagation is no longer unique when the output is not a scalar.

For a scalar `z`, we usually care about $\frac{\partial z}{\partial x}$ and $\frac{\partial z}{\partial y}$. The backward pass starts from the output, and the first step is to set $\frac{\partial z}{\partial z} = 1$. This step is reasonable because the unit gradient for a scalar output is unambiguous: we want to backpropagate along the direction of `z`.

But what if the output is a vector or matrix `Z`? What exactly do we want?

- Do we want the gradient of each element of `Z` with respect to `x` and `y`? That would be a higher-order tensor.
- Or do we want some scalar function, such as the sum, mean, or a weighted sum of `Z`, with respect to `x` and `y`?

In other words, for non-scalar outputs, backpropagation must first answer one question: from which “direction” do we propagate the gradient?

Mathematically, this “direction” is a tensor `v` with the same shape as the output:

$$ v = \frac{\partial L}{\partial Z} $$

Then PyTorch actually computes a vector–Jacobian product (VJP):

$$ \frac{\partial L}{\partial x} = v^\top \left(\frac{\partial Z}{\partial x}\right) $$

For scalar outputs, `v` is automatically 1 (equivalent to calling `Z.backward()`, which takes $L$ as $Z$); for non-scalar outputs, `v` needs to be provided by ourselves.

There are two ways to write this.

One way is to explicitly pass `gradient`, indicating the direction along which we want to propagate:


In [10]:
x = torch.arange(1.0, 5.0, requires_grad=True)
y = torch.arange(5.0, 9.0, requires_grad=True)
Z = torch.outer(x, y)
Z.backward(gradient=torch.ones_like(Z))
print('x.grad:', x.grad)
print('y.grad:', y.grad)

x.grad: tensor([26., 26., 26., 26.])
y.grad: tensor([10., 10., 10., 10.])


Here, `torch.ones_like(Z)` means that

$$ \frac{\partial L}{\partial Z_{i,j}} = 1. $$

So passing an all-ones gradient is equivalent to defining

$$ L = \sum_{i,j} Z_{i,j} $$

and then calling `backward()`.

Another way is to first convert `Z` into a scalar and then call `backward()`:


In [11]:
x = torch.arange(1.0, 5.0, requires_grad=True)
y = torch.arange(5.0, 9.0, requires_grad=True)
Z = torch.outer(x, y)
Z = torch.sum(Z)
Z.backward()
print('x.grad:', x.grad)
print('y.grad:', y.grad)

x.grad: tensor([26., 26., 26., 26.])
y.grad: tensor([10., 10., 10., 10.])


In many cases, these two approaches are equivalent. Either we explicitly tell PyTorch along which direction to propagate gradients, or we first reduce the output to a scalar (for example, by summing), so that it implicitly propagates along that scalar direction.


## 2.1.4 Higher-Order Derivatives: Making the Differentiation Process Itself Part of the Computation

Up to this point, what we have computed are first-order gradients: given a scalar output (or something that can be converted into a scalar) $L$, we compute $\nabla_x L$ and $\nabla_y L$. But sometimes we need higher-order information, such as second derivatives (certain directions of the Hessian), curvature, or for use in some regularization terms.

The key point here is: if you want to differentiate a “gradient”, then the process of computing that gradient must itself be differentiable. This is the meaning of `create_graph=True`. When computing first-order derivatives, we not only compute numerical values, but also record the process of computing those derivatives as a new computation graph.

At this point, many people may ask: why not use `backward()`? Because the design goal of `backward()` is training the model. It accumulates gradients into the `.grad` attributes of leaf tensors and, by default, frees the computation graph to save memory. But when computing higher-order derivatives, we usually want:

- The gradient to be returned as a tensor (so that it can be further computed).
- To optionally retain / construct the computation graph (so that we can differentiate again).

Therefore, `torch.autograd.grad` is more commonly used.

We continue to use the same example: $z = \sin(x \cdot y)$. We first compute the first-order derivatives $\frac{dz}{dx}$ and $\frac{dz}{dy}$, and then differentiate these results to see what the second-order derivatives $\frac{d^2 z}{dx^2}$ and $\frac{d^2 z}{dy^2}$ look like.


In [12]:
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)
z = torch.sin(x * y)

dzdx, dzdy = torch.autograd.grad(z, (x, y), create_graph=True)
print('dz/dx:', dzdx)
print('dz/dy:', dzdy)

dz/dx: tensor(-0.5820, grad_fn=<MulBackward0>)
dz/dy: tensor(-0.2910, grad_fn=<MulBackward0>)


The most important line is `create_graph=True`. Without it, `dz/dx` and `dz/dy` would be treated as pure numerical results, and the information about how they were computed would not be retained. Then we would not be able to differentiate them again. The outputs `dz/dx` and `dz/dy `both contain a `grad_fn`, which indicates that they themselves can be differentiated.

When computing higher-order derivatives, we sometimes want to compute gradients with respect to different variables sequentially within the same computation graph. However, after calling `backward()`, PyTorch by default frees the computation graph to save memory, which prevents further backward passes on the same graph. If we truly need to perform multiple backward passes on the same forward result, we can set `retain_graph=True` to retain the graph.


In [13]:
x = torch.tensor(2.0, requires_grad=True)
y = torch.tensor(4.0, requires_grad=True)
z = torch.sin(x * y)

dzdx, dzdy = torch.autograd.grad(z, (x, y), create_graph=True)
print('dz/dx:', dzdx)
print('dz/dy:', dzdy)

(d2zdx2,) = torch.autograd.grad(dzdx, x, retain_graph=True)
(d2zdy2,) = torch.autograd.grad(dzdy, y)
print('d2z/dx2:', d2zdx2)
print('d2z/dy2:', d2zdy2)

dz/dx: tensor(-0.5820, grad_fn=<MulBackward0>)
dz/dy: tensor(-0.2910, grad_fn=<MulBackward0>)
d2z/dx2: tensor(-15.8297)
d2z/dy2: tensor(-3.9574)


However, a more common practice is to execute the forward pass again to obtain a new computation graph. `retain_graph=True` is typically used only when we genuinely need to perform multiple gradient computations on the same graph, such as in experiments with higher-order derivatives or in certain regularization terms.


## 2.1.5 VJP and JVP: What Reverse Mode and Forward Mode Actually Compute

Up to this point, we have been talking about “computing gradients”. But strictly speaking, most functions in deep learning are not scalar-to-scalar mappings. Instead, they are:

$$ f: \mathbb{R}^n \to \mathbb{R}^m $$

Its derivative is a Jacobian matrix:

$$ J = \frac{\partial f}{\partial x} \in \mathbb{R}^{m \times n} $$

The real issue is that when both $m,n$ are large, we almost never explicitly construct $J$. What we actually want, and what the framework computes in practice, is a product involving the Jacobian — either multiplied on the left or on the right.


### 2.1.5.1 VJP: Vector–Jacobian Product (Reverse Mode)

Given an “upstream gradient” vector $v \in \mathbb{R}^m$ (which can be understood as $\frac{\partial L}{\partial f}$), reverse mode computes:

$$ v^\top J \in \mathbb{R}^n $$

This is called the **vector–Jacobian product (VJP)**.

In the language of training:

- We have a scalar `loss`: $L = \mathcal{L}(f(x))$
- An upstream gradient: $v = \frac{\partial L}{\partial f}$
- Backpropagation computes: $\frac{\partial L}{\partial x} = v^\top \frac{\partial f}{\partial x}$

So when we call `backward()` in practice, what is actually being computed is a special case of VJP.


In [14]:
def vjp_func(x: torch.Tensor, y: torch.Tensor):
    return torch.sin(torch.dot(x, y))


x = torch.arange(1.0, 5.0)
y = torch.arange(5.0, 9.0)
out = AF.vjp(vjp_func, (x, y))
print('func(x,y):', out[0])
print('VJP output:', out[1])

func(x,y): tensor(0.7739)
VJP output: (tensor([3.1666, 3.7999, 4.4332, 5.0666]), tensor([0.6333, 1.2666, 1.9000, 2.5333]))


### 2.1.5.2 JVP: Jacobian–Vector Product (Forward Mode)

Forward mode is the opposite. Given an input direction $u \in \mathbb{R}^n$, it computes:

$$ Ju \in \mathbb{R}^m $$

This is called the **Jacobian–vector product (JVP)**.

Intuitively, it answers the following question: if we make a small perturbation in the input space along direction $u$, along which direction will the output change? This is common in sensitivity analysis, implicit layers, certain second-order methods, and some physics or scientific computing settings.


In [15]:
def jvp_func(a: torch.Tensor, b: torch.Tensor):
    return torch.sin(torch.dot(a, b))


x = torch.arange(1.0, 5.0)
y = torch.arange(5.0, 9.0)
v_x = torch.full_like(x, 0.1)
v_y = torch.full_like(y, 0.2)
out = AF.jvp(jvp_func, (x, y), (v_x, v_y))
print('func(x,y):', out[0])
print('JVP output:', out[1])

func(x,y): tensor(0.7739)
JVP output: tensor(2.9133)


### 2.1.5.3 Why VJP Is More Common in Deep Learning

This is not about which method is “more advanced,” but about scale matching.

- In deep learning training, $n$ is usually the parameter dimension (millions or billions).
- $m$ is usually the output dimension (often a scalar).
- What we want is $\nabla_x L \in \mathbb{R}^n$.

The computational cost of VJP is roughly on the order of one backward pass. It is suitable when $n$ is large but the output is scalar or low-dimensional. JVP is more suitable when the input dimension is relatively small, but we care about how the output changes along certain directions.

Therefore, a common rule of thumb is:

- If the output is scalar or low-dimensional and the input dimension is large, reverse mode (VJP) is more appropriate.
- If the input dimension is relatively small and the output dimension is large, forward mode (JVP) may be more appropriate.


## 2.1.6 Common Errors in Backpropagation


In [16]:
x = torch.arange(1.0, 5.0, requires_grad=True)
y = torch.arange(5.0, 9.0, requires_grad=True)

**Calling `backward()` multiple times**

Calling `backward()` multiple times on the same computation graph will cause an error. After the first backward pass finishes, PyTorch frees the intermediate tensors that were saved only for backpropagation, in order to save memory. Therefore, when we try to traverse the same graph a second time, we find that the “markers” have already been cleared. If multiple gradient computations are needed, you can set `retain_graph=True` in the first call.


In [17]:
z = torch.sin(torch.dot(x, y))
z.backward()
try:
    z.backward()  # This will raise an error because gradients are already computed
except RuntimeError as err:
    print('RuntimeError:', err)

RuntimeError: Trying to backward through the graph a second time (or directly access saved tensors after they have already been freed). Saved intermediate values of the graph are freed when you call .backward() or autograd.grad(). Specify retain_graph=True if you need to backward through the graph a second time or if you need to access saved tensors after calling backward.


In [18]:
z = torch.sin(torch.dot(x, y))
z.backward(retain_graph=True)
z.backward()  # This works because we retained the graph

**Trying to access the gradient of intermediate nodes**

Only leaf nodes (that is, the original tensors created by the user) store gradient information. Gradients of intermediate nodes are not stored. If every intermediate tensor stored its gradient, memory usage would increase dramatically. Moreover, during training we only need parameter gradients, not gradients of all intermediate values. Therefore, attempting to access the `.grad` attribute of intermediate tensors will return `None` and may trigger a `UserWarning`. If you need to retain the gradient of an intermediate node, you can call `retain_grad()` on that tensor when it is created.


In [19]:
import warnings

q = torch.dot(x, y)
z = torch.sin(q)
z.backward()

with warnings.catch_warnings(record=True) as w:
    print('q.grad:', q.grad)
    if len(w) > 0:
        for warn in w:
            print('UserWarning:', warn.message)

q.grad: None


In [20]:
q = torch.dot(x, y)
q.retain_grad()
z = torch.sin(q)
z.backward()
print('q.grad after retain_grad:', q.grad)  # Now q.grad is available

q.grad after retain_grad: tensor(0.6333)


**Using in-place operations**

In PyTorch, operations with a trailing underscore, such as `x.add_(1)` or `x.relu_()`, modify the tensor in place. They do not create a new tensor, but directly modify the memory of `x`. This may appear convenient, but during backpropagation, PyTorch often needs certain intermediate values from the forward pass. If those values are modified in place after the forward pass, the backward computation may lose the information required to compute gradients correctly. Therefore, during backpropagation, it is recommended to avoid in-place operations, or ensure that they do not modify intermediate variables required for gradient computation.


In [21]:
z = torch.dot(x, y)
try:
    x.relu_()
except RuntimeError as err:
    print('RuntimeError:', err)

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.


In [22]:
z = torch.dot(x, y)
x = torch.relu(x)
z.backward()