<a href="https://colab.research.google.com/github/jaredmccain-ux/pytorch-tutorial-learning-notes/blob/main/_downloads/autogradqs_tutorial/autogradqs_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# For tips on running notebooks in Google Colab, see
# https://docs.pytorch.org/tutorials/beginner/colab
%matplotlib inline

[Learn the Basics](intro.html) \|\|
[Quickstart](quickstart_tutorial.html) \|\|
[Tensors](tensorqs_tutorial.html) \|\| [Datasets &
DataLoaders](data_tutorial.html) \|\|
[Transforms](transforms_tutorial.html) \|\| [Build
Model](buildmodel_tutorial.html) \|\| **Autograd** \|\|
[Optimization](optimization_tutorial.html) \|\| [Save & Load
Model](saveloadrun_tutorial.html)

Automatic Differentiation with `torch.autograd`
===============================================

When training neural networks, the most frequently used algorithm is
**back propagation**. In this algorithm, parameters (model weights) are
adjusted according to the **gradient** of the loss function with respect
to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation
engine called `torch.autograd`. It supports automatic computation of
gradient for any computational graph.

Consider the simplest one-layer neural network, with input `x`,
parameters `w` and `b`, and some loss function. It can be defined in
PyTorch in the following manner:


In [1]:
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)

Tensors, Functions and Computational graph
==========================================

This code defines the following **computational graph**:

![](https://pytorch.org/tutorials/_static/img/basics/comp-graph.png)

In this network, `w` and `b` are **parameters**, which we need to
optimize. Thus, we need to be able to compute the gradients of loss
function with respect to those variables. In order to do that, we set
the `requires_grad` property of those tensors.


<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<p>You can set the value of <code>requires_grad</code> when creating atensor, or later by using <code>x.requires_grad_(True)</code> method.</p>

</div>



A function that we apply to tensors to construct computational graph is
in fact an object of class `Function`. This object knows how to compute
the function in the *forward* direction, and also how to compute its
derivative during the *backward propagation* step. **A reference to the
backward propagation function is stored in `grad_fn` property of a
tensor. You can find more information of `Function` [in the
documentation]**(https://pytorch.org/docs/stable/autograd.html#function).


In [2]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7be105c2f2e0>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7be129fef6a0>


Computing Gradients
===================

To optimize weights of parameters in the neural network, we need to
compute the derivatives of our loss function with respect to parameters,
namely, we need $\frac{\partial loss}{\partial w}$ and
$\frac{\partial loss}{\partial b}$ under some fixed values of `x` and
`y`. To compute those derivatives, we call `loss.backward()`, and then
retrieve the values from `w.grad` and `b.grad`:


In [3]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.0019, 0.0292, 0.3007],
        [0.0019, 0.0292, 0.3007],
        [0.0019, 0.0292, 0.3007],
        [0.0019, 0.0292, 0.3007],
        [0.0019, 0.0292, 0.3007]])
tensor([0.0019, 0.0292, 0.3007])


<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<ul>
<li>We can only obtain the <code>grad</code> properties for the leafnodes of the computational graph, which have <code>requires_grad</code> propertyset to <code>True</code>. For all other nodes in our graph, gradients will not beavailable.- We can only perform gradient calculations using<code>backward</code> once on a given graph, for performance reasons. If we needto do several <code>backward</code> calls on the same graph, we need to pass<code>retain_graph=True</code> to the <code>backward</code> call.</li>
</ul>
```

</div>

# **解释上面这句话：**
简单来说，**多次反向传播**指的是在同一个计算图（Computational Graph）上执行超过一次的 `.backward()` 操作。

在 PyTorch 的默认机制中，计算图是**一次性**的。为了节省内存，一旦你调用了 `loss.backward()`，PyTorch 就会在完成梯度计算后立即销毁这个计算图。如果你尝试第二次对同一个变量调用 `.backward()`，程序会报错。

---

### 为什么要进行“多次”反向传播？

虽然大多数基础模型（如简单的分类器）只需要一个 Loss 进行一次反向传播，但在以下复杂的深度学习场景中，多次反向传播是必须的：

#### 1. 多个损失函数 (Multiple Losses)

假设你的模型有两个输出任务，分别计算出  和 。这两个 Loss 共享了模型的前半部分权重。

* 如果你先执行 `Loss_1.backward()`，计算图会被销毁。
* 当你接着执行 `Loss_2.backward()` 时，程序会因为找不到计算图而崩溃。
* **解决方法：** 执行 `Loss_1.backward(retain_graph=True)`，这样图会被保留，供  使用。

#### 2. 生成对抗网络 (GANs)

在训练 GAN 时，D（判别器）和 G（生成器）的损失函数往往高度耦合。有时候你需要先计算判别器的梯度，保留图，再计算生成器的梯度。

#### 3. 梯度累积 (Gradient Accumulation) 的特殊变体

虽然常见的梯度累积（多个 Batch 后才 `step`）不需要 `retain_graph`，但在某些复杂的强化学习算法（如 A3C）或二阶导数计算中，需要对同一个图进行多次路径回溯。

---

### 形象的类比：过河拆桥

你可以把**计算图**想象成一座为了过河（计算梯度）而临时搭建的**桥**：

* **默认状态：** 你过完桥（完成 `backward`）后，为了节省建筑材料（内存），PyTorch 会立刻把桥拆掉。
* **多次反向传播：** 如果你告诉 PyTorch `retain_graph=True`，就相当于说：“先别拆桥，我一会儿还要再跑回来一趟。”

---

### 代码对比

```python
import torch

x = torch.randn(3, requires_grad=True)
y = x * 2
loss = y.sum()

# 第一次反向传播
loss.backward(retain_graph=True)
print(x.grad) # 正常输出

# 如果没有 retain_graph=True，下面这行会报错
loss.backward()
print(x.grad) # 梯度会累加

```

### 总结

* **单次反向传播（默认）：** 节省内存，图用完即删。
* **多次反向传播（`retain_graph=True`）：** 允许你多次利用同一组中间变量计算梯度，常用于多任务学习或复杂的生成模型。


# **解释迭代和多次反向传播**
这是一个非常经典的困惑点。要解开这个疑问，最关键的是区分两个概念：**“训练迭代（Training Iterations）”** 和 **“单次计算图内的多次反向传播（Multiple Backwards on the same Graph）”**。

你平时理解的“更新很多次参数”，其实是**每一轮迭代都用了全新的计算图**。

---

### 1. 核心区别：新图 vs. 旧图

在 PyTorch 这种**动态计算图**框架中，流程是这样的：

* **标准的训练循环（每一轮迭代）：**
1. **Forward:** 你跑一次 `output = model(input)`。此时，PyTorch 会在内存中**从无到有现场搭建**一座“桥”（计算图），记录下所有的中间变量和运算关系。
2. **Backward:** 你跑一次 `loss.backward()`。PyTorch 顺着这座桥走回去，计算出梯度。
3. **桥被拆除：** 为了节省内存，梯度算完的一瞬间，这张图（以及它占用的中间变量内存）会被**立即销毁**。
4. **下一轮迭代：** 当你下一次跑 `model(input)` 时，它会重新搭建一张**全新的桥**。


* **文档里说的“多次反向传播”：**
是指在**同一个 Forward 产生的同一张图**上，连续调用两次 `.backward()`。
> **类比：** 你建了一座桥，走过去之后（第一次 backward），如果你想不拆桥再走一遍（第二次 backward），你就必须声明 `retain_graph=True`。



---

### 2. 为什么“每一轮迭代重新建图”不麻烦？

你可能会觉得：每轮都重建一张图，这不是很浪费性能吗？

其实不然，这正是 PyTorch 灵活的核心所在：

1. **内存效率：** 如果不销毁图，随着训练进行，你的显存（VRAM）瞬间就会被成千上万个中间层的激活值（Activations）挤爆。
2. **动态性：** 因为每轮都重绘，你可以让每一轮的神经网络结构都不一样（比如在循环中改变层的数量，或者根据条件跳转路径），这在处理变长序列（NLP）或动态结构时非常强大。

---

### 3. 什么时候真的需要“多次反向传播”？

既然大部分时间我们都是“用完就拆”，那什么变态场景需要 `retain_graph=True` 呢？

* **多任务学习（Multi-task Learning）：**
如果你有两个独立的 Loss（例如  负责分类， 负责回归），它们共享同一个骨干网络。
```python
# 错误写法：第一行执行完图就没了，第二行会报错
loss1.backward()
loss2.backward()

# 正确写法：
loss1.backward(retain_graph=True) # 保留图给下一个人用
loss2.backward() # 这一次用完可以拆了

```


* **对抗训练（GANs）：**
当生成器和判别器的损失函数需要交替利用同一个前向计算结果来更新梯度时。

---

### 总结

| 场景 | 是否涉及“多次反向传播” | 是否需要 `retain_graph` |
| --- | --- | --- |
| **正常训练（100 个 Epoch）** | 否（每轮都是独立的新图） | 不需要 |
| **同一个 Batch 里计算多个 Loss** | **是**（共享同一张图） | **需要** |
| **二阶导数计算** | **是**（需要保留一阶导数的图） | **需要** |

所以，这并不会让你平时的训练变得不方便。相反，PyTorch 默认“用完即焚”的做法，正是为了让你能在有限的显存里跑更大的模型。


Disabling Gradient Tracking
===========================

By default, all tensors with `requires_grad=True` are tracking their
computational history and support gradient computation. However, there
are some cases when we do not need to do that, for example, when we have
trained the model and just want to apply it to some input data, i.e. we
only want to do *forward* computations through the network. We can stop
tracking computations by surrounding our computation code with
`torch.no_grad()` block:


In [4]:
z = torch.matmul(x, w)+b
print(z.requires_grad)

with torch.no_grad():
    z = torch.matmul(x, w)+b
print(z.requires_grad)

True
False


Another way to achieve the same result is to use the `detach()` method
on the tensor:


In [5]:
z = torch.matmul(x, w)+b
z_det = z.detach()
print(z_det.requires_grad)

False


There are reasons you might want to disable gradient tracking:

:   -   To mark some parameters in your neural network as **frozen
        parameters**.
    -   To **speed up computations** when you are only doing forward
        pass, because computations on tensors that do not track
        gradients would be more efficient.


More on Computational Graphs
============================

Conceptually, autograd keeps a record of data (tensors) and all executed
operations (along with the resulting new tensors) in a directed acyclic
graph (DAG) consisting of
[Function](https://pytorch.org/docs/stable/autograd.html#torch.autograd.Function)
objects. In this DAG, leaves are the input tensors, roots are the output
tensors. By tracing this graph from roots to leaves, you can
automatically compute the gradients using the chain rule.

In a forward pass, autograd does two things simultaneously:

-   run the requested operation to compute a resulting tensor
-   maintain the operation's *gradient function* in the DAG.

The backward pass kicks off when `.backward()` is called on the DAG
root. `autograd` then:

-   computes the gradients from each `.grad_fn`,
-   accumulates them in the respective tensor's `.grad` attribute
-   using the chain rule, propagates all the way to the leaf tensors.

<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<p>An important thing to note is that the graph is recreated from scratch; after each<code>.backward()</code> call, autograd starts populating a new graph. This isexactly what allows you to use control flow statements in your model;you can change the shape, size and operations at every iteration ifneeded.</p>

</div>



Optional Reading: Tensor Gradients and Jacobian Products
========================================================

In many cases, we have a scalar loss function, and we need to compute
the gradient with respect to some parameters. However, there are cases
when the output function is an arbitrary tensor. In this case, PyTorch
allows you to compute so-called **Jacobian product**, and not the actual
gradient.

For a vector function $\vec{y}=f(\vec{x})$, where
$\vec{x}=\langle x_1,\dots,x_n\rangle$ and
$\vec{y}=\langle y_1,\dots,y_m\rangle$, a gradient of $\vec{y}$ with
respect to $\vec{x}$ is given by **Jacobian matrix**:

$$\begin{aligned}
J=\left(\begin{array}{ccc}
   \frac{\partial y_{1}}{\partial x_{1}} & \cdots & \frac{\partial y_{1}}{\partial x_{n}}\\
   \vdots & \ddots & \vdots\\
   \frac{\partial y_{m}}{\partial x_{1}} & \cdots & \frac{\partial y_{m}}{\partial x_{n}}
   \end{array}\right)
\end{aligned}$$

Instead of computing the Jacobian matrix itself, PyTorch allows you to
compute **Jacobian Product** $v^T\cdot J$ for a given input vector
$v=(v_1 \dots v_m)$. This is achieved by calling `backward` with $v$ as
an argument. The size of $v$ should be the same as the size of the
original tensor, with respect to which we want to compute the product:


In [None]:
inp = torch.eye(4, 5, requires_grad=True)
out = (inp+1).pow(2).t()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"First call\n{inp.grad}")
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nSecond call\n{inp.grad}")
inp.grad.zero_()
out.backward(torch.ones_like(out), retain_graph=True)
print(f"\nCall after zeroing gradients\n{inp.grad}")

Notice that when we call `backward` for the second time with the same
argument, the value of the gradient is different. This happens because
when doing `backward` propagation, PyTorch **accumulates the
gradients**, i.e. the value of computed gradients is added to the `grad`
property of all leaf nodes of computational graph. If you want to
compute the proper gradients, you need to zero out the `grad` property
before. In real-life training an *optimizer* helps us to do this.


<div style="background-color: #54c7ec; color: #fff; font-weight: 700; padding-left: 10px; padding-top: 5px; padding-bottom: 5px"><strong>NOTE:</strong></div>

<div style="background-color: #f3f4f7; padding-left: 10px; padding-top: 10px; padding-bottom: 10px; padding-right: 10px">

<p>Previously we were calling <code>backward()</code> function withoutparameters. This is essentially equivalent to calling<code>backward(torch.tensor(1.0))</code>, which is a useful way to compute thegradients in case of a scalar-valued function, such as loss duringneural network training.</p>

</div>



------------------------------------------------------------------------


Further Reading
===============

-   [Autograd
    Mechanics](https://pytorch.org/docs/stable/notes/autograd.html)
