# 计算图构建细节

## Inplace operation

参考：
+ https://github.com/chunhuizhang/bilibili_vlogs/blob/master/learn_torch/grad/05_torch_variables_grad_inplace_operation.ipynb
+ https://www.bilibili.com/video/BV1o24y1b7tk?vd_source=225dba48b31d269151658db856705273

### inplace operation介绍

首先，导入一些package，查看当前Python解释器的版本信息：

In [9]:
import sys
import torch
from torch import nn
print(torch.__version__)
print(sys.version_info)

2.0.1
sys.version_info(major=3, minor=8, micro=17, releaselevel='final', serial=0)


神经网络的训练流程为：

1. compute loss: forward
2. loss.backward(): (compute grad)
3. optimizer.step(): $x = x - lr \cdot x.grad$

接着我们分别介绍两种不被允许的inplace operation

1. 对于`requires_grad==True`叶子张量不能使用inplace operation
+ all `Parameters` are leaf nodes and requirs grad.
+ 使用tensor.is_leaf 判断是否为叶子结点.

2. 对于在求梯度阶段用到的张量不能使用inplace operation.

---

下面举两个例子说明什么是inplace operation以及如何规避这种情况：

1. 对于`requires_grad==True`叶子张量不能使用inplace operation

In [10]:
w = torch.FloatTensor(10)
w.requires_grad = True

In [11]:
w.is_leaf # w is a leaf node

True

使用normal_操作（带`_`的操作一般为inplace operation）

In [12]:
# inplace operation
w.normal_()

RuntimeError: a leaf Variable that requires grad is being used in an in-place operation.

可以看到直接对叶子节点做inplace operation会报错，那么如果要修改叶子节点，如何避免这种问题呢？可以对`.data`执行inplace operation：

In [13]:
print(w.data.requires_grad)
print(w.data)
w.data.normal_()

False
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])


tensor([-0.7905, -1.3713,  0.0745, -0.1921,  1.1473,  0.1278,  0.9638,  1.6434,
        -0.8331,  0.5344])

In [14]:
w.data

tensor([-0.7905, -1.3713,  0.0745, -0.1921,  1.1473,  0.1278,  0.9638,  1.6434,
        -0.8331,  0.5344])

2. 求梯度阶段（不限于是否是leaf node/variable/Parameters）需要用到的张量

我们有如下计算图：

In [15]:
x = torch.FloatTensor([1., 2.])
w1 = torch.FloatTensor([[2,],[1.]])
w2 = torch.FloatTensor([3.])
w1.requires_grad = True
w2.requires_grad = True

In [16]:
w2.is_leaf

True

下面我们执行inplace operation：

> 代码会发生报错，原因是：
> + 在计算f的时候，d是等于某个值的，f对于w2的导数适合d值相关的。但是在计算完f之后，d的值被修改，这会倒是f.backward()对于w2的导数计算错误。具体来说，在执行f = torch.matmul(d, w2)这句的时候，pytorch的反向求导机制保存了d的引用，为了之后的反向求导计算

In [20]:
# x * w1 -> d; d * w2 -> f
d = torch.matmul(x, w1)
f = torch.matmul(d, w2)
d[:] = 0
f.backward()

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [1]], which is output 0 of torch::autograd::CopySlices, is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

当我们在计算f = torch.matmul(d, w2)之前修改d，就不会出现上述问题：


In [21]:
# x * w1 -> d; d * w2 -> f
d = torch.matmul(x, w1)
d[:] = 0
f = torch.matmul(d, w2)
f.backward()

### `.data`和`.detach`

二者的联系与区别，以及需要注意的问题：
+ `detach`
    - Returns a new Tensor, detached from the current graph.
    - The result will never require gradient.
+ `x.data` 与 `x.detach()` 返回的 tensor 有相同的地方, 也有不同的地方，相同点如下:
    - 都和 x 共享同一块数据
    - 都和 x 的 计算历史无关
    - `requires_grad = False`
不同点如下
    - x.data 的修改不会导致报错，但其实计算是有问题的（相当于埋了一个bug）；
    - x.detach() 会直接报错（**更加梯度安全，推荐使用**）；
    
下面是分别使用`.data`和`.detach()`的两个例子：

In [22]:
a = torch.tensor([1, 2, 3.], requires_grad=True)

out = a.sigmoid()

c = out.data
# c = out.detach()

print(f'a.requires_grad: {a.requires_grad}, out.requires_grad: {out.requires_grad}, c.requires_grad: {c.requires_grad}')

print(out)
print(c)
c.zero_()

print(out)
print(c)

out.sum().backward()
print(a.grad, a.sigmoid()*(1-a.sigmoid()))

a.requires_grad: True, out.requires_grad: True, c.requires_grad: False
tensor([0.7311, 0.8808, 0.9526], grad_fn=<SigmoidBackward0>)
tensor([0.7311, 0.8808, 0.9526])
tensor([0., 0., 0.], grad_fn=<SigmoidBackward0>)
tensor([0., 0., 0.])
tensor([0., 0., 0.]) tensor([0.1966, 0.1050, 0.0452], grad_fn=<MulBackward0>)


In [23]:
a = torch.tensor([1, 2, 3.], requires_grad=True)

out = a.sigmoid()

# c = out.data
c = out.detach()  # 更安全，当执行inplace operation的时候会报错

print(f'a.requires_grad: {a.requires_grad}, out.requires_grad: {out.requires_grad}, c.requires_grad: {c.requires_grad}')

print(out)
print(c)
c.zero_()

print(out)
print(c)

out.sum().backward()
print(a.grad, a.sigmoid()*(1-a.sigmoid()))

a.requires_grad: True, out.requires_grad: True, c.requires_grad: False
tensor([0.7311, 0.8808, 0.9526], grad_fn=<SigmoidBackward0>)
tensor([0.7311, 0.8808, 0.9526])
tensor([0., 0., 0.], grad_fn=<SigmoidBackward0>)
tensor([0., 0., 0.])


RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3]], which is output 0 of SigmoidBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

### Embedding

我们最后介绍一个Embedding例子，说明inplace operation问题。一定要注意构建图的顺序关系！！

在计算b的时候，max_norm的操作使得embedding.weight发生改变，产生与a的计算使用的weight产生冲突

In [25]:
n, d, m = 3, 5, 7
# embedding = nn.Embedding(n, d, max_norm=True)
embedding = nn.Embedding(n, d, max_norm=1)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor([1, 2])
a = embedding.weight @ W.t()  # weight must be cloned for this to be differentiable 
b = embedding(idx) @ W.t()  # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [3, 5]] is at version 2; expected version 1 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

对weight进行clone操作，使得之后weight的改变不会影响a的计算图

In [26]:
n, d, m = 3, 5, 7
# embedding = nn.Embedding(n, d, max_norm=True)
embedding = nn.Embedding(n, d, max_norm=1)
W = torch.randn((m, d), requires_grad=True)
idx = torch.tensor([1, 2])
a = embedding.weight.clone() @ W.t()  # TODO: (done) weight must be cloned for this to be differentiable 
b = embedding(idx) @ W.t()  # modifies weight in-place
out = (a.unsqueeze(0) + b.unsqueeze(1))
loss = out.sigmoid().prod()
loss.backward()