# 进一步了解pytorch中核心概念

通过例子了解pytorch的基本概念。

本文还参考了：[PyTorch 101, Part 1: Understanding Graphs, Automatic Differentiation and Autograd](https://blog.paperspace.com/pytorch-101-understanding-graphs-and-automatic-differentiation/)

PyTorch的核心是提供两个主要功能：

- n维张量，类似于numpy，但可以在GPU上运行；
- 自动区分以构建和训练神经网络。

## 张量

首先，温习下Numpy。 numpy提供了一个n维的数组对象, 并提供了许多操纵这个数组对象的函数。Numpy 是科学计算的通用框架; Numpy 数组没有计算图, 也没有深度学习, 也没有梯度下降等方法实现的接口。但是可以很容易地使用 numpy 生成随机数据，并将产生的数据传入双层的神经网络,并实现这个网络的正向传播和反向传播。下面是一个numpy实现的两层神经网络：

In [2]:
import numpy as np

# N是批尺寸参数；D_in是输入维度
# H是隐藏层维度；D_out是输出维度
N, D_in, H, D_out = 64, 1000, 100, 10

# 产生随机输入和输出数据
x = np.random.randn(N, D_in)
y = np.random.randn(N, D_out)

# 随机初始化权重
w1 = np.random.randn(D_in, H)
w2 = np.random.randn(H, D_out)

learning_rate = 1e-6
for t in range(10): # 可以多设代数
    # 前向传播：计算预测值y
    h = x.dot(w1)
    h_relu = np.maximum(h, 0)
    y_pred = h_relu.dot(w2)

    # 计算并显示loss（损失）
    loss = np.square(y_pred - y).sum()
    print(t, loss)

    # 反向传播，计算w1、w2对loss的梯度
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.T.dot(grad_y_pred)
    grad_h_relu = grad_y_pred.dot(w2.T)
    grad_h = grad_h_relu.copy()
    grad_h[h < 0] = 0
    grad_w1 = x.T.dot(grad_h)

    # 更新权重
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 47210584.421595454
1 52147870.419683054
2 57742749.7018303
3 48678706.12763923
4 27685925.824841775
5 11067765.22382442
6 4462490.630094571
7 2432642.970467399
8 1722643.7375642802
9 1364839.1587514523


简单分析下上面的计算，$loss=\sum _i^n (y_{predict_i}-y_i)^2$，所以$\frac{\partial{loss}}{\partial{y_{predict_i}}}=2(y_{predict_i}-y_{i})$，所以对于向量y_pred的梯度，有：grad_y_pred = 2.0 * (y_pred - y)，同理可以依次推求各个梯度，比如$y_{predict_i}=\sum _i^n h\_relu_i * w2_i$，所以$\frac{\partial{loss}}{\partial{w2_i}}=\frac{\partial{loss}}{\partial{y_{predict_i}}} * \frac{\partial{y_{predict_i}}}{\partial{w2_i}}$，而$\frac{\partial{y_{predict_i}}}{\partial{w2_i}}=h\_relu_i$，所以转换成向量：有grad_w2 = h_relu.T.dot(grad_y_pred)。更正式点的推导可以参考：[一个简单两层网络的演变](https://blog.csdn.net/StreamRock/article/details/83718443)。

首先，把前向传播和loss函数数学表达式写好。然后可以先考虑一个样本的后向计算。后向计算中麻烦点的就是矩阵求导，这部分可以参考：[向量的2范数求导？](https://www.zhihu.com/question/31845977),还有[Matrix calculus](https://en.wikipedia.org/wiki/Matrix_calculus). 公式不好码，这里就不多说了。

Numpy 是一个伟大的框架, 但它不能利用 GPU 加速它数值计算，而对于现代的深度神经网络, GPU 往往是提供 50倍或更大的加速,所以 numpy 不足以满足现在深度学习的需求。

PyTorch提供了Tensor，其在概念上与 numpy 数组相同，也是一个n维数组, 不过PyTorch 提供了很多能在这些 Tensor 上操作的函数。

任何numpy 数组的操作都可以在 PyTorch Tensor 上开展。另外不像 numpy, PyTorch Tensor 可以利用 GPU 加速他们的数字计算。

要在 GPU 上运行 PyTorch 张量, 只需将其转换为新的数据类型. 下面就像上面的 numpy 例子一样, 将 PyTorch Tensor 生成的随机数据传入双层的神经网络, 手动实现网络的正向传播和反向传播:

In [3]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random input and output data
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Randomly initialize weights
w1 = torch.randn(D_in, H, device=device)
w2 = torch.randn(H, D_out, device=device)

learning_rate = 1e-6
for t in range(10):
    # Forward pass: compute predicted y
    h = x.mm(w1)
    h_relu = h.clamp(min=0)
    y_pred = h_relu.mm(w2)

    # Compute and print loss; loss is a scalar, and is stored in a PyTorch Tensor
    # of shape (); we can get its value as a Python number with loss.item().
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Backprop to compute gradients of w1 and w2 with respect to loss
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h_relu.t().mm(grad_y_pred)
    grad_h_relu = grad_y_pred.mm(w2.t())
    grad_h = grad_h_relu.clone()
    grad_h[h < 0] = 0
    grad_w1 = x.t().mm(grad_h)

    # Update weights using gradient descent
    w1 -= learning_rate * grad_w1
    w2 -= learning_rate * grad_w2

0 28328694.0
1 26614422.0
2 30046522.0
3 34163608.0
4 34199284.0
5 27548846.0
6 17430478.0
7 9046773.0
8 4364270.0
9 2210282.75


## 自动求导

对于小型的两层网络而言，手动实现反向传递并不重要，但对于大型的复杂网络而言，这变得非常麻烦。

幸运的是，我们可以使用**自动微分** 来自动计算神经网络中的反向传播。PyTorch中的 autograd软件包提供了这个功能。使用autograd时，网络 在正向传递时将定义一个 计算图；图中的节点为张量，图中的边为从输入张量产生输出张量的函数。通过该图进行反向传播，可以轻松计算梯度。计算图的相关内容可以复习2.1-60min-pytorch了解。

这个过程复杂，但在实践中直接使用简单。**每个张量代表计算图中的一个节点**。如果 x是一个张量，并且有 x.requires_grad=True，那么x.grad就是另一个张量，代表着x相对于某个标量值的梯度。

有时候可能不需要pytorch自动构建计算图，比如反向传播的时候，我们不想更新权重也进入计算图，这时候可以使用 torch.no_grad()来阻止计算图的构建。

通过使用PyTorch张量和autograd来实现网络就不再需要手动实现网络的反向传播：

In [4]:
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights; setting requires_grad=True means that we
# want to compute gradients for these Tensors during the backward pass.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(10):
    # Forward pass: compute predicted y using operations on Tensors. Since w1 and
    # w2 have requires_grad=True, operations involving these Tensors will cause
    # PyTorch to build a computational graph, allowing automatic computation of
    # gradients. Since we are no longer implementing the backward pass by hand we
    # don't need to keep references to intermediate values.
    y_pred = x.mm(w1).clamp(min=0).mm(w2)
  
    # Compute and print loss. Loss is a Tensor of shape (), and loss.item()
    # is a Python number giving its value.
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass. This call will compute the
    # gradient of loss with respect to all Tensors with requires_grad=True.
    # After this call w1.grad and w2.grad will be Tensors holding the gradient
    # of the loss with respect to w1 and w2 respectively.
    loss.backward()

    # Update weights using gradient descent. For this step we just want to mutate
    # the values of w1 and w2 in-place; we don't want to build up a computational
    # graph for the update steps, so we use the torch.no_grad() context manager
    # to prevent PyTorch from building a computational graph for the updates
    with torch.no_grad():
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after running the backward pass
        w1.grad.zero_()
        w2.grad.zero_()

0 26639924.0
1 20806696.0
2 20385932.0
3 22464084.0
4 24744818.0
5 24688410.0
6 21053018.0
7 14962570.0
8 9134588.0
9 5084152.0


来看看loss.backward()具体都做了什么。在一次epoch，即一次循环中，计算图已经从x，结合w1和w2，构建到y_pred，然后又结合y构建到loss，这里面w1，w2，y_pred和loss的grad都是有的。因为是反向传播，因此计算图计算的w1.grad 就是$\frac{\partial{loss}}{\partial{w1}}$（计算图的相关内容可以复习2.1-60min-pytorch了解）。

然后注意最后的with语句，with是指操作的上下文环境，因为是更新权重，不要进入计算图，因此在torch.no_grad()环境下运算。最后要手动地将w1的梯度置零，让下一个epoch重新计算。至于为什么默认是没清零的，可以看看2.1-60min-pytorch中的内容。

## 定义新的autograd函数

在底层，每一个原始的**自动求导运算**实际上是**两个在Tensor上运行的函数**。其中，**forward**函数计算从输入Tensors获得的输出Tensors。而**backward**函数接收输出Tensors对于某个标量值的梯度，并且计算输入Tensors相对于该相同标量值的梯度。

在PyTorch中，所有数学运算都是通过 torch.nn.Autograd.Function 类来实现的，可以很容易地通过定义**torch.autograd.Function**的子类并实现**forward和backward函数**，来**定义自己的自动求导运算**。然后，我们可以通过**构造实例**并**像调用函数一样调用它**，并传递包含输入数据的张量。

接下来一个例子定义一个自己的定制autograd函数来执行ReLU，并用它来实现一个两层神经网络。

可以看到首先继承了torch.autograd.Function类。

重写的forward函数接受一个包含input的tensor并返回一个包含输出的tensor。ctx是一个context对象，可以用来为 backward 计算保存信息。可以使用  ctx.save_for_backward 方法缓存任意对象，以便在 后面 backward 传递中使用。clamp函数就是类似ReLU形式的函数，min=0 就是ReLU。所以前向计算就是在最后这步计算ReLU，然后记得要把input放入计算图。

在backward传递中，我们得到一个张量，它包含了loss相对于**输出**的梯度，我们需要计算loss相对于**输入**的梯度。首先注意backward只是计算梯度，并不更新权重！ReLU的反向梯度是怎么算的前面复习numpy的时候已经有过介绍了，注意这里grad_output是output的梯度，即$\frac{\partial{loss}}{\partial{output}}$，所以根据链式法则，我们计算$\frac{\partial{loss}}{\partial{input}}=\frac{\partial{loss}}{\partial{output}} \frac{\partial{output}}{\partial{input}}$，而$\frac{\partial{output}}{\partial{input}}$就是对ReLU函数求导，就是非零的时候为1，零的时候为0。所以有 grad_output[input < 0] = 0，grad_output已经赋值到grad_input，所以有grad_input[input < 0] = 0

In [5]:
# Code in file autograd/two_layer_net_custom_function.py
import torch

class MyReLU(torch.autograd.Function):
    """
    We can implement our own custom autograd Functions by subclassing
    torch.autograd.Function and implementing the forward and backward passes
    which operate on Tensors.
    """
    @staticmethod
    def forward(ctx, x):
        """
        In the forward pass we receive a context object and a Tensor containing the
        input; we must return a Tensor containing the output, and we can use the
        context object to cache objects for use in the backward pass.
        """
        ctx.save_for_backward(x)
        return x.clamp(min=0)
    
    @staticmethod
    def backward(ctx, grad_output):
        """
        In the backward pass we receive the context object and a Tensor containing
        the gradient of the loss with respect to the output produced during the
        forward pass. We can retrieve cached data from the context object, and must
        compute and return the gradient of the loss with respect to the input to the
        forward function.
        """
        x, = ctx.saved_tensors
        grad_x = grad_output.clone()
        grad_x[x < 0] = 0
        return grad_x


device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold input and output
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Create random Tensors for weights.
w1 = torch.randn(D_in, H, device=device, requires_grad=True)
w2 = torch.randn(H, D_out, device=device, requires_grad=True)

learning_rate = 1e-6
for t in range(10):
    # Forward pass: compute predicted y using operations on Tensors; we call our
    # custom ReLU implementation using the MyReLU.apply function
    y_pred = MyReLU.apply(x.mm(w1)).mm(w2)
 
    # Compute and print loss
    loss = (y_pred - y).pow(2).sum()
    print(t, loss.item())

    # Use autograd to compute the backward pass.
    loss.backward()

    with torch.no_grad():
        # Update weights using gradient descent
        w1 -= learning_rate * w1.grad
        w2 -= learning_rate * w2.grad

        # Manually zero the gradients after running the backward pass
        w1.grad.zero_()
        w2.grad.zero_()

0 29388140.0
1 23344078.0
2 20084176.0
3 17090222.0
4 13747746.0
5 10321166.0
6 7316202.5
7 5010261.0
8 3402158.0
9 2342496.5


简单分析下上面这段代码。 MyReLU.apply 将函数作用到前向计算中，现在计算图中有梯度的变量：w1，w2，y_pred，loss ，和之前一样。注意MyReLU中同时定义了前向后向计算。

为了更清楚地理解backward函数的意义，这里补充一些参考：[探讨pytorch中nn.Module与nn.autograd.Function的backward()函数](https://cloud.tencent.com/developer/article/1149993).

backward()在pytorch中是一个经常出现的函数，我们一般会在更新loss的时候使用它，比如loss.backward()。通过对loss进行backward来实现从输出到输入的自动求梯度运算。对该函数溯源下可以发现这个backward来源于torch.autograd.backward。这个函数我们可能不常见，那么这个函数在哪儿呢，就在Tensor这个类中（loss就是一个tensor），而**Tensor这个类中有一个函数：backward()函数**。这个函数**返回的就是torch.autograd.backward()**。也就是说，我们在训练中输入我们数据，然后经过一系列神经网络运算，最后计算loss，然后loss.backward()。这里的backward()归根究底就是，上面说的这个函数。

好，那么接下来就说明两个backward，一个nn.Module中的backward()，一个torch.autograd.Function中的backward()，其实有一个是假的backward()。

很容易发现，我们在自己定义一个全新的网络层的时候会**继承nn.Module**，但是我们只需要实现__init__和forward()即可，不需要实现也**没必要实现backward()函数**。即使你实现了，你继承了nn.Module并且编写了一个backward()函数，在实际运行中该backward函数也并不会执行。这就是假的backward函数，不会在pytorch的自动求梯度图中执行。因为目的是要实现对loss的backward，在forward中进行操作的时候，已经对torch.autograd.Function的subclass进行了操作。也就是说在我们**对tensor进行每一步操作运算的时候都会生成一个Function类的子类，里面定了好了forward和backward操作**，最后连成**计算图**.

那既然不建议在nn.Module中定义backward。那能不能自己定义backward函数？

可以通过**继承torch.autograd.Function来定义**！官方教程：[PyTorch: Defining new autograd functions](https://github.com/jcjohnson/pytorch-examples#pytorch-defining-new-autograd-functions)，这就是上面的例子。MyReLU继承了torch.autograd.Function。那么我们在什么情况下需要自己定义？

平常使用的nn.Module其实说白了就是一层包装(Contain)，比如nn.Conv2继承了nn.Module，但是里面的核心函数是torch.nn.function.conv2d，为什么要包装下，原因很简单，**为了方便**，因为我们使用的卷积层是有参数的，这些参数是可以学习的(learnable parameters)。在这个包装类中我们通过torch.nn.parameter的Parameter类把参数进行包装然后传递给torch.nn.function中的函数进行计算，这样也就简化了我们的操作。

那么什么时候需要使用torch.autograd.Function去定义自己的层，在**有些操作通过组合pytorch中已有的层实现不了的时候**，比如你要实现一个新的梯度下降算法，那么就可以尝试着写这些东西。但是要注意，因为这个涉及到了底层，你需要forward和backward一起写，然后自己写对中间变量的操作，比如gradinput以及gradoutput。比如：

```python
class my_function(Function):
    def forward(self, input, parameters):
        self.saved_for_backward = [input, parameters]
        # output = [对输入和参数进行的操作，这里省略]
        return output

    def backward(self, grad_output):
        input, parameters = self.saved_for_backward
        # grad_input = [求 forward(input)关于 parameters 的导数] * grad_output
        return grad_input

# 然后通过定义一个Module来包装一下

class my_module(nn.Module):
    def __init__(self, ...):
        super(my_module, self).__init__()
        self.parameters = # 初始化一些参数

    def backward(self, input):
        output = my_function(input, self.parameters) # 在这里执行你之前定义的function!
        return output
```

这样你就可以通过自定义层然后包装，然后来使用了。包不包装对于执行效率的影响几乎可以不计。

然后再补充下关于ctx的内容。参考了：[About the value of CTX](https://discuss.pytorch.org/t/about-the-value-of-ctx/16821)，[Difference between 'ctx' and 'self' in python?](https://stackoverflow.com/questions/49516188/difference-between-ctx-and-self-in-python)。

ctx是context，上下文，这里是Function运行的上下文。如果forward函数只包含helper函数，那么ctx是空的。

相同的ctx会被传给backward函数，所以可以**使用它来存储一些东西以在backward的时候使用**。它和python类的self参数有些相似。区别在于重写torch.nn.Module的forward函数时，这不是staticmethod，但是继承autograd的function时，这是staticmethod。

A static method (@staticmethod) is called using the class type directly, not an instance of this class，比如:LinearFunction.backward(x, y)

因为没有类的实例，所以静态方法中self就没有意义，而**ctx是一个普通参数，可以在调用静态方法的时候传递**。

此外，还有几个函数也比较特别：比如mark_dirty()，参考：[Understand mark_dirty()](https://discuss.pytorch.org/t/understand-mark-dirty/303)。

首先补充下pytorch中的in-place运算。参考：[Memory efficient pytorch](https://www.slideshare.net/HyungjooCho2/memory-efficient-pytorch)。原地计算，输入会被输出覆盖，比如通过替换元素更新输入。

比如a=a.add(1)是out-place的，而a.add_(1)是in-place的，a+=1也是in-place的。尾部带下划线的是in-place计算。

如果在做in-place原地运算，并且还会在原来的tensor执行，backward可能会有错。比如：有$y = x^2 z = x^2$，如果对第二个公式执行了z = x.pow_(2)原地计算，x是一个张量，那么就不能计算第一个公式的backward了。

所有的张量上，pytorch有一个内部版本计数器来跟踪这些东西，make_dirty可以确保此版本计数器已正确计算。如果用户执行的操作无法正确计算backward，则会抛出错误。dirty就是数据库中那个脏数据差不多的意思。

## Tensorflow：静态图

Pytorch和tensorflo很像，都是定义计算图，并使用自动微分进行梯度计算。最大的区别在于pytorch是构建动态计算图。

在Tensorflow中，定义一次计算图之后，就不变了。而在pytorch中，每一次前向计算都会定义新的计算图。静态图很好，因为可以预先优化图，比如可以设计分配计算图到GPU的策略。如果反复使用相同的图，那么这种潜在的昂贵的预先优化可以在反复运行相同的图时进行摊销。

静态图和动态图的一个区别是控制流。对于某些模型，我们可能希望对每个数据点执行不同的计算;例如，可以对每个数据点的不同时间步长展开一个递归网络;这种展开可以作为一个循环来实现。对于静态图，循环结构需要成为图的一部分;因此，TensorFlow提供了tf.scan等操作符，将循环嵌入图中。对于动态图，情况更简单:因为我们为每个示例动态构建图，所以我们可以使用普通的命令式流控制来执行不同输入的计算。tensorflow是这样构建图的：

In [None]:
import tensorflow as tf
import numpy as np

# First we set up the computational graph:

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create placeholders for the input and target data; these will be filled
# with real data when we execute the graph.
x = tf.placeholder(tf.float32, shape=(None, D_in))
y = tf.placeholder(tf.float32, shape=(None, D_out))

# Create Variables for the weights and initialize them with random data.
# A TensorFlow Variable persists its value across executions of the graph.
w1 = tf.Variable(tf.random_normal((D_in, H)))
w2 = tf.Variable(tf.random_normal((H, D_out)))

# Forward pass: Compute the predicted y using operations on TensorFlow Tensors.
# Note that this code does not actually perform any numeric operations; it
# merely sets up the computational graph that we will later execute.
h = tf.matmul(x, w1)
h_relu = tf.maximum(h, tf.zeros(1))
y_pred = tf.matmul(h_relu, w2)

# Compute loss using operations on TensorFlow Tensors
loss = tf.reduce_sum((y - y_pred) ** 2.0)

# Compute gradient of the loss with respect to w1 and w2.
grad_w1, grad_w2 = tf.gradients(loss, [w1, w2])

# Update the weights using gradient descent. To actually update the weights
# we need to evaluate new_w1 and new_w2 when executing the graph. Note that
# in TensorFlow the the act of updating the value of the weights is part of
# the computational graph; in PyTorch this happens outside the computational
# graph.
learning_rate = 1e-6
new_w1 = w1.assign(w1 - learning_rate * grad_w1)
new_w2 = w2.assign(w2 - learning_rate * grad_w2)

# Now we have built our computational graph, so we enter a TensorFlow session to
# actually execute the graph.
with tf.Session() as sess:
    # Run the graph once to initialize the Variables w1 and w2.
    sess.run(tf.global_variables_initializer())

    # Create numpy arrays holding the actual data for the inputs x and targets y
    x_value = np.random.randn(N, D_in)
    y_value = np.random.randn(N, D_out)
    for _ in range(10):
        # Execute the graph many times. Each time it executes we want to bind
        # x_value to x and y_value to y, specified with the feed_dict argument.
        # Each time we execute the graph we want to compute the values for loss,
        # new_w1, and new_w2; the values of these Tensors are returned as numpy
        # arrays.
        loss_value, _, _ = sess.run([loss, new_w1, new_w2],
                                    feed_dict={x: x_value, y: y_value})
        print(loss_value)

## PyTorch: nn

计算图和autograd都是构建复杂运算和自动微分的有力范式，但是更大型的网络，这些工具还是有些底层。

当构建神经网络时，我们通常安排计算到layers，这些层有可学习的参数，这些参数能在学习过程中被优化。

在Tensorflow中，一些包比如Keras，TensorFlow-Slim，和TFLearn，会在原始计算图上提供高级的抽象，这对构建神经网络很有帮助。

在pytorch中，nn包提供了同样的功能。nn包定义了一组Modules，基本上等价于神经网络的层。一个Module接收input tensors 并计算 output tensors，但也可以持有内部变量，比如包含可学习参数的tensors。nn包也定义了一组有用的loss函数，这些函数在训练神经网络时也经常使用。比如还是上面的那个两层神经网络：

In [None]:
# Code in file nn/two_layer_net_nn.py
import torch

device = torch.device('cpu')
# device = torch.device('cuda') # Uncomment this to run on GPU

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in, device=device)
y = torch.randn(N, D_out, device=device)

# Use the nn package to define our model as a sequence of layers. nn.Sequential
# is a Module which contains other Modules, and applies them in sequence to
# produce its output. Each Linear Module computes output from input using a
# linear function, and holds internal Tensors for its weight and bias.
# After constructing the model we use the .to() method to move it to the
# desired device.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        ).to(device)

# The nn package also contains definitions of popular loss functions; in this
# case we will use Mean Squared Error (MSE) as our loss function. Setting
# reduction='sum' means that we are computing the *sum* of squared errors rather
# than the mean; this is for consistency with the examples above where we
# manually compute the loss, but in practice it is more common to use mean
# squared error as a loss by setting reduction='elementwise_mean'.
loss_fn = torch.nn.MSELoss(reduction='sum')

learning_rate = 1e-4
for t in range(10):
    # Forward pass: compute predicted y by passing x to the model. Module objects
    # override the __call__ operator so you can call them like functions. When
    # doing so you pass a Tensor of input data to the Module and it produces
    # a Tensor of output data.
    y_pred = model(x)

    # Compute and print loss. We pass Tensors containing the predicted and true
    # values of y, and the loss function returns a Tensor containing the loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
  
    # Zero the gradients before running the backward pass.
    model.zero_grad()

    # Backward pass: compute gradient of the loss with respect to all the learnable
    # parameters of the model. Internally, the parameters of each Module are stored
    # in Tensors with requires_grad=True, so this call will compute gradients for
    # all learnable parameters in the model.
    loss.backward()

    # Update the weights using gradient descent. Each parameter is a Tensor, so
    # we can access its data and gradients like we did before.
    with torch.no_grad():
        for param in model.parameters():
            param.data -= learning_rate * param.grad

总结下上面的代码，利用了nn包的Linear，ReLU等层（layers）来构建一个nn.Sequential对象。这个Sequential对象的model是有__call__()函数的，因此可以像调用函数一样直接调用它，这时候，会去执行forward()函数，因此y_pred = model(x)会进行前向计算。最后再调用loss_fn来计算loss。然后直接反向传播。再在no_grad环境下执行梯度下降更新即可。

## PyTorch：optim

当目前为止，一直是手动地更新参数权重，这对于简单的优化算法，比如随机梯度下降来说没什么，可是当我们用更复杂的优化器，比如 AdaGrad，RMSProp，Adam 等时，这就是很大的负担了。

pytorch的optim包抽象了优化算法的基本思路，并提供了常用优化算法的实现。

接下来就用optim包来使用Adam算法。

In [6]:
# Code in file nn/two_layer_net_optim.py
import torch

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Use the nn package to define our model and loss function.
model = torch.nn.Sequential(
          torch.nn.Linear(D_in, H),
          torch.nn.ReLU(),
          torch.nn.Linear(H, D_out),
        )
loss_fn = torch.nn.MSELoss(reduction='sum')

# Use the optim package to define an Optimizer that will update the weights of
# the model for us. Here we will use Adam; the optim package contains many other
# optimization algorithms. The first argument to the Adam constructor tells the
# optimizer which Tensors it should update.
learning_rate = 1e-4
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
for t in range(10):
    # Forward pass: compute predicted y by passing x to the model.
    y_pred = model(x)
    
    # Compute and print loss.
    loss = loss_fn(y_pred, y)
    print(t, loss.item())
  
    # Before the backward pass, use the optimizer object to zero all of the
    # gradients for the Tensors it will update (which are the learnable weights
    # of the model)
    optimizer.zero_grad()

    # Backward pass: compute gradient of the loss with respect to model parameters
    loss.backward()

    # Calling the step function on an Optimizer makes an update to its parameters
    optimizer.step()

0 725.2034912109375
1 707.5624389648438
2 690.359619140625
3 673.596923828125
4 657.2451171875
5 641.3485107421875
6 625.83740234375
7 610.8456420898438
8 596.354248046875
9 582.2810668945312


可以看到直接定义了torch.optim.Adam对象，将模型参数和学习率作为参数赋值进去。注意在backward()之前，**optimizer**会调用zero_grad()将tensors的grad设置为0。最后更新参数也是使用optimizer，用它的step()函数

## PyTorch：定制 nn Modules

当想要定义比sequence更复杂的模型时，可以通过继承nn.Module定义自己的Modules，并定义forward函数，该函数接收输入tensors，并使用其他的modules或其他得autograd运算产生输出tensors。

还是实现一个两层神经网络：

In [None]:
# Code in file nn/two_layer_net_module.py
import torch

class TwoLayerNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we instantiate two nn.Linear modules and assign them as
        member variables.
        """
        super(TwoLayerNet, self).__init__()
        self.linear1 = torch.nn.Linear(D_in, H)
        self.linear2 = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        In the forward function we accept a Tensor of input data and we must return
        a Tensor of output data. We can use Modules defined in the constructor as
        well as arbitrary (differentiable) operations on Tensors.
        """
        h_relu = self.linear1(x).clamp(min=0)
        y_pred = self.linear2(h_relu)
        return y_pred

# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above.
model = TwoLayerNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. The call to model.parameters()
# in the SGD constructor will contain the learnable parameters of the two
# nn.Linear modules which are members of the model.
loss_fn = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)
    
    # Compute and print loss
    loss = loss_fn(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

从上述例子中可以看到在自定义的类中__init__函数负责定义基本结构层，而在forward函数中定义层间运算。后面就类似了。

## PyTorch: Control Flow + Weight Sharing

一个动态图和权值共享的例子。一个全连接的ReLU网络，每个forward都选择一个随机数，用许多层，并重用相同的权重多次来计算最里面的隐藏层。

因为这个模型可以使用普通的Python流控制来实现循环，并且我们可以通过在定义前向传递时简单地**多次重用同一个模块来实现最内层之间的权重共享**。

In [None]:
# Code in file nn/dynamic_net.py
import random
import torch

class DynamicNet(torch.nn.Module):
    def __init__(self, D_in, H, D_out):
        """
        In the constructor we construct three nn.Linear instances that we will use
        in the forward pass.
        """
        super(DynamicNet, self).__init__()
        self.input_linear = torch.nn.Linear(D_in, H)
        self.middle_linear = torch.nn.Linear(H, H)
        self.output_linear = torch.nn.Linear(H, D_out)

    def forward(self, x):
        """
        For the forward pass of the model, we randomly choose either 0, 1, 2, or 3
        and reuse the middle_linear Module that many times to compute hidden layer
        representations.

        Since each forward pass builds a dynamic computation graph, we can use normal
        Python control-flow operators like loops or conditional statements when
        defining the forward pass of the model.

        Here we also see that it is perfectly safe to reuse the same Module many
        times when defining a computational graph. This is a big improvement from Lua
        Torch, where each Module could be used only once.
        """
        h_relu = self.input_linear(x).clamp(min=0)
        for _ in range(random.randint(0, 3)):
            h_relu = self.middle_linear(h_relu).clamp(min=0)
        y_pred = self.output_linear(h_relu)
        return y_pred


# N is batch size; D_in is input dimension;
# H is hidden dimension; D_out is output dimension.
N, D_in, H, D_out = 64, 1000, 100, 10

# Create random Tensors to hold inputs and outputs.
x = torch.randn(N, D_in)
y = torch.randn(N, D_out)

# Construct our model by instantiating the class defined above
model = DynamicNet(D_in, H, D_out)

# Construct our loss function and an Optimizer. Training this strange model with
# vanilla stochastic gradient descent is tough, so we use momentum
criterion = torch.nn.MSELoss(reduction='sum')
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
for t in range(500):
    # Forward pass: Compute predicted y by passing x to the model
    y_pred = model(x)

    # Compute and print loss
    loss = criterion(y_pred, y)
    print(t, loss.item())

    # Zero gradients, perform a backward pass, and update the weights.
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

注意和之前代码相比唯一的不同在于在forward函数中执行了几个loop，通过loop实现权值共享。