In [None]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [None]:
#hide
from fastbook import *

# Application Architectures Deep Dive

# 应用程序架构深度挖掘[机器翻译]

We are now in the exciting position that we can fully understand the architectures that we have been using for our state-of-the-art models for computer vision, natural language processing, and tabular analysis. In this chapter, we're going to fill in all the missing details on how fastai's application models work and show you how to build the models they use.


我们现在处于一个令人兴奋的位置，我们可以完全理解我们在计算机视觉、自然语言处理、和表格分析。在本章中，我们将填写fastai应用程序模型如何工作的所有缺失细节，并向您展示如何构建它们使用的模型。[机器翻译]

We will also go back to the custom data preprocessing pipeline we saw in <<chapter_midlevel_data>> for Siamese networks and show you how you can use the components in the fastai library to build custom pretrained models for new tasks.


我们还将回到我们在 <chapter_midlevel_data> 中看到的暹罗网络的自定义数据预处理管道，并向您展示如何使用fastai库中的组件来构建自定义预训练模型用于新任务。[机器翻译]

We'll start with computer vision.

我们将从计算机视觉开始。[机器翻译]

## Computer Vision

# # 计算机视觉[机器翻译]

For computer vision application we use the functions `cnn_learner` and `unet_learner` to build our models, depending on the task. In this section we'll explore how to build the `Learner` objects we used in Parts 1 and 2 of this book.

对于计算机视觉应用，我们根据任务使用函数 “cnn_learner” 和 “unet_learner” 来构建我们的模型。在本节中，我们将探讨如何构建我们在本书第1部分和第2部分中使用的 “学习者” 对象。[机器翻译]

### cnn_learner

# Cnn_learner[机器翻译]

Let's take a look at what happens when we use the `cnn_learner` function. We begin by passing this function an architecture to use for the *body* of the network. Most of the time we use a ResNet, which you already know how to create, so we don't need to delve into that any further. Pretrained weights are downloaded as required and loaded into the ResNet.


让我们看看当我们使用 'cnn_learner' 函数时会发生什么。我们首先传递这个函数一个用于网络的 * body * 的架构。大多数时候，我们使用一个你已经知道如何创建的ResNet，所以我们不需要再深入研究它了。根据需要下载预训练的重量并加载到ResNet中。[机器翻译]

Then, for transfer learning, the network needs to be *cut*. This refers to slicing off the final layer, which is only responsible for ImageNet-specific categorization. In fact, we do not slice off only this layer, but everything from the adaptive average pooling layer onwards. The reason for this will become clear in just a moment. Since different architectures might use different types of pooling layers, or even completely different kinds of *heads*, we don't just search for the adaptive pooling layer to decide where to cut the pretrained model. Instead, we have a dictionary of information that is used for each model to determine where its body ends, and its head starts. We call this `model_meta`—here it is for resnet-50:

然后，对于迁移学习，网络需要 * 切 *。这是指切掉最后一层，它只负责ImageNet特定的分类。事实上，我们并不是只切掉这个层，而是从自适应平均池化层开始切掉所有东西。原因很快就会变得清楚。由于不同的体系结构可能使用不同类型的池层，甚至是完全不同类型的 * heads *，我们不只是搜索自适应池层来决定在哪里切割预训练的模型。相反，我们有一个信息字典，用于每个模型，以确定其身体的终点和头部的起点。我们称之为 'model_meta'-这里是resnet-50:[机器翻译]

In [None]:
model_meta[resnet50]

{'cut': -2,
 'split': <function fastai.vision.learner._resnet_split(m)>,
 'stats': ([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])}

> jargon: Body and Head: The "head" of a neural net is the part that is specialized for a particular task. For a CNN, it's generally the part after the adaptive average pooling layer. The "body" is everything else, and includes the "stem" (which we learned about in <<chapter_resnet>>).

> 行话: 身体和头部: 神经网络的 “头部” 是专门用于特定任务的部分。对于CNN，它通常是自适应平均池层之后的部分。“Body” 是其他一切，包括 “stem” (我们在 <chapter_resnet>> 中了解到)。[机器翻译]

If we take all of the layers prior to the cut point of `-2`, we get the part of the model that fastai will keep for transfer learning. Now, we put on our new head. This is created using the function `create_head`:

如果我们把所有的层都放在 '-2' 的切点之前，我们就会得到fastai将为迁移学习保留的模型部分。现在，我们戴上新的头。这是使用函数 “create_head” 创建的:[机器翻译]

In [None]:
#hide_output
create_head(20,2)

Sequential(
  (0): AdaptiveConcatPool2d(
    (ap): AdaptiveAvgPool2d(output_size=1)
    (mp): AdaptiveMaxPool2d(output_size=1)
  )
  (1): full: False
  (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (3): Dropout(p=0.25, inplace=False)
  (4): Linear(in_features=20, out_features=512, bias=False)
  (5): ReLU(inplace=True)
  (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (7): Dropout(p=0.5, inplace=False)
  (8): Linear(in_features=512, out_features=2, bias=False)
)

```
Sequential(
  (0): AdaptiveConcatPool2d(
    (ap): AdaptiveAvgPool2d(output_size=1)
    (mp): AdaptiveMaxPool2d(output_size=1)
  )
  (1): Flatten()
  (2): BatchNorm1d(20, eps=1e-05, momentum=0.1, affine=True)
  (3): Dropout(p=0.25, inplace=False)
  (4): Linear(in_features=20, out_features=512, bias=False)
  (5): ReLU(inplace=True)
  (6): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True)
  (7): Dropout(p=0.5, inplace=False)
  (8): Linear(in_features=512, out_features=2, bias=False)
)
```

'''
顺序 (
(0): AdaptiveConcatPool2d (
(Ap): AdaptiveAvgPool2d(output_size = 1)
(Mp): AdaptiveMaxPool2d(output_size = 1)
)
(1): Flatten()
(2): BatchNorm1d(20，eps = 1e-05，动量 = 0.1，仿射 = 真)
(3): 退出 (p = 0.25，inplace = False)
(4): 线性 (in_features = 20，out_features = 512，bias = False)
(5): ReLU(inplace = True)
(6): BatchNorm1d(512，eps = 1e-05，动量 = 0.1，仿射 = 真)
(7): 退出 (p = 0.5，inplace = False)
(8): 线性 (in_features = 512，out_features = 2，bias = False)
)
'''[机器翻译]

With this function you can choose how many additional linear layers are added to the end, how much dropout to use after each one, and what kind of pooling to use. By default, fastai will apply both average pooling, and max pooling, and will concatenate the two together (this is the `AdaptiveConcatPool2d` layer). This is not a particularly common approach, but it was developed independently at fastai and other research labs in recent years, and tends to provide some small improvement over using just average pooling.


使用此功能，您可以选择在末尾添加多少额外的线性层，每个层之后要使用多少丢失，以及使用哪种池。默认情况下，fastai将同时应用平均池和最大池，并将两者连接在一起 (这是 'AdaptiveConcatPool2d' 层)。这不是一种特别常见的方法，但它是近年来在fastai和其他研究实验室独立开发的，并且倾向于提供一些比仅使用平均池的小改进。[机器翻译]

fastai is a bit different from most libraries in that by default it adds two linear layers, rather than one, in the CNN head. The reason for this is that transfer learning can still be useful even, as we have seen, when transferring the pretrained model to very different domains. However, just using a single linear layer is unlikely to be enough in these cases; we have found that using two linear layers can allow transfer learning to be used more quickly and easily, in more situations.

Fastai与大多数库有点不同，默认情况下，它在CNN头部添加了两个线性层，而不是一个。这样做的原因是，即使我们已经看到，当将预训练的模型转移到非常不同的领域时，迁移学习仍然是有用的。然而，在这些情况下，仅仅使用一个线性层是不够的; 我们发现使用两个线性层可以让迁移学习更快更容易地使用，在更多的情况下。[机器翻译]

> note: One Last Batchnorm?: One parameter to `create_head` that is worth looking at is `bn_final`. Setting this to `true` will cause a batchnorm layer to be added as your final layer. This can be useful in helping your model scale appropriately for your output activations. We haven't seen this approach published anywhere as yet, but we have found that it works well in practice wherever we have used it.

> 注意: 最后一个Batchnorm？: 值得查看的 “create_head” 的一个参数是 “bn_final”。将此设置为 “true” 将导致batchnorm层被添加为您的最后一层。这有助于帮助您的模型为您的输出激活适当缩放。我们还没有看到这种方法在任何地方发布，但是我们发现无论我们在哪里使用它，它在实践中都很有效。[机器翻译]

Let's now take a look at what `unet_learner` did in the segmentation problem we showed in <<chapter_intro>>.

现在让我们看看 “unet_learner” 在我们在 <chapter_intro>> 中展示的细分问题中做了什么。[机器翻译]

### unet_learner

# Unet_learner[机器翻译]

One of the most interesting architectures in deep learning is the one that we used for segmentation in <<chapter_intro>>. Segmentation is a challenging task, because the output required is really an image, or a pixel grid, containing the predicted label for every pixel. There are other tasks that share a similar basic design, such as increasing the resolution of an image (*super-resolution*), adding color to a black-and-white image (*colorization*), or converting a photo into a synthetic painting (*style transfer*)—these tasks are covered by an [online](https://book.fast.ai/) chapter of this book, so be sure to check it out after you've read this chapter. In each case, we are starting with an image and converting it to some other image of the same dimensions or aspect ratio, but with the pixels altered in some way. We refer to these as *generative vision models*.


深度学习中最有趣的架构之一是我们在 <<chapter_intro>> 中用于细分的架构。分割是一项具有挑战性的任务，因为所需的输出实际上是一个图像或像素网格，包含每个像素的预测标签。还有其他类似的基本设计，例如提高图像的分辨率 (* 超分辨率 *)，为黑白图像添加颜色 (* 彩色化 *)，或将照片转换为合成绘画 (* 风格转移 *)-这些任务由 [在线] ( https://book.fast.ai/ ) 这本书的第几章，所以读完这一章后一定要看看。在每种情况下，我们都从一个图像开始，并将其转换为具有相同尺寸或纵横比的其他图像，但是像素以某种方式改变。我们将这些称为 * 生成视觉模型 *。[机器翻译]

The way we do this is to start with the exact same approach to developing a CNN head as we saw in the previous problem. We start with a ResNet, for instance, and cut off the adaptive pooling layer and everything after that. Then we replace those layers with our custom head, which does the generative task.


我们这样做的方式是从与我们在之前的问题中看到的完全相同的方法开始开发CNN head。例如，我们从ResNet开始，切断自适应池层和之后的一切。然后我们用我们的自定义头替换这些层，它执行生成任务。[机器翻译]

There was a lot of handwaving in that last sentence! How on earth do we create a CNN head that generates an image? If we start with, say, a 224-pixel input image, then at the end of the ResNet body we will have a 7×7 grid of convolutional activations. How can we convert that into a 224-pixel segmentation mask?


最后一句话里有很多挥手!我们到底如何创建一个CNN头来生成图像？如果我们从224像素的输入图像开始，那么在ResNet主体的末尾，我们将有一个7 × 7网格的卷积激活。我们如何将它转换成224像素的分割遮罩？[机器翻译]

Naturally, we do this with a neural network! So we need some kind of layer that can increase the grid size in a CNN. One very simple approach to this is to replace every pixel in the 7×7 grid with four pixels in a 2×2 square. Each of those four pixels will have the same value—this is known as *nearest neighbor interpolation*. PyTorch provides a layer that does this for us, so one option is to create a head that contains stride-1 convolutional layers (along with batchnorm and ReLU layers as usual) interspersed with 2×2 nearest neighbor interpolation layers. In fact, you can try this now! See if you can create a custom head designed like this, and try it on the CamVid segmentation task. You should find that you get some reasonable results, although they won't be as good as our <<chapter_intro>> results.


自然，我们用神经网络做到这一点!所以我们需要某种层来增加CNN中的网格大小。一个非常简单的方法是用2 × 2正方形中的四个像素替换7 × 7网格中的每个像素。这四个像素中的每一个都将具有相同的值-这称为 * 最近邻插值 *。PyTorch为我们提供了一个这样做的层，所以一个选项是创建一个包含stride-1卷积层的头部 (以及通常的batchnorm和ReLU层) 穿插2 × 2最近邻插值层。事实上，你现在可以试试这个!看看你是否可以创建一个这样设计的自定义头部，并在CamVid分割任务上尝试一下。你应该会发现你得到了一些合理的结果，尽管它们不如我们的 <<chapter_intro>> 结果好。[机器翻译]

Another approach is to replace the nearest neighbor and convolution combination with a *transposed convolution*, otherwise known as a *stride half convolution*. This is identical to a regular convolution, but first zero padding is inserted between all the pixels in the input. This is easiest to see with a picture—<<transp_conv>> shows a diagram from the excellent [convolutional arithmetic paper](https://arxiv.org/abs/1603.07285) we discussed in <<chapter_convolutions>>, showing a 3×3 transposed convolution applied to a 3×3 image.

另一种方法是用 * 转置卷积 * 代替最近邻和卷积组合，也称为 * 跨步半卷积 *。这与常规卷积相同，但在输入中的所有像素之间插入第一个零填充。这是最容易看到的图片-<<transp_conv>> 从优秀的 [卷积算术论文] ( https://arxiv.org/abs/1603.07285 ) 我们在 <<chapter_convolutions>> 中讨论过，显示了应用于3 × 3图像的3 × 3转置卷积。[机器翻译]

<img alt="A transposed convolution" width="815" caption="A transposed convolution (courtesy of Vincent Dumoulin and Francesco Visin)" id="transp_conv" src="images/att_00051.png">

<Img alt = "A转置卷积" width = "815" caption = "A转置卷积 (由Vincent Dumoulin和Francesco Visin提供)" id = "transp_conv" src = "images/att_00051.png">[机器翻译]

As you see, the result of this is to increase the size of the input. You can try this out now by using fastai's `ConvLayer` class; pass the parameter `transpose=True` to create a transposed convolution, instead of a regular one, in your custom head.


如您所见，这样做的结果是增加了输入的大小。您现在可以通过使用fastai的 'convlayer' 类来尝试这个; 在您的自定义头部中传递参数 'transpose = true' 以创建转置卷积，而不是常规卷积。[机器翻译]

Neither of these approaches, however, works really well. The problem is that our 7×7 grid simply doesn't have enough information to create a 224×224-pixel output. It's asking an awful lot of the activations of each of those grid cells to have enough information to fully regenerate every pixel in the output. The solution to this problem is to use *skip connections*, like in a ResNet, but skipping from the activations in the body of the ResNet all the way over to the activations of the transposed convolution on the opposite side of the architecture. This approach, illustrated in <<unet>>, was developed by Olaf Ronneberger, Philipp Fischer, and Thomas Brox in the 2015 paper ["U-Net: Convolutional Networks for Biomedical Image Segmentation"](https://arxiv.org/abs/1505.04597). Although the paper focused on medical applications, the U-Net has revolutionized all kinds of generative vision models.

然而，这两种方法都不适用。问题是我们的7 × 7网格根本没有足够的信息来创建224 × 224像素的输出。它要求每个网格单元格的激活量非常大，以便有足够的信息来完全再生输出中的每个像素。这个问题的解决方法是使用 * 跳过连接 *，就像在ResNet中一样，但是从ResNet主体中的激活一直跳到架构另一侧的转置卷积的激活。这种方法，在 <<unet>> 中说明，由Olaf Ronneberger，Philipp Fischer和Thomas Brox在2015的论文中开发 [“U-Net: 用于生物医学图像分割的卷积网络”] ( https://arxiv.org/abs/1505.04597 )。尽管这篇论文专注于医学应用，但U-Net已经彻底改变了各种生成视觉模型。[机器翻译]

<img alt="The U-Net architecture" width="630" caption="The U-Net architecture (courtesy of Olaf Ronneberger, Philipp Fischer, and Thomas Brox)" id="unet" src="images/att_00052.png">

<Img alt = "The U-Net architecture" width = "630" caption = "The U-Net architecture (由Olaf Ronneberger、Philipp Fischer和Thomas Brox提供)" id = "unet" src = "images/att_00052.png">[机器翻译]

This picture shows the CNN body on the left (in this case, it's a regular CNN, not a ResNet, and they're using 2×2 max pooling instead of stride-2 convolutions, since this paper was written before ResNets came along) and the transposed convolutional ("up-conv") layers on the right. Then extra skip connections are shown as gray arrows crossing from left to right (these are sometimes called *cross connections*). You can see why it's called a "U-Net!"


这张照片显示了左边的美国有线电视新闻网主体 (在这种情况下，它是一个普通的美国有线电视新闻网，而不是一个ResNet，他们使用2 × 2最大池而不是步幅-2卷积，因为这篇论文是在ResNets出现之前写的) 和右边的转置卷积 (“up-conv”) 层。然后额外的跳过连接显示为从左到右交叉的灰色箭头 (这些有时称为 * 交叉连接 *)。你可以看到为什么它被称为 “U-Net!”[机器翻译]

With this architecture, the input to the transposed convolutions is not just the lower-resolution grid in the preceding layer, but also the higher-resolution grid in the ResNet head. This allows the U-Net to use all of the information of the original image, as it is needed. One challenge with U-Nets is that the exact architecture depends on the image size. fastai has a unique `DynamicUnet` class that autogenerates an architecture of the right size based on the data provided.


在这种架构下，换位卷积的输入不仅仅是前一层的低分辨率网格，还包括ResNet head中的高分辨率网格。这允许U-Net根据需要使用原始图像的所有信息。U-net的一个挑战是，确切的架构取决于图像大小。Fastai有一个独特的 'dynamicune' 类，它根据提供的数据自动生成合适大小的架构。[机器翻译]

Let's focus now on an example where we leverage the fastai library to write a custom model.

现在让我们关注一个示例，在该示例中，我们利用fastai库编写自定义模型。[机器翻译]

### A Siamese Network

# 连体网络[机器翻译]

In [None]:
#hide
from fastai.vision.all import *
path = untar_data(URLs.PETS)
files = get_image_files(path/"images")

class SiameseImage(Tuple):
    def show(self, ctx=None, **kwargs): 
        img1,img2,same_breed = self
        if not isinstance(img1, Tensor):
            if img2.size != img1.size: img2 = img2.resize(img1.size)
            t1,t2 = tensor(img1),tensor(img2)
            t1,t2 = t1.permute(2,0,1),t2.permute(2,0,1)
        else: t1,t2 = img1,img2
        line = t1.new_zeros(t1.shape[0], t1.shape[1], 10)
        return show_image(torch.cat([t1,line,t2], dim=2), 
                          title=same_breed, ctx=ctx)
    
def label_func(fname):
    return re.match(r'^(.*)_\d+.jpg$', fname.name).groups()[0]

class SiameseTransform(Transform):
    def __init__(self, files, label_func, splits):
        self.labels = files.map(label_func).unique()
        self.lbl2files = {l: L(f for f in files if label_func(f) == l) for l in self.labels}
        self.label_func = label_func
        self.valid = {f: self._draw(f) for f in files[splits[1]]}
        
    def encodes(self, f):
        f2,t = self.valid.get(f, self._draw(f))
        img1,img2 = PILImage.create(f),PILImage.create(f2)
        return SiameseImage(img1, img2, t)
    
    def _draw(self, f):
        same = random.random() < 0.5
        cls = self.label_func(f)
        if not same: cls = random.choice(L(l for l in self.labels if l != cls)) 
        return random.choice(self.lbl2files[cls]),same
    
splits = RandomSplitter()(files)
tfm = SiameseTransform(files, label_func, splits)
tls = TfmdLists(files, tfm, splits=splits)
dls = tls.dataloaders(after_item=[Resize(224), ToTensor], 
    after_batch=[IntToFloatTensor, Normalize.from_stats(*imagenet_stats)])

Let's go back to the input pipeline we set up in <<chapter_midlevel_data>> for a Siamese network. If you remember, it consisted of pair of images with the label being `True` or `False`, depending on if they were in the same class or not.


让我们回到我们在 <chapter_midlevel_data>> 中为暹罗网络设置的输入管道。如果你还记得的话，它由一对图像组成，标签是 “真” 还是 “假”，这取决于它们是否在同一个类中。[机器翻译]

Using what we just saw, let's build a custom model for this task and train it. How? We will use a pretrained architecture and pass our two images through it. Then we can concatenate the results and send them to a custom head that will return two predictions. In terms of modules, this looks like this:

使用我们刚刚看到的内容，让我们为这个任务构建一个自定义模型并训练它。怎么样？我们将使用一个预先训练的架构，并通过它传递我们的两个图像。然后，我们可以将结果连接起来，并将它们发送到将返回两个预测的自定义head。在模块方面，这看起来像这样:[机器翻译]

In [None]:
class SiameseModel(Module):
    def __init__(self, encoder, head):
        self.encoder,self.head = encoder,head
    
    def forward(self, x1, x2):
        ftrs = torch.cat([self.encoder(x1), self.encoder(x2)], dim=1)
        return self.head(ftrs)

To create our encoder, we just need to take a pretrained model and cut it, as we explained before. The function `create_body` does that for us; we just have to pass it the place where we want to cut. As we saw earlier, per the dictionary of metadata for pretrained models, the cut value for a resnet is `-2`:

为了创建我们的编码器，我们只需要拿一个预先训练好的模型并切割它，正如我们之前解释的那样。函数 “create_body” 为我们做到了这一点; 我们只需要把它传递到我们想切割的地方。正如我们之前看到的，根据预训练模型的元数据字典，resnet的切割值为 '-2':[机器翻译]

In [None]:
encoder = create_body(resnet34, cut=-2)

Then we can create our head. A look at the encoder tells us the last layer has 512 features, so this head will need to receive `512*4`. Why 4? First we have to multiply by 2 because we have two images. Then we need a second multiplication by 2 because of our concat-pool trick. So we create the head as follows:

然后我们可以创造我们的头。看一下编码器告诉我们最后一层有512个特征，所以这个头需要接收 '512*4'。为什么是4？首先我们必须乘以2，因为我们有两个图像。然后我们需要第二次乘以2，因为我们的concat-pool技巧。所以我们创建头部如下:[机器翻译]

In [None]:
head = create_head(512*4, 2, ps=0.5)

With our encoder and head, we can now build our model:

借助我们的编码器和head，我们现在可以构建我们的模型:[机器翻译]

In [None]:
model = SiameseModel(encoder, head)

Before using `Learner`, we have two more things to define. First, we must define the loss function we want to use. It's regular cross-entropy, but since our targets are Booleans, we need to convert them to integers or PyTorch will throw an error:

在使用 “学习者” 之前，我们还有两件事要定义。首先，我们必须定义我们想要使用的损失函数。它是常规的交叉熵，但是由于我们的目标是布尔值，我们需要将它们转换为整数，否则PyTorch会抛出一个错误:[机器翻译]

In [None]:
def loss_func(out, targ):
    return nn.CrossEntropyLoss()(out, targ.long())

More importantly, to take full advantage of transfer learning, we have to define a custom *splitter*. A splitter is a function that tells the fastai library how to split the model into parameter groups. These are used behind the scenes to train only the head of a model when we do transfer learning. 


更重要的是，为了充分利用迁移学习，我们必须定义一个自定义 * splitter *。Splitter是一个函数，它告诉fastai库如何将模型拆分为参数组。当我们进行迁移学习时，这些是在幕后用来训练模型的头部的。[机器翻译]

Here we want two parameter groups: one for the encoder and one for the head. We can thus define the following splitter (`params` is just a function that returns all parameters of a given module):

这里我们需要两个参数组: 一个用于编码器，一个用于头部。因此，我们可以定义以下拆分器 (“参数” 只是返回给定模块的所有参数的函数):[机器翻译]

In [None]:
def siamese_splitter(model):
    return [params(model.encoder), params(model.head)]

Then we can define our `Learner` by passing the data, model, loss function, splitter, and any metric we want. Since we are not using a convenience function from fastai for transfer learning (like `cnn_learner`), we have to call `learn.freeze` manually. This will make sure only the last parameter group (in this case, the head) is trained:

然后，我们可以通过传递数据、模型、损失函数、拆分器和任何我们想要的度量来定义我们的 “学习者”。由于我们没有使用fastai的便利功能进行迁移学习 (如 “cnn_learner”)，我们必须手动调用 “learn.Freeze”。这将确保只训练最后一个参数组 (在本例中为head):[机器翻译]

In [None]:
learn = Learner(dls, model, loss_func=loss_func, 
                splitter=siamese_splitter, metrics=accuracy)
learn.freeze()

Then we can directly train our model with the usual methods:

然后我们可以用通常的方法直接训练我们的模型:[机器翻译]

In [None]:
learn.fit_one_cycle(4, 3e-3)

epoch,train_loss,valid_loss,accuracy,time
0,0.367015,0.281242,0.885656,00:26
1,0.307688,0.214721,0.915426,00:26
2,0.275221,0.170615,0.936401,00:26
3,0.223771,0.159633,0.943843,00:26


Before unfreezing and fine-tuning the whole model a bit more with discriminative learning rates (that is: a lower learning rate for the body and a higher one for the head):

在解冻和微调整个模型之前，具有更高的区分性学习率 (即: 身体的学习率较低，头部的学习率较高):[机器翻译]

In [None]:
learn.unfreeze()
learn.fit_one_cycle(4, slice(1e-6,1e-4))

epoch,train_loss,valid_loss,accuracy,time
0,0.212744,0.159033,0.94452,00:35
1,0.201893,0.159615,0.94249,00:35
2,0.204606,0.152338,0.945196,00:36
3,0.213203,0.148346,0.947903,00:36


94.8\% is very good when we remember a classifier trained the same way (with no data augmentation) had an error rate of 7%.

94.8 \ % 非常好，当我们记得一个分类器以同样的方式训练 (没有数据增强) 时，错误率为7%。[机器翻译]

Now that we've seen how to create complete state-of-the-art computer vision models, let's move on to NLP.

既然我们已经看到了如何创建完整的最先进的计算机视觉模型，让我们继续NLP。[机器翻译]

## Natural Language Processing

# # 自然语言处理[机器翻译]

Converting an AWD-LSTM language model into a transfer learning classifier, as we did in <<chapter_nlp>>, follows a very similar process to what we did with `cnn_learner` in the first section of this chapter. We do not need a "meta" dictionary in this case, because we do not have such a variety of architectures to support in the body. All we need to do is select the stacked RNN for the encoder in the language model, which is a single PyTorch module. This encoder will provide an activation for every word of the input, because a language model needs to output a prediction for every next word.


将AWD-LSTM语言模型转换为迁移学习分类器，正如我们在 <<chapter_nlp>> 中所做的，遵循了一个与我们在本章第一部分中使用 “cnn_learner” 非常相似的过程。在这种情况下，我们不需要 “元” 字典，因为我们在身体中没有如此多样的架构来支持。我们所需要做的就是在语言模型中为编码器选择堆叠的RNN，这是一个单一的PyTorch模块。这个编码器将为输入的每个单词提供激活，因为语言模型需要为每个下一个单词输出预测。[机器翻译]

To create a classifier from this we use an approach described in the [ULMFiT paper](https://arxiv.org/abs/1801.06146) as "BPTT for Text Classification (BPT3C)":

为此，我们使用 [ULMFiT论文] ( https://arxiv.org/abs/1801.06146 ) 作为 “文本分类的BPTT (BPT3C)”:[机器翻译]

> : We divide the document into fixed-length batches of size *b*. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences.

>: 我们将文档分成大小 * b * 的固定长度批次。在每个批次开始时，模型用前一批的最终状态初始化; 我们跟踪均值和最大值池的隐藏状态; 梯度反向传播到其隐藏状态有助于最终预测的批次。在实践中，我们使用可变长度反向传播序列。[机器翻译]

In other words, the classifier contains a `for` loop, which loops over each batch of a sequence. The state is maintained across batches, and the activations of each batch are stored. At the end, we use the same average and max concatenated pooling trick that we use for computer vision models—but this time, we do not pool over CNN grid cells, but over RNN sequences.


换句话说，分类器包含一个 “for” 循环，它循环序列的每一批。跨批次维护状态，并存储每个批次的激活。最后，我们使用了与计算机视觉模型相同的平均和最大串联池技巧 -- 但这一次，我们不是池在CNN网格单元格上，而是池在RNN序列上。[机器翻译]

For this `for` loop we need to gather our data in batches, but each text needs to be treated separately, as they each have their own labels. However, it's very likely that those texts won't all be of the same length, which means we won't be able to put them all in the same array, like we did with the language model.


对于这个 “For” 循环，我们需要批量收集数据，但是每个文本都需要单独处理，因为它们都有自己的标签。然而，这些文本很可能不会都是相同的长度，这意味着我们不能把它们都放在同一个数组中，就像我们对语言模型所做的那样。[机器翻译]

That's where padding is going to help: when grabbing a bunch of texts, we determine the one with the greatest length, then we fill the ones that are shorter with a special token called `xxpad`. To avoid extreme cases where we have a text with 2,000 tokens in the same batch as a text with 10 tokens (so a lot of padding, and a lot of wasted computation), we alter the randomness by making sure texts of comparable size are put together. The texts will still be in a somewhat random order for the training set (for the validation set we can simply sort them by order of length), but not completely so.


这就是填充将有所帮助的地方: 当抓取一堆文本时，我们确定长度最大的一个，然后用一个叫做 “xxpad” 的特殊标记填充较短的文本。为了避免极端情况，我们在同一个批次中有2,000个令牌的文本，以及有10个令牌的文本 (因此大量填充和大量浪费的计算)，我们通过确保可比大小的文本放在一起来改变随机性。对于训练集，文本仍然处于某种随机的顺序 (对于验证集，我们可以简单地按长度顺序对它们进行排序)，但不完全是这样。[机器翻译]

This is done automatically behind the scenes by the fastai library when creating our `DataLoaders`.

这是在创建我们的 “dataloaders” 时由fastai库在幕后自动完成的。[机器翻译]

## Tabular

# # 表格[机器翻译]

Finally, let's take a look at `fastai.tabular` models. (We don't need to look at collaborative filtering separately, since we've already seen that these models are just tabular models, or use the dot product approach, which we've implemented earlier from scratch.)


最后，让我们来看看 “fastai.Tabular” 模型。(我们不需要单独考虑协作过滤，因为我们已经看到这些模型只是表格模型，或者使用点产品方法，我们之前已经从头开始实现了。)[机器翻译]

Here is the `forward` method for `TabularModel`:


以下是 “tabularmodel” 的 “forward” 方法:[机器翻译]

```python
if self.n_emb != 0:
    x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
    x = torch.cat(x, 1)
    x = self.emb_drop(x)
if self.n_cont != 0:
    x_cont = self.bn_cont(x_cont)
    x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
return self.layers(x)
```


'''Python
如果self.n_emb!= 0:
X = [e(x_cat[:，i]) for i，e in enumerate(self.Embbeds)]
X = torch.cat(x，1)
X = self.emb_drop(x)
如果self.n_cont!= 0:
X_cont = self.bn_cont(x_cont)
X = torch.cat([x，x_cont]，1) if self.n_emb!= 0 else x_cont
返回self.layers(x)
'''[机器翻译]

We won't show `__init__` here, since it's not that interesting, but we will look at each line of code in `forward` in turn. The first line:

我们不会在这里显示 “_ _ init _ _”，因为它没那么有趣，但是我们会依次查看 “前进” 中的每一行代码。第一行:[机器翻译]

```python
if self.n_emb != 0:
```


'''Python
如果self.n_emb!= 0:
'''[机器翻译]

is just testing whether there are any embeddings to deal with—we can skip this section if we only have continuous variables. `self.embeds` contains the embedding matrices, so this gets the activations of each:
 
```python
    x = [e(x_cat[:,i]) for i,e in enumerate(self.embeds)]
```


只是测试是否有任何嵌入要处理 -- 如果我们只有连续变量，我们可以跳过这一节。“Self.embedding” 包含嵌入矩阵，因此这会获得每个矩阵的激活:

'''Python
X = [e(x_cat[:，i]) for i，e in enumerate(self.Embbeds)]
'''[机器翻译]

and concatenates them into a single tensor:


并将它们连接成一个张量:[机器翻译]

```python
    x = torch.cat(x, 1)
```


'''Python
X = torch.cat(x，1)
'''[机器翻译]

Then dropout is applied. You can pass `emb_drop` to `__init__` to change this value:


然后申请辍学。您可以将 'emb_drop' 传递给 '_ _ init _ _' 来更改此值:[机器翻译]

```python
    x = self.emb_drop(x)
```


'''Python
X = self.emb_drop(x)
'''[机器翻译]

Now we test whether there are any continuous variables to deal with:


现在我们测试是否有任何连续变量需要处理:[机器翻译]

```python
if self.n_cont != 0:
```


'''Python
如果self.n_cont!= 0:
'''[机器翻译]

They are passed through a batchnorm layer:


它们通过一个batchnorm层:[机器翻译]

```python
    x_cont = self.bn_cont(x_cont)
```


'''Python
X_cont = self.bn_cont(x_cont)
'''[机器翻译]

and concatenated with the embedding activations, if there were any:


并与嵌入激活连接，如果有:[机器翻译]

```python
    x = torch.cat([x, x_cont], 1) if self.n_emb != 0 else x_cont
```


'''Python
X = torch.cat([x，x_cont]，1) if self.n_emb!= 0 else x_cont
'''[机器翻译]

Finally, this is passed through the linear layers (each of which includes batchnorm, if `use_bn` is `True`, and dropout, if `ps` is set to some value or list of values):


最后，这是通过线性层 (如果 'use_bn' 是 'true'，则每个层都包括batchnorm，如果 'ps' 被设置为某些值或值列表，则会丢失):[机器翻译]

```python
return self.layers(x)


'''Python
返回self.layers(x)[机器翻译]

```


'''[机器翻译]

Congratulations! Now you know every single piece of the architectures used in the fastai library!

恭喜!现在你知道了fastai库中使用的每一个架构![机器翻译]

## Wrapping Up Architectures

# # 总结架构[机器翻译]

As you can see, the details of deep learning architectures need not scare you now. You can look inside the code of fastai and PyTorch and see just what is going on. More importantly, try to understand *why* it's going on. Take a look at the papers that are being referenced in the code, and try to see how the code matches up to the algorithms that are described.


正如您所看到的，深度学习架构的细节现在不需要吓到您。您可以查看fastai和PyTorch的代码内部，看看到底发生了什么。更重要的是，试着理解 * 为什么 * 它正在发生。看看代码中引用的论文，试着看看代码如何与所描述的算法相匹配。[机器翻译]

Now that we have investigated all of the pieces of a model and the data that is passed into it, we can consider what this means for practical deep learning. If you have unlimited data, unlimited memory, and unlimited time, then the advice is easy: train a huge model on all of your data for a really long time. But the reason that deep learning is not straightforward is because your data, memory, and time are typically limited. If you are running out of memory or time, then the solution is to train a smaller model. If you are not able to train for long enough to overfit, then you are not taking advantage of the capacity of your model.


现在我们已经研究了模型的所有部分以及传递给模型的数据，我们可以考虑这对于实际深度学习意味着什么。如果你有无限的数据、无限的记忆和无限的时间，那么建议很简单: 在你所有的数据上训练一个巨大的模型。但是深度学习并不简单的原因是，您的数据、内存和时间通常是有限的。如果您的内存或时间不足，那么解决方案是训练一个较小的模型。如果你不能训练足够长的时间来过度训练，那么你就没有利用你的模型的能力。[机器翻译]

So, step one is to get to the point where you can overfit. Then the question is how to reduce that overfitting. <<reduce_overfit>> shows how we recommend prioritizing the steps from there.

所以，第一步是达到你可以过度健康的程度。那么问题是如何减少过度拟合。<<Reduce_overfit>> 显示了我们建议如何从那里对步骤进行优先级排序。[机器翻译]

<img alt="Steps to reducing overfitting" width="400" caption="Steps to reducing overfitting" id="reduce_overfit" src="images/att_00047.png">

<Img alt = "减少过度拟合的步骤" width = "400" caption = "减少过度拟合的步骤" id = "reduce_overfit" src = "images/att_00047.png">[机器翻译]

Many practitioners, when faced with an overfitting model, start at exactly the wrong end of this diagram. Their starting point is to use a smaller model, or more regularization. Using a smaller model should be absolutely the last step you take, unless training your model is taking up too much time or memory. Reducing the size of your model reduces the ability of your model to learn subtle relationships in your data.


许多从业者，当面对过度拟合的模型时，从这个图表的错误末端开始。他们的出发点是使用一个更小的模型，或者更多的正则化。使用较小的模型绝对应该是你采取的最后一步，除非训练你的模型占用了太多的时间或记忆。减小模型的大小会降低模型学习数据中微妙关系的能力。[机器翻译]

Instead, your first step should be to seek to *create more data*. That could involve adding more labels to data that you already have, finding additional tasks that your model could be asked to solve (or, to think of it another way, identifying different kinds of labels that you could model), or creating additional synthetic data by using more or different data augmentation techniques. Thanks to the development of Mixup and similar approaches, effective data augmentation is now available for nearly all kinds of data.


相反，你的第一步应该是寻求 * 创建更多的数据 *。这可能涉及向您已经拥有的数据添加更多标签，找到您的模型可以要求解决的其他任务 (或者，以另一种方式思考，识别您可以建模的不同类型的标签)，或通过使用更多或不同的数据增强技术创建其他合成数据。由于Mixup和类似方法的发展，现在几乎所有类型的数据都可以获得有效的数据增强。[机器翻译]

Once you've got as much data as you think you can reasonably get hold of, and are using it as effectively as possible by taking advantage of all the labels that you can find and doing all the augmentation that makes sense, if you are still overfitting you should think about using more generalizable architectures. For instance, adding batch normalization may improve generalization.


一旦你有了你认为可以合理掌握的尽可能多的数据，并且通过利用你能找到的所有标签并做所有有意义的增强来尽可能有效地使用它，如果你仍然过度拟合，你应该考虑使用更通用的架构。例如，添加批处理标准化可能会提高泛化。[机器翻译]

If you are still overfitting after doing the best you can at using your data and tuning your architecture, then you can take a look at regularization. Generally speaking, adding dropout to the last layer or two will do a good job of regularizing your model. However, as we learned from the story of the development of AWD-LSTM, it is often the case that adding dropout of different types throughout your model can help even more. Generally speaking, a larger model with more regularization is more flexible, and can therefore be more accurate than a smaller model with less regularization.


如果您在尽最大努力使用数据和调整架构后仍然过度拟合，那么您可以看看正则化。一般来说，在最后一两层增加dropout可以很好地正则化你的模型。然而，正如我们从AWD-LSTM的发展故事中了解到的那样，在整个模型中添加不同类型的退出通常会更有帮助。一般来说，具有更多正则化的较大模型更灵活，因此可以比具有更少正则化的较小模型更精确。[机器翻译]

Only after considering all of these options would we recommend that you try using a smaller version of your architecture.

只有在考虑了所有这些选项之后，我们才建议您尝试使用较小版本的架构。[机器翻译]

## Questionnaire

# # 问卷调查[机器翻译]

1. What is the "head" of a neural net?
1. What is the "body" of a neural net?
1. What is "cutting" a neural net? Why do we need to do this for transfer learning?
1. What is `model_meta`? Try printing it to see what's inside.
1. Read the source code for `create_head` and make sure you understand what each line does.
1. Look at the output of `create_head` and make sure you understand why each layer is there, and how the `create_head` source created it.
1. Figure out how to change the dropout, layer size, and number of layers created by `create_cnn`, and see if you can find values that result in better accuracy from the pet recognizer.
1. What does `AdaptiveConcatPool2d` do?
1. What is "nearest neighbor interpolation"? How can it be used to upsample convolutional activations?
1. What is a "transposed convolution"? What is another name for it?
1. Create a conv layer with `transpose=True` and apply it to an image. Check the output shape.
1. Draw the U-Net architecture.
1. What is "BPTT for Text Classification" (BPT3C)?
1. How do we handle different length sequences in BPT3C?
1. Try to run each line of `TabularModel.forward` separately, one line per cell, in a notebook, and look at the input and output shapes at each step.
1. How is `self.layers` defined in `TabularModel`?
1. What are the five steps for preventing over-fitting?
1. Why don't we reduce architecture complexity before trying other approaches to preventing overfitting?

1.什么是神经网络的 “头”？
1.什么是神经网络的 “身体”？
1.什么是 “切割” 神经网络？为什么我们需要这样做来进行迁移学习？
1.什么是model_meta？试着打印出来看看里面有什么。
1。阅读 “create_head” 的源代码，并确保您理解每一行的功能。
1.查看 “create_head” 的输出，确保您理解为什么每个层都在那里，以及 “create_head” 源是如何创建它的。
1.找出如何更改 “创建有线电视新闻网” 创建的删除、图层大小和图层数量，并查看是否可以从宠物识别器中找到更准确的值。
1.'AdaptiveConcatPool2d' 做什么？
1.什么是 “最近邻插值”？如何使用它来对卷积激活进行上采样？
1.什么是 “转置卷积”？它的另一个名字是什么？
1.用 “transpose = true” 创建一个conv层，并将其应用于图像。检查输出形状。
1.绘制U-Net架构。
1.什么是 “用于文本分类的BPTT” (BPT3C)？
1.我们如何在BPT3C中处理不同的长度序列？
1.尝试在笔记本中分别运行 “tabularmodel.Forward” 的每一行，每个单元格一行，并查看每一步的输入和输出形状。
1.在 “tabularmodel” 中如何定义 “self.Layers”？
1.防止过度拟合的五个步骤是什么？
1.在尝试其他方法来防止过度拟合之前，我们为什么不降低体系结构的复杂性呢？[机器翻译]

### Further Research

# 进一步研究[机器翻译]

1. Write your own custom head and try training the pet recognizer with it. See if you can get a better result than fastai's default.
1. Try switching between `AdaptiveConcatPool2d` and `AdaptiveAvgPool2d` in a CNN head and see what difference it makes.
1. Write your own custom splitter to create a separate parameter group for every ResNet block, and a separate group for the stem. Try training with it, and see if it improves the pet recognizer.
1. Read the online chapter about generative image models, and create your own colorizer, super-resolution model, or style transfer model.
1. Create a custom head using nearest neighbor interpolation and use it to do segmentation on CamVid.

1.编写自己的自定义头部，并尝试使用它训练宠物识别器。看看你是否能得到比fastai默认的更好的结果。
1.尝试在CNN head中的 'adaptiveeconcatpool2d' 和 'AdaptiveAvgPool2d' 之间切换，看看它有什么不同。
1.编写自己的自定义拆分器，为每个ResNet块创建一个单独的参数组，并为stem创建一个单独的组。试着用它训练，看看它是否能提高宠物识别器。
1.阅读关于生成图像模型的在线章节，并创建自己的颜色器、超分辨率模型或风格传输模型。
1.使用最近邻插值创建一个自定义head，并使用它在CamVid上进行分割。[机器翻译]