In [9]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

  return torch._C._cuda_getDeviceCount() > 0


In [10]:
#hide
from fastbook import *
from fastai.vision.widgets import *

# From Model to Production

# 从模型到生产

The six lines of code we saw in <<chapter_intro>> are just one small part of the process of using deep learning in practice. In this chapter, we're going to use a computer vision example to look at the end-to-end process of creating a deep learning application. More specifically, we're going to build a bear classifier! In the process, we'll discuss the capabilities and constraints of deep learning, explore how to create datasets, look at possible gotchas when using deep learning in practice, and more. Many of the key points will apply equally well to other deep learning problems, such as those in <<chapter_intro>>. If you work through a problem similar in key respects to our example problems, we expect you to get excellent results with little code, quickly.


我们在 <<chapter_intro>> 中看到的六行代码只是实践中使用深度学习过程的一小部分。在本章中，我们将使用计算机视觉示例来查看创建深度学习应用程序的端到端过程。更具体地说，我们将构建一个熊分类器！在此过程中，我们将讨论深度学习的功能及其约束，探索如何创建数据集，在实践中使用深度学习时查看可能的陷阱，等等。许多关键点同样适用于其他深度学习问题，例如 <<chapter_intro>> 中的问题。如果你解决了一个与我们的示例问题在关键方面相似的问题，我们希望你能以很少的代码迅速获得出色的结果。

Let's start with how you should frame your problem.

让我们从你应该如何构建你的问题开始。

## The Practice of Deep Learning

## 深度学习的实践

We've seen that deep learning can solve a lot of challenging problems quickly and with little code. As a beginner, there's a sweet spot of problems that are similar enough to our example problems that you can very quickly get extremely useful results. However, deep learning isn't magic! The same 6 lines of code won't work for every problem anyone can think of today. Underestimating the constraints and overestimating the capabilities of deep learning may lead to frustratingly poor results, at least until you gain some experience and can solve the problems that arise. Conversely, overestimating the constraints and underestimating the capabilities of deep learning may mean you do not attempt a solvable problem because you talk yourself out of it. 


我们已经看到，深度学习可以用很少的代码快速解决许多具有挑战性的问题。作为初学者，有一个问题的最佳点，就是它与我们的示例问题非常相似，你可以很快获得非常有用的结果。然而，深度学习不是魔法！相同的6行代码无法解决任何人今天想到的所有问题。低估制约条件和高估深度学习的能力可能会导致令人沮丧的糟糕结果，至少在你获得一些经验并能解决出现的问题之前。相反，高估制约条件和低估深度学习的能力可能意味着你不会尝试解决可解决的问题，因为你说服自己放弃它。

We often talk to people who underestimate both the constraints and the capabilities of deep learning. Both of these can be problems: underestimating the capabilities means that you might not even try things that could be very beneficial, and underestimating the constraints might mean that you fail to consider and react to important issues.


我们经常与低估深度学习的制约和能力的人交谈。这两者都有问题: 低估深度学习的能力意味着你甚至可能不会尝试非常有益的东西，低估这些制约可能意味着你无法考虑和应对重要问题。

The best thing to do is to keep an open mind. If you remain open to the possibility that deep learning might solve part of your problem with less data or complexity than you expect, then it is possible to design a process where you can find the specific capabilities and constraints related to your particular problem as you work through the process. This doesn't mean making any risky bets — we will show you how you can gradually roll out models so that they don't create significant risks, and can even backtest them prior to putting them in production.

最好的办法是保持开放的心态。如果你对深度学习可能以比预期更少的数据或复杂性解决部分问题的可能性保持开放，然后，可以设计一个过程，在这个过程中，你可以找到与你的特定问题相关的特定能力和制约条件。这并不意味着进行任何有风险的赌注-我们将向你展示如何逐步推出模型，以使它们不会产生重大风险，甚至可以在投入生产前对其进行回测。

### Starting Your Project

### 启动你的项目

So where should you start your deep learning journey? The most important thing is to ensure that you have some project to work on—it is only through working on your own projects that you will get real experience building and using models. When selecting a project, the most important consideration is data availability. Regardless of whether you are doing a project just for your own learning or for practical application in your organization, you want something where you can get started quickly. We have seen many students, researchers, and industry practitioners waste months or years while they attempt to find their perfect dataset. The goal is not to find the "perfect" dataset or project, but just to get started and iterate from there.


那么你应该从哪里开始你的深度学习之旅呢？最重要的是确保你有一些项目要做 -- 只有通过你自己的项目，你才能获得构建和使用模型的真实体验。选择项目时，最重要的考虑因素是数据可用性。无论你是为自己的学习而进行的项目，还是为组织中的实际应用而做的项目，都希望可以快速入门。我们已经看到许多学生、研究人员和行业从业者在试图找到自己的完美数据集时浪费了数月或数年的时间。目标不是找到“完美”的数据集或项目，而只是从那里开始并进行迭代。

If you take this approach, then you will be on your third iteration of learning and improving while the perfectionists are still in the planning stages!


如果你采取这种方法，那么当完美主义者还在计划阶段的时候，你将处于学习和改进的第三次迭代中!

We also suggest that you iterate from end to end in your project; that is, don't spend months fine-tuning your model, or polishing the perfect GUI, or labelling the perfect dataset… Instead, complete every step as well as you can in a reasonable amount of time, all the way to the end. For instance, if your final goal is an application that runs on a mobile phone, then that should be what you have after each iteration. But perhaps in the early iterations you take some shortcuts, for instance by doing all of the processing on a remote server, and using a simple responsive web application. By completing the project end to end, you will see where the trickiest bits are, and which bits make the biggest difference to the final result.

我们还建议你在项目中从头到尾迭代; 也就是说，不要花几个月来精调你的模型，或者打磨完美的图形用户界面，或者标记完美的数据集…… 相反，尽可能在合理的时间内完成每一步，一直到最后。例如，如果你的最终目标是一个运行在手机上的应用程序，那么这应该是你每次迭代后的目标。但是，在早期迭代中，你可能会选择一些快捷方式，例如，在远程服务器上执行所有处理，并使用简单的响应式web应用程序。通过端到端地完成项目，你将看到最复杂的位在哪里，哪些位对最终结果影响最大。

As you work through this book, we suggest that you complete lots of small experiments, by running and adjusting the notebooks we provide, at the same time that you gradually develop your own projects. That way, you will be getting experience with all of the tools and techniques that we're explaining, as we discuss them.


当你读完这本书的时候，我们建议你通过运行和调整我们提供的笔记本来完成许多小实验，同时你也逐渐开发你自己的项目。这样，当我们讨论这些工具和技术时，你将获得我们正在解释的所有工具和技术的经验。

> s: To make the most of this book, take the time to experiment between each chapter, be it on your own project or by exploring the notebooks we provide. Then try rewriting those notebooks from scratch on a new dataset. It's only by practicing (and failing) a lot that you will get an intuition of how to train a model.  


> s：为了充分利用这本书，花时间在每章之间进行实验，无论是在你自己的项目上还是通过探索我们提供的笔记本。然后尝试在新数据集上从头开始重写这些笔记本。只有通过大量练习 (以及失败)，你才能获得如何训练模型的直觉。

By using the end-to-end iteration approach you will also get a better understanding of how much data you really need. For instance, you may find you can only easily get 200 labeled data items, and you can't really know until you try whether that's enough to get the performance you need for your application to work well in practice.


通过使用端到端迭代方法，你还可以更好地了解你真正需要多少数据。例如，你可能会发现你只能轻松获得200个带标签的数据项，并且直到你尝试出这是否足以获得你的应用程序在实践中正常工作所需的性能，否则你无法真正知道。

In an organizational context you will be able to show your colleagues that your idea can really work by showing them a real working prototype. We have repeatedly observed that this is the secret to getting good organizational buy-in for a project.

在组织环境中，你可以通过向同事演示真实的工作原型，向他们展示你的想法确实可以奏效。我们已经反复观察到，这是项目获得良好组织支持的秘诀。

Since it is easiest to get started on a project where you already have data available, that means it's probably easiest to get started on a project related to something you are already doing, because you already have data about things that you are doing. For instance, if you work in the music business, you may have access to many recordings. If you work as a radiologist, you probably have access to lots of medical images. If you are interested in wildlife preservation, you may have access to lots of images of wildlife.


由于在已经有数据的项目中开始是最容易的，这意味着在与你已经在做的事情相关的项目中开始可能是最容易的，因为你已经有了关于你正在做的事情的数据。例如，如果你从事音乐行业，则可以获得许多唱片。如果你是一名放射科医生，则可能获得大量的医学图像。如果你对野生动物保护感兴趣，则可以获得许多野生动物的图像。

Sometimes, you have to get a bit creative. Maybe you can find some previous machine learning project, such as a Kaggle competition, that is related to your field of interest. Sometimes, you have to compromise. Maybe you can't find the exact data you need for the precise project you have in mind; but you might be able to find something from a similar domain, or measured in a different way, tackling a slightly different problem. Working on these kinds of similar projects will still give you a good understanding of the overall process, and may help you identify other shortcuts, data sources, and so forth.


有时候，你必须有点创造力。也许你可以找到一些先前的机器学习项目，例如与你感兴趣的领域有关的Kaggle竞赛。有时候，你必须有所妥协。也许你找不到你想要的精确项目所需的确切数据; 但是你也许能从相似的领域找到一些东西，或者用不同的方式测量，从而解决稍微不同的问题。在这类类似的项目上进行工作，将使你对整个过程有一个很好的理解，并且keyi帮助你确定其他的快捷方式、数据源等等。

Especially when you are just starting out with deep learning, it's not a good idea to branch out into very different areas, to places that deep learning has not been applied to before. That's because if your model does not work at first, you will not know whether it is because you have made a mistake, or if the very problem you are trying to solve is simply not solvable with deep learning. And you won't know where to look to get help. Therefore, it is best at first to start with something where you can find an example online where somebody has had good results with something that is at least somewhat similar to what you are trying to achieve, or where you can convert your data into a format similar to what someone else has used before (such as creating an image from your data). Let's have a look at the state of deep learning, just so you know what kinds of things deep learning is good at right now.

尤其是当你刚开始深度学习的时候，将其扩展到非常不同的领域，进入以前从未应用过深度学习的地方并不是一个好主意。那是因为如果你的模型一开始不起作用，你就不知道是不是因为你犯了错误，还是你试图解决的问题根本无法通过深度学习解决。而且你也不知道去哪里寻求帮助。因此，最好首先从可以在网上找到的示例开始，在该示例中有人的应用效果良好，至少与你要实现的目标相似，或者可以将数据转换为格式与其他人以前使用过的类似（例如，根据你的数据创建图像）。让我们来看看深度学习的状态，这样你就知道深度学习现在擅长什么。

### The State of Deep Learning

### 深度学习的进展

Let's start by considering whether deep learning can be any good at the problem you are looking to work on. This section provides a summary of the state of deep learning at the start of 2020. However, things move very fast, and by the time you read this some of these constraints may no longer exist. We will try to keep the [book's website](https://book.fast.ai/) up-to-date; in addition, a Google search for "what can AI do now" is likely to provide current information.

让我们从考虑深度学习是否能很好地解决你想解决的问题开始。本节总结了2020年初的深度学习的进展。然而，事情进展得非常快，当你读到这篇文章时，其中一些制约条件可能已经不存在了。我们将尽量保持更新 [图书网站](https://book.fast.ai/)；此外，谷歌搜索“人工智能现在可以做什么”可能会提供当前信息。

#### Computer vision

#### 计算机视觉

There are many domains in which deep learning has not been used to analyze images yet, but those where it has been tried have nearly universally shown that computers can recognize what items are in an image at least as well as people can—even specially trained people, such as radiologists. This is known as *object recognition*. Deep learning is also good at recognizing where objects in an image are, and can highlight their locations and name each found object. This is known as *object detection* (there is also a variant of this that we saw in <<chapter_intro>>, where every pixel is categorized based on what kind of object it is part of—this is called *segmentation*). Deep learning algorithms are generally not good at recognizing images that are significantly different in structure or style to those used to train the model. For instance, if there were no black-and-white images in the training data, the model may do poorly on black-and-white images. Similarly, if the training data did not contain hand-drawn images, then the model will probably do poorly on hand-drawn images. There is no general way to check what types of images are missing in your training set, but we will show in this chapter some ways to try to recognize when unexpected image types arise in the data when the model is being used in production (this is known as checking for *out-of-domain* data).


有许多领域还没有使用深度学习来分析图像，但是那些已经尝试过的领域几乎普遍表明，计算机至少可以像人们一样识别图像中的物品 —— 甚至是像经过特殊训练的人，如放射科医生。这被称为 *物体识别* 。深度学习还擅长识别图像中的对象在哪里，并且可以突出显示它们的位置并命名每个找到的对象。这被称为 *物体检测* (我们在 <<chapter_intro>> 中看到的还有一个变体，其中每个像素根据它是什么类型的对象的一部分进行分类 —— 这称为 *分割*)。深度学习算法通常不擅长识别结构或风格与用于训练模型的图像明显不同的图像。例如，如果训练数据中没有黑白图像，则模型在黑白图像上的表现可能很差。同样，如果训练数据不包含手绘图像，那么模型可能会在手绘图像上表现不佳。没有通用的方法可以检查训练集中缺少哪些类型的图像，但是我们将在本章中介绍一些方法，当模型在生产中使用时，尝试识别数据中何时出现意外的图像类型 (这被称为检查 *域外* 数据)。

One major challenge for object detection systems is that image labelling can be slow and expensive. There is a lot of work at the moment going into tools to try to make this labelling faster and easier, and to require fewer handcrafted labels to train accurate object detection models. One approach that is particularly helpful is to synthetically generate variations of input images, such as by rotating them or changing their brightness and contrast; this is called *data augmentation* and also works well for text and other types of models. We will be discussing it in detail in this chapter.


物体检测系统的一个主要挑战是图像标记可能缓慢且昂贵。此时有很多工作要做，进入工具，试图使这个标签更快更容易，并需要更少的手工标签来训练准确的物体检测模型。一种特别有用的方法是综合生成输入图像的变化，例如通过旋转它们或改变它们的亮度和对比度; 这被称为 *数据增强*，也适用于文本和其他类型的模型。我们将在本章详细讨论这一点。

Another point to consider is that although your problem might not look like a computer vision problem, it might be possible with a little imagination to turn it into one. For instance, if what you are trying to classify are sounds, you might try converting the sounds into images of their acoustic waveforms and then training a model on those images.

需要考虑的另一点是，尽管你的问题看起来不像计算机视觉问题，但只要有一点想象力，就有可能把它变成计算机视觉。例如，如果你试图分类的是声音，你可以尝试将声音转换成声音波形的图像，然后在这些图像上训练一个模型。

#### Text (natural language processing)

# 文本 (自然语言处理)

Computers are very good at classifying both short and long documents based on categories such as spam or not spam, sentiment (e.g., is the review positive or negative), author, source website, and so forth. We are not aware of any rigorous work done in this area to compare them to humans, but anecdotally it seems to us that deep learning performance is similar to human performance on these tasks. Deep learning is also very good at generating context-appropriate text, such as replies to social media posts, and imitating a particular author's style. It's good at making this content compelling to humans too—in fact, even more compelling than human-generated text. However, deep learning is currently not good at generating *correct* responses! We don't currently have a reliable way to, for instance, combine a knowledge base of medical information with a deep learning model for generating medically correct natural language responses. This is very dangerous, because it is so easy to create content that appears to a layman to be compelling, but actually is entirely incorrect.


计算机非常擅长根据是否为非垃圾邮件，情绪（例如评论是正面还是负面），作者，来源网站等类别对长短文档进行分类。我们尚不知道在此领域是否进行过任何严格的工作来将它们与人类进行比较，但是在我们看来，深度学习的表现与人类在这些任务中的表现相似。深度学习也非常擅长生成适合上下文的文本，例如回复社交媒体帖子，以及模仿特定作者的风格。它也擅长让这些内容吸引人类 -- 事实上，甚至比人类生成的文本更有吸引力。然而，深度学习目前并不擅长生成 *正确的* 反应！例如，我们目前尚没有可靠的方法将医学信息知识库与深度学习模型结合起来，以产生医学上正确的自然语言反应。这是非常危险的，因为很容易创建在外行人看来很有说服力的内容，但实际上是完全不正确的。

Another concern is that context-appropriate, highly compelling responses on social media could be used at massive scale—thousands of times greater than any troll farm previously seen—to spread disinformation, create unrest, and encourage conflict. As a rule of thumb, text generation models will always be technologically a bit ahead of models recognizing automatically generated text. For instance, it is possible to use a model that can recognize artificially generated content to actually improve the generator that creates that content, until the classification model is no longer able to complete its task.


另一个担忧是，社交媒体上的上下文恰当，极具说服力的回复（比以前看到的任何巨魔农场都要大数千倍）可能会被大规模使用来传播虚假信息，制造动荡并怂恿冲突。根据经验，文本生成模型在技术上始终会比识别自动生成文本的模型领先一点。例如，可以使用能够识别人工生成的内容的模型来实际改进创建该内容的生成器，直到分类模型不再能够完成其任务为止。

Despite these issues, deep learning has many applications in NLP: it can be used to translate text from one language to another, summarize long documents into something that can be digested more quickly, find all mentions of a concept of interest, and more. Unfortunately, the translation or summary could well include completely incorrect information! However, the performance is already good enough that many people are using these systems—for instance, Google's online translation system (and every other online service we are aware of) is based on deep learning.

尽管存在这些问题，深度学习在自然语言处理中有许多应用: 它可以用来将文本从一种语言翻译成另一种语言，将长文档汇总成可以更快消化的东西，找到感兴趣的概念的所有提及等等。不幸的是，译文或摘要很可能包含完全不正确的信息！然而，由于其性能已经足够好，许多人都在使用这些系统 -- 例如，谷歌的在线翻译系统 (以及我们知道的所有其他在线服务) 就是基于深度学习。

#### Combining text and images

#### 结合文字和图片

The ability of deep learning to combine text and images into a single model is, generally, far better than most people intuitively expect. For example, a deep learning model can be trained on input images with output captions written in English, and can learn to generate surprisingly appropriate captions automatically for new images! But again, we have the same warning that we discussed in the previous section: there is no guarantee that these captions will actually be correct.


深度学习将文本和图像组合成一个模型的能力，通常比大多数人直觉上预期的要好得多。例如，深度学习模型可以在输入图像上用英语书写的输出标题进行训练，并且可以学习为新图像自动生成令人惊讶的适当标题！但是，我们再次得到了与上一节中讨论的相同的警告: 不能保证这些标题实际上是正确的。

Because of this serious issue, we generally recommend that deep learning be used not as an entirely automated process, but as part of a process in which the model and a human user interact closely. This can potentially make humans orders of magnitude more productive than they would be with entirely manual methods, and actually result in more accurate processes than using a human alone. For instance, an automatic system can be used to identify potential stroke victims directly from CT scans, and send a high-priority alert to have those scans looked at quickly. There is only a three-hour window to treat strokes, so this fast feedback loop could save lives. At the same time, however, all scans could continue to be sent to radiologists in the usual way, so there would be no reduction in human input. Other deep learning models could automatically measure items seen on the scans, and insert those measurements into reports, warning the radiologists about findings that they may have missed, and telling them about other cases that might be relevant.

由于这个严重的问题，我们通常建议深度学习不是作为一个完全自动化的过程，而是作为模型和人类用户密切互动的过程的一部分。这可能会使人类比完全手动的方法更有效率，并且实际上会获得比单独使用人类更精确的过程。例如，可以使用自动系统直接从ct扫描中识别潜在的中风患者，并发送高优先级警报以快速查看这些扫描。治疗中风只有三小时的窗口期，所以这种快速的反馈循环可以拯救生命。然而，与此同时，所有扫描都可以继续以通常的方式发送给放射科医生，因此不会减少人类的输入。其他深度学习模型可以自动测量扫描中遇到的项目，并将这些测量结果插入报告中，警告放射科医生他们可能错过的发现，并告知他们其他可能相关的情况。

#### Tabular data

#### 表格数据

For analyzing time series and tabular data, deep learning has recently been making great strides. However, deep learning is generally used as part of an ensemble of multiple types of model. If you already have a system that is using random forests or gradient boosting machines (popular tabular modeling tools that you will learn about soon), then switching to or adding deep learning may not result in any dramatic improvement. Deep learning does greatly increase the variety of columns that you can include—for example, columns containing natural language (book titles, reviews, etc.), and high-cardinality categorical columns (i.e., something that contains a large number of discrete choices, such as zip code or product ID). On the down side, deep learning models generally take longer to train than random forests or gradient boosting machines, although this is changing thanks to libraries such as [RAPIDS](https://rapids.ai/), which provides GPU acceleration for the whole modeling pipeline. We cover the pros and cons of all these methods in detail in <<chapter_tabular>>.

在分析时间序列和表格数据方面，深度学习最近取得了长足的进步。然而，深度学习通常被用作多种类型模型集合的一部分。如果你已经有一个使用随机森林或梯度提升机器 (流行的表格建模工具，你将很快了解) 的系统，然后切换到或添加深度学习可能不会导致任何显著改善。深度学习确实大大增加了你可以包含的列的种类 -- 例如，包含自然语言的列 (书名、评论等) 和高基数分类列 (即包含大量离散选择的东西，如邮政编码或产品ID)。另一方面，深度学习模型通常比随机森林或梯度提升机器需要更长的时间来训练。尽管[RAPIDS](https://rapids.ai/)之类的库正在改变这种情况，该库为整个建模管道提供了显卡加速。 我们在<<chapter_tabular>>中详细介绍了所有这些方法的优缺点。

#### Recommendation systems

#### 推荐系统

Recommendation systems are really just a special type of tabular data. In particular, they generally have a high-cardinality categorical variable representing users, and another one representing products (or something similar). A company like Amazon represents every purchase that has ever been made by its customers as a giant sparse matrix, with customers as the rows and products as the columns. Once they have the data in this format, data scientists apply some form of collaborative filtering to *fill in the matrix*. For example, if customer A buys products 1 and 10, and customer B buys products 1, 2, 4, and 10, the engine will recommend that A buy 2 and 4. Because deep learning models are good at handling high-cardinality categorical variables, they are quite good at handling recommendation systems. They particularly come into their own, just like for tabular data, when combining these variables with other kinds of data, such as natural language or images. They can also do a good job of combining all of these types of information with additional metadata represented as tables, such as user information, previous transactions, and so forth.


推荐系统实际上只是一种特殊类型的表格数据。特别是，他们通常有一个代表用户的高基数分类变量，另一个代表产品 (或类似的东西)。像亚马逊这样的公司将其客户进行的每一次购买都作为一个巨大的稀疏矩阵，以客户为行，以产品为列。一旦他们有了这种格式的数据，数据科学家就会应用某种形式的协作过滤来 *填充矩阵* 。例如，如果客户A购买产品1和10，客户B购买产品1、2、4和10，引擎将建议A购买2和4。因为深度学习模型擅长处理高基数的分类变量，所以它们非常擅长处理推荐系统。当将这些变量与其他类型的数据 (如自然语言或图像) 相结合时，它们特别适合表格，就像表格数据一样。他们还可以很好地将所有这些类型的信息与表所示的附加元数据 (如用户信息、以前的事务等) 结合起来。

However, nearly all machine learning approaches have the downside that they only tell you what products a particular user might like, rather than what recommendations would be helpful for a user. Many kinds of recommendations for products a user might like may not be at all helpful—for instance, if the user is already familiar with the products, or if they are simply different packagings of products they have already purchased (such as a boxed set of novels, when they already have each of the items in that set). Jeremy likes reading books by Terry Pratchett, and for a while Amazon was recommending nothing but Terry Pratchett books to him (see <<pratchett>>), which really wasn't helpful because he already was aware of these books!

然而，几乎所有的机器学习方法都有缺点，它们只告诉你特定用户可能喜欢什么产品，而不是什么建议对用户有帮助。用户可能喜欢的产品的多种推荐可能一点帮助都没有 -- 例如，如果用户已经熟悉这些产品，或者，如果它们只是他们已经购买的产品的不同包装 (例如整套小说，当他们已经拥有该套小说中的每本书时)。Jeremy喜欢阅读Terry Pratchett的书，有一段时间亚马逊只向他推荐Terry Pratchett的书 (见 <<pratchett>>)，这真的没用，因为他已经知道这些书了!

<img alt="Terry Pratchett books recommendation" caption="A not-so-useful recommendation" id="pratchett" src="images/pratchett.png">


#### Other data types

#### 其他数据类型

Often you will find that domain-specific data types fit very nicely into existing categories. For instance, protein chains look a lot like natural language documents, in that they are long sequences of discrete tokens with complex relationships and meaning throughout the sequence. And indeed, it does turn out that using NLP deep learning methods is the current state-of-the-art approach for many types of protein analysis. As another example, sounds can be represented as spectrograms, which can be treated as images; standard deep learning approaches for images turn out to work really well on spectrograms.

通常，你会发现特定于域的数据类型非常适合现有的类别。例如，蛋白质链看起来很像自然语言文档，因为它们是离散令牌的长序列，在整个序列中具有复杂的关系和含义。确实，事实证明，使用自然语言处理深度学习方法是许多类型蛋白质分析的当前最先进的方法。还有一个例子，声音可以被表示为声谱图，可以将其视为图像; 图像的标准深度学习方法在谱图上非常有效。

### The Drivetrain Approach

### 传动系统方法

There are many accurate models that are of no use to anyone, and many inaccurate models that are highly useful. To ensure that your modeling work is useful in practice, you need to consider how your work will be used. In 2012 Jeremy, along with Margit Zwemer and Mike Loukides, introduced a method called *the Drivetrain Approach* for thinking about this issue.

有许多精确的模型对任何人都没有用，还有许多不准确的模型却非常有用。为了确保建模工作在实践中有用，你需要考虑如何使用你的工作。2012，Jeremy和Margit Zwemer以及Mike Loukides一起引入了一种名为 *传动系统方法* 的方法来思考这个问题。

The Drivetrain Approach, illustrated in <<drivetrain>>, was described in detail in ["Designing Great Data Products"](https://www.oreilly.com/radar/drivetrain-approach-data-products/). The basic idea is to start with considering your objective, then think about what actions you can take to meet that objective and what data you have (or can acquire) that can help, and then build a model that you can use to determine the best actions to take to get the best results in terms of your objective.

传动系统方法，如 <<传动系统>> 所示，详见 [“设计伟大的数据产品”](https://www.oreilly.com/radar/drivetrain-approach-data-products/)。其基本的想法是从考虑你的目标开始，然后考虑你可以采取什么行动来达到这个目标，以及有什么数据 (或者可以获得什么数据) 可以帮助你，然后建立一个模型，你可以用它来确定根据你的目标获得最佳结果的最佳行动。

<img src="images/drivetrain-approach.png" id="drivetrain" caption="The Drivetrain Approach">


Consider a model in an autonomous vehicle: you want to help a car drive safely from point A to point B without human intervention. Great predictive modeling is an important part of the solution, but it doesn't stand on its own; as products become more sophisticated, it disappears into the plumbing. Someone using a self-driving car is completely unaware of the hundreds (if not thousands) of models and the petabytes of data that make it work. But as data scientists build increasingly sophisticated products, they need a systematic design approach.


设想自动驾驶汽车中的一个模型: 你想帮助汽车安全地从a点行驶到B点，而无需人工干预。出色的预测建模是解决方案的重要组成部分，但它不能独立存在; 随着产品变得更加复杂，它会消失在探索中。使用自动驾驶汽车的人完全没有意识到数百种 (如果不是数千种)模型以及使其发挥作用的千兆级数据。但是随着数据科学家开发越来越复杂的产品，他们需要一种系统的设计方法。

We use data not just to generate more data (in the form of predictions), but to produce *actionable outcomes*. That is the goal of the Drivetrain Approach. Start by defining a clear *objective*. For instance, Google, when creating their first search engine, considered "What is the user’s main objective in typing in a search query?" This led them to their objective, which was to "show the most relevant search result." The next step is to consider what *levers* you can pull (i.e., what actions you can take) to better achieve that objective. In Google's case, that was the ranking of the search results. The third step was to consider what new *data* they would need to produce such a ranking; they realized that the implicit information regarding which pages linked to which other pages could be used for this purpose. Only after these first three steps do we begin thinking about building the predictive *models*. Our objective and available levers, what data we already have and what additional data we will need to collect, determine the models we can build. The models will take both the levers and any uncontrollable variables as their inputs; the outputs from the models can be combined to predict the final state for our objective.

我们使用数据不仅仅是为了产生更多的数据 (以预测的形式)，而是为了产生 *可行的结果* 。这是传动系统方法的目标。从定义一个明确的 *目标* 开始。例如，谷歌在创建第一个搜索引擎时，考虑 “用户输入搜索查询的主要目的是什么？”这可以使他们达到目的，即“显示最相关的搜索结果”。下一步是考虑你能拉动什么“杠杆”(即你能采取什么行动) 来更好地实现这个目的。对谷歌而言，这就是搜索结果的排名。第三步是考虑他们需要什么样的新 *数据* 来产生这样的排名; 他们意识到，关于链接到哪些其他页面的隐含信息可以用于此目的。只有在这前三个步骤之后，我们才开始考虑构建预测 *模型* 。我们的目标和可用杠杆，我们已经拥有的数据以及我们需要收集的额外数据，决定了我们可以构建的模型。模型将杠杆和任何不可控变量都作为输入; 模型的输出可以合并以预测我们目标的最终状态。

Let's consider another example: recommendation systems. The *objective* of a recommendation engine is to drive additional sales by surprising and delighting the customer with recommendations of items they would not have purchased without the recommendation. The *lever* is the ranking of the recommendations. New *data* must be collected to generate recommendations that will *cause new sales*. This will require conducting many randomized experiments in order to collect data about a wide range of recommendations for a wide range of customers. This is a step that few organizations take; but without it, you don't have the information you need to actually optimize recommendations based on your true objective (more sales!).


让我们设想另一个例子: 推荐系统。推荐引擎的 *目的* 是通过向顾客推荐没有推荐就不会购买的商品来令顾客感到惊喜和高兴，从而带动更多的销售。*杠杆* 是推荐的排名。必须收集新的 *数据* 以产生将 *引起新销售* 的推荐。这将需要进行许多随机实验，以便收集针对广泛的客户的广泛推荐的数据。这是很少有组织采取的步骤; 但是没有它，你就没有所需的信息来根据你的真实目的实际优化建议 (更多的销售！)。

Finally, you could build two *models* for purchase probabilities, conditional on seeing or not seeing a recommendation. The difference between these two probabilities is a utility function for a given recommendation to a customer. It will be low in cases where the algorithm recommends a familiar book that the customer has already rejected (both components are small) or a book that they would have bought even without the recommendation (both components are large and cancel each other out).


最后，您可以根据购买概率构建两个 *模型*，条件是看到或没有看到推荐。这两种概率之间的差异是给定推荐给客户的效用函数。如果算法推荐的是客户已经拒绝的熟悉的书（两种成分都很少），或者即使没有推荐，他们也会购买的书（两种成分都很大并且彼此抵消），这种差异就很小。

As you can see, in practice often the practical implementation of your models will require a lot more than just training a model! You'll often need to run experiments to collect more data, and consider how to incorporate your models into the overall system you're developing. Speaking of data, let's now focus on how to find data for your project.

正如你所看到的，在实践中，通常你的模型的实际实现需要的不仅仅是训练一个模型！你经常需要运行实验来收集更多的数据，并考虑如何将你的模型整合到你正在开发的整个系统中。说到数据，现在让我们关注如何为你的项目找到数据。

## Gathering Data

## 收集数据

For many types of projects, you may be able to find all the data you need online. The project we'll be completing in this chapter is a *bear detector*. It will discriminate between three types of bear: grizzly, black, and teddy bears. There are many images on the internet of each type of bear that we can use. We just need a way to find them and download them. We've provided a tool you can use for this purpose, so you can follow along with this chapter and create your own image recognition application for whatever kinds of objects you're interested in. In the fast.ai course, thousands of students have presented their work in the course forums, displaying everything from hummingbird varieties in Trinidad to bus types in Panama—one student even created an application that would help his fiancée recognize his 16 cousins during Christmas vacation!

对于许多类型的项目，你可能能够在线找到所需的所有数据。我们将在本章中完成的项目是一个 *熊探测器* 。它将区分三种类型的熊: 灰熊、黑熊和泰迪熊。互联网上有许多我们可以使用的每种类型的熊的图像。我们只需要一个方法来找到它们并下载它们。我们提供了一个可用于此目的的工具，因此你可以按照本章进行操作，并为您感兴趣的任何对象创建自己的图像识别应用程序。在fast.ai课程里，成千上万的学生在课程论坛上展示了他们的作品，显示从特立尼达的蜂鸟品种到巴拿马的公共汽车类型的一切 —— 一名学生甚至创建了一个应用程序，可以帮助他的未婚妻在圣诞节假期期间认出他的16个堂表兄弟!

At the time of writing, Bing Image Search is the best option we know of for finding and downloading images. It's free for up to 1,000 queries per month, and each query can download up to 150 images. However, something better might have come along between when we wrote this and when you're reading the book, so be sure to check out the [book's website](https://book.fast.ai/) for our current recommendation.

在撰写本文时，必应图像搜索是我们所知的查找和下载图像的最佳选择。每月最多可免费进行1,000次查询，每次查询最多可下载150张图片。然而，在我们写这本书的时候和你读这本书的之间，可能会有更好的事情发生，所以请务必查看 [图书网站](https://book.fast.ai/) 以获取我们当前的建议。

> important: Keeping in Touch With the Latest Services: Services that can be used for creating datasets come and go all the time, and their features, interfaces, and pricing change regularly too. In this section, we'll show how to use the Bing Image Search API available as part of Azure Cognitive Services at the time this book was written. We'll be providing more options and more up to date information on the [book's website](https://book.fast.ai/), so be sure to have a look there now to get the most current information on how to download images from the web to create a dataset for deep learning.

> 重要提示: 持续关注最新服务: 可用于创建数据集的服务随时变化，它们的功能、界面和定价也会定期更改。在本节中，我们将展示在编写本书时如何使用作为Azure认知服务的一部分提供的必应图像搜索接口。我们将在 [图书网站](https://book.fast.ai/)上提供更多选项和更多最新信息，所以现在一定要看看那里，了解如何从网络下载图像以创建深度学习数据集的最新信息。

# clean
To download images with Bing Image Search, sign up at Microsoft for a free account. You will be given a key, which you can copy and enter in a cell as follows (replacing 'XXX' with your key and executing it):

# 清洗
要使用必应图像搜索下载图像，请在Microsoft注册一个免费帐户。你将获得一个密钥，可以按如下方式复制并输入单元格 (用您的密钥替换 “xxx” 并执行它):

In [11]:
key = 'XXX'

Or, if you're comfortable at the command line, you can set it in your terminal with:


或者，如果你对命令行感到满意，你可以在终端中设置它:

    export AZURE_SEARCH_KEY=your_key_here


and then restart Jupyter Notebook, type this in a cell and execute it:


然后重新启动Jupyter笔记本，在单元格中键入并执行它:

```python
key = os.environ['AZURE_SEARCH_KEY']
```


Once you've set `key`, you can use `search_images_bing`. This function is provided by the small `utils` class included with the notebooks online. If you're not sure where a function is defined, you can just type it in your notebook to find out:

设置 `密钥` 后，可以使用 `search_images_bing`。此功能由在线笔记本附带的小型 `utils` 类提供。如果不确定函数的定义位置，你可以在笔记本中输入它来找出:

In [12]:
search_images_bing

<function fastbook.search_images_bing(key, term, min_sz=128, max_images=150)>

In [13]:
results = search_images_bing(key, 'grizzly bear')
ims = results.attrgot('content_url')
len(ims)

HTTPError: 401 Client Error: PermissionDenied for url: https://api.bing.microsoft.com/v7.0/images/search?q=grizzly+bear&count=150&min_height=128&min_width=128

We've successfully downloaded the URLs of 150 grizzly bears (or, at least, images that Bing Image Search finds for that search term). Let's look at one:

我们已经成功下载了150只灰熊的链接 (或者至少是必应图像搜索为该搜索词找到的图像)。让我们来看一个:

In [None]:
#hide
ims = ['http://3.bp.blogspot.com/-S1scRCkI3vY/UHzV2kucsPI/AAAAAAAAA-k/YQ5UzHEm9Ss/s1600/Grizzly%2BBear%2BWildlife.jpg']

In [None]:
dest = 'images/grizzly.jpg'
download_url(ims[0], dest)

In [None]:
im = Image.open(dest)
im.to_thumb(128,128)

This seems to have worked nicely, so let's use fastai's `download_images` to download all the URLs for each of our search terms. We'll put each in a separate folder:

这似乎效果很好，那么让我们使用fastai的 `download_images` 下载每个搜索词的所有链接。我们将把每一个放在一个单独的文件夹中:

In [None]:
bear_types = 'grizzly','black','teddy'
path = Path('bears')

In [None]:
if not path.exists():
    path.mkdir()
    for o in bear_types:
        dest = (path/o)
        dest.mkdir(exist_ok=True)
        results = search_images_bing(key, f'{o} bear')
        download_images(dest, urls=results.attrgot('content_url'))

Our folder has image files, as we'd expect:

正如我们期望的那样，我们的文件夹中包含图像文件:

In [None]:
fns = get_image_files(path)
fns

> j: I just love this about working in Jupyter notebooks! It's so easy to gradually build what I want, and check my work every step of the way. I make a _lot_ of mistakes, so this is really helpful to me...

> J: 我喜欢在Jupyter笔记本上工作！逐步构建我想要的东西，并在每一步检查我的工作非常容易。我犯了 _很多_ 错误，所以这对我真的很有帮助.

Often when we download files from the internet, there are a few that are corrupt. Let's check:

通常当我们从互联网上下载文件时，会有一些文件损坏。让我们检查一下:

In [None]:
failed = verify_images(fns)
failed

To remove all the failed images, you can use `unlink` on each of them. Note that, like most fastai functions that return a collection, `verify_images` returns an object of type `L`, which includes the `map` method. This calls the passed function on each element of the collection:

要删除所有失败的图片，你可以对每个图片使用 `unlink`。请注意，与大多数返回集合的fastai函数一样，`verify_images` 返回类型为 `L` 的对象，其中包括 `map` 方法。这会对集合的每个元素调用传递的函数:

In [None]:
failed.map(Path.unlink);

### Sidebar: Getting Help in Jupyter Notebooks

### 侧边栏: 在Jupyter笔记本中获取帮助

Jupyter notebooks are great for experimenting and immediately seeing the results of each function, but there is also a lot of functionality to help you figure out how to use different functions, or even directly look at their source code. For instance, if you type in a cell:

Jupyter笔记本非常适合进行实验并立即看到每个函数的结果，但是也有很多功能可以帮助你了解如何使用不同的函数，或者甚至直接看他们的源代码。例如，如果你键入一个单元格:

```
??verify_images
```
a window will pop up with:

将弹出一个窗口，其中包含:

```
Signature: verify_images(fns)
Source:   
def verify_images(fns):
    "Find images in `fns` that can't be opened"
    return L(fns[i] for i,o in
             enumerate(parallel(verify_image, fns)) if not o)
File:      ~/git/fastai/fastai/vision/utils.py
Type:      function
```

This tells us what argument the function accepts (`fns`), then shows us the source code and the file it comes from. Looking at that source code, we can see it applies the function `verify_image` in parallel and only keeps the image files for which the result of that function is `False`, which is consistent with the doc string: it finds the images in `fns` that can't be opened.


这告诉我们函数接受什么参数 (`fns`)，然后向我们显示源代码和它来自的文件。查看源代码，我们可以看到它并行应用函数 `verify_image`，并且只保留该函数结果为 `False` 的图像文件，这与doc字符串一致: 它在 `fns` 中找到无法打开的图像。

Here are some other features that are very useful in Jupyter notebooks:


以下是一些在Jupyter笔记本中非常有用的其他功能:

- At any point, if you don't remember the exact spelling of a function or argument name, you can press Tab to get autocompletion suggestions.
- When inside the parentheses of a function, pressing Shift and Tab simultaneously will display a window with the signature of the function and a short description. Pressing these keys twice will expand the documentation, and pressing them three times will open a full window with the same information at the bottom of your screen.
- In a cell, typing `?func_name` and executing will open a window with the signature of the function and a short description.
- In a cell, typing `??func_name` and executing will open a window with the signature of the function, a short description, and the source code.
- If you are using the fastai library, we added a `doc` function for you: executing `doc(func_name)` in a cell will open a window with the signature of the function, a short description and links to the source code on GitHub and the full documentation of the function in the [library docs](https://docs.fast.ai).
- Unrelated to the documentation but still very useful: to get help at any point if you get an error, type `%debug` in the next cell and execute to open the [Python debugger](https://docs.python.org/3/library/pdb.html), which will let you inspect the content of every variable.

- 在任何时候，如果你不记得函数或参数名称的确切拼写，可以按Tab键获取自动完成建议。
- 当在函数的括号内时，同时按下Shift和Tab键将显示一个带有函数签名和简短描述的窗口。按两次这些键将展开文档，按三次将在屏幕底部打开一个包含相同信息的完整窗口。
- 在单元格中，键入 `?func_name` 并执行将打开一个带有函数签名和简短描述的窗口。
- 在单元格中，键入 `??func_name` 并执行将打开一个带有函数签名、简短描述和源代码的窗口。
- 如果你正在使用fastai库，我们为你添加了一个 `doc` 函数: 在单元格中执行 `doc(func_name)` 将打开一个带有该函数签名的窗口，gitHub上源代码的简短描述和链接，以及 [库文档](https://docs.fast.ai)。
- 与文档无关，但仍然非常有用: 如果你遇到错误，要在任何时候获得帮助，请在下一个单元格中键入 `%debug`，然后执行以打开 [Python调试器](https://docs.python.org/3/library/pdb.html)，这将使你检查每个变量的内容。

### End sidebar

### 结束侧边栏

One thing to be aware of in this process: as we discussed in <<chapter_intro>>, models can only reflect the data used to train them. And the world is full of biased data, which ends up reflected in, for example, Bing Image Search (which we used to create our dataset). For instance, let's say you were interested in creating an app that could help users figure out whether they had healthy skin, so you trained a model on the results of searches for (say) "healthy skin." <<healthy_skin>> shows you the kinds of results you would get.

在这个过程中需要注意的一件事是: 正如我们在 <<chapter_intro>> 中讨论的，模型只能反映用于训练它们的数据。这个世界充满了有偏差的数据，这最终反映在例如必应图像搜索 (我们用来创建数据集) 中。例如，假设你有兴趣创建一个应用程序来帮助用户判断自己的皮肤是否健康，因此你针对（例如）“健康的皮肤"的搜索结果训练了一个模型 <<healthy_skin>> 显示你将获得的结果种类。

<img src="images/healthy_skin.gif" width="600" caption="Data for a healthy skin detector?" id="healthy_skin">


With this as your training data, you would end up not with a healthy skin detector, but a *young white woman touching her face* detector! Be sure to think carefully about the types of data that you might expect to see in practice in your application, and check carefully to ensure that all these types are reflected in your model's source data. footnote:[Thanks to Deb Raji, who came up with the "healthy skin" example. See her paper ["Actionable Auditing: Investigating the Impact of Publicly Naming Biased Performance Results of Commercial AI Products"](https://dl.acm.org/doi/10.1145/3306618.3314244) for more fascinating insights into model bias.]

有了这个作为你的训练数据，你最终不会获得一个健康的皮肤探测器，而是一个 *年轻的白人妇女触摸她的脸* 探测器！请务必仔细考虑您可能希望在应用程序中看到的数据类型，并仔细检查以确保所有这些类型都反映在模型的源数据中。脚注: [感谢Deb Raji，她提出了 “健康皮肤” 的例子。参见其论文 [“可操作的审计: 调查公开命名商业AI产品的偏差性能结果的影响”](https://dl.acm.org/doi/10.1145/3306618.3314244)对模型偏差有更多有趣的见解。]

Now that we have downloaded some data, we need to assemble it in a format suitable for model training. In fastai, that means creating an object called `DataLoaders`.

现在我们已经下载了一些数据，我们需要以适合模型训练的格式来装配它。在fastai中，这意味着创建一个名为 `DataLoaders` 的对象。

## From Data to DataLoaders

## 从数据到数据加载器

`DataLoaders` is a thin class that just stores whatever `DataLoader` objects you pass to it, and makes them available as `train` and `valid`. Although it's a very simple class, it's very important in fastai: it provides the data for your model. The key functionality in `DataLoaders` is provided with just these four lines of code (it has some other minor functionality we'll skip over for now):


`DataLoaders` 是一个精简类，它只存储你传递给它的任何 `DataLoaders` 对象，并将它们作为 `train` 和 `valid` 可用。虽然这是一个非常简单的类，但在fastai中非常重要: 它为你的模型提供数据。`DataLoaders` 中的关键功能仅提供了这四行代码 (它还有一些其他次要功能，我们现在将跳过):

```python
class DataLoaders(GetAttr):
    def __init__(self, *loaders): self.loaders = loaders
    def __getitem__(self, i): return self.loaders[i]
    train,valid = add_props(lambda i,self: self[i])
```


> jargon: DataLoaders: A fastai class that stores multiple `DataLoader` objects you pass to it, normally a `train` and a `valid`, although it's possible to have as many as you like. The first two are made available as properties.

> 行话: 数据加载器: 一个fastai类，用于存储传递给它的多个 `DataLoader` 对象，通常是`train` 和`valid`，尽管您可以拥有任意多的对象。这前两个作为属性提供。

Later in the book you'll also learn about the `Dataset` and `Datasets` classes, which have the same relationship.


在本书的后面，你还将了解 `Dataset` 和 `Datasets` 类，它们之间有着相同的关系。

To turn our downloaded data into a `DataLoaders` object we need to tell fastai at least four things:


要将下载的数据转换为 `DataLoaders` 对象，我们需要告诉fastai至少四件事:

- What kinds of data we are working with
- How to get the list of items
- How to label these items
- How to create the validation set


- 我们正在处理哪些类型的数据
- 如何获取项目列表
- 如何标记这些项目
- 如何创建验证集

So far we have seen a number of *factory methods* for particular combinations of these things, which are convenient when you have an application and data structure that happen to fit into those predefined methods. For when you don't, fastai has an extremely flexible system called the *data block API*. With this API you can fully customize every stage of the creation of your `DataLoaders`. Here is what we need to create a `DataLoaders` for the dataset that we just downloaded:

到目前为止，我们已经看到了许多 *工厂方法* 用于这些东西的特定组合，当你有一个恰好适合这些预定义方法的应用程序和数据结构时，这些方法很方便。当你不这样做时，fastai有一个非常灵活的系统，叫做 *数据块接口* 。使用此接口，你可以完全自定义创建 `DataLoaders` 的每个阶段。下面是我们需要为刚刚下载的数据集创建一个 `DataLoaders`:

In [None]:
bears = DataBlock(
    blocks=(ImageBlock, CategoryBlock), 
    get_items=get_image_files, 
    splitter=RandomSplitter(valid_pct=0.2, seed=42),
    get_y=parent_label,
    item_tfms=Resize(128))

Let's look at each of these arguments in turn. First we provide a tuple where we specify what types we want for the independent and dependent variables: 


让我们依次看看每个参数。首先，我们提供一个元组，在这里我们为自变量和因变量指定我们想要的类型:

```python
blocks=(ImageBlock, CategoryBlock)
```


The *independent variable* is the thing we are using to make predictions from, and the *dependent variable* is our target. In this case, our independent variables are images, and our dependent variables are the categories (type of bear) for each image. We will see many other types of block in the rest of this book.


*自变量* 是我们用来预测的东西，而 *因变量* 是我们的目标。在这种情况下，我们的自变量是图像，我们的因变量是每个图像的类别 (熊的类型)。我们将在本书的其余部分看到许多其他类型的块。

For this `DataLoaders` our underlying items will be file paths. We have to tell fastai how to get a list of those files. The `get_image_files` function takes a path, and returns a list of all of the images in that path (recursively, by default):


对于这个 `DataLoaders`，我们的基础项目将是文件路径。我们必须告诉fastai如何获得这些文件的列表。`get_image_files` 函数采用一个路径，并返回该路径中所有图像的列表 (默认情况下递归):

```python
get_items=get_image_files
```


Often, datasets that you download will already have a validation set defined. Sometimes this is done by placing the images for the training and validation sets into different folders. Sometimes it is done by providing a CSV file in which each filename is listed along with which dataset it should be in. There are many ways that this can be done, and fastai provides a very general approach that allows you to use one of its predefined classes for this, or to write your own. In this case, however, we simply want to split our training and validation sets randomly. However, we would like to have the same training/validation split each time we run this notebook, so we fix the random seed (computers don't really know how to create random numbers at all, but simply create lists of numbers that look random; if you provide the same starting point for that list each time—called the *seed*—then you will get the exact same list each time):


通常，你下载的数据集已经设置了验证集。有时，是通过将训练和验证集的图像放入不同的文件夹来完成的。有时，是通过提供一个CSV文件来完成的，其中每个文件名都与它应该在哪个数据集中一起列出。有许多方法可以做到这一点，fastai提供了一个非常通用的方法，允许你为此使用它的一个预定义的类，或者编写你自己的类。但是，在这种情况下，我们只想随机划分训练集和验证集。然而，我们希望每次运行这个笔记本时都有相同的训练/验证划分，因此我们修复了随机种子 (计算机根本不知道如何创建随机数，而是简单地创建看起来随机的数字列表;如果你每次为该列表提供相同的起点 -- 称为 *种子* -- 那么你每次都会得到完全相同的列表):

```python
splitter=RandomSplitter(valid_pct=0.2, seed=42)
```


The independent variable is often referred to as `x` and the dependent variable is often referred to as `y`. Here, we are telling fastai what function to call to create the labels in our dataset:


自变量通常被称为 `x`，因变量通常被称为 `y`。在这里，我们告诉fastai要调用什么函数来在数据集中创建标签:

```python
get_y=parent_label
```


`parent_label` is a function provided by fastai that simply gets the name of the folder a file is in. Because we put each of our bear images into folders based on the type of bear, this is going to give us the labels that we need.


`parent_label` 是fastai提供的一个函数，它只是简单的获取文件所在文件夹的名称。因为我们根据熊的类型将每个熊图像放入文件夹，这将为我们提供所需的标签。

Our images are all different sizes, and this is a problem for deep learning: we don't feed the model one image at a time but several of them (what we call a *mini-batch*). To group them in a big array (usually called a *tensor*) that is going to go through our model, they all need to be of the same size. So, we need to add a transform which will resize these images to the same size. *Item transforms* are pieces of code that run on each individual item, whether it be an image, category, or so forth. fastai includes many predefined transforms; we use the `Resize` transform here:


我们的图像大小各不相同，这对深度学习来说是个问题: 我们不会一次给模型一个图像，而是给其中几个图像 (我们称之为 *小批量*)。要将它们组合成要通过我们的模型的大型数组（通常称为 *张量* ），它们的大小都必须相同。因此，我们需要添加一个转换，将这些图像调整为相同的大小。*项目转换* 是在每个单独项目上运行的代码段，无论是图像、类别还是其他。fastai包括许多预定义的转换; 我们在这里使用 `Resize` 转换:

```python
item_tfms=Resize(128)
```


This command has given us a `DataBlock` object. This is like a *template* for creating a `DataLoaders`. We still need to tell fastai the actual source of our data—in this case, the path where the images can be found:

这个命令给了我们一个 `DataBlock` 对象。这就像一个 *模板* 用于创建 `DataLoaders`。我们仍然需要告诉fastai我们数据的实际来源 -- 在这个例子里，可以找到图像的路径:

In [None]:
dls = bears.dataloaders(path)

A `DataLoaders` includes validation and training `DataLoader`s. `DataLoader` is a class that provides batches of a few items at a time to the GPU. We'll be learning a lot more about this class in the next chapter. When you loop through a `DataLoader` fastai will give you 64 (by default) items at a time, all stacked up into a single tensor. We can take a look at a few of those items by calling the `show_batch` method on a `DataLoader`:

`DataLoaders` 包括验证和训练 `DataLoader`。`DataLoader` 是一个类，可一次向显卡提供几个项目的批。我们将在下一章学习更多关于这个类的知识。当你循环通过一个 `Dataloader` 时，fastai一次会给你64个 (默认情况下) 项目，所有项目都堆叠成一个张量。我们可以通过在 `DataLoader` 上调用 `show_batch` 方法来查看其中的一些项目:

In [None]:
dls.valid.show_batch(max_n=4, nrows=1)

By default `Resize` *crops* the images to fit a square shape of the size requested, using the full width or height. This can result in losing some important details. Alternatively, you can ask fastai to pad the images with zeros (black), or squish/stretch them:

默认情况下，`Resize`  使用全宽或全高将图像 *裁剪* 以适合所要求大小的正方形形状。这可能会导致丢失一些重要的细节。或者，你可以要求fastai用零 (黑色) 填充图像，或者压扁/拉伸它们:

In [None]:
bears = bears.new(item_tfms=Resize(128, ResizeMethod.Squish))
dls = bears.dataloaders(path)
dls.valid.show_batch(max_n=4, nrows=1)

In [None]:
bears = bears.new(item_tfms=Resize(128, ResizeMethod.Pad, pad_mode='zeros'))
dls = bears.dataloaders(path)
dls.valid.show_batch(max_n=4, nrows=1)

All of these approaches seem somewhat wasteful, or problematic. If we squish or stretch the images they end up as unrealistic shapes, leading to a model that learns that things look different to how they actually are, which we would expect to result in lower accuracy. If we crop the images then we remove some of the features that allow us to perform recognition. For instance, if we were trying to recognize breeds of dog or cat, we might end up cropping out a key part of the body or the face necessary to distinguish between similar breeds. If we pad the images then we have a whole lot of empty space, which is just wasted computation for our model and results in a lower effective resolution for the part of the image we actually use.


所有这些方法似乎都有些浪费，或者有些问题。如果我们挤压或拉伸图像，它们最终会成为不真实的形状，从而会导致一个模型，该模型可以了解事物看起来与实际情况不同，我们期望这会导致更低的准确性。如果我们裁剪图像，则会删除一些使我们能够执行识别的功能。例如，如果我们试图识别狗或猫的品种，我们可能最终会裁剪出所需的身体关键部位或面部以区分相似的品种。如果我们填充图像，则有大量的空白空间，这只是浪费了我们模型的计算，导致实际使用的图像部分的有效分辨率较低。

Instead, what we normally do in practice is to randomly select part of the image, and crop to just that part. On each epoch (which is one complete pass through all of our images in the dataset) we randomly select a different part of each image. This means that our model can learn to focus on, and recognize, different features in our images. It also reflects how images work in the real world: different photos of the same thing may be framed in slightly different ways.


相反，我们通常在实践中做的是随机选择图像的一部分，并裁剪到该部分。在每个迭代(这是对数据集中所有图像的一次完整循环)，我们随机选择每个图像的不同部分。这意味着我们的模型可以学会关注和识别图像中的不同特征。它还反映了图像在现实世界中的工作方式: 同一件事的不同照片可能以略有不同的方式进行构图。

In fact, an entirely untrained neural network knows nothing whatsoever about how images behave. It doesn't even recognize that when an object is rotated by one degree, it still is a picture of the same thing! So actually training the neural network with examples of images where the objects are in slightly different places and slightly different sizes helps it to understand the basic concept of what an object is, and how it can be represented in an image.


事实上，一个完全未经训练的神经网络对图像的行为一无所知。它甚至没有意识到当一个物体旋转一个角度时，它仍然是同一事物的图片！因此，实际训练带有图像示例的神经网络时，物体位于稍有不同的位置且大小略有不同，这有助于它理解物体的基本概念以及如何在图像中表示它。

Here's another example where we replace `Resize` with `RandomResizedCrop`, which is the transform that provides the behavior we just described. The most important parameter to pass in is `min_scale`, which determines how much of the image to select at minimum each time:

还有另一个例子，我们用 `RandomResizedCrop` 替换 `Resize`，这是提供我们刚才描述的行为的转换。要传递的最重要的参数是 `min_scale`，它确定每次至少选择多少图像:

In [None]:
bears = bears.new(item_tfms=RandomResizedCrop(128, min_scale=0.3))
dls = bears.dataloaders(path)
dls.train.show_batch(max_n=4, nrows=1, unique=True)

We used `unique=True` to have the same image repeated with different versions of this `RandomResizedCrop` transform. This is a specific example of a more general technique, called data augmentation.

我们使用 `unique=True` 将相同的图像，重复使用此 `RandomResizedCrop` 变换的不同版本。这是一种更通用的技术的具体示例，称为数据增强。

### Data Augmentation

### 数据增强

*Data augmentation* refers to creating random variations of our input data, such that they appear different, but do not actually change the meaning of the data. Examples of common data augmentation techniques for images are rotation, flipping, perspective warping, brightness changes and contrast changes. For natural photo images such as the ones we are using here, a standard set of augmentations that we have found work pretty well are provided with the `aug_transforms` function. Because our images are now all the same size, we can apply these augmentations to an entire batch of them using the GPU, which will save a lot of time. To tell fastai we want to use these transforms on a batch, we use the `batch_tfms` parameter (note that we're not using `RandomResizedCrop` in this example, so you can see the differences more clearly; we're also using double the amount of augmentation compared to the default, for the same reason):

*数据增强* 是指为我们输入的数据创建随机变体，使它们看起来有所不同，但实际上不会改变数据的含义。图像的常见数据增强技术的示例是旋转、翻转、透视扭曲、亮度变化和对比度变化。对于我们在这里使用的自然照片图像，`aug_transforms`函数提供了一组标准的增强效果，我们发现它们效果很好。由于我们的图像现在都一样大，因此我们可以使用显卡将这些增强应用于整批图像，这将节省大量时间。为了告诉fastai我们想在批处理中使用这些转换，我们使用了 `batch_tfms` 参数 (请注意，在本例中，我们没有使用 `RandomResizedCrop`，因此，你可以更清楚地看到差异; 出于同样的原因，我们还使用了两倍于默认值的增量):

In [None]:
bears = bears.new(item_tfms=Resize(128), batch_tfms=aug_transforms(mult=2))
dls = bears.dataloaders(path)
dls.train.show_batch(max_n=8, nrows=2, unique=True)

Now that we have assembled our data in a format fit for model training, let's actually train an image classifier using it.

现在我们已经以适合模型训练的格式收集了数据，让我们使用它来训练一个图像分类器。

## Training Your Model, and Using It to Clean Your Data

## 训练你的模型，并使用它来清洗你的数据

Time to use the same lines of code as in <<chapter_intro>> to train our bear classifier.


是时候使用与 <<chapter_intro>> 中相同的代码行来训练我们的熊分类器了。

We don't have a lot of data for our problem (150 pictures of each sort of bear at most), so to train our model, we'll use `RandomResizedCrop` with an image size of 224 px, which is fairly standard for image classification, and default `aug_transforms`:

对于我们的问题，我们没有很多数据 (每种熊最多150张图片)，所以为了训练我们的模型，我们将使用 `RandomResizedCrop`，图像大小为224 px，这对于图像分类来说是相当标准的，并且默认为 `aug_transforms`:

In [None]:
bears = bears.new(
    item_tfms=RandomResizedCrop(224, min_scale=0.5),
    batch_tfms=aug_transforms())
dls = bears.dataloaders(path)

We can now create our `Learner` and fine-tune it in the usual way:

我们现在可以创建我们的 `Learner`，并以常规方式对其进行精调:

In [None]:
learn = cnn_learner(dls, resnet18, metrics=error_rate)
learn.fine_tune(4)

Now let's see whether the mistakes the model is making are mainly thinking that grizzlies are teddies (that would be bad for safety！), or that grizzlies are black bears, or something else. To visualize this, we can create a *confusion matrix*:

现在让我们看看模型犯的错误是否主要是把灰熊认成泰迪熊 (这极不安全!)，或者把灰熊认成黑熊，或者别的什么。为了可视化这一点，我们可以创建一个 *混淆矩阵* :

In [None]:
interp = ClassificationInterpretation.from_learner(learn)
interp.plot_confusion_matrix()

The rows represent all the black, grizzly, and teddy bears in our dataset, respectively. The columns represent the images which the model predicted as black, grizzly, and teddy bears, respectively. Therefore, the diagonal of the matrix shows the images which were classified correctly, and the off-diagonal cells represent those which were classified incorrectly. This is one of the many ways that fastai allows you to view the results of your model. It is (of course!) calculated using the validation set. With the color-coding, the goal is to have white everywhere except the diagonal, where we want dark blue. Our bear classifier isn't making many mistakes!


每一行分别代表我们数据集中的所有黑熊、灰熊和泰迪熊。每一列分别代表模型预测为黑熊、灰熊和泰迪熊的图像。因此，矩阵的对角线显示正确分类的图像，非对角线单元格表示错误分类的图像。这是fastai让你查看模型结果的众多方式之一。它 (当然!) 是使用验证集计算。使用颜色编码，目标是在对角线以外的所有地方都是白色，而对角线是我们想要的深蓝色。我们的熊分类器没有犯很多错误!

It's helpful to see where exactly our errors are occurring, to see whether they're due to a dataset problem (e.g., images that aren't bears at all, or are labeled incorrectly, etc.), or a model problem (perhaps it isn't handling images taken with unusual lighting, or from a different angle, etc.). To do this, we can sort our images by their *loss*.


了解我们的错误到底发生在哪里，了解它们是否是由于数据集问题 (例如，根本不是熊的图像，或者标记不正确，等等。)，或者模型问题 (也许它无法处理用异常照明或从不同角度等拍摄的图像)。为此，我们可以根据它们的 *损失* 对我们的图像进行排序。

The loss is a number that is higher if the model is incorrect (especially if it's also confident of its incorrect answer), or if it's correct, but not confident of its correct answer. In a couple of chapters we'll learn in depth how loss is calculated and used in the training process. For now, `plot_top_losses` shows us the images with the highest loss in our dataset. As the title of the output says, each image is labeled with four things: prediction, actual (target label), loss, and probability. The *probability* here is the confidence level, from zero to one, that the model has assigned to its prediction:

如果模型不正确 (特别是如果它对错误答案置信)，或者如果它是正确的，但对正确答案不置信，损失会更高。在下面几章中，我们将深入了解如何在培训过程中计算和使用损失。现在，`plot_top_losses` 向我们展示了数据集中损失最高的图像。正如输出的标题所说，每个图像都标有四个东西: 预测，实际 (目标标签)，损失和概率。这里的 *概率* 是模型分配给其预测的置信度，从零到一:

In [None]:
interp.plot_top_losses(5, nrows=1)

This output shows that the image with the highest loss is one that has been predicted as "grizzly" with high confidence. However, it's labeled (based on our Bing image search) as "black." We're not bear experts, but it sure looks to us like this label is incorrect! We should probably change its label to "grizzly."


此输出显示损失最高的图像是已预测为具有高置信度的 “灰熊” 的图像。然而，它被标记为 “黑熊” (基于我们的必应图像搜索)。我们不是熊专家，但在我们看来，这个标签确实不正确！我们应该把它的标签改成 “灰熊”。

The intuitive approach to doing data cleaning is to do it *before* you train a model. But as you've seen in this case, a model can actually help you find data issues more quickly and easily. So, we normally prefer to train a quick and simple model first, and then use it to help us with data cleaning.


进行数据清理的直观方法是在训练模型 *之前* 进行清理。但是正如你在本例中所看到的，模型实际上可以帮助您更快速、更轻松地找到数据问题。因此，我们通常更喜欢先训练一个快速简单的模型，然后用它来帮助我们清理数据。

fastai includes a handy GUI for data cleaning called `ImageClassifierCleaner` that allows you to choose a category and the training versus validation set and view the highest-loss images (in order), along with menus to allow images to be selected for removal or relabeling:

fastai包括一个方便的接口，用于数据清理，称为 `ImageClassifierCleaner`，允许你选择类别和训练与验证集，并查看损失最大的图像 (按顺序)，以及允许选择移除或重新标记图像的菜单:

In [None]:
#hide_output
cleaner = ImageClassifierCleaner(learn)
cleaner

<img alt="Cleaner widget" width="700" src="images/att_00007.png">



In [None]:
#hide
# for idx in cleaner.delete(): cleaner.fns[idx].unlink()
# for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)

We can see that amongst our "black bears" is an image that contains two bears: one grizzly, one black. So, we should choose `<Delete>` in the menu under this image. `ImageClassifierCleaner` doesn't actually do the deleting or changing of labels for you; it just returns the indices of items to change. So, for instance, to delete (`unlink`) all images selected for deletion, we would run:


我们可以看到，在我们的 “黑熊” 中，有一个包含两只熊的图像: 一只灰熊，一只黑熊。因此，我们应该在此图像下的菜单中选择 `<Delete>`。`ImageClassifierCleaner` 实际上并没有为你删除或更改标签; 它只是返回要更改的项目的索引。因此，例如，要删除 (`unlink`) 所有选择删除的图像，我们将运行:

```python
for idx in cleaner.delete(): cleaner.fns[idx].unlink()
```


To move images for which we've selected a different category, we would run:


要移动我们选择的图像到其他类别，我们将运行:

```python
for idx,cat in cleaner.change(): shutil.move(str(cleaner.fns[idx]), path/cat)
```


> s: Cleaning the data and getting it ready for your model are two of the biggest challenges for data scientists; they say it takes 90% of their time. The fastai library aims to provide tools that make it as easy as possible.


> S: 清理数据并为模型做好准备是数据科学家面临的两个最大挑战; 他们说这需要90%的时间。fastai库旨在提供尽可能简单的工具。

We'll be seeing more examples of model-driven data cleaning throughout this book. Once we've cleaned up our data, we can retrain our model. Try it yourself, and see if your accuracy improves!

我们将在本书中看到更多模型驱动的数据清理示例。一旦我们清理了数据，我们就可以重新训练我们的模型。自己尝试一下，看看你的准确性是否提高了！

> note: No Need for Big Data: After cleaning the dataset using these steps, we generally are seeing 100% accuracy on this task. We even see that result when we download a lot fewer images than the 150 per class we're using here. As you can see, the common complaint that _you need massive amounts of data to do deep learning_ can be a very long way from the truth!

> 注意: 不需要大数据: 在使用这些步骤清理数据集后，我们通常会看到这项任务的准确率为100%。当我们下载的图像比我们在这里使用的每类150张的图像少得多时，我们甚至会看到这个结果。正如你所看到的，常见的抱怨 _你需要大量的数据来做深度学习_ 可能与真相相去甚远！

Now that we have trained our model, let's see how we can deploy it to be used in practice.

现在我们已经训练了我们的模型，让我们看看如何部署它以在实践中使用。

## Turning Your Model into an Online Application

## 把你的模型变成一个在线应用程序

We are now going to look at what it takes to turn this model into a working online application. We will just go as far as creating a basic working prototype; we do not have the scope in this book to teach you all the details of web application development generally.

现在，我们来看看将这种模型转变为可运行的在线应用程序所需的条件。我们将尽可能创建一个基本的工作原型; 在这本书里，我们没法教你web应用程序开发的所有细节。

### Using the Model for Inference

### 使用模型进行推理

Once you've got a model you're happy with, you need to save it, so that you can then copy it over to a server where you'll use it in production. Remember that a model consists of two parts: the *architecture* and the trained *parameters*. The easiest way to save the model is to save both of these, because that way when you load a model you can be sure that you have the matching architecture and parameters. To save both parts, use the `export` method.


一旦你有了一个你满意的模型，你需要保存它，这样你就可以把它复制到一个服务器上，在那里你可以在作品中使用它。请记住，模型由两部分组成: *架构* 和经过训练的 *参数*。保存模型的最简单方法是保存这两者，因为这样当你加载模型时，可以确保你拥有匹配的架构和参数。要保存这两个部分，请使用 `export` 方法。

This method even saves the definition of how to create your `DataLoaders`. This is important, because otherwise you would have to redefine how to transform your data in order to use your model in production. fastai automatically uses your validation set `DataLoader` for inference by default, so your data augmentation will not be applied, which is generally what you want.


此方法甚至保存了如何创建 `DataLoaders` 的定义。这很重要，否则你将不得不重新定义如何转换数据，以便在作品中使用你的模型。默认情况下，fastai会自动使用你的验证集 `DataLoader` 进行推断，因此你的数据增强不会被应用，而这通常就是你想要的。

When you call `export`, fastai will save a file called "export.pkl":

当你调用 `export` 时，fastai将保存一个名为 “export.pkl” 的文件:

In [None]:
learn.export()

Let's check that the file exists, by using the `ls` method that fastai adds to Python's `Path` class:

让我们通过使用fastai添加到Python的 `Path` 类的 `ls` 方法来检查文件是否存在：

In [None]:
path = Path()
path.ls(file_exts='.pkl')

You'll need this file wherever you deploy your app to. For now, let's try to create a simple app within our notebook.


无论你将应用程序部署到何处，都需要此文件。现在，让我们尝试在笔记本中创建一个简单的应用程序。

When we use a model for getting predictions, instead of training, we call it *inference*. To create our inference learner from the exported file, we use `load_learner` (in this case, this isn't really necessary, since we already have a working `Learner` in our notebook; we're just doing it here so you can see the whole process end-to-end):

当我们使用模型来获取预测时，而不是训练，我们称之为 *推断*。为了从导出的文件中创建我们的推理学习器，我们使用 `load_learner` (在这种情况下，这并不是真正必要的，因为我们的笔记本中已经有了一个有效的 `Learner`; 我们只是在这里做，这样你就可以端到端地看到整个过程):

In [None]:
learn_inf = load_learner(path/'export.pkl')

When we're doing inference, we're generally just getting predictions for one image at a time. To do this, pass a filename to `predict`:

当我们做推断时，我们通常一次只得到一个图像的预测。为此，将文件名传递给 `predict`:

In [None]:
learn_inf.predict('images/grizzly.jpg')

This has returned three things: the predicted category in the same format you originally provided (in this case that's a string), the index of the predicted category, and the probabilities of each category. The last two are based on the order of categories in the *vocab* of the `DataLoaders`; that is, the stored list of all possible categories. At inference time, you can access the `DataLoaders` as an attribute of the `Learner`:

这返回了三个内容: 与你最初提供的格式相同的预测类别 (在本例中是字符串)、预测类别的索引以及每个类别的概率。后两者基于 `DataLoaders` 的 *词汇表* 中的类别顺序; 即所有可能类别的存储列表。在推断时，你可以访问 `DataLoaders` 作为 `Learner` 的属性:

In [None]:
learn_inf.dls.vocab

We can see here that if we index into the vocab with the integer returned by `predict` then we get back "grizzly," as expected. Also, note that if we index into the list of probabilities, we see a nearly 1.00 probability that this is a grizzly.

我们可以在这里看到，如果我们用 `predict` 返回的整数索引到词汇表，那么我们会像预期的那样返回 “灰熊”。另外，请注意，如果我们索引到概率列表中，我们会看到将近1.00的概率是灰熊。

We know how to make predictions from our saved model, so we have everything we need to start building our app. We can do it directly in a Jupyter notebook.

我们知道如何从我们保存的模型中进行预测，所以我们拥有开始构建应用程序所需的一切。我们可以直接在Jupyter笔记本中完成。

### Creating a Notebook App from the Model

### 从模型创建笔记本应用程序

To use our model in an application, we can simply treat the `predict` method as a regular function. Therefore, creating an app from the model can be done using any of the myriad of frameworks and techniques available to application developers.


要在应用程序中使用我们的模型，我们可以简单地将 `predict` 方法视为常规函数。因此，从模型创建应用程序可以使用应用程序开发人员可用的无数框架和技术中的任何一个来完成。

However, most data scientists are not familiar with the world of web application development. So let's try using something that you do, at this point, know: it turns out that we can create a complete working web application using nothing but Jupyter notebooks! The two things we need to make this happen are:


然而，大多数数据科学家并不熟悉web应用程序开发领域。因此，让我们尝试使用你目前所做的事情，在这一点上，你需要知道: 事实证明，我们可以使用Jupyter笔记本创建一个完整的工作web应用程序！我们需要做的两件事是:

- IPython widgets (ipywidgets)
- Voilà


- IPython小组件 (ipywidgets)
- Voilà

*IPython widgets* are GUI components that bring together JavaScript and Python functionality in a web browser, and can be created and used within a Jupyter notebook. For instance, the image cleaner that we saw earlier in this chapter is entirely written with IPython widgets. However, we don't want to require users of our application to run Jupyter themselves.


*IPython小组件* 是在web浏览器中汇集JavaScript和Python功能的接口组件，可以在Jupyter笔记本中创建和使用。例如，我们在本章前面看到的图像清理器完全是用IPython小部件编写的。但是，我们不想要求应用程序的用户自己运行Jupyter。

That is why *Voilà* exists. It is a system for making applications consisting of IPython widgets available to end users, without them having to use Jupyter at all. Voilà is taking advantage of the fact that a notebook _already is_ a kind of web application, just a rather complex one that depends on another web application: Jupyter itself. Essentially, it helps us automatically convert the complex web application we've already implicitly made (the notebook) into a simpler, easier-to-deploy web application, which functions like a normal web application rather than like a notebook.


这就是为什么 *Voilà* 存在。这是一个系统，用于使最终用户可以使用由IPython小组件组成的应用程序，而不需要使用Jupyter。Voilà 正在利用一个事情，即笔记本已经是一种web应用程序，只是一个相当复杂的依赖于另一个web应用程序的应用程序: Jupyter本身。本质上，它帮助我们自动将我们已经隐式制作的复杂web应用程序 (笔记本) 转换为一个更简单、更易于部署的web应用程序，它的功能像一个普通的web应用程序，而不是像一个笔记本。

But we still have the advantage of developing in a notebook, so with ipywidgets, we can build up our GUI step by step. We will use this approach to create a simple image classifier. First, we need a file upload widget:

但是我们仍然有在笔记本中开发的优势，所以有了ipywidgets，我们可以一步一步地建立我们的图形用户接口。我们将使用这种方法创建一个简单的图像分类器。首先，我们需要一个文件上传小组件:

In [None]:
#hide_output
btn_upload = widgets.FileUpload()
btn_upload

<img alt="An upload button" width="159" src="images/att_00008.png">



Now we can grab the image:

现在我们可以抓取图像:

In [None]:
#hide
# For the book, we can't actually click an upload button, so we fake it
btn_upload = SimpleNamespace(data = ['images/grizzly.jpg'])

In [None]:
img = PILImage.create(btn_upload.data[-1])

<img alt="Output widget representing the image" width="117" src="images/att_00009.png">



We can use an `Output` widget to display it:

我们可以使用一个 `Output` 小组件来显示它:

In [None]:
#hide_output
out_pl = widgets.Output()
out_pl.clear_output()
with out_pl: display(img.to_thumb(128,128))
out_pl

<img alt="Output widget representing the image" width="117" src="images/att_00009.png">


Then we can get our predictions:

然后我们可以得到我们的预测:

In [None]:
pred,pred_idx,probs = learn_inf.predict(img)

and use a `Label` to display them:

并使用一个 `Label`  来显示它们:

In [None]:
#hide_output
lbl_pred = widgets.Label()
lbl_pred.value = f'Prediction: {pred}; Probability: {probs[pred_idx]:.04f}'
lbl_pred

`Prediction: grizzly; Probability: 1.0000`

`预测: 灰熊; 概率: 1.0000`

We'll need a button to do the classification. It looks exactly like the upload button:

我们需要一个按钮来进行分类。它看起来就像是上传按钮:

In [None]:
#hide_output
btn_run = widgets.Button(description='Classify')
btn_run

We'll also need a *click event handler*; that is, a function that will be called when it's pressed. We can just copy over the lines of code from above:

我们还需要一个 *点击事件处理程序*; 也就是说，一个在按下时会被调用的函数。我们可以从上面复制代码行:

In [None]:
def on_click_classify(change):
    img = PILImage.create(btn_upload.data[-1])
    out_pl.clear_output()
    with out_pl: display(img.to_thumb(128,128))
    pred,pred_idx,probs = learn_inf.predict(img)
    lbl_pred.value = f'Prediction: {pred}; Probability: {probs[pred_idx]:.04f}'

btn_run.on_click(on_click_classify)

You can test the button now by pressing it, and you should see the image and predictions update automatically!

你现在可以通过按下按钮来测试它，你应该会看到图像和预测自动更新!

We can now put them all in a vertical box (`VBox`) to complete our GUI:

我们现在可以将它们全部放在一个垂直框 (`VBox`) 中，以完成我们的图形用户接口:

In [None]:
#hide
#Putting back btn_upload to a widget for next cell
btn_upload = widgets.FileUpload()

In [None]:
#hide_output
VBox([widgets.Label('Select your bear!'), 
      btn_upload, btn_run, out_pl, lbl_pred])

<img alt="The whole widget" width="233" src="images/att_00011.png">


We have written all the code necessary for our app. The next step is to convert it into something we can deploy.

我们已经为我们的应用程序编写了所有必要的代码。下一步是将其转换为我们可以部署的内容。

### Turning Your Notebook into a Real App

# 把你的笔记本变成一个真正的应用程序

In [None]:
#hide
# !pip install voila
# !jupyter serverextension enable voila —sys-prefix

Now that we have everything working in this Jupyter notebook, we can create our application. To do this, start a new notebook and add to it only the code needed to create and show the widgets that you need, and markdown for any text that you want to appear. Have a look at the *bear_classifier* notebook in the book's repo to see the simple notebook application we created.


现在我们已经在这个Jupyter笔记本中工作了，我们可以创建我们的应用程序。为此，启动一个新的笔记本，只添加创建和显示所需小部件所需的代码，并markdown你想要显示的任何文本。看看本书的repo中的 *熊分类器* 笔记本，看看我们创建的简单笔记本应用程序。

Next, install Voilà if you haven't already, by copying these lines into a notebook cell and executing it:


接下来，如果你还没有安装voilà，请将这些行复制到笔记本单元格中并执行它:

    !pip install voila
    !jupyter serverextension enable voila —sys-prefix



Cells that begin with a `!` do not contain Python code, but instead contain code that is passed to your shell (bash, Windows PowerShell, etc.). If you are comfortable using the command line, which we'll discuss more later in this book, you can of course simply type these two lines (without the `!` prefix) directly into your terminal. In this case, the first line installs the `voila` library and application, and the second connects it to your existing Jupyter notebook.


以 `!` 开头的单元格不包含Python代码，而是包含传递给shell的代码 (bash、Windows PowerShell等)。如果你习惯使用命令行（我们将在本书稍后部分讨论），你当然可以直接在终端中键入这两行命令 (不带 `!` 前缀)。在这种情况下，第一行安装 `voila` 库和应用程序，第二行将其连接到现有的Jupyter笔记本。

Voilà runs Jupyter notebooks just like the Jupyter notebook server you are using now does, but it also does something very important: it removes all of the cell inputs, and only shows output (including ipywidgets), along with your markdown cells. So what's left is a web application! To view your notebook as a Voilà web application, replace the word "notebooks" in your browser's URL with: "voila/render". You will see the same content as your notebook, but without any of the code cells.


Voilà 运行Jupyter笔记本，就像你现在使用的Jupyter笔记本服务器一样，但它也做了一些非常重要的事情: 它删除了所有单元格输入，并且只显示输出 (包括ipywidgets) 以及你的markdown单元格。所以剩下的是一个web应用程序！要将你的笔记本视为Voilà的web应用程序，请将浏览器URL中的 “笔记本” 一词替换为: “voila/render”。你将看到与笔记本相同的内容，但没有任何代码单元格。

Of course, you don't need to use Voilà or ipywidgets. Your model is just a function you can call (`pred,pred_idx,probs = learn.predict(img)`), so you can use it with any framework, hosted on any platform. And you can take something you've prototyped in ipywidgets and Voilà and later convert it into a regular web application. We're showing you this approach in the book because we think it's a great way for data scientists and other folks that aren't web development experts to create applications from their models.


当然，你不需要使用voilà 或ipywidgets。你的模型只是一个你可以调用的函数，所以你可以在任何平台上使用它。你可以把你在ipywidgets和voilà 中原型化的东西转换成常规的web应用程序。我们在书中向你展示了这种方法，因为我们认为这是数据科学家和其他不是web开发专家的人从他们的模型中创建应用程序的一个好方法。

We have our app, now let's deploy it!

我们有了应用程序，现在让我们来部署它!

### Deploying your app

### 部署你的应用程序

As you now know, you need a GPU to train nearly any useful deep learning model. So, do you need a GPU to use that model in production? No! You almost certainly *do not need a GPU to serve your model in production*. There are a few reasons for this:


正如你现在所知道的，你需要一个显卡来训练几乎任何有用的深度学习模型。那么，你需要一个显卡来在生产环境中使用这个模型吗？不！你几乎可以肯定 *不需要显卡来为生产环境中的模型服务*。这有几个原因:

- As we've seen, GPUs are only useful when they do lots of identical work in parallel. If you're doing (say) image classification, then you'll normally be classifying just one user's image at a time, and there isn't normally enough work to do in a single image to keep a GPU busy for long enough for it to be very efficient. So, a CPU will often be more cost-effective.
- An alternative could be to wait for a few users to submit their images, and then batch them up and process them all at once on a GPU. But then you're asking your users to wait, rather than getting answers straight away! And you need a high-volume site for this to be workable. If you do need this functionality, you can use a tool such as Microsoft's [ONNX Runtime](https://github.com/microsoft/onnxruntime), or [AWS Sagemaker](https://aws.amazon.com/sagemaker/)
- The complexities of dealing with GPU inference are significant. In particular, the GPU's memory will need careful manual management, and you'll need a careful queueing system to ensure you only process one batch at a time.
- There's a lot more market competition in CPU than GPU servers, as a result of which there are much cheaper options available for CPU servers.


- 正如我们所看到的，显卡只有在并行执行大量相同工作时才有用。如果你正在做 (比如说) 图像分类，那么你通常一次只对一个用户的图像进行分类，通常在单个图像中没有足够的工作来做，以是显卡保持足够长的工作时间以获得高效。因此，CPU通常会更具成本效益。
- 另一种选择是等待几个用户提交他们的图像，然后批处理它们，并在显卡上一次处理它们。但是你要让你的用户等待，而不是马上得到答案！你需要一个高容量的网站才能做到这一点。如果你确实需要这个功能，你可以使用一个工具，比如微软的 [ONNX Runtime](https://github.com/microsoft/onnxruntime) 或 [AWS Sagemaker](https://aws.amazon.com/sagemaker/)
- 处理显卡推断非常复杂。特别是，显卡的内存需要仔细的手动管理，你需要一个严谨的排队系统来确保一次只处理一批。
- CPU比显卡服务器有更多的市场竞争，因此CPU服务器有更便宜的选择。

Because of the complexity of GPU serving, many systems have sprung up to try to automate this. However, managing and running these systems is also complex, and generally requires compiling your model into a different form that's specialized for that system. It's typically preferable to avoid dealing with this complexity until/unless your app gets popular enough that it makes clear financial sense for you to do so.

由于显卡服务的复杂性，许多系统涌现出来试图实现自动化。但是，管理和运行这些系统也很复杂，通常需要将模型编译成专门针对该系统的不同形式。通常最好避免处理这种复杂性，直到/除非你的应用程序变得足够受欢迎，以至于你这样做有明显的经济意义。

For at least the initial prototype of your application, and for any hobby projects that you want to show off, you can easily host them for free. The best place and the best way to do this will vary over time, so check the [book's website](https://book.fast.ai/) for the most up-to-date recommendations. As we're writing this book in early 2020 the simplest (and free!) approach is to use [Binder](https://mybinder.org/). To publish your web app on Binder, you follow these steps:


至少对于应用程序的初始原型，以及你想展现的任何爱好项目，你都可以轻松地免费托管它们。最好的地方和最好的方法会随着时间的推移而变化，所以请查阅 [图书网站](https://book.fast.ai/) 上最新的建议。当我们在2020年初写这本书时，最简单 (而且免费!) 的方法是使用 [Binder](https://mybinder.org/)。要在Binder上发布web应用，请按照以下步骤操作:

1. Add your notebook to a [GitHub repository](http://github.com/).
2. Paste the URL of that repo into Binder's URL, as shown in <<deploy-binder>>.
3. Change the File dropdown to instead select URL.
4. In the "URL to open" field, enter `/voila/render/name.ipynb` (replacing `name` with the name of for your notebook).
5. Click the clickboard button at the bottom right to copyt the URL and paste it somewhere safe. 
6. Click Launch.

    
1. 将你的笔记本添加到 [GitHub仓库](http://github.com/)。
2. 将该代码仓库的URL粘贴到Binder的URL中，如 <<deploy-binder>> 所示。
3. 将文件下拉列表改为选择URL。
4. 在 “要打开的URL” 字段中，输入 `/voila/render/name.ipynb` (将 `name` 替换为你的笔记本的名称)。
5. 点击右下角的点击板按钮复制网址并粘贴到安全的地方。
6. 单击Launch。

<img alt="Deploying to Binder" width="800" caption="Deploying to Binder" id="deploy-binder" src="images/att_00001.png">


The first time you do this, Binder will take around 5 minutes to build your site. Behind the scenes, it is finding a virtual machine that can run your app, allocating storage, collecting the files needed for Jupyter, for your notebook, and for presenting your notebook as a web application.


第一次这样做时，Binder将需要大约5分钟来构建你的网站。在幕后，它正在寻找一个可以运行你的应用程序的虚拟机，分配存储，收集Jupyter、你的笔记本以及将你的笔记本作为web应用程序呈现所需的文件。

Finally, once it has started the app running, it will navigate your browser to your new web app. You can share the URL you copied to allow others to access your app as well.


最后，一旦它启动了应用程序运行，它会将你的浏览器导航到你的新web应用。你可以共享你复制的URL，以允许其他人也访问你的应用程序。

For other (both free and paid) options for deploying your web app, be sure to take a look at the [book's website](https://book.fast.ai/).

对于部署你的网络应用程序的其他 (免费和付费) 选项，请务必查看 [图书网站](https://book.fast.ai/ )。

You may well want to deploy your application onto mobile devices, or edge devices such as a Raspberry Pi. There are a lot of libraries and frameworks that allow you to integrate a model directly into a mobile application. However, these approaches tend to require a lot of extra steps and boilerplate, and do not always support all the PyTorch and fastai layers that your model might use. In addition, the work you do will depend on what kind of mobile devices you are targeting for deployment—you might need to do some work to run on iOS devices, different work to run on newer Android devices, different work for older Android devices, etc. Instead, we recommend wherever possible that you deploy the model itself to a server, and have your mobile or edge application connect to it as a web service.


可能你希望将应用程序部署到移动设备或边缘设备 (如树莓派) 上。有许多库和框架允许你将模型直接集成到移动应用程序中。然而，这些方法往往需要大量额外的步骤和样板文件，而且并不总是支持模型可能使用的所有PyTorch和fastai层。此外，你所做的工作将取决于你部署的移动设备类型 -- 你可能需要做一些工作才能在iOS设备上运行，需要做不同的工作才能在较新的安卓设备上运行，而对于较旧的安卓设备也要进行其他操作，等等。相反，我们建议你尽可能将模型本身部署到服务器，并让你的移动或边缘应用程序作为web服务连接到它。

There are quite a few upsides to this approach. The initial installation is easier, because you only have to deploy a small GUI application, which connects to the server to do all the heavy lifting. More importantly perhaps, upgrades of that core logic can happen on your server, rather than needing to be distributed to all of your users. Your server will have a lot more memory and processing capacity than most edge devices, and it is far easier to scale those resources if your model becomes more demanding. The hardware that you will have on a server is also going to be more standard and more easily supported by fastai and PyTorch, so you don't have to compile your model into a different form.


这种方法有很多优点。初始安装更容易，因为你只需要部署一个小的图形用户界面应用程序，它连接到服务器来完成所有繁重的工作。也许更重要的是，该核心逻辑的升级可以在你的服务器上进行，而不需要分发给所有用户。与大多数边缘设备相比，你的服务器将具有更多的内存和处理能力，如果你的模型要求更高，则扩展这些资源要容易得多。你在服务器上拥有的硬件也将更加标准，更容易得到fastai和PyTorch的支持，因此你不必将模型编译成其他形式。

There are downsides too, of course. Your application will require a network connection, and there will be some latency each time the model is called. (It takes a while for a neural network model to run anyway, so this additional network latency may not make a big difference to your users in practice. In fact, since you can use better hardware on the server, the overall latency may even be less than if it were running locally!) Also, if your application uses sensitive data then your users may be concerned about an approach which sends that data to a remote server, so sometimes privacy considerations will mean that you need to run the model on the edge device (it may be possible to avoid this by having an *on-premise* server, such as inside a company's firewall). Managing the complexity and scaling the server can create additional overhead too, whereas if your model runs on the edge devices then each user is bringing their own compute resources, which leads to easier scaling with an increasing number of users (also known as *horizontal scaling*).

当然，也有缺点。你的应用程序将需要网络连接，并且每次调用模型时都会有一些延迟。(无论如何，神经网络模型需要一段时间才能运行，因此这种额外的网络延迟在实践中可能不会对用户产生很大影响。事实上，因为可以在服务器上使用更好的硬件，所以总延迟甚至可能比在本地运行时还要少!) 此外，如果你的应用程序使用敏感数据，那么用户可能会担心将该数据发送到远程服务器的方法，因此，有时出于隐私考虑，你需要在边缘设备上运行该模型 (通过拥有一个 *预置* 服务器，例如在公司的防火墙内)。管理复杂性和扩展服务器也会产生额外的开销，而如果你的模型在边缘设备上运行，则每个用户都将带来自己的计算资源，这使得随着用户数量的增加更容易扩展 (也称为 *水平扩展*)。

> A: I've had a chance to see up close how the mobile ML landscape is changing in my work. We offer an iPhone app that depends on computer vision, and for years we ran our own computer vision models in the cloud. This was the only way to do it then since those models needed significant memory and compute resources and took minutes to process inputs. This approach required building not only the models (fun!) but also the infrastructure to ensure a certain number of "compute worker machines" were absolutely always running (scary), that more machines would automatically come online if traffic increased, that there was stable storage for large inputs and outputs, that the iOS app could know and tell the user how their job was doing, etc. Nowadays Apple provides APIs for converting models to run efficiently on device and most iOS devices have dedicated ML hardware, so that's the strategy we use for our newer models. It's still not easy but in our case it's worth it, for a faster user experience and to worry less about servers. What works for you will depend, realistically, on the user experience you're trying to create and what you personally find is easy to do. If you really know how to run servers, do it. If you really know how to build native mobile apps, do that. There are many roads up the hill.


> A: 我曾有机会近距离观察移动机器学习在我的工作中是如何变化的。我们提供依赖于计算机视觉的iPhone应用程序，多年来，我们在云中运行自己的计算机视觉模型。这是当时唯一的方法，因为这些模型需要大量的内存和计算资源，并且需要几分钟来处理输入。这种方法不仅需要建立模型 (有趣!) 而且确保一定数量的 “计算工人机器” 绝对总是运行的基础设施 (可怕)，如果流量增加，更多的机器会自动上线，大输入和输出有稳定的存储，IOS应用程序可以知道并告诉用户他们的工作进展如何，等等。如今，Apple提供了用于转换模型以在设备上高效运行的应用程序接口，并且大多数iOS设备都有专用ML硬件，因此这就是我们用于更新模型的策略。这仍然不容易，但在我们的案例中，但是在我们的案例中这是值得的，它可以带来更快的用户体验和并减少对服务器的担心。实际上，适合你的方法取决于你要尝试创建的用户体验，并且你会发现自己很容易做到。如果你真的知道如何运行服务器，那就去做吧。如果你真的知道如何构建本地移动应用程序，那就去做吧。上山的路有很多条。

Overall, we'd recommend using a simple CPU-based server approach where possible, for as long as you can get away with it. If you're lucky enough to have a very successful application, then you'll be able to justify the investment in more complex deployment approaches at that time.


总的来说，我们建议尽可能使用简单的基于CPU的服务器方法，只要你能避免这种情况。如果你幸运地拥有一个非常成功的应用程序，那么届时你将能够证明在更复杂的部署方法上的投资是合理的。

Congratulations, you have successfully built a deep learning model and deployed it! Now is a good time to take a pause and think about what could go wrong.

恭喜您，你已经成功构建了深度学习模型并部署了它！现在是暂停一下，想想哪里会出错的好时机。

## How to Avoid Disaster

## 如何避免灾难

In practice, a deep learning model will be just one piece of a much bigger system. As we discussed at the start of this chapter, a data product requires thinking about the entire end-to-end process, from conception to use in production. In this book, we can't hope to cover all the complexity of managing deployed data products, such as managing multiple versions of models, A/B testing, canarying, refreshing the data (should we just grow and grow our datasets all the time, or should we regularly remove some of the old data?), handling data labeling, monitoring all this, detecting model rot, and so forth. In this section we will give an overview of some of the most important issues to consider; for a more detailed discussion of deployment issues we refer to you to the excellent [Building Machine Learning Powered Applications](http://shop.oreilly.com/product/0636920215912.do) by Emmanuel Ameisen (O'Reilly)


实际上，深度学习模型只是更大系统的一部分。正如我们在本章开始时所讨论的，数据产品需要考虑从概念到生产中使用的整个端到端流程。在本书中，我们不能希望涵盖管理部署的数据产品的所有复杂性，例如管理模型的多个版本、A/B测试、canarying、刷新数据 (我们应该一直增加我们的数据集，还是应该定期删除一些旧数据？)，处理数据标记，监控所有这些，检测模型腐烂，等等。在本节中，我们将概述一些需要考虑的最重要的问题; 有关部署问题的更详细讨论，请参考Emmanuel Ameisen (O'Reilly)编写的[构建机器学习驱动的应用程序](http://shop.oreilly.com/product/0636920215912.do)。

One of the biggest issues to consider is that understanding and testing the behavior of a deep learning model is much more difficult than with most other code you write. With normal software development you can analyze the exact steps that the software is taking, and carefully study which of these steps match the desired behavior that you are trying to create. But with a neural network the behavior emerges from the model's attempt to match the training data, rather than being exactly defined.


要考虑的最大问题之一是，理解和测试深度学习模型的行为比您编写的大多数其他代码要困难得多。通过正常的软件开发，您可以分析软件正在采取的确切步骤，并仔细研究这些步骤中哪些与您试图创建的所需行为相匹配。但是通过神经网络，这种行为出现在模型试图匹配训练数据的过程中，而不是被精确定义。

This can result in disaster! For instance, let's say we really were rolling out a bear detection system that will be attached to video cameras around campsites in national parks, and will warn campers of incoming bears. If we used a model trained with the dataset we downloaded there would be all kinds of problems in practice, such as:


这可能会导致灾难！例如，假设我们真的推出了一个熊检测系统，该系统将连接到国家公园营地周围的摄像机上，并警告露营者即将到来的熊。如果我们使用下载的数据集训练的模型，那么在实践中将遇到各种各样的问题，例如:

- Working with video data instead of images
- Handling nighttime images, which may not appear in this dataset
- Dealing with low-resolution camera images
- Ensuring results are returned fast enough to be useful in practice
- Recognizing bears in positions that are rarely seen in photos that people post online (for example from behind, partially covered by bushes, or when a long way away from the camera)

- 处理视频数据而不是图像
- 处理夜间图像，这些图像可能不会出现在此数据集中
- 处理低分辨率相机图像
- 确保快速返回结果以能够在实践中使用
- 识别人们在网上发布的照片中很少看到的位置的熊 (例如从后面，部分被灌木丛覆盖，或者远离相机的时候)

A big part of the issue is that the kinds of photos that people are most likely to upload to the internet are the kinds of photos that do a good job of clearly and artistically displaying their subject matter—which isn't the kind of input this system is going to be getting. So, we may need to do a lot of our own data collection and labelling to create a useful system.


一个很大的问题是，人们最有可能上传到互联网上的照片类型是那些能够很好地清晰和艺术地展示主题的照片类型——这不是这个系统要得到的那种输入。因此，我们可能需要做大量自己的数据收集和标签来创建一个有用的系统。

This is just one example of the more general problem of *out-of-domain* data. That is to say, there may be data that our model sees in production which is very different to what it saw during training. There isn't really a complete technical solution to this problem; instead, we have to be careful about our approach to rolling out the technology.


这只是 *域外* 数据的更普遍问题的一个示例。也就是说，我们的模型在生产中看到的数据可能与训练中看到的数据完全不同。这个问题并没有一个完整的技术解决方案; 相反，我们必须谨慎采用该技术。

There are other reasons we need to be careful too. One very common problem is *domain shift*, where the type of data that our model sees changes over time. For instance, an insurance company may use a deep learning model as part of its pricing and risk algorithm, but over time the types of customers that the company attracts, and the types of risks they represent, may change so much that the original training data is no longer relevant.


还有其他一些原因，我们也需要注意。一个非常常见的问题是 *域转移*，我们的模型看到的数据类型会随着时间的推移而变化。例如，保险公司可能使用深度学习模型作为其定价和风险算法的一部分，但是随着时间的推移，公司吸引的客户类型及其所代表的风险类型可能会发生巨大的变化，以至于原始训练数据不再相关。

Out-of-domain data and domain shift are examples of a larger problem: that you can never fully understand the entire behaviour of your neural network. They have far too many parameters to be able to analytically understand all of their possible behaviors. This is the natural downside of their best feature—their flexibility, which enables them to solve complex problems where we may not even be able to fully specify our preferred solution approaches. The good news, however, is that there are ways to mitigate these risks using a carefully thought-out process. The details of this will vary depending on the details of the problem you are solving, but we will attempt to lay out here a high-level approach, summarized in <<deploy_process>>, which we hope will provide useful guidance.

域外数据和域转移是一个较大问题的示例: 你永远无法完全理解神经网络的整个行为。他们有太多的参数，无法分析地理解他们所有可能的行为。这是其最佳特性的自然缺点——其灵活性，这使他们能够解决复杂的问题，我们甚至可能无法完全指定我们的首选解决方案方法。然而，好消息是，有办法通过仔细考虑的过程来降低这些风险。具体细节将根据你要解决的问题的具体情况而有所不同，但是我们将在此处尝试提出一种概括在<<deploy_process>>中的高级方法，希望该方法可以提供有用的指导。

<img alt="Deployment process" width="500" caption="Deployment process" id="deploy_process" src="images/att_00061.png">



Where possible, the first step is to use an entirely manual process, with your deep learning model approach running in parallel but not being used directly to drive any actions. The humans involved in the manual process should look at the deep learning outputs and check whether they make sense. For instance, with our bear classifier a park ranger could have a screen displaying video feeds from all the cameras, with any possible bear sightings simply highlighted in red. The park ranger would still be expected to be just as alert as before the model was deployed; the model is simply helping to check for problems at this point.


在可能的情况下，第一步是使用完全手动的过程，深度学习模型方法并行运行，但不直接用于驱动任何动作。参与手动过程的人应该查看深度学习的输出，并检查它们是否有意义。例如，使用我们的熊分类器，公园管理员可以有一个屏幕显示所有摄像机的视频，任何可能的熊目击都只需用红色突出显示。公园管理员仍然会像模型部署之前一样保持警惕; 该模型只是帮助检查此时的问题。

The second step is to try to limit the scope of the model, and have it carefully supervised by people. For instance, do a small geographically and time-constrained trial of the model-driven approach. Rather than rolling our bear classifier out in every national park throughout the country, we could pick a single observation post, for a one-week period, and have a park ranger check each alert before it goes out.


第二步是尝试限制模型的范围，并让人们仔细监督它。例如，对模型驱动方法进行地理和时间限制的小型试验。与其在全国每个国家公园推出我们的熊分类器，不如在一个星期的时间内挑选一个观察站，让公管理员在警报发出之前检查每个警报。

Then, gradually increase the scope of your rollout. As you do so, ensure that you have really good reporting systems in place, to make sure that you are aware of any significant changes to the actions being taken compared to your manual process. For instance, if the number of bear alerts doubles or halves after rollout of the new system in some location, we should be very concerned. Try to think about all the ways in which your system could go wrong, and then think about what measure or report or picture could reflect that problem, and ensure that your regular reporting includes that information.

然后，逐渐扩大发布的范围。在执行此操作时，请确保已建立了非常完善的报告系统，以确保与手动流程相比，你可以意识到正在采取的操作有任何重大变化。例如，如果在某个位置推出新系统后，熊警报的数量翻倍或减半，我们应该非常关注。尝试考虑系统可能出错的所有方式，然后考虑采用何种衡量标准、报告或图片可以反映这个问题，并确保你的定期报告包含这些信息。

> J: I started a company 20 years ago called _Optimal Decisions_ that used machine learning and optimization to help giant insurance companies set their pricing, impacting tens of billions of dollars of risks. We used the approaches described here to manage the potential downsides of something going wrong. Also, before we worked with our clients to put anything in production, we tried to simulate the impact by testing the end-to-end system on their previous year's data. It was always quite a nerve-wracking process, putting these new algorithms into production, but every rollout was successful.

> J: 我在20年前创办了一家名为 _最佳决策_ 的公司，该公司利用机器学习和优化来帮助大型保险公司确定定价，从而影响数百亿美元的风险。我们使用此处介绍的方法来管理出现问题的潜在弊端。此外，在与客户合作以将任何产品投入生产之前，我们试图通过测试其上一年数据的端到端系统来模拟其影响。将这些新算法投入生产总是一个令人紧张的过程，但是每次试运行都是成功的。

### Unforeseen Consequences and Feedback Loops

### 不可预见的后果和反馈循环

One of the biggest challenges in rolling out a model is that your model may change the behaviour of the system it is a part of. For instance, consider a "predictive policing" algorithm that predicts more crime in certain neighborhoods, causing more police officers to be sent to those neighborhoods, which can result in more crimes being recorded in those neighborhoods, and so on. In the Royal Statistical Society paper ["To Predict and Serve?"](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2016.00960.x), Kristian Lum and William Isaac observe that: "predictive policing is aptly named: it is predicting future policing, not future crime."


推出模型的最大挑战之一是模型可能会改变其所属系统的行为。例如，考虑一种“预测性警务”算法，该算法可以预测某些社区中的更多犯罪，从而导致将更多的警务人员发送到这些社区，这可能导致在这些社区中记录更多的犯罪，依此类推。在皇家统计学会的论文 [“预测和服务？”](https://rss.onlinelibrary.wiley.com/doi/full/10.1111/j.1740-9713.2016.00960.x)，Kristian Lum和William Isaac观察到: “预测警务恰如其名: 它预测的是未来警务，而不是未来犯罪。”

Part of the issue in this case is that in the presence of bias (which we'll discuss in depth in the next chapter), *feedback loops* can result in negative implications of that bias getting worse and worse. For instance, there are concerns that this is already happening in the US, where there is significant bias in arrest rates on racial grounds. [According to the ACLU](https://www.aclu.org/issues/smart-justice/sentencing-reform/war-marijuana-black-and-white), "despite roughly equal usage rates, Blacks are 3.73 times more likely than whites to be arrested for marijuana." The impact of this bias, along with the rollout of predictive policing algorithms in many parts of the US, led Bärí Williams to [write in the *New York Times*](https://www.nytimes.com/2017/12/02/opinion/sunday/intelligent-policing-and-my-innocent-children.html): "The same technology that’s the source of so much excitement in my career is being used in law enforcement in ways that could mean that in the coming years, my son, who is 7 now, is more likely to be profiled or arrested—or worse—for no reason other than his race and where we live."


本案例中的部分问题是，在存在偏差的情况下 (我们将在下一章中深入讨论)，*反馈循环* 可能导致该偏差的负面影响越来越严重。例如，有人担心这种情况已经在美国发生，在美国，由于种族原因，逮捕率存在重大偏差。[根据美国公民自由联盟](https://www.aclu.org/issues/smart-justice/sentencing-reform/war-marijuana-black-and-white)，“尽管使用率大致相等，但黑人因大麻被捕的可能性比白人高3.73倍。”这种偏差的影响，以及预测警务算法在美国许多地方的推出，导致bärí Williams [在 *纽约时报* 中写道](https://www.nytimes.com/2017/12/02/opinion/sunday/intelligent-policing-and-my-innocent-children.html):“在我的职业生涯中，令人兴奋的根源是同样的技术是在执法中使用的，这可能意味着在未来几年中，我现年7岁的儿子更有可能被侧写或逮捕 —— 甚至更糟 —— 除了他的种族和我们的住所以外，没有任何其他原因。”

A helpful exercise prior to rolling out a significant machine learning system is to consider this question: "What would happen if it went really, really well?" In other words, what if the predictive power was extremely high, and its ability to influence behavior was extremely significant? In that case, who would be most impacted? What would the most extreme results potentially look like? How would you know what was really going on?


在推出重要的机器学习系统之前，一个有用的练习是考虑这个问题: “如果运行得非常好，将会发生什么？” 换句话说，如果预测能力非常高，其影响行为的能力非常显著，会怎么样？在这种情况下，谁会受到最大影响？最极端的结果可能是什么样子？你怎么知道到底发生了什么？

Such a thought exercise might help you to construct a more careful rollout plan, with ongoing monitoring systems and human oversight. Of course, human oversight isn't useful if it isn't listened to, so make sure that there are reliable and resilient communication channels so that the right people will be aware of issues, and will have the power to fix them.

这样的思想练习可能会帮助您通过持续的监视系统和人工监督来构建更仔细的发布计划。当然，如果不被倾听，人工监督是没有用的，因此请确保有可靠且有弹性的沟通渠道，以便合适的人了解问题并有能力解决问题。

## Get Writing!

## 开始写作!

One of the things our students have found most helpful to solidify their understanding of this material is to write it down. There is no better test of your understanding of a topic than attempting to teach it to somebody else. This is helpful even if you never show your writing to anybody—but it's even better if you share it! So we recommend that, if you haven't already, you start a blog. Now that you've completed Chapter 2 and have learned how to train and deploy models, you're well placed to write your first blog post about your deep learning journey. What's surprised you? What opportunities do you see for deep learning in your field? What obstacles do you see?


我们的学生发现最有助于巩固他们对这种材料的理解的一件事就是把它写下来。没有比尝试教别人更好的测试你对一个话题的理解了。即使你从未向任何人展示过你的作品，这也是有帮助的 -- 但是如果你分享它，那就更好了！所以我们建议你（如果还没有的话）创建一个博客。既然你已经完成了第2章，并学习了如何训练和部署模型，那么你就可以撰写关于深度学习之旅的第一篇博客文章了。有什么让你吃惊的？你在自己的领域看到了哪些深度学习的机会？你看到了什么障碍？

Rachel Thomas, cofounder of fast.ai, wrote in the article ["Why You (Yes, You) Should Blog"](https://medium.com/@racheltho/why-you-yes-you-should-blog-7d2544ac1045):


Fast.ai的联合创始人Rachel Thomas在文章 [“为什么你 (是的，你) 应该写博客”](https://medium.com/@racheltho/why-you-yes-you-should-blog-7d2544ac1045)中写道:

```asciidoc
____
The top advice I would give my younger self would be to start blogging sooner. Here are some reasons to blog:


```asciidoc
____
我给年轻自己的建议是尽快开始写博客。以下是一些博客的理由:

* It’s like a resume, only better. I know of a few people who have had blog posts lead to job offers!
* Helps you learn. Organizing knowledge always helps me synthesize my own ideas. One of the tests of whether you understand something is whether you can explain it to someone else. A blog post is a great way to do that.
* I’ve gotten invitations to conferences and invitations to speak from my blog posts. I was invited to the TensorFlow Dev Summit (which was awesome!) for writing a blog post about how I don’t like TensorFlow.
* Meet new people. I’ve met several people who have responded to blog posts I wrote.
* Saves time. Any time you answer a question multiple times through email, you should turn it into a blog post, which makes it easier for you to share the next time someone asks.
____
```


* 这就像一份简历，只是更好。我知道有一些人通过博客发布工作机会！
* 帮助你学习。组织知识总是帮助我概括自己的想法。你是否理解某事的测试之一是你能否向别人解释它。博客发帖是一个很好的方法。
* 我收到了会议邀请和博客帖子中的发言邀请。由于写了一篇关于我不喜欢TensorFlow的博客文章，我受邀参加TensorFlow开发峰会（这太棒了！）。
* 结识新朋友。我见过几个回复我写的博客帖子的人。
* 节省时间。任何时候你通过电子邮件多次回答一个问题，你都应该把它变成一篇博客文章，这样下次有人问你的时候你就更容易分享了。
____
```

Perhaps her most important tip is this: 


也许她最重要的提示是:

> : You are best positioned to help people one step behind you. The material is still fresh in your mind. Many experts have forgotten what it was like to be a beginner (or an intermediate) and have forgotten why the topic is hard to understand when you first hear it. The context of your particular background, your particular style, and your knowledge level will give a different twist to what you’re writing about.


>: 你最适合帮助你身后的人。材料在你的脑海中仍然新鲜。许多专家已经忘记了初学者 (或中级) 的感觉，并且忘记了为什么当你第一次听到这个话题时很难理解。你的特定背景，特定风格和知识水平的背景会使你所写的内容有所不同。

We've provided full details on how to set up a blog in <<appendix_blog>>. If you don't have a blog already, take a look at that now, because we've got a really great approach set up for you to start blogging for free, with no ads—and you can even use Jupyter Notebook!

我们提供了有关如何在 <<appendix_blog>> 中设置博客的完整详细信息。如果你还没有博客，现在就看看，因为我们已经为你建立了一个非常好的免费博客的方法，没有广告 —— 你甚至可以使用Jupyter笔记本!

## Questionnaire

## 问卷调查

1. Provide an example of where the bear classification model might work poorly in production, due to structural or style differences in the training data.
1. Where do text models currently have a major deficiency?
1. What are possible negative societal implications of text generation models?
1. In situations where a model might make mistakes, and those mistakes could be harmful, what is a good alternative to automating a process?
1. What kind of tabular data is deep learning particularly good at?
1. What's a key downside of directly using a deep learning model for recommendation systems?
1. What are the steps of the Drivetrain Approach?
1. How do the steps of the Drivetrain Approach map to a recommendation system?
1. Create an image recognition model using data you curate, and deploy it on the web.
1. What is `DataLoaders`?
1. What four things do we need to tell fastai to create `DataLoaders`?
1. What does the `splitter` parameter to `DataBlock` do?
1. How do we ensure a random split always gives the same validation set?
1. What letters are often used to signify the independent and dependent variables?
1. What's the difference between the crop, pad, and squish resize approaches? When might you choose one over the others?
1. What is data augmentation? Why is it needed?
1. What is the difference between `item_tfms` and `batch_tfms`?
1. What is a confusion matrix?
1. What does `export` save?
1. What is it called when we use a model for getting predictions, instead of training?
1. What are IPython widgets?
1. When might you want to use CPU for deployment? When might GPU be better?
1. What are the downsides of deploying your app to a server, instead of to a client (or edge) device such as a phone or PC?
1. What are three examples of problems that could occur when rolling out a bear warning system in practice?
1. What is "out-of-domain data"?
1. What is "domain shift"?
1. What are the three steps in the deployment process?


1. 举例说明由于训练数据的结构或风格差异，熊分类模型在生产中可能效果不佳。
1. 文本模型目前在哪些方面存在重大不足？
1. 文本生成模型可能的负面社会影响是什么？
1. 在模型可能会出错，并且这些错误可能是有害的情况下，什么是自动化过程的好选择？
1. 深度学习特别擅长什么样的表格数据？
1. 在推荐系统中直接使用深度学习模型的主要缺点是什么？
1. 传动系统方法的步骤是什么？
1. 传动系统方法的步骤如何映射到推荐系统？
1. 使用你整理的数据创建图像识别模型，并将其部署到web上。
1. 什么是 `DataLoaders`？
1. 我们需要告诉fastai什么四件事来创建 `DataLoaders`？
1. 将 `splitter` 参数设置为 `datablock` 做什么？
1. 我们如何确保随机分割始终提供相同的验证集？
1. 常用什么字母表示自变量和因变量？
1. 裁剪，填充和压缩调整方法之间有什么区别？你何时可以选择一个？
1. 什么是数据增强？为什么需要它？
1. `item_tfms` 和 `batch_tfms` 有什么区别？
1. 什么是混淆矩阵？
1. `export` 保存什么？
1. 当我们使用模型来获取预测而不是训练时，它叫什么？
1. 什么是IPython小组件？
1. 你可能希望在什么时候使用CPU进行部署？什么时候用显卡会更好？
1. 将应用程序部署到服务器而不是客户端 (或边缘) 设备 (如手机或PC) 有什么缺点？
1. 在实践中推出熊预警系统时可能出现的三个问题的例子是什么？
1. 什么是 “域外数据”？
1. 什么是 “域转移”？
1. 部署过程中的三个步骤是什么？

### Further Research

### 进一步研究

1. Consider how the Drivetrain Approach maps to a project or problem you're interested in.
1. When might it be best to avoid certain types of data augmentation?
1. For a project you're interested in applying deep learning to, consider the thought experiment "What would happen if it went really, really well?"
1. Start a blog, and write your first blog post. For instance, write about what you think deep learning might be useful for in a domain you're interested in.


1. 思考如何将传动系统方法映射到你感兴趣的项目或问题。
1. 什么时候最好避免某些类型的数据增加？
1. 对于一个你对应用深度学习感兴趣的项目，考虑一下思维实验 “如果它运行得非常非常好会发生什么？”
1. 开一个博客，写你的第一篇博文。例如，写下你认为深度学习对你感兴趣的领域可能有用的东西。