## Rethinking the Inception Architecture for Computer Vision
## 重新思考计算机视觉的Inception结构

### 摘要

* Convolutional networks are at the core of most state-of-the-art computer vision solutions for a wide variety of tasks. Since 2014 very deep convolutional networks started to become mainstream, yielding substantial gains in various benchmarks. Although increased model size and computational cost tend to translate to immediate quality gains for most tasks (as long as enough labeled data is provided for training), computational efficiency and low parameter count are still enabling factors for various use cases such as mobile vision and big-data scenarios. Here we are exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. We benchmark our methods on the ILSVRC 2012 classification challenge validation set demonstrate substantial gains over the state of the art: 21.2% top-1 and 5.6% top-5 error for single frame evaluation using a network with a computational cost of 5 billion multiply-adds per inference and with using less than 25 million parameters. With an ensemble of 4 models and multi-crop evaluation, we report 3.5% top-5 error and 17.3% top-1 error.

* 卷积网络是最先进的计算机视觉解决方案的核心。自2014年以来，非常深的卷积网络开始成为主流，在各种基准上产生了实质性的收益。虽然增加的模型大小和计算成本往往会转化为大多数任务的即时质量提高（只要为训练提供足够的标记数据），计算效率和低参数计数仍然是各种用例（如移动视觉和大数据场景）的有利因素。在这里，我们正在探索如何通过适当的因子化卷积和积极的正则化来尽可能有效地利用增加的计算来扩展网络。我们在ILSVRC 2012分类挑战验证集上对我们的方法进行了基准测试，证明了我们的方法比现有技术有了实质性的提高：单帧评估的最大误差为21.2%，前5个错误为5.6%，每个推理的计算成本为50亿倍，使用的参数少于2500万个。综合4个模型和多种评价指标，我们报告了3.5%的前5个错误和17.3%的前1个错误。

### 1 介绍

* Since the 2012 ImageNet competition winning entry by Krizhevsky et al, their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection, segmentation, human pose estimation, video classification, object tracking, and superresolution.

* 自Krizhevsky等人在2012年ImageNet竞赛中获胜以来，他们的网络“AlexNet”已成功应用于更广泛的计算机视觉任务，例如目标检测、分割、人体姿势估计、视频分类、目标跟踪和超分辨率。

* These successes spurred a new line of research that focused on finding higher performing convolutional neural networks. Starting in 2014, the quality of network architectures significantly improved by utilizing deeper and wider networks. VGGNet and GoogLeNet yielded similarly high performance in the 2014 ILSVRC classification challenge. One interesting observation was that gains in the classification performance tend to transfer to significant quality gains in a wide variety of application domains. This means that architectural improvements in deep convolutional architecture can be utilized for improving performance for most other computer vision tasks that are increasingly reliant on high quality, learned visual features. Also, improvements in the network quality resulted in new application domains for convolutional networks in cases where AlexNet features could not compete with hand engineered, crafted solutions, e.g. proposal generation in detection.

* 这些成功刺激了一个新的研究方向，集中于寻找性能更高的卷积神经网络。从2014年开始，网络架构的质量通过使用更深入和更广泛的网络而显著提高。VGGNet和GoogLeNet在2014年ILSVRC分类挑战赛中表现同样出色。一个有趣的观察结果是，分类性能的提高往往会转移到各种应用领域的显著质量提高。这意味着，深卷积体系结构中的体系结构改进可以用于提高大多数其他计算机视觉任务的性能，这些任务越来越依赖于高质量的学习视觉特性。此外，网络质量的提高为卷积网络带来了新的应用领域，在这种情况下，AlexNet功能无法与手工设计、精心设计的解决方案相竞争，例如在检测中生成建议。

* Although VGGNet has the compelling feature of architectural simplicity, this comes at a high cost: evaluating the network requires a lot of computation. On the other hand, the Inception architecture of GoogLeNet was also designed to perform well even under strict constraints on memory and computational budget. For example, GoogleNet employed only 5 million parameters, which represented a 12× reduction with respect to its predecessor AlexNet, which used 60 million parameters. Furthermore, VGGNet employed about 3x more parameters than AlexNet.

* 尽管VGGNet具有体系结构简单这一引人注目的特性，但这需要付出高昂的代价：评估网络需要大量的计算。另一方面，GoogLeNet的Inception架构也被设计成即使在内存和计算预算的严格限制下也能表现良好。例如，GoogleNet只使用了500万个参数，这比它的前身AlexNet减少了12倍，后者使用了6000万个参数。此外，VGGNet使用的参数比AlexNet多3倍。

* The computational cost of Inception is also much lower than VGGNet or its higher performing successors. This has made it feasible to utilize Inception networks in big-data scenarios, where huge amount of data needed to be processed at reasonable cost or scenarios where memory or computational capacity is inherently limited, for example in mobile vision settings. It is certainly possible to mitigate parts of these issues by applying specialized solutions to target memory use,or by optimizing the execution of certain operations via computational tricks. However, these methods add extra complexity. Furthermore, these methods could be applied to optimize the Inception architecture as well, widening the efficiency gap again.

* 初始阶段的计算成本也远低于VGGNet或其性能更高的后续产品。这使得在大数据场景中使用初始网络是可行的，在大数据场景中，需要以合理的成本处理大量数据，或者在内存或计算能力固有受限的情况下，例如在移动视觉设置中。当然，通过对目标内存使用应用专门的解决方案，或者通过计算技巧优化某些操作的执行，来减轻这些问题的一部分。然而，这些方法增加了额外的复杂性。此外，这些方法还可用于优化初始架构，再次扩大效率差距。

* Still, the complexity of the Inception architecture makes it more difficult to make changes to the network. If the architecture is scaled up naively, large parts of the computational gains can be immediately lost. Also, does not provide a clear description about the contributing factors that lead to the various design decisions of the GoogLeNet architecture. This makes it much harder to adapt it to new use-cases while maintaining its efficiency. For example, if it is deemed necessary to increase the capacity of some Inception-style model, the simple transformation of just doubling the number of all filter bank sizes will lead to a 4x increase in both computational cost and number of parameters. This might prove prohibitive or unreasonable in a lot of practical scenarios, especially if the associated gains are modest. In this paper, we start with describing a few general principles and optimization ideas that that proved to be useful for scaling up convolution networks in efficient ways. Although our principles are not limited to Inception-type networks, they are easier to observe in that context as the generic structure of the Inception style building blocks is flexible enough to incorporate those constraints naturally. This is enabled by the generous use of dimensional reduction and parallel structures of the Inception modules which allows for mitigating the impact of structural changes on nearby components. Still, one needs to be cautious about doing so, as some guiding principles should be observed to maintain high quality of the models.

* 尽管如此，Inception架构的复杂性使得对网络进行更改变得更加困难。如果架构被天真地放大，很大一部分的计算增益可能会立即丢失。此外，也没有对导致GoogLeNet架构的各种设计决策的因素进行清晰的描述。这使得它在保持效率的同时更难适应新的用例。例如，如果认为有必要增加某个初始类型模型的容量，那么只需将所有滤波器组大小的数量增加一倍的简单转换将导致计算成本和参数数量增加4倍。在许多实际情况下，这可能被证明是禁止性的或不合理的，特别是在相关收益不大的情况下。在这篇文章中，我们从描述一些基本原理和优化思想开始，这些原则和思想被证明是有用的，可以有效地扩展卷积网络。虽然我们的原则不局限于初始类型的网络，但是在这种情况下，它们更容易观察到，因为初始样式构建块的一般结构足够灵活，可以自然地包含这些约束。这是通过大量使用维度缩减和初始模块的并行结构实现的，这些模块允许减轻结构变化对附近组件的影响。不过，我们仍需谨慎行事，因为应遵守一些指导原则，以保持模型的高质量。

### 2 总体设计原则

* Here we will describe a few design principles based on large-scale experimentation with various architectural choices with convolutional networks. At this point, the utility of the principles below are speculative and additional future experimental evidence will be necessary to assess their accuracy and domain of validity. Still, grave deviations from these principles tended to result in deterioration in the quality of the networks and fixing situations where those deviations were detected resulted in improved architectures in general.

* 在这里，我们将描述一些基于大规模实验的设计原则，这些实验使用卷积网络的各种架构选择。在这一点上，以下原则的效用是推测性的，未来将需要额外的实验证据来评估其准确性和有效性范围。然而，与这些原则的严重偏差往往会导致网络质量的恶化，如果检测到这些偏差，则会导致总体架构的改进。

* 1. Avoid representational bottlenecks, especially early in the network. Feed-forward networks can be represented by an acyclic graph from the input layer(s) to the classifier or regressor. This defines a clear direction for the information flow. For any cut separating the inputs from the outputs, one can access the amount of information passing though the cut. One should avoid bottlenecks with extreme compression. In general the representation size should gently decrease from the inputs to the outputs before reaching the final representation used for the task at hand. Theoretically, information content can not be assessed merely by the dimensionality of the representation as it discards important factors like correlation structure; the dimensionality merely provides a rough estimate of information content.

* 1. 避免代表性瓶颈，尤其是在网络早期。前馈网络可以用从输入层到分类器或回归器的无环图来表示。这为信息流定义了一个明确的方向。对于将输入与输出分离的任何切割，都可以访问通过切割的信息量。一个应该避免瓶颈与极端压缩。一般来说，在达到用于手头任务的最终表示之前，表示大小应该从输入逐渐减小到输出。从理论上讲，信息内容不能仅仅通过表征的维度来评估，因为它抛弃了相关结构等重要因素，而维度仅仅是对信息内容的粗略估计。

* 2. Higher dimensional representations are easier to process locally within a network. Increasing the activations per tile in a convolutional network allows for more disentangled features. The resulting networks will train faster.

* 2. 高维表示更容易在网络中本地处理。增加卷积网络中每个块的激活允许更多的分离特性。由此产生的网络将更快地训练。

* 3. Spatial aggregation can be done over lower dimensional embeddings without much or any loss in representational power. For example, before performing a more spread out (e.g. 3 × 3) convolution, one can reduce the dimension of the input representation before the spatial aggregation without expecting serious adverse effects. We hypothesize that the reason for that is the strong correlation between adjacent unit results in much less loss of information during dimension reduction, if the outputs are used in a spatial aggregation context. Given that these signals should be easily compressible, the dimension reduction even promotes faster learning.

* 3. 空间聚集可以在低维嵌入上进行，而不会损失太多或任何表示能力。例如，在执行更分散的（例如3×3）卷积之前，可以在空间聚集之前减少输入表示的维数，而不会期望严重的不利影响。我们假设这是因为相邻单元之间的强相关性导致在降维过程中，如果将输出用于空间聚集上下文，则会导致更少的信息损失。考虑到这些信号应该很容易压缩，降维甚至可以促进更快的学习。

* 4. Balance the width and depth of the network. Optimal performance of the network can be reached by balancing the number of filters per stage and the depth of the network. Increasing both the width and the depth of the network can contribute to higher quality networks. However, the optimal improvement for a constant amount of computation can be reached if both are increased in parallel. The computational budget should therefore be distributed in a balanced way between the depth and width of the network.

* 4. 平衡网络的宽度和深度。通过平衡每个阶段的滤波器数量和网络的深度，可以达到网络的最佳性能。增加网络的宽度和深度可以提高网络质量。但是，如果两者并行增加，则可以达到对恒定计算量的最佳改进。因此，计算预算应在网络的深度和宽度之间进行平衡分配。

* Although these principles might make sense, it is not straightforward to use them to improve the quality of net-works out of box. The idea is to use them judiciously in ambiguous situations only.

* 虽然这些原则可能有意义，但要使用它们来改进现成的网络作品的质量并不简单。我们的想法是只在模棱两可的情况下明智地使用它们。

### 3. 大滤波器尺寸的因子卷积

* Much of the original gains of the GoogLeNet net-work arise from a very generous use of dimension reduction. This can be viewed as a special case of factorizing convolutions in a computationally efficient manner. Consider for example the case of a 1 × 1 convolutional layer followed by a 3 × 3 convolutional layer. In a vision net-work, it is expected that the outputs of near-by activations are highly correlated. Therefore, we can expect that their activations can be reduced before aggregation and that this should result in similarly expressive local representations.

* GoogLeNet网络的许多原始收益都来自于对降维的大量使用。这可以看作是以计算效率高的方式分解卷积的一个特例。例如，考虑一个1×1的卷积层，接着是一个3×3的卷积层。在视觉网络中，人们期望近距离激活的输出是高度相关的。因此，我们可以预期，它们的激活可以在聚合之前减少，并且这将导致类似的表达局部表示。

* Here we explore other ways of factorizing convolutions in various settings, especially in order to increase the computational efficiency of the solution. Since Inception net-works are fully convolutional, each weight corresponds to one multiplication per activation. Therefore, any reduction in computational cost results in reduced number of parameters. This means that with suitable factorization, we can end up with more disentangled parameters and therefore with faster training. Also, we can use the computational and memory savings to increase the filter-bank sizes of our network while maintaining our ability to train each model replica on a single computer.

* 在这里，我们将探索在各种情况下分解卷积的其他方法，特别是为了提高解的计算效率。由于初始网络是完全卷积的，每个权重对应于每次激活的一次乘法。因此，计算成本的任何减少都会导致参数数量的减少。这意味着，通过适当的因子分解，我们可以得到更多的解纠缠参数，从而获得更快的训练。此外，我们可以使用节省的计算和内存来增加我们网络的过滤器组大小，同时保持我们在一台计算机上训练每个模型副本的能力。

![avatar](图片1/1.png)

<center>图1. 小型网络取代了5×5卷积。</center>

#### 3.1. 分解成更小的卷积

![avatar](图片1/4.png)

![avatar](图片1/5.png)

<center>图5。初始模块中，每个5×5卷积被两个3×3卷积代替，如第2节原则3所示。</center>

* Convolutions with larger spatial filters (e.g. 5 × 5 or 7 × 7) tend to be disproportionally expensive in terms of computation. For example, a 5 × 5 convolution with n filters over a grid with m filters is 25/9 = 2.78 times more computationally expensive than a 3 × 3 convolution with the same number of filters. Of course, a 5 × 5 filter can capture dependencies between signals between activations of units further away in the earlier layers, so a reduction of the geometric size of the filters comes at a large cost of expressiveness. However, we can ask whether a 5 × 5 convolution could be replaced by a multi-layer network with less parameters with the same input size and output depth. If we zoom into the computation graph of the 5 × 5 convolution, we see that each output looks like a small fully-connected network sliding over 5 × 5 tiles over its input (see Figure 1). Since we are constructing a vision network, it seems natural to exploit translation invariance again and replace the fully connected component by a two layer convolutional architecture: the first layer is a 3 × 3 convolution, the second is a fully connected layer on top of the 3 × 3 output grid of the first layer (see Figure 1). Sliding this small network over the input activation grid boils down to replacing the 5 × 5 convolution with two layers of 3 × 3 convolution (compare Figure 4 with 5).

* 使用更大的空间滤波器（如5×5或7×7）的卷积往往在计算方面花费不成比例。例如，在具有m个滤波器的网格上使用n个滤波器的5×5卷积比使用相同数量的滤波器的3×3卷积计算开销高25/9=2.78倍。当然，5×5滤波器可以捕捉到前一层中单元激活之间信号之间的依赖关系，因此减小滤波器的几何尺寸需要付出很大的表现力代价。然而，我们可以问是否可以用一个输入大小和输出深度相同的参数较少的多层网络来代替5×5卷积。如果我们放大5×5卷积的计算图，我们会看到每个输出看起来像一个小的完全连接的网络，在它的输入上滑动5×5块（见图1）。由于我们正在构建一个视觉网络，所以很自然地再次利用平移不变性，并用一个两层卷积结构来代替完全连接的组件：第一层是3×3的卷积，第二层是在第一层的3×3输出网格之上的完全连接层（见图1）。在输入激活网格上滑动这个小网络，可以归结为用两层3×3卷积代替5×5卷积（比较图4和图5）。

* This setup clearly reduces the parameter count by sharing the weights between adjacent tiles. To analyze the expected computational cost savings, we will make a few simplifying assumptions that apply for the typical situations: We can assume that $n = \alpha m$, that is that we want to change the number of activations/unit by a constant alpha factor. Since the 5 × 5 convolution is aggregating, $\alpha$ is typically slightly larger than one (around 1.5 in the case of GoogLeNet). Having a two layer replacement for the 5 × 5 layer, it seems reasonable to reach this expansion in two steps: increasing the number of filters by$\sqrt{\alpha}$in both steps. In order to simplify our estimate by choosing $\alpha = 1$ (no expansion), If we would naivly slide a network without reusing the computation between neighboring grid tiles, we would increase the computational cost. sliding this network can be represented by two 3 × 3 convolutional layers which reuses the activations between adjacent tiles. This way, we end up with a net $\frac{9+9}{25}$ reduction of computation, resulting 25 in a relative gain of 28% by this factorization. The exact same saving holds for the parameter count as each parameter is used exactly once in the computation of the activation of each unit. Still, this setup raises two general questions: Does this replacement result in any loss of expressiveness? If our main goal is to factorize the linear part of the computation, would it not suggest to keep linear activations in the first layer? We have ran several control experiments (for example see figure 2) and using linear activation was always inferior to using rectified linear units in all stages of the factorization. We attribute this gain to the enhanced space of variations that the network can learn especially if we batch-normalize the output activations. One can see similar effects when using linear activations for the dimension reduction components.

* 此设置通过在相邻分片之间共享权重，明显减少了参数计数。为了分析预期的计算成本节省，我们将做出一些适用于典型情况的简化假设：我们可以假设$n=\alpha m$，也就是说，我们希望通过一个常数α因子来改变每单位的激活次数。由于5×5卷积正在聚集，$\alpha$通常略大于1（对于GoogLeNet，大约为1.5）。用两层替换5×5层，似乎可以通过两个步骤实现这个扩展：在两个步骤中增加$\sqrt{\alpha}$的过滤器数量。为了通过选择$\alpha=1$（无扩展）来简化我们的估计，如果我们不重用相邻网格块之间的计算而直接滑动网络，则会增加计算成本。滑动网络可以用两个3×3的卷积层来表示，这两个卷积层可以重用相邻块之间的激活。这样，我们最终得到净$\frac{9+9}{25}$的计算量减少，通过这种因式分解，得到了28%的相对增益。每一个参数的保存在每一个参数的计算中都是完全相同的。不过，这种设置提出了两个普遍的问题：这种替代会导致表达能力的丧失吗？如果我们的主要目标是分解计算的线性部分，那么是否建议在第一层保持线性激活？我们进行了几个对照实验（例如，见图2），在因子分解的所有阶段，使用线性激活总是不如使用校正的线性单元。我们将此增益归因于网络可以学习的增强的变化空间，特别是当我们批量规范化输出激活时。当对降维组件使用线性激活时，可以看到类似的效果。

![avatar](图片1/2.png)

<center>图2。两个初始模型之间的几个控制实验之一，其中一个使用分解成线性+ReLU层，另一个使用两个ReLU层。经过386万次操作后，前者的准确率为76.2%，后者在验证集上达到了77.2%的top-1准确率。</center>

#### 3.2. 非对称卷积的空间分解

* The above results suggest that convolutions with filters larger 3 × 3 a might not be generally useful as they can always be reduced into a sequence of 3 × 3 convolutional layers. Still we can ask the question whether one should factorize them into smaller, for example 2 × 2 convolutions. However, it turns out that one can do even better than 2 × 2 by using asymmetric convolutions, e.g. n × 1. For example using a 3 × 1 convolution followed by a 1 × 3 convolution is equivalent to sliding a two layer network with the same receptive field as in a 3 × 3 convolution (see figure 3). Still the two-layer solution is 33% cheaper for the same number of output filters, if the number of input and output filters is equal. By comparison, factorizing a 3 × 3 convolution into a two 2 × 2 convolution represents only a 11% saving of computation.

* 以上结果表明，滤波器大于3×3a的卷积可能不太有用，因为它们总是可以被简化为3×3卷积层的序列。我们仍然可以问是否应该将它们分解成更小的，例如2×2卷积。然而，使用非对称卷积（如n×1）可以得到比2×2更好的结果。例如，使用3×1卷积，然后使用1×3卷积，相当于滑动具有与3×3卷积相同感受野的两层网络（见图3）。同样数量的输出滤波器，如果输入和输出滤波器的数量相等，两层解决方案的成本仍然是33%。相比之下，将一个3×3的卷积分解成2个2×2的卷积只节省了11%的计算量。

![avatar](图片1/3.png)

* In theory, we could go even further and argue that one can replace any n × n convolution by a 1 × n convolution followed by a n × 1 convolution and the computational cost saving increases dramatically as n grows (see figure 6). In practice, we have found that employing this factorization does not work well on early layers, but it gives very good results on medium grid-sizes (On m × m feature maps, where m ranges between 12 and 20). On that level, very good results can be achieved by using 1 × 7 convolutions followed by 7 × 1 convolutions.

* 理论上，我们可以更进一步，认为可以用1×n卷积代替任何n×n卷积，然后再进行n×1卷积，并且随着n的增加，节省的计算成本显著增加（见图6）。在实践中，我们发现在早期的图层上使用这种因子分解并不能很好地工作，但是在中等网格大小（m×m特征映射上，m的范围在12到20之间）时可以得到非常好的结果。在这个水平上，先用1×7卷积，再用7×1卷积，可以得到很好的结果。

![avatar](图片1/6.png)

<center>图6。n×n卷积因子分解后的初始模。在我们提出的架构中，我们选择n=7作为17×17网格。（根据原则3选择过滤器尺寸）</center>

### 4. 辅助分类器的效用

* has introduced the notion of auxiliary classifiers to improve the convergence of very deep networks. The original motivation was to push useful gradients to the lower layers to make them immediately useful and improve the convergence during training by combating the vanishing gradient problem in very deep networks. Also Lee et al argues that auxiliary classifiers promote more stable learning and better convergence. Interestingly, we found that auxiliary classifiers did not result in improved convergence early in the training: the training progression of network with and without side head looks virtually identical before both models reach high accuracy. Near the end of training, the network with the auxiliary branches starts to overtake the accuracy of the network without any auxiliary branch and reaches a slightly higher plateau.

* 引入了辅助分类器的概念，以提高非常深层网络的收敛性。最初的动机是将有用的梯度推到较低的层，使其立即有用，并通过克服超深网络中梯度消失问题来提高训练过程中的收敛性。Lee等人还认为辅助量词促进了更稳定的学习和更好的收敛。有趣的是，我们发现辅助分类器在训练初期并没有改善收敛性：在两个模型达到高精度之前，有侧头和无侧头的网络训练进程看起来几乎相同。在训练接近尾声时，有辅助支路的网络开始超越没有辅助支路的网络的精度，并达到稍高的平台。

* Also used two side-heads at different stages in the network. The removal of the lower auxiliary branch did not have any adverse effect on the final quality of the network. Together with the earlier observation in the previous paragraph, this means that original the hypothesis of [20] that these branches help evolving the low-level features is most likely misplaced. Instead, we argue that the auxiliary classifiers act as regularizer. This is supported by the fact that the main classifier of the network performs better if the side branch is batch-normalized [7] or has a dropout layer. This also gives a weak supporting evidence for the conjecture that batch normalization acts as a regularizer.

* 在网络的两个不同阶段也使用了头部。移除下辅助支路对网络的最终质量没有任何不利影响。再加上前面一段的观察，这意味着[20]最初的假设，即这些分支有助于低级特征的进化，很可能是错误的。相反，我们认为辅助分类器充当正则化器。如果副分支是批处理规范化的[7]或有一个丢失层，则网络的主分类器的性能会更好。这也为批处理规范化作为正则化器的猜想提供了一个微弱的支持证据。

![avatar](图片1/7.png)

<center>图7。初始滤波器组输出扩展。如第2节原则2所示，这种体系结构用于最粗糙（8×8）网格上，以促进高维表示。我们只在最粗糙的网格上使用这个解决方案，因为与空间聚集相比，生成高维稀疏表示的地方是最关键的，因为本地处理（通过1×1卷积）的比率比空间聚集要高。</center>

![avatar](图片1/8.png)

<center>图8。最后17×17层顶部的辅助分级机。侧头层的批量标准化[7]使top-1精度的绝对增益达到0.4%。下轴显示执行的itertions的数量，每个批大小为32。</center>

### 5. 有效减小网格大小

* Traditionally, convolutional networks used some pooling operation to decrease the grid size of the feature maps. In order to avoid a representational bottleneck, before applying maximum or average pooling the activation dimension of the network filters is expanded. For example, starting a $d × d$ grid with $k$ filters, if we would like to arrive at a $d × d$ grid with $2k$ filters, we first need to compute a stride-1 convolution with $2k$ filters and then apply an additional pooling step. This means that the overall computational cost is dominated by the expensive convolution on the larger grid using $2d^2 k^2$ operations. One possibility would be to switch to pooling with convolution and therefore resulting in $2(\frac{d}{2}) 2k^2$ reducing the computational cost by a quarter. However, this creates a representational bottlenecks as the overall dimensionality of the representation drops to $( \frac{d}{2}) k$ resulting in 2less expressive networks (see Figure 9). Instead of doing so, we suggest another variant the reduces the computational cost even further while removing the representational bottleneck. (see Figure 10). We can use two parallel stride 2 blocks: $P$ and $C. $ $P $is a pooling layer (either average or maximum pooling) the activation, both of them are stride 2 the filter banks of which are concatenated as in figure 10.

* 传统上，卷积网络使用一些池操作来减小特征映射的网格大小。为了避免代表性瓶颈，在应用最大或平均池之前，扩展网络过滤器的激活维度。例如，开始使用$k$过滤器的$d×d$网格，如果我们想要得到一个带有$2k$过滤器的$d×d$网格，我们首先需要使用$2k$过滤器计算一个stride-1卷积，然后应用额外的池化步骤。这意味着，在较大的网格上使用$2d^2k^2$运算，总体计算成本由昂贵的卷积决定。一种可能是使用卷积转换成池，从而产生$2(\frac{d}{2})2k^2$，将计算成本减少四分之一。然而，当表示的整体维度降到$(\frac{d}{2})k$时，这就造成了一个表示瓶颈，从而导致了2个缺乏表达的网络（见图9）。而不是这样做，我们建议另一个变体，在消除表征瓶颈的同时，进一步降低计算成本。（见图10）。我们可以使用两个并行的跨2块：$P$和$C。$$P$是一个池层（平均池或最大池）激活，它们都是第2步，它们的过滤器组如图10所示连接在一起。

![avatar](图片1/9.png)

<center>图9。减小网格大小的两种可选方法。左边的解决方案违反了第2节中不引入代表性瓶颈的原则1。右边的版本计算起来要贵3倍。</center>

![avatar](图片1/10.png)

<center>图10。初始模块，在扩展过滤器组的同时减小网格大小。它既便宜又避免了原则1所建议的代表性瓶颈。右边的图表表示相同的解决方案，但是从网格大小的角度而不是从操作的角度来看。</center>

### 6. Inception-v2

* Here we are connecting the dots from above and propose a new architecture with improved performance on the ILSVRC 2012 classification benchmark. The layout of our network is given in table 1. Note that we have factorized the traditional 7 × 7 convolution into three 3 × 3 convolutions based on the same ideas as described in section 3.1. For the Inception part of the network, we have 3 traditional inception modules at the 35 × 35 with 288 filters each. This is reduced to a 17 × 17 grid with 768 filters using the grid reduction technique described in section 5. This is is followed by 5 instances of the factorized inception modules as depicted in figure 5. This is reduced to a 8 × 8 × 1280 grid with the grid reduction technique depicted in figure 10. At the coarsest 8 × 8 level, we have two Inception modules as depicted in figure 6, with a concatenated output filter bank size of 2048 for each tile. The detailed structure of the net-work, including the sizes of filter banks inside the Inception modules, is given in the supplementary material, given in the model.txt that is in the tar-file of this submission.

* 在这里，我们将上述各点联系起来，并在ILSVRC 2012分类基准上提出一种新的体系结构，该体系结构具有改进的性能。我们的网络布局如表1所示。请注意，我们基于第3.1节中描述的相同思想，将传统的7×7卷积分解为3个3×3卷积。对于网络的初始部分，我们在35×35处有3个传统的初始模块，每个模块有288个过滤器。使用第5节中描述的网格缩减技术，将其简化为带有768个滤波器的17×17网格。接下来是图5所示的分解初始模块的5个实例。使用图10所示的网格缩减技术，将其简化为8×8×1280网格。在最粗略的8×8级别，我们有两个初始模块，如图6所示，每个块的级联输出滤波器组大小为2048。网络的详细结构，包括初始模块内滤波器组的尺寸，在补充资料中给出型号.txt这是在本次提交的tar文件中。

* However, we have observed that the quality of the network is relatively stable to variations as long as the principles from Section 2 are observed. Although our network is 42 layers deep, our computation cost is only about 2.5 higher than that of GoogLeNet and it is still much more efficient than VGGNet.

* 然而，我们观察到，只要遵守第2节的原则，网络的质量对变化是相对稳定的。虽然我们的网络有42层深，但是我们的计算成本只比GoogLeNet高出2.5倍，而且仍然比VGGNet更高效。

### 7. 基于标签平滑的模型正则化

* Here we propose a mechanism to regularize the classifier layer by estimating the marginalized effect of label-dropout during training.

* 在这里，我们提出了一种机制，通过估计训练过程中标签丢失的边缘化效应来正则化分类器层。

* For each training example x, our model computes the probability of each label $k\in\{1...K\}: p(k|x) = \frac{exp(z_k)}{\sum^K_{i=1} exp(z_i)}$. Here, $z_i$ are the logits or unnormalized log-probabilities. Consider the ground-truth distribution over labels $q(k|x)$ for this training example, normalized so that $􏰃\sum_k q(k|x) = 1$. For brevity, let us omit the dependence of $p$ and $q$ on example $x$. We define the loss for the example as the cross entropy: $l = − \sum^K_{k=1} log(p(k))q(k)$.Minimizing this is equivalent to maximizing the expected log-likelihood of a label, where the label is selected according to its ground-truth distribution $q(k)$. Cross-entropy loss is differentiable with respect to the logits $z_k$ and thus can be used for gradient training of deep models. The gradient has a rather simple form: $\frac{\partial l}{\partial z_k} = p(k) − q(k)$, which is bounded $\partial z_k$ between −1 and 1.

* 对于每个训练示例x，我们的模型计算每个标签$k\in\{1…k\}$中的概率： $p(k|x) = \frac{exp(z_k)}{\sum^K_{i=1} exp(z_i)}$。这里，$z_i$是逻辑或非规范化的对数概率。考虑这个训练示例中标签$q(k | x)$上的基本真实分布，使$\sum_kq(k | x)=1$。为了简洁起见，我们省略$p$和$q$对示例$x$的依赖性。我们将这个例子中的损失定义为交叉熵：$l = − \sum^K_{k=1} log(p(k))q(k)$。最小化这相当于最大化标签的对数似然，其中标签是根据其基本真实分布$q(K)$选择的。交叉熵损失相对于logits$z_k$是可微的，因此可以用于深层模型的梯度训练。梯度有一个相当简单的形式：$\frac{\partial l}{\partial z_k}=p(k)−q(k)$，它在−1和1之间有界$\partial z_k$。

* Consider the case of a single ground-truth label $y$, so that $q(y) = 1$ and $q(k) = 0$ for all $k\neq y$. In this case, minimizing the cross entropy is equivalent to maximizing the log-likelihood of the correct label. For a particular example $x$ with label $y$, the log-likelihood is maximized for $q(k) = \delta_{k,y}$, where $\delta_{k,y}$ is Dirac delta, which equals 1 for $k = y$ and 0 otherwise. This maximum is not achievable forfinite $z_k$ but is approached if $zy >> zk$ for all $k \neq y -$ that is, if the logit corresponding to the ground-truth label is much great than all other logits. This, however, can cause two problems. First, it may result in over-fitting: if the model learns to assign full probability to the ground-truth label for each training example, it is not guaranteed to generalize. Second, it encourages the differences between the largest logit and all others to become large, and this, combined with the bounded gradient $\frac{\partial l}{\partial z_k}$ , reduces the ability of the model to adapt. Intuitively, this happens because the model becomes too confident about its predictions.

* 考虑一个单一的先验标签$y$，因此$q(y)=1$，并且$q(k)=0$ 对于所有$k\neq y$。在这种情况下，最小化交叉熵等同于最大化正确标签的对数可能性。对于标签为$y$的特定示例$x$，对于$q（k）=\delta_{k，y}$，对数似然最大化，其中$\delta_{k,y}$是Dirac delta，对于$k=y$，等于1，否则为0。对于有限的$z_k$无法达到此最大值，但是如果$zy>>zk$用于所有$k\neq y-$，也就是说，如果与先验标签对应的逻辑项比所有其他逻辑项都大得多，则可以接近该最大值。然而，这可能导致两个问题。首先，它可能会导致过度拟合：如果模型学习为每个训练样本指定全概率的基本真值标签，它就不能保证泛化。第二，它鼓励最大logit和所有其他logit之间的差异变大，这与有界梯度$\frac{\partial l}{\partial z}$相结合，降低了模型的适应能力。直觉上，这是因为模型对其预测过于自信。

* We propose a mechanism for encouraging the model to be less confident. While this may not be desired if the goal is to maximize the log-likelihood of training labels, it does regularize the model and makes it more adaptable. The method is very simple. Consider a distribution over labels $u(k)$, independent of the training example $x$, and a smoothing parameter $\varepsilon$. For a training example with ground-truth label $y$, we replace the label distribution $q(k|x) = \delta_{k,y}$ with 

* 我们提出了一种机制来鼓励模型降低自信。虽然如果目标是最大化训练标签的对数似然，这可能并不理想，但它确实使模型正规化，使其更具适应性。方法非常简单。考虑在标签$u(k)$上的分布，独立于训练示例$x$，以及平滑参数$\varepsilon$。对于具有基本真理标签$y$的训练示例，我们将标签分布$q(k | x)=\delta_{k,y}$替换为

![avatar](图片1/12.png)

* which is a mixture of the original ground-truth distribution $q(k|x)$ and the fixed distribution $u(k)$, with weights $1 − \varepsilon$ and $\varepsilon$, respectively. This can be seen as the distribution of the label $k$ obtained as follows: first, set it to the ground-truth label $k = y$; then, with probability $\varepsilon$, replace $k$ with a sample drawn from the distribution $u(k)$. We propose to use the prior distribution over labels as $u(k)$. In our experiments, we used the uniform distribution $u(k) = 1/K$, so that

* 它是先验分布$q(k | x)$和固定分布$u(k)$的混合体，权重分别为$1-\varepsilon$和$\varepsilon$。这可以看作是标签$k$的分布，如下所示：首先，将其设置为基本真理标签$k=y$；然后，使用概率$\varepsilon$，将$k$替换为从分布$u(k)$中提取的样本。我们建议使用标签上的优先分布为$u(k)$。在我们的实验中，我们使用均匀分布$u(k)=1/k$，因此

![avatar](图片1/13.png)

* We refer to this change in ground-truth label distribution as label-smoothing regularization, or LSR.

* 我们将这种在基本真实标签分布中的变化称为标签平滑正则化，或LSR。

* Note that LSR achieves the desired goal of preventing the largest logit from becoming much larger than all others. Indeed, if this were to happen, then a single $q(k)$ would approach 1 while all others would approach 0. This would result in a large cross-entropy with $q′(k)$ because, unlike $q(k)=\delta_{k,y}$, all $q′(k)$ have a positive lower bound.

* 请注意，LSR实现了预期的目标，即防止最大的logit变得比所有其他logit大得多。事实上，如果发生这种情况，那么单个$q(k)$将接近1，而所有其他的都将接近0。这将导致与$q′(k)$有较大的交叉熵，因为与$q(k)=\delta{ky}$不同，所有的$q′(k)$都有一个正下界。

* Another interpretation of LSR can be obtained by considering the cross entropy:

* 通过考虑交叉熵可以得到LSR的另一种解释：

![avatar](图片1/14.png)

* Thus, $LSR$ is equivalent to replacing a single cross-entropy loss $H(q, p)$ with a pair of such losses $H(q, p)$ and $H(u, p)$. The second loss penalizes the deviation of predicted label distribution p from the prior $u$, with the relative weight $\varepsilon$ .$1-\varepsilon$ Note that this deviation could be equivalently captured by the $KL$ divergence, since $H (u, p) = D_{KL} (u∥p)+H(u)$ and $H(u)$ is fixed. When $u$ is the uniform distribution, $H(u,p)$ is a measure of how dissimilar the predicted distribution p is to uniform, which could also be measured (but not equivalently) by negative entropy $−H(p)$; we have not experimented with this approach.

* 因此，$LSR$相当于将单个交叉熵损失$H(q，p)$替换为一对这样的损失$H(q，p)$和$H(u，p)$。第二个损失惩罚预测的标签分布p与之前的$u$之间的偏差，相对权重$\varepsilon$.$1-\varepsilon$注意，由于$H(u，p)=D{KL}(u∥p)+H(u)$和$H(u)$的偏差是固定的。当$u$是均匀分布时，$H(u，p)$是预测分布p与均匀分布之间的差异程度的度量，也可以通过负熵$-H(p)$来测量（但不等效）；我们没有尝试过这种方法。

* In our ImageNet experiments with $K = 1000$ classes, we used $u(k) = 1/1000$ and $\varepsilon = 0.1$. For ILSVRC 2012, we have found a consistent improvement of about 0.2% absolute both for top-1 error and the top-5 error (cf. Table 3).

* 在我们对$K=1000$类的ImageNet实验中，我们使用了$u(K)=1/1000$和$\varepsilon=0.1$。对于ILSVRC 2012，我们发现top-1误差和top-5误差的绝对改善率一致，约为0.2%（参考表3）。

### 8. 训练方法

* We have trained our networks with stochastic gradient utilizing the TensorFlow distributed machine learning system using 50 replicas running each on a NVidia Kepler GPU with batch size 32 for 100 epochs. Our earlier experiments used momentum with a decay of 0.9, while our best models were achieved using RMSProp with decay of 0.9 and $\varepsilon = 1.0$. We used a learning rate of 0.045, decayed every two epoch using an exponential rate of 0.94. In addition, gradient clipping with threshold 2.0 was found to be useful to stabilize the training. Model evaluations are performed using a running average of the parameters computed over time.

* 我们利用TensorFlow分布式机器学习系统，用随机梯度训练我们的网络，使用50个副本，每个副本在NVidia开普勒GPU上运行，批大小为32，适用于100个epoch。我们早期的实验使用了衰变为0.9的动量，而我们的最佳模型是使用衰变为0.9和$\varepsilon=1.0$的RMSProp实现的。我们使用0.045的学习率，每两个时期衰减一次，指数速率为0.94。除此之外，0.0的梯度对稳定阈值是有用的。使用随时间计算的参数的运行平均值来执行模型评估。

### 9. 低分辨率输入的性能

* A typical use-case of vision networks is for the the post-classification of detection, for example in the Multibox context. This includes the analysis of a relative small patch of the image containing a single object with some context. The tasks is to decide whether the center part of the patch corresponds to some object and determine the class of the object if it does. The challenge is that objects tend to be relatively small and low-resolution. This raises the question of how to properly deal with lower resolution input.

* 视觉网络的一个典型用例是用于检测的后分类，例如在多盒上下文中。这包括对图像的一个相对较小的块的分析，其中包含具有某种上下文的单个对象。任务是确定面片的中心部分是否与某个对象相对应，并确定该对象的类。挑战在于，物体往往相对较小，分辨率较低。这就提出了如何正确处理低分辨率输入的问题。

* The common wisdom is that models employing higher resolution receptive fields tend to result in significantly improved recognition performance. However it is important to distinguish between the effect of the increased resolution of the first layer receptive field and the effects of larger model capacitance and computation. If we just change the resolution of the input without further adjustment to the model, then we end up using computationally much cheaper models to solve more difficult tasks. Of course, it is natural, that these solutions loose out already because of the reduced computational effort. In order to make an accurate assessment, the model needs to analyze vague hints in order to be able to “hallucinate” the fine details. This is computationally costly. The question remains therefore: how much does higher input resolution helps if the computational effort is kept constant. One simple way to ensure constant effort is to reduce the strides of the first two layer in the case of lower resolution input, or by simply removing the first pooling layer of the network.

* 普遍的看法是，使用更高分辨率接收场的模型往往会显著提高识别性能。然而，区分第一层感受野分辨率的提高与模型电容和计算量增大的影响是很重要的。如果我们只是改变输入的分辨率而不进一步调整模型，那么我们最终会使用计算上更便宜的模型来解决更困难的任务。当然，由于计算量的减少，这些解已经松脱了，这是很自然的。为了做出准确的评估，该模型需要分析模糊的提示，以便能够“幻觉”细节。这在计算上是昂贵的。因此，问题仍然存在：如果计算量保持不变，较高的输入分辨率有多大帮助。确保持续努力的一个简单方法是在低分辨率输入的情况下减少前两层的跨距，或者简单地删除网络的第一个池层。

![avatar](图片1/15.png)

<center>表2.接收野大小不同，但计算量不变时的识别性能比较。</center>

* For this purpose we have performed the following three experiments:
    1. 299 × 299 receptive field with stride 2 and maximum pooling after the first layer.
    2. 151 × 151 receptive field with stride 1 and maximum pooling after the first layer.
    3. 79×79receptivefieldwithstride1andwithoutpooling after the first layer.

* 为此，我们进行了以下三个实验：
    1. 299×299感受野，步幅2，第一层后最大池化。
    2. 151×151感受野，步幅1，第一层后最大池化。
    3. 79×79接收场，第一层后无冷却。

* All three networks have almost identical computational cost. Although the third network is slightly cheaper, the cost of the pooling layer is marginal and (within 1% of the total cost of the)network. In each case, the networks were trained until convergence and their quality was measured on the validation set of the ImageNet ILSVRC 2012 classification benchmark. The results can be seen in table 2. Although the lower-resolution networks take longer to train, the quality of the final result is quite close to that of their higher resolution counterparts.

* 这三种网络的计算成本几乎相同。尽管第三个网络稍微便宜一点，但池层的成本是边际的，并且（在总成本的1%以内）网络。在每种情况下，对网络进行训练直到收敛，并在ImageNet ILSVRC 2012分类基准的验证集上测量其质量。结果见表2。虽然低分辨率网络需要更长的时间来训练，但最终结果的质量与高分辨率网络相当接近。

* However, if one would just naively reduce the network size according to the input resolution, then network would perform much more poorly. However this would an unfair comparison as we would are comparing a 16 times cheaper model on a more difficult task.

* 然而，如果一个人只是天真地根据输入分辨率缩小网络的大小，那么网络的性能就会差得多。然而，这将是一个不公平的比较，因为我们会比较一个16倍便宜的模型在一个更困难的任务。

* Also these results of table 2 suggest, one might consider using dedicated high-cost low resolution networks for smaller objects in the R-CNN context.

* 此外，表2的这些结果表明，人们可以考虑在R-CNN上下文中为较小的对象使用专用的高成本低分辨率网络。

### 10 实验结果与比较

* Table 3 shows the experimental results about the recognition performance of our proposed architecture (Inception-v2) as described in Section 6. Each Inception-v2 line shows the result of the cumulative changes including the high-lighted new modification plus all the earlier ones. Label Smoothing refers to method described in Section 7. Factorized 7 × 7 includes a change that factorizes the first 7 × 7 convolutional layer into a sequence of 3 × 3 convolutional layers. BN-auxiliary refers to the version in which the fully connected layer of the auxiliary classifier is also batch-normalized, not just the convolutions. We are referring to the model in last row of Table 3 as Inception-v3 and evaluate its performance in the multi-crop and ensemble settings.

* 表3显示了我们提出的体系结构（Inception-v2）的识别性能的实验结果，如第6节所述。每个Inception-v2行显示累积变化的结果，包括高亮的新修改加上所有先前的修改。标签平滑参考第7节中描述的方法。因子分解的7×7包括将第一个7×7卷积层分解为3×3个卷积层的序列的变化。BN-auxiliary是指辅助分类器的完全连接层也被批量规范化的版本，而不仅仅是卷积。我们将表3最后一行中的模型称为Inception-v3，并评估其在多作物和集成环境中的性能。

![avatar](图片1/16.png)

<center>表3。单季试验结果比较了各种影响因素的累积效应。我们将我们的数据与al[7]上发表的关于Ioffe的最佳单作物推断进行了比较。对于“Inception-v2”行，更改是累积的，每个后续行除了以前的行之外还包括新的更改。最后一行是指所有的变更，我们在下面称之为“Inception-v3”。不幸的是，他等[6]只报告了10个作物评估结果，但没有报告下表4中报告的单作物结果。</center>

![avatar](图片1/17.png)

<center>表4。单模型、多作物试验结果比较了各因素对累积效应的影响。我们将我们的数据与ILSVRC 2012分类基准上发布的最佳单模型推断结果进行比较。</center>

* All our evaluations are done on the 48238 non-blacklisted examples on the ILSVRC-2012 validation set, as suggested by. We have evaluated all the 50000 examples as well and the results were roughly 0.1% worse in top-5 error and around 0.2% in top-1 error. In the upcoming version of this paper, we will verify our ensemble result on the test set, but at the time of our last evaluation of BN-Inception in spring indicates that the test and validation set error tends to correlate very well.

* 我们对ILSVRC-2012验证集上的48238个未列入黑名单的示例进行了评估，如所示。我们也评估了所有50000个例子，结果在前5个错误中差了大约0.1%，在前1个错误中差了大约0.2%。在本文的下一个版本中，我们将在测试集上验证我们的集成结果，但在spring对BN初始的最后一次评估时，我们发现测试集和验证集的误差往往非常相关。

![avatar](图片1/18.png)

<center>表5。综合评价结果比较多模型，多作物报告结果。我们的数据与ILSVRC 2012分类基准上发表的最佳集成推断结果进行了比较。*所有结果，但报告的前5个系综结果都在验证集中。在验证集上，该集成产生了3.46%的前5名错误。</center>

### 11.结论

* We have provided several design principles to scale up convolutional networks and studied them in the context of the Inception architecture. This guidance can lead to high performance vision networks that have a relatively modest computation cost compared to simpler, more monolithic architectures. Our highest quality version of Inception-v3 reaches 21.2%, top-1 and 5.6% top-5 error for single crop evaluation on the ILSVR 2012 classification, setting a new state of the art. This is achieved with relatively modest (2.5×) increase in computational cost compared to the net-work described in Ioffe et al. Still our solution uses much less computation than the best published results based on denser networks: our model outperforms the results of He et al – cutting the top-5 (top-1) error by 25% (14%) relative, respectively – while being six times cheaper computationally and using at least five times less parameters (estimated). Our ensemble of four Inception-v3 models reaches 3.5% with multi-crop evaluation reaches 3.5% top-5 error which represents an over 25% reduction to the best published results and is almost half of the error of ILSVRC 2014 winining GoogLeNet ensemble.

* 我们提供了一些设计原则来扩展卷积网络，并在初始架构的背景下研究了它们。这种指导可以导致高性能视觉网络，与更简单、更单片的体系结构相比，具有相对较小的计算成本。我们的最高质量版本Inception-v3在ILSVR 2012分类中的单作物评估中达到21.2%、top-1和5.6%top-5误差，创下了新的技术水平。与Ioffe et al.中描述的网络相比，计算成本相对温和地增加（2.5倍）。然而，我们的解决方案比基于密集网络的最佳公布结果使用的计算量要少得多：我们的模型比He等人的结果要好得多——将前5名（top-1）的误差分别减少了25%（14%），同时计算成本降低了6倍，使用的参数（估计值）至少减少了5倍。我们的四个Inception-v3模型的集合达到了3.5%，多作物评估达到了3.5%top-5误差，这意味着比最佳发布结果减少了25%以上，几乎是ILSVRC 2014 Wining-GoogLeNet集成误差的一半。

* We have also demonstrated that high quality results can be reached with receptive field resolution as low as 79 × 79. This might prove to be helpful in systems for detecting relatively small objects. We have studied how factorizing convolutions and aggressive dimension reductions inside neural network can result in networks with relatively low computational cost while maintaining high quality. The combination of lower parameter count and additional regularization with batch-normalized auxiliary classifiers and label-smoothing allows for training high quality networks on relatively modest sized training sets.

* 我们还证明，在接收野分辨率低至79×79的情况下，可以获得高质量的结果。这可能对探测相对较小物体的系统有帮助。我们研究了如何在神经网络中分解卷积和积极的降维，从而在保持高质量的同时，使网络具有相对较低的计算成本。低参数计数和附加正则化与批量规范化辅助分类器和标签平滑相结合，可以在相对较小的训练集上训练高质量的网络。