Going Deeper with Convolutions
----

http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf


Abstract

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

我们提出一种深度卷积网络架构，代码名称为Inception（开端），在ILSVRC2014的分类和检测任务上达到了先进水平。这种架构的主要特点是改进在网络内部的计算资源利用率。通过精心的设计，我们增加了网络的深度和宽度但是保持了计算成本不变。为了优化质量，架构的决定是基于Hebbian原则和多尺度处理的直觉。设计的一个实现是提交到ILSVRC14的叫做GooLeNet的22层深度网络，在分类和识别的上进行质量的评估。

1 Introduction

In the last three years, our object classification and detection capabilities have dramatically improved due to advances in deep learning and convolutional networks [10]. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12 times fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate. On the object detection front, the biggest gains have not come from naive application of bigger and bigger deep networks, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].

在过去的3年里，因为深度学习和卷积网络，我们的目标分类和检测能力大大的提高了。一个很高的消息是，这种进步大部分的原因并不仅仅是因为更强劲的硬件，更大的数据集和更大的模型，而主要是一系列新的想法，算法和改进的网络架构。并没有使用新的数据源，例如，ILSVRC2014的顶级作品中分类和检测数据集。我们的GooLeNet实际上使用的参数比两年前的冠军Krizhevsky2012少了12倍，但更加准确。在物体检测方面，最大的收获并没有来自于更大的深度网络，而是深度网络和经典的机器视觉的协同合作，例如R-CNN（Girshick）。

Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.

另外一个值得注意的是，随着移动和嵌入式技术的发展，我们的算法的效率变得更为重要（特别是电量和内存的使用）。值得提出的是在设计本文深度架构的考量中，这个因素是被考虑的，而不仅仅是一些固定的准确度数字。对于大多数的试验，这个模型被设计控制推断时的计算成本为15亿次乘法和加法，所以它们不仅仅是纯粹的学术考量，而是可以被用于现实世界，甚至在更大的数据集上，已可以接受的成本。

In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, where it significantly outperforms the current state of the art.

在这篇论文里，我们把焦点放在一个高效的深度网络架构，代码名字为Inception，名字来源于Lin12的网络中的网络和流行的网络表情包“we need to go deeper”。在我们这里，“深”这个词有两个不同的含义：1.以“Inception”模块的形式引入新的层次，从而网络变深。一般来说，可以把Inception模块看做Arora12的启发。这个架构带来的好处被ILSVRC2014的分类和检测任务验证，明显地超过了现在的最先进水准。

![weneedtogodepper](weneedtogodepper.jpg)

2 Related Work

Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.

从LeNet-5开始，cnn有一种典型的结构：重叠的conv层组（后跟可选的归一化层或者最大池层），后跟一个或者多个fc层。这种基本模式的变体在图片分类中非常流行，从MNIST，CIFAR到最出名的ImageNet分类竞赛中得到了最好的结果。对于想ImageNet这样的大数据集来说，最近的趋势是增加更多的层，更大尺寸的层，而通过dropout层来解决过拟合的问题。

Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19].

尽管最大池存在导致损失空间信息准确度的问题，同样的convnet架构例如[9]还是很成功的应用到了localization和物体检测和人体姿势估计。

Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] used a series of fixed Gabor filters of different sizes to handle multiple scales. We use a similar strategy here. However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception architecture are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.

由灵长类脑皮层的神经元模型得到的灵感，Serre15使用一系列不同大小的Gabor滤波器来处理多尺度。我们这里也使用相同的策略。但是相对于15使用的固定的两层深度模型，Inception架构里的所有滤波器都是可学习的。Inception层反复使用，形成了22层的深度模型，我们称为GooLeNet。

Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. In their model, additional 1 × 1 convolutional layers are added to the network, increasing its depth. We use this approach heavily in our architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty.

Network-in-Network是Lin12提出的一种方法，来增强神经网络的表达能力。在他们的模型里，多余的1x1卷积层用来增加网络的深度。虽然我们大量的使用了这种方法，但是，在我们的配置里，1x1卷积有双重目的：非常重要的一点是，他们主要被用来做降维模块来消除计算瓶颈，不然的话就会限制到我们网络的大小。这样在不用显著影响性能的情况下，不仅可以增加网络的深度，而且还有宽度。

Finally, the current state of the art for object detection is the Regions with Convolutional Neural Networks (R-CNN) method by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: utilizing low- level cues such as color and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.

最后，目前最先进的目标检测技术是Girshick6提出的R-CNN。R-CNN把检测问题分解成两个子问题：利用如颜色质地等低层次的信息来生成物体位置的候选项；和用CNN分类器来识别这些位置的物体分类。这样的两阶段方法利用低层次信息和先进的CNN分类能力提高了边界框分割的准确度。我们在提交的检测方案商采用了一种类似的流程，但在两个阶段都做了改进，例如多边界框预测提高边界框召回，还有组合方法来得到更好的候选边框。

3 Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth – the number of network levels – as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks.

最直接的提高深度网络的方法就是增加他们的规模。这包括增加深度（网络的层次数），还有宽度（每层的单元数）。这是训练高质量模型的一种简单和保险的方法，特别有大量的带标注的训练数据的情况下。但是，这种简单的策略也有两个明显的缺点。

Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to over- fitting, especially if the number of labeled examples in the training set is limited. This is a major bottleneck as strongly labeled datasets are laborious and expensive to obtain, often requiring expert human raters to distinguish between various fine-grained visual categories such as those in ImageNet (even in the 1000-class ILSVRC subset) as shown in Figure 1.

更大的规模一般意味着更多的参数，这就更容易过拟合，特别是训练集有限时。这是个很大的缺陷，因为获得高质量的标注数据集很难也很贵，经常需要人类专家在非常多细粒度的视觉分类上做区分，就像ImageNet一样（即使是ILSVRC 1000分类的子集）。

The other drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then much of the computation is wasted. As the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of performance.

另一个缺点是增加网络的规模极大的增加了计算资源的使用。例如，在一个深度视觉网络中，如果两个卷积层叠加，对于卷积滤波器的增加会导致计算的平方倍增加。如果增加的能力不被很好的利用（比如大部分权重都趋近于0），那么很多的计算资源就被浪费了。因为计算成本是有限的，即使主要目标是提高性能质量，也不能一味地扩大网络的规模。

A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle – neurons that fire together, wire together – suggests that the underlying idea is applicable even under less strict conditions, in practice.

一种解决这两个问题的基本方法是引入稀疏性，用稀疏的层代替全连接层，甚至是在卷积层内。除了模仿生物系统，由于Arora等人的开创性工作，这也将具有更坚实的理论基础的优势。其主要结果表明，如果数据集的概率分布可以由大的，非常稀疏的深层神经网络表示，则可以通过分析前一层激活和聚类神经元的相关统计信息来高层次地构建最优网络拓扑 相关输出。 尽管严格的数学证据需要非常强的条件，但是这个声明与众所周知的Hebbian原则（一起发射在一起的神经元）联系在一起的事实表明，即使在不太严格的条件下，实践中也是适用的。

Unfortunately, today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off. The gap is widened yet further by the use of steadily improving and highly tuned numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, yet the trend changed back to full connections with [9] in order to further optimize parallel computation. Current state-of-the-art architectures for computer vision have uniform structure. The large number of filters and greater batch size allows for the efficient use of dense computation.

不幸的是，当今的计算基础设施在不统一的稀疏数据结构上进行数值计算是非常低效的。即使是计算操作减小100倍，查找和缓存不命中的开销将占主导：切换到稀疏矩阵可能也不能抵消。那些稳定改进和高度优化过的数值类库，利用底层CPU或者GPU的细节，使得稠密矩阵乘法可以极快的运行，使得这种差距进一步扩大。而且，不统一的稀疏模型需要更复杂的工程和计算基础设施。当前大部分的视觉计算学习系统只是利用了卷积的优点在空间域利用稀疏性。然而，卷积被实现成与前一层的补丁进行稠密连接的集合。convnet传统上在特征维度使用随机和稀疏连接，为了打破对称性和改进学习，但是趋势已经变回使用全连接，为了进一步优化并行计算。当前最先进的机器视觉架构具有统一的结构。使用很多的滤波器和更大的批，允许有效的使用稠密计算。

This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep- learning architectures in the near future.

这提出了一个问题，即下一个中间步骤是否有希望：一个利用滤波器级稀疏性的架构，如该理论所提出的，但是通过利用密集矩阵上的计算来利用我们当前的硬件。 关于稀疏矩阵计算的广泛文献（例如[3]）表明，将稀疏矩阵聚类成相对密集的子矩阵倾向于给出稀疏矩阵乘法的竞争性能。认为在不久的将来，类似的方法将被用于自动化构建非均匀深度学习架构似乎并不牵强。

The Inception architecture started out as a case study for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, modest gains were observed early on when compared with reference networks based on [12]. With a bit of tuning the gap widened and Inception proved to be especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly in separation, they turned out to be close to optimal locally. One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction. Making sure of this would require a much more thorough analysis and verification.


4 Architectural Details

![inception_module](inception_module.png)

The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer by layer construction where one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. Thus, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5; this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).

Inception的主要思想在于怎么去用易得的稠密组件去近似和覆盖局部最优的稀疏结构。注意，假设变换不变性意味着我们的网络由卷积基础模块构成。我们要做的只是找到局部最后结构然后在空间上重复它。Arora等人[2]提出一种一层层的结构，分析最后一层的相关性，合并成高相关组。这种合并形成了下一层连接到上一层的单元。我们假设，较早层次里的每一个单元对应输入图片的一些区域，这些单元被组合到滤波器库中。在较低的层次中（离图片近的）相关单元在局部区域直接合并。这就导致了在一个区域有很多合并的组，他们可以被下一层1x1的卷积锁覆盖。但是，可以期望，会有一些空间扩散的集合，会被大补丁卷积覆盖，大区域补丁的补丁数量下降。为了避免补丁对齐问题，目前的Inception网络的卷积核大小限制为1x1, 3x3, 5x5；这是基于方便而不是必要。这也同时说明建议的架构是一种结合，层他们的输出滤波器连接组成一个单独的输出向量，形成下一个阶段的输入。更多的，因为池化操作是成功的关键，也建议添加并行的池化路径，在每个阶段，这也有格外的好处。

As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease. This suggests that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.

这些Inception模块是重叠起来的，他们的输出的相关数据必然也不一样：高度抽象的特征被高层捕获，他们的空间注意力是降低的。因此3x3和5x5卷积在越高的层次中的比例应该增加。

One big problem with the above modules, at least in this na ̈ıve form, is that even a modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: the number of output filters equals to the number of filters in the previous stage. The merging of output of the pooling layer with outputs of the convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.

上面这个模型最大的问题在于，至少在这种朴素的形式下，即使是一般数量的5x5卷积在上一层卷积核非常多的情况下，计算成本也是非常高的。这个问题在池化单元混合进来以后更加严重：输出滤波器的数量等于前一层滤波器的数量。池化层和卷积层合并的输出，必然会导致一个阶段到另一个阶段输出数量的增加。虽然这个结构包含了最优稀疏结构，但是是一种低效的方式，导致计算在几层之内就爆炸了。

This leads to the second idea of the Inception architecture: judiciously reducing dimension wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to process. The representation should be kept sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse. That is, 1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation making them dual-purpose. The final result is depicted in Figure 2(b).

这导出了Inception模块的第二个想法：谨慎低减少维度来防止计算增加过多。这成功的依赖于嵌入：即使低维嵌入也包含很多关于图片相对较大区域的信息。尽管如此，稠密压缩的表达形式的嵌入信息很难被处理。这种形式大多数时候保持稀疏，仅当它们需要汇总的时候才进行压缩。也就是说，1x1卷积被用来降维在昂贵的3x3和5x5之前。除了降维，它们还附带了relu来达到双重作用。最终结果如图2(b)显示。

In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.

一般来说，Inception网络由上面提到的模块一个叠一个来组成，偶尔使用步长2的最大池层来是网格的分辨率减半。因为一些技术原因（训练时的内存），在较低层次使用传统卷积网络，而在高层使用inception模块也是有用的。但这并不是必须的，只是反映了我们当前实现中基础设施的低效。

A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolutions with larger patch sizes. Furthermore, the design follows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.

这种架构一种有用的方面是，它允许显著增加某个阶段的单元数量，而不会导致后面阶段的计算复杂度不受控制的增加。而这个结果是由于在大补丁之前广泛使用的降维来达到的。并且，这个设计也遵从了实践中的直觉，视觉信息需要在不同尺度来处理后结合，以便下一层可以同时提取不同尺度的属性。

The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. One can utilize the Inception architecture to create slightly inferior, but computationally cheaper versions of it. We have found that all the available knobs and levers allow for a controlled balancing of computational resources resulting in networks that are 3 − 10× faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.

这种改进过的对计算资源的使用，使得每个阶段变宽，同时增加更多的阶段而不会太大低增加计算的难度。任何人可以利用inception结构来创建性能差一些，但是计算更廉价的版本。我们发现这种结构比类似的没有inception结构快3-10倍，尽管需要仔细的人工设计。

5 GoogLeNet

By the“GoogLeNet” name we refer to the particular incarnation of the Inception architecture used in our submission for the ILSVRC 2014 competition. We also used one deeper and wider Inception network with slightly superior quality, but adding it to the ensemble seemed to improve the results only marginally. We omit the details of that network, as empirical evidence suggests that the influence of the exact architectural parameters is relatively minor. Table 1 illustrates the most common instance of Inception used in the competition. This network (trained with different image- patch sampling methods) was used for 6 out of the 7 models in our ensemble.

GoogLeNet值的是Inception架构的一个具体实现，被用来参加ILSVRC2014竞赛。我们来使用了一个更深和更宽的Inception网络，但是改进有限。我们忽略那个网络的细节，因为经验表明确切的网络参数影响较小。表1显示了最普通的一个参与竞赛的Inception实例。这个网络被用来7个组合模型中的6个。

All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224×224 in the RGB color space with zero mean. “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions. One can see the number of 1×1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.

所有的卷积，包括在Inception模块里的那些，都使用relu做激励函数。网络的感知域大小为224x224的RGB彩色空间和0均值。“#3x3降维”和“#5x5降维”指的是在3x3和5x5卷积之前1x1降维层里卷积核的数量。所有的降维/投射层也都是用relu。

The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint.

这个网络被设计为计算高效和实用性，因此推理可以被运行在单独的资源有限的设备上，特别是低内存的占用的。

The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. The exact number depends on how layers are counted by the machine learning infrastructure. The use of average pooling before the classifier is based on [12], although our implementation has an additional linear layer. The linear layer enables us to easily adapt our networks to other label sets, however it is used mostly for convenience and we do not expect it to have a major effect. We found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.

这个网络有22层，仅计算那些有参数的层（或者27层如果算上池化层）。总共的层次（独立的基础模块）大概有100。确切的数字取决于有多少层被计算。在分类器之前的平均池化层基于[12]，尽管我们的实现有一个额外的线性层。这个线性层使我们能容易把网络应用到其他的标签集合，但是主要是为了方便，而且我们不期望它会起太大作用。我们发现把fc变成av pooling能够提高top-1准确率大约0.6%，但是dropout的使用仍然很重要，即使在去掉fc的情况下。

![inception_googlenet](inception_googlenet.png)

Given relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. The strong performance of shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, discrimination in the lower stages in the classifier was expected. This was thought to combat the vanishing gradient problem while providing regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At inference time, these auxiliary networks are discarded. Later control experiments have shown that the effect of the auxiliary networks is relatively minor (around 0.5%) and that it required only one of them to achieve the same effect.

鉴于这么深的网络，能有效的把梯度反向传递到所有层的能力是一个问题。在这个任务上性能很强的浅网络表明网络中间层产生的特征需要有很强的分辨能力。通过在这些中间层的后面增加额外的分类器，低层次的分辨能力在分类器上也是预期的。这说明要在克服梯度消失的同时增加正规化。这些分类器已小卷积网络的形式放在inception4a,4d模块输出的上面。在训练中，它们的损失已小的权重加到网络的总体损失里（权重系数为0.3）。在推理阶段，这些附加的网络被丢弃了。后来的试验表明这些附加的网络的作用非常小（大概0.5%）并且只需要一个就能达到相同的效果。

The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:

确切的网络结构，包括辅助分类器，如下：

• An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a), and 4×4×528 for the (4d) stage.
• A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.
• A fully connected layer with 1024 units and rectified linear activation.
• A dropout layer with 70% ratio of dropped outputs.
• A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time).

- 平均池化层，5x5卷积核，步长为3，4a的输出大小为4x4x512，4d的输出为4x4x528。
- 1x1，128个卷积核的卷积层用来做降维和relu激励
- 一个有1024个单元的fc层和relu激励
- 一个dropout层，70%的丢弃率
- 一个线性层使用softmax损失作为分类器（预测1000个分类，但在推理阶段被去除）

A schematic view of the resulting network is depicted in Figure 3.

一个网络的示意图显示在图3中。

![googlenet_diagram.png](googlenet_diagram.png)

6 Training Methodology

GoogLeNet networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.

GooLeNet使用DistBelief分布式系统来训练。尽管我们仅使用基于CPU的实现，但粗略地估计使用少数几个高端的GPU可以在一周内让GoogLeNet收敛，内存的使用是主要的限制。我们使用动量0.9动量的异步随机梯度下降，固定学习率计划（每8个周期下降4%）。Polyak均值被用来在推理阶段来创建最终的模型。

Image sampling methods have changed substantially over the months leading to the competition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, such as dropout and the learning rate. Therefore, it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition, includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area with aspect ratio constrained to the interval [ 3 , 4 ]. Also, we found that the photometric distortions 43
of Andrew Howard [8] were useful to combat overfitting to the imaging conditions of training data.

图片采样的方式在几个月中做了很多改变，已经训练好的模型又重新以不同的选项进行了训练，甚至修改了超参数，例如dropout和学习率。因此，很难对训练网络的最有效的方式给出确切的指导。更复杂的是，有些模型主要在相对小一些的数据集上训练，而另一些在一些大数据集上，这受到[8]的启发。然而，在比赛之后有一个处方被验证非常有用。包含完整图片尺寸不同大小区域的采样，均匀分布在8%到100%的图片面积，高宽比在[3,4]之间。另外，我们发现光性畸变43（Andrew Howard[8]）对于克服过拟合是非常有用的。

7 ILSVRC 2014 Classification Challenge Setup and Results

The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class, and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed correctly classified if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes.

ILSVRC2014分类挑战的任务是把图片分类到Imagenet的1000个分类中。训练集有120w图片，5w张验证集，10w张测试集。每张图片关联到一个正确的分类，性能以最高的分类预测来衡量。通常报告两个数据：top-1准确率，用第一个预测分类和真值比较。top-5错误率，用真值和前5个预测分类：只要在前5个预测分类中有真实值就算是预测正确，而不管顺序如何。这个竞赛使用top-5错误率来计算排名。

We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we describe next.

我们没有使用外部的训练数据来参加这个比赛。除了本文前面提到的训练技术以外，我们后面来介绍我们采用的一些技术来提高测试阶段的性能。

1. We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, due to an oversight) and learning rate policies. They differed only in sampling methodologies and the randomized input image order.

2. During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resized the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This leads to 4×3×6×2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).

3. The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers, but they lead to inferior performance than the simple averaging.

- 我们独立地进行了7个相同版本GooLeNet的训练（包括一个更宽的版本），然后用他们进行合并预测。这些模型使用相同的初始化方法（甚至由于一个过失，使用的是相同的初始权重）和学习率策略。他们的区别仅在采样方法和随机的图片输入顺序。

- 在测试阶段，我们采用一个比Krizhevsky[9]更激进的截取方法。确切的说，我们把图片拉伸到4个尺度，较小的维度（高或者宽）分别是256，288，320和352。取这些拉伸过的图片的左中右的矩形（纵向图片情况中取上中下）。对每个矩形，我们截取4个角和中心的224x224区域，加上把矩形拉伸成224x244，还有他们的对称版本。这样每个图片就有4x3x6x2 = 144个截取。类似的方法在Andrew Howard[8]在去年使用，我们实验过比这个稍差。我们申明如此激进的截取方法在实际中可能未必需要，在给定了合理数量的截取之后，更多的截取带来的好处就不是很重要了。（稍后会提到）

- 多个截取，和多个分类器得到的softmax预测值做平均得到最后的预测。我们在验证集上还试验分析了其他的方法，例如使用在不同的截取上使用最大池，而在不同的分类器上使用平均，但是它们的效果不如简单的做平均。

In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.

在这片论文的剩下部分，我们分析那些对最后性能起到贡献的因素。

Our final submission to the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers.

我们最后的版本在测试集和验证集都达到了6.67%的top-5错误率，排名第一。这比2012的SuperVision方法减少了56.5%。比去年的最好方法（Clarifai）也大概减少了40%，并且这两种方法还使用了外部数据来训练分类器。

Table 2 shows the statistics of some of the top-performing approaches over the past 3 years.

表2显示了最近3年的一些性能最好的方法的信息

We also analyze and report the performance of multiple testing choices, by varying the number of models and the number of crops used when predicting an image in Table 3. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.

表3中给出了我们使用了不同测试选择得到的分析和报告，利用不同数量的模型和截取。当使用1个模型时，我们使用在验证集上top-1错误率最低的那个。所有的数据都在验证集上给出，为了不使模型在测试集上过拟合。

![googlenet_perf](googlenet_perf.png)

8 ILSVRC 2014 Detection Challenge Setup and Results

The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary. Results are reported using the mean average precision (mAP). The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the selective search [20] approach with multibox [5] predictions for higher object bounding box recall. In order to reduce the number of false positives, the super-pixel size was increased by 2×. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the proposals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 GoogLeNets when classifying each region. This leads to an increase in accuracy from 40% to 43.9%. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time.

ILSVRC检测任务是在图片中产生200个可能对象的边框。检测出来的物体的标签和真值相同，并且边框和真值覆盖在50%以上，就被认为是正确的。多余检测被认为是错误的而且会被处罚。相对于分类任务来说，一个图片可能包含多个物体或者没有物体，并且他们的大小是不同的。结果使用mAP（平均精度）来报告。用于检测的GooLeNet和R-CNN[6]非常类似，但是使用Inception模块当做区域的分类器。另外，候选区结合了selective search[20]和multibox[5]两种方法来增加候选区的召回。为了降低误报，super-pixel的大小扩大一倍。这使得selective search算法给出的候选区数量减少了一半。我们通过使用multi-box[5]增加了200个区域候选项，总共，使用了[6]的60%的候选区，但是覆盖率从92%增加到93%。对于单个模型来说，减少候选区和增加覆盖率对总体的效果来说提升了1%的mAP。最后，我们使用6个GooLeNet的组合来对区域进行分类。这导致了准确率从40%升到43.9%。相比于R-CNN，我们因为没有时间，所以没有进行边框回归。

We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use convolutional networks. We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNet entry did not use the localization data for pretraining.

我们在有了第一个版本之后就报告了以上的检测结果。相比于2013的结果，准确率几乎翻倍了。结果最好的所有团队都是用了卷积网络。在表4中，我们列举了官方公布的所有团队的结果和策略：是否使用外部数据，组合模型或者相关模型。外部数据一般是用ILSVRC12分类数据对模型做预训练，然后再在检测数据中调优。一些团队还提到了定位数据的使用。因为很大部分的定位任务边界框不包含在检测数据集中，可以使用和分类相同的方法预训练一个边框回归器。GooLeNet没有使用定位数据进行预训练。

In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.

在表5中，我们仅仅比较单个模型的结果。最好的结果是Deep Insight，并且在3个组合的情况下只提升了0.3%，而GooLeNet在组合模型中能极大低改进结果。

![googlenet_detection_perf.png](googlenet_detection_perf.png)

![googlenet_detetion_models.png](googlenet_detetion_models.png)

9 Conclusions

Our results yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and narrower architectures.

我们的数据揭示了一个坚固的证据，近似期望的最优的稀疏结构，容易可用的稠密基本结构，是一种提升机器视觉神经网络的有效方式。相比于浅和窄的网络，这种方法的最大优点在于，适度地增加计算量来获得显著的质量。

Our object detection work was competitive despite not utilizing context nor performing bounding box regression, suggesting yet further evidence of the strengths of the Inception architecture.

我们的对象检测工作虽然没有利用上下文，也没有执行边界框回归，但仍然具有竞争力，这进一步证明了Inception架构的优势。

For both classification and detection, it is expected that similar quality of result can be achieved by much more expensive non-Inception-type networks of similar depth and width. Still, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest future work towards creating sparser and more refined structures in automated ways on the basis of [2], as well as on applying the insights of the Inception architecture to other domains.

对于分类和检测，预期可以通过更昂贵的类似深度和宽度的非初始型网络实现相似的结果质量。 然而，我们的方法产生了可靠的证据，即转向稀疏体系结构是一般可行和有用的想法。 这表明未来的工作将在[2]的基础上以自动化方式创建更为精细和更精细的结构，以及将Inception架构的见解应用于其他领域。