Going Deeper with Convolutions
----

http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Szegedy_Going_Deeper_With_2015_CVPR_paper.pdf


Abstract

We propose a deep convolutional neural network architecture codenamed Inception that achieves the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. By a carefully crafted design, we increased the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

我们提出一种深度卷积网络架构，代码名称为Inception（开端），在ILSVRC2014的分类和检测任务上达到了先进水平。这种架构的主要特点是改进在网络内部的计算资源利用率。通过精心的设计，我们增加了网络的深度和宽度但是保持了计算成本不变。为了优化质量，架构的决定是基于Hebbian原则和多尺度处理的直觉。设计的一个实现是提交到ILSVRC14的叫做GooLeNet的22层深度网络，在分类和识别的上进行质量的评估。

1 Introduction

In the last three years, our object classification and detection capabilities have dramatically improved due to advances in deep learning and convolutional networks [10]. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12 times fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate. On the object detection front, the biggest gains have not come from naive application of bigger and bigger deep networks, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].

在过去的3年里，因为深度学习和卷积网络，我们的目标分类和检测能力大大的提高了。一个很高的消息是，这种进步大部分的原因并不仅仅是因为更强劲的硬件，更大的数据集和更大的模型，而主要是一系列新的想法，算法和改进的网络架构。并没有使用新的数据源，例如，ILSVRC2014的顶级作品中分类和检测数据集。我们的GooLeNet实际上使用的参数比两年前的冠军Krizhevsky2012少了12倍，但更加准确。在物体检测方面，最大的收获并没有来自于更大的深度网络，而是深度网络和经典的机器视觉的协同合作，例如R-CNN（Girshick）。

Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.

另外一个值得注意的是，随着移动和嵌入式技术的发展，我们的算法的效率变得更为重要（特别是电量和内存的使用）。值得提出的是在设计本文深度架构的考量中，这个因素是被考虑的，而不仅仅是一些固定的准确度数字。对于大多数的试验，这个模型被设计控制推断时的计算成本为15亿次乘法和加法，所以它们不仅仅是纯粹的学术考量，而是可以被用于现实世界，甚至在更大的数据集上，已可以接受的成本。

In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, where it significantly outperforms the current state of the art.

在这篇论文里，我们把焦点放在一个高效的深度网络架构，代码名字为Inception，名字来源于Lin12的网络中的网络和流行的网络表情包“we need to go deeper”。在我们这里，“深”这个词有两个不同的含义：1.以“Inception”模块的形式引入新的层次，从而网络变深。一般来说，可以把Inception模块看做Arora12的启发。这个架构带来的好处被ILSVRC2014的分类和检测任务验证，明显地超过了现在的最先进水准。

![weneedtogodepper](weneedtogodepper.jpg)

2 Related Work

Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max-pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.

从LeNet-5开始，cnn有一种典型的结构：重叠的conv层组（后跟可选的归一化层或者最大池层），后跟一个或者多个fc层。这种基本模式的变体在图片分类中非常流行，从MNIST，CIFAR到最出名的ImageNet分类竞赛中得到了最好的结果。对于想ImageNet这样的大数据集来说，最近的趋势是增加更多的层，更大尺寸的层，而通过dropout层来解决过拟合的问题。

Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19].

尽管最大池存在导致损失空间信息准确度的问题，同样的convnet架构例如[9]还是很成功的应用到了localization和物体检测和人体姿势估计。

Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] used a series of fixed Gabor filters of different sizes to handle multiple scales. We use a similar strategy here. However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception architecture are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model.

由灵长类脑皮层的神经元模型得到的灵感，Serre15使用一系列不同大小的Gabor滤波器来处理多尺度。我们这里也使用相同的策略。但是相对于15使用的固定的两层深度模型，Inception架构里的所有滤波器都是可学习的。Inception层反复使用，形成了22层的深度模型，我们称为GooLeNet。

Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. In their model, additional 1 × 1 convolutional layers are added to the network, increasing its depth. We use this approach heavily in our architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty.

Network-in-Network是Lin12提出的一种方法，来增强神经网络的表达能力。在他们的模型里，多余的1x1卷积层用来增加网络的深度。虽然我们大量的使用了这种方法，但是，在我们的配置里，1x1卷积有双重目的：非常重要的一点是，他们主要被用来做降维模块来消除计算瓶颈，不然的话就会限制到我们网络的大小。这样在不用显著影响性能的情况下，不仅可以增加网络的深度，而且还有宽度。

Finally, the current state of the art for object detection is the Regions with Convolutional Neural Networks (R-CNN) method by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: utilizing low- level cues such as color and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals.

最后，目前最先进的目标检测技术是Girshick6提出的R-CNN。R-CNN把检测问题分解成两个子问题：利用如颜色质地等低层次的信息来生成物体位置的候选项；和用CNN分类器来识别这些位置的物体分类。这样的两阶段方法利用低层次信息和先进的CNN分类能力提高了边界框分割的准确度。我们在提交的检测方案商采用了一种类似的流程，但在两个阶段都做了改进，例如多边界框预测提高边界框召回，还有组合方法来得到更好的候选边框。

3 Motivation and High Level Considerations

The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth – the number of network levels – as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks.

最直接的提高深度网络的方法就是增加他们的规模。这包括增加深度（网络的层次数），还有宽度（每层的单元数）。这是训练高质量模型的一种简单和保险的方法，特别有大量的带标注的训练数据的情况下。但是，这种简单的策略也有两个明显的缺点。

Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to over- fitting, especially if the number of labeled examples in the training set is limited. This is a major bottleneck as strongly labeled datasets are laborious and expensive to obtain, often requiring expert human raters to distinguish between various fine-grained visual categories such as those in ImageNet (even in the 1000-class ILSVRC subset) as shown in Figure 1.

更大的规模一般意味着更多的参数，这就更容易过拟合，特别是训练集有限时。这是个很大的缺陷，因为获得高质量的标注数据集很难也很贵，经常需要人类专家在非常多细粒度的视觉分类上做区分，就像ImageNet一样（即使是ILSVRC 1000分类的子集）。

The other drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then much of the computation is wasted. As the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of performance.

另一个缺点是增加网络的规模极大的增加了计算资源的使用。例如，在一个深度视觉网络中，如果两个卷积层叠加，对于卷积滤波器的增加会导致计算的平方倍增加。如果增加的能力不被很好的利用（比如大部分权重都趋近于0），那么很多的计算资源就被浪费了。因为计算成本是有限的，即使主要目标是提高性能质量，也不能一味地扩大网络的规模。

A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle – neurons that fire together, wire together – suggests that the underlying idea is applicable even under less strict conditions, in practice.

一种解决这两个问题的基本方法是引入稀疏性，用稀疏的层代替全连接层，甚至是在卷积层内。除了模仿生物系统，由于Arora等人的开创性工作，这也将具有更坚实的理论基础的优势。其主要结果表明，如果数据集的概率分布可以由大的，非常稀疏的深层神经网络表示，则可以通过分析前一层激活和聚类神经元的相关统计信息来高层次地构建最优网络拓扑 相关输出。 尽管严格的数学证据需要非常强的条件，但是这个声明与众所周知的Hebbian原则（一起发射在一起的神经元）联系在一起的事实表明，即使在不太严格的条件下，实践中也是适用的。

Unfortunately, today’s computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses would dominate: switching to sparse matrices might not pay off. The gap is widened yet further by the use of steadily improving and highly tuned numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of employing convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, yet the trend changed back to full connections with [9] in order to further optimize parallel computation. Current state-of-the-art architectures for computer vision have uniform structure. The large number of filters and greater batch size allows for the efficient use of dense computation.

不幸的是，当今的计算基础设施在不统一的稀疏数据结构上进行数值计算是非常低效的。即使是计算操作减小100倍，查找和缓存不命中的开销将占主导：切换到稀疏矩阵可能也不能抵消。那些稳定改进和高度优化过的数值类库，利用底层CPU或者GPU的细节，使得稠密矩阵乘法可以极快的运行，使得这种差距进一步扩大。而且，不统一的稀疏模型需要更复杂的工程和计算基础设施。当前大部分的视觉计算学习系统只是利用了卷积的优点在空间域利用稀疏性。然而，卷积被实现成与前一层的补丁进行稠密连接的集合。convnet传统上在特征维度使用随机和稀疏连接，为了打破对称性和改进学习，但是趋势已经变回使用全连接，为了进一步优化并行计算。当前最先进的机器视觉架构具有统一的结构。使用很多的滤波器和更大的批，允许有效的使用稠密计算。

This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep- learning architectures in the near future.

这提出了一个问题，即下一个中间步骤是否有希望：一个利用滤波器级稀疏性的架构，如该理论所提出的，但是通过利用密集矩阵上的计算来利用我们当前的硬件。 关于稀疏矩阵计算的广泛文献（例如[3]）表明，将稀疏矩阵聚类成相对密集的子矩阵倾向于给出稀疏矩阵乘法的竞争性能。认为在不久的将来，类似的方法将被用于自动化构建非均匀深度学习架构似乎并不牵强。

The Inception architecture started out as a case study for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, modest gains were observed early on when compared with reference networks based on [12]. With a bit of tuning the gap widened and Inception proved to be especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly in separation, they turned out to be close to optimal locally. One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction. Making sure of this would require a much more thorough analysis and verification.


4 Architectural Details

The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vi- sion network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convo- lutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by layer construction where one should an- alyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clus- ters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from an earlier layer corresponds to some region of the in- put image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. Thus, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Incep- tion architecture are restricted to filter sizes 1×1, 3×3 and 5×5; this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling opera- tions have been essential for the success of current convolutional networks, it suggests that adding an alternative par- allel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).

As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher lay- ers, their spatial concentration is expected to decrease. This suggests that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers.
One big problem with the above modules, at least in this na ̈ıve form, is that even a modest number of 5×5 convo- lutions can be prohibitively expensive on top of a convolu- tional layer with a large number of filters. This problem be- comes even more pronounced once pooling units are added to the mix: the number of output filters equals to the number of filters in the previous stage. The merging of output of the pooling layer with outputs of the convolutional lay- ers would lead to an inevitable increase in the number of outputs from stage to stage. While this architecture might cover the optimal sparse structure, it would do it very inef- ficiently, leading to a computational blow up within a few stages.

This leads to the second idea of the Inception architec- ture: judiciously reducing dimension wherever the compu- tational requirements would increase too much otherwise. This is based on the success of embeddings: even low di- mensional embeddings might contain a lot of information about a relatively large image patch. However, embed- dings represent information in a dense, compressed form and compressed information is harder to process. The rep- resentation should be kept sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse. That is, 1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reductions, they also include the use of rectified lin- ear activation making them dual-purpose. The final result is depicted in Figure 2(b).
In general, an Inception network is a network consist- ing of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start us- ing Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation.
A useful aspect of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational com- plexity at later stages. This is achieved by the ubiquitous use of dimensionality reduction prior to expensive convolu- tions with larger patch sizes. Furthermore, the design fol- lows the practical intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from the different scales simultaneously.
The improved use of computational resources allows for increasing both the width of each stage as well as the num- ber of stages without getting into computational difficulties. One can utilize the Inception architecture to create slightly inferior, but computationally cheaper versions of it. We have found that all the available knobs and levers allow for a controlled balancing of computational resources resulting in networks that are 3 − 10× faster than similarly perform- ing networks with non-Inception architecture, however this requires careful manual design at this point.



5 GoogLeNet

By the“GoogLeNet” name we refer to the particular in- carnation of the Inception architecture used in our submis- sion for the ILSVRC 2014 competition. We also used one deeper and wider Inception network with slightly superior quality, but adding it to the ensemble seemed to improve the results only marginally. We omit the details of that network, as empirical evidence suggests that the influence of the ex- act architectural parameters is relatively minor. Table 1 il- lustrates the most common instance of Inception used in the competition. This network (trained with different image- patch sampling methods) was used for 6 out of the 7 models in our ensemble.
All the convolutions, including those inside the Incep- tion modules, use rectified linear activation. The size of the receptive field in our network is 224×224 in the RGB color space with zero mean. “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions. One can see the number of 1×1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these re- duction/projection layers use rectified linear activation as well.
The network was designed with computational efficiency and practicality in mind, so that inference can be run on in- dividual devices including even those with limited compu- tational resources, especially with low-memory footprint.

The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. The exact number depends on how layers are counted by the machine learning infrastructure. The use of average pooling before the classifier is based on [12], although our implementation has an additional linear layer. The linear layer enables us to easily adapt our networks to other label sets, however it is used mostly for convenience and we do not expect it to have a major effect. We found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.

Given relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. The strong performance of shallower networks on this task suggests that the fea- tures produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classi- fiers connected to these intermediate layers, discrimination in the lower stages in the classifier was expected. This was thought to combat the vanishing gradient problem while providing regularization. These classifiers take the form of smaller convolutional networks put on top of the out- put of the Inception (4a) and (4d) modules. During train- ing, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classi- fiers were weighted by 0.3). At inference time, these auxil- iary networks are discarded. Later control experiments have shown that the effect of the auxiliary networks is relatively minor (around 0.5%) and that it required only one of them to achieve the same effect.
The exact structure of the extra network on the side, in- cluding the auxiliary classifier, is as follows:
• An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a), and 4×4×528 for the (4d) stage.
• A 1×1 convolution with 128 filters for dimension re- duction and rectified linear activation.
• A fully connected layer with 1024 units and rectified linear activation.
• A dropout layer with 70% ratio of dropped outputs.
• A linear layer with softmax loss as the classifier (pre- dicting the same 1000 classes as the main classifier, but removed at inference time).
A schematic view of the resulting network is depicted in Figure 3.



6 Training Methodology

GoogLeNet networks were trained using the DistBe- lief [4] distributed machine learning system using mod- est amount of model and data-parallelism. Although we used a CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momen- tum [17], fixed learning rate schedule (decreasing the learn- ing rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.
Image sampling methods have changed substantially
over the months leading to the competition, and already
converged models were trained on with other options, some-
times in conjunction with changed hyperparameters, such
as dropout and the learning rate. Therefore, it is hard to
give a definitive guidance to the most effective single way
to train these networks. To complicate matters further, some
of the models were mainly trained on smaller relative crops,
others on larger ones, inspired by [8]. Still, one prescrip-
tion that was verified to work very well after the competi-
tion, includes sampling of various sized patches of the im-
age whose size is distributed evenly between 8% and 100%
of the image area with aspect ratio constrained to the inter-
val [ 3 , 4 ]. Also, we found that the photometric distortions 43
of Andrew Howard [8] were useful to combat overfitting to the imaging conditions of training data.