Deep Residual Learning for Image Regconition
----

https://arxiv.org/abs/1512.03385

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

神经网络越深就越难训练。我们提出一个残差学习框架来训练比以往深得多的网络。我们明确地把层当做输入层的残差函数。我们提供综合经验证明，这种残差网络更容易被优化，通过显著增加深度可以获得更好的准确率。在ImageNet数据集上，我们使用深到152层的残差网络来做评估，它比VGG[41]深了8倍还多，但复杂程度更低。组合这种网络在ImageNet的测试集上可以达到3.57%的错误率。这个结果赢得了ILSVRC2015分类任务的冠军。我们还给出了在CIFAR-10上的100和1000层的分析。

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

对于很多视觉识别任务来说，表达深度都是非常重要的。仅仅由于极端的深度表达，我们在COCO物体检测数据集上获得了相对28%的改进。深度残差网络是我们提交ILSVRC和COCO2015竞赛的基础。我们同时获得了ImageNet检测，定位，COCO定位和分割比赛的第一名。

1 Introduction

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 50, 40]. Deep networks naturally integrate low/mid/high level features [50] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.

深度卷积网络[22,21]在图片分类[21,50,40]上导致了一系列的突破。深度网络以端到端的方式自然地集成了底层/中间层/高层特征[50]还有分类器，而且特征的层次可以通过增加更多的层来丰富（深度）。最近的研究表明[41,44]网络的深度是非常重要的，并且在ImageNet上的最好结果[41,44,13,16]都是深度模型。从16层到30层。许多其他的视觉识别任务[8,12,7,32,27]也从深度模型得到非常多的好处。

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

由于深度的增加，一个问题就产生了：学习更好的网路是否和叠加更多的层一样容易？回答这个问题的第一个障碍是臭名昭著的梯度消失/爆炸[1,9]，它从一开始就阻碍了收敛。但是这个问题已经极大低被初始化中的标准化[23,9,37,13]和中间层的标准化[16]解决了，使得几十层的网络通过反向传递和随机梯度下降[22]能够开始收敛。

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.

当深度网路开始收敛，一个降级的问题出现了：网络深度增加了，准确率饱和了（这可能并不奇怪）然后迅速下降。意外的是，这种退化并不是由过拟合引起的，在一个合适的深度模型中增加更多的层会导致更高的训练误差，[11,42]提到了这些，并且被我们的试验全面的验证了。图1是一个典型的例子。

![high_training_error.png](high_training_error.png)

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.

图1：训练误差（左）和测试误差（右）在CIFAR-10上，20层和56层的简单网络。较深的网络有很大的训练和测试误差。图4显示了在ImageNet上的相同的现象。

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).

这种退化（训练准确率）表明不是所有的系统都能同样简单地优化。我们来考虑一个浅网络结构和一个通过添加更多层的深网络。有一种方式来构造更深的模型：新增加的层是一对一的映射，其它层从已经训练的浅网络拷贝而来。这种构造方法表明一个深网络不应该比对应的浅网络有更高的训练误差。但是试验表明，我们目前的学习方案比不上这种构造方法（或者没有足够的时间）。

In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F (x) := H(x) − x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

在这篇论文中，我们引入一种深度残差学习框架来解决这个退化问题。不同于让每个层直接来拟合底层的映射，我们显式的让这些层拟合一个残差映射。形式上，底层映射用H(x)指代，我们让堆叠的非线性层来拟合另一个映射F(x) := H(x) - x。原来的映射就变成了 F(x) + x。我们假设这个残差映射比原映射要更容易优化。极端情况下，如果原映射是identity，把残差变成0，比直接用这些层来拟合这个identity映射要容易。

The formulation of F (x) + x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

公式 F(x) + x 可以通过前向网络的“短路连接”来实现（图2）。短路连接[2,34,49]跳过了一个或多个层。在我们这里，短路连接只是简单的做恒等的映射，它们的输出被添加到堆叠层的输出上（图2）。恒等短路连接每天增加新的参数，也没有额外的计算复杂度。整个网络依然可以使用端到端的反向传递来训练，通过一些公共库在不改变优化器的情况下可以很简单地被实现。

Figure 2. Residual learning: a building block.

![resnet_building_block](resnet_building_block.png)


We present comprehensive experiments on ImageNet [36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

我们在ImageNet[36]上做了充分了试验，来显示这个退化问题并评估我们的方法。1）我们的非常深的残差网络很容易优化，但是对应的“普通”网络（简单叠加层）在层次增加的时候出现了很高的训练误差；2）深度残差网络可以很容易通过增加深度来获得准确率的提升，得到比以往的网络好得多的结果。

Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.

类似的现象同样出现在CIFAR-10上[20]，这说明我们的方法不仅仅只限于一个特定的数据集。我们给出了在这个数据集上一个成功训练的100层的模型，另外还探索了1000层的模型。

On the ImageNet classification dataset [36], we obtain excellent results by extremely deep residual nets. Our 152 layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.

在ImageNet分类数据集上，我们通过深度残差网络获得了出色的成绩。152层的残差网络是ImageNet上公开的最深的网络，并且比VGG[41]的复杂度要小。我们的组合分类器在ImageNet的测试集上的top-5错误率为3.57%，是ILSVRC2015的第一名。这个极深的表达方式在别的识别任务中也有出色的泛化能力，使得我们后来赢得了更多ILSVRC和COLO2015的第一名：ImageNet识别，ImageNet定位，COCO检测，COCO分割。这些事实强有力的说明了，残差学习的原理是通用的，我们希望它被应用到更多的视觉或者非视觉问题中。

2 Related Work

Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 48]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.

**残差的表达形式**。在图片识别领域，VLAD[18]使用字典编码残差向量的方式，Fisher Vector[30]可以看成是VLAD的概率版本。这两种方法是图像识别和分类中非常有效的浅表示。对于向量量化，对残差向量编码比对于原向量编码要有效得多。

In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3, 45, 46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

在底层视觉和计算机图形学中，为了求解PDE（偏微分方程），Multigrid方法被广泛使用来把系统分成多个尺度的子问题，而每个子问题负责在较粗和较细的尺度上解决残差问题。Multigrid的替代方法是层次基础预处理，它依赖于两个尺度之间的残差向量的变量表达。[3,45,46]已经证明这些优化方法比不考虑残差属性的方法收敛要迅速得多。这些方法都说明一种好的重新表达和预处理可以简化优化问题。

Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an “inception” layer is composed of a shortcut branch and a few deeper branches.

**短路连接**。关于短路连接的理论和实验已经有很长时间了。较早的实践中，添加一个线性层连接网络的输入和输出来训练多层感知机（MLP）。在44,24中，一些中间层被直接连接到辅助分类器上，用来解决梯度消失和爆炸问题。论文[39,38,31,47]提出了一些方法通过短路连接来对，响应，梯度和传播错误进行居中。在44中，一个“inception”层由一个短路分支和一些较深的分支组成。

Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).

和我们的工作同时进行的，“高速网络”[42, 43]用门函数的方式来表达短路连接。这些“门”有参数和独立于数据，而我们的恒等短路是无参的。当一个短路门“关闭”时（接近于0），高速网络中的层就像是非-残差函数。相对的，我们的策略使用学习残差函数；我们的恒等短路永远也不能关闭，所有的信息一致都可以通过，同时残差网络被学习。另外，高速网络并没有显示通过增加深度来获得准确率的能力（例如，多于100层）。

3 Deep Residual Learning

3.1 Residual Learning

Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions2, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.

我们假设通过几个叠加的层（不需要是整个网络）来拟合H(x)映射，x 表示这些层的第一个输入。如果假设多个非线性层可以渐进地逼近复杂的函数，那么也就是说它们同样可以渐进地逼近残差函数，也就是，H(x) - x（假设输入和输出维度相同）。我们显式地让这些层来逼近一个残差函数：F(x) := H(x) - x，而不是原函数H(x)。原函数就变成了F(x) + x。尽量两种形式都被逼近，但是学习的难易度是不同的。

This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.

这个变形是因为反直觉的退化现象而提出的（图1左）。在介绍部分讨论过，如果新添加的层可以被组织成恒等映射，那么一个更深的模型的训练误差就不会比对应的浅模型要大。退化现象显示用多个非线性层逼近恒等映射，优化器有困难。在残差学习的形式中，如果最优的映射是恒等，优化器简单的把多个线性层的权重都设置成0，就可以来逼近恒等映射。

In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.

在实际中，最优的目标是恒等映射这种情况不太可能，但是我们的变形有利于预先解决这个问题。如果最优化目标函数比0更近似于恒等映射，对于优化器，找到恒等映射的扰动，比把这个函数作为一个新函数，要容易一些（不知道说了些啥。。。）。通过实验（图7）表明，学习到的残差函数一般来说响应比较小，这说明恒等映射提供了合理的预处理。

3.2 Identity Mapping by Shortcuts

We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as:

$$ y = F (x, \left\{ W_i \right\}) + x $$

Here x and y are the input and output vectors of the layers considered. The function $ F (x, \left\{ W_i \right\}) $ represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, $ F = W_2 σ(W_1 x) $ in which σ denotes ReLU [29] and the biases are omitted for simplifying notations. The operation F + x is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., σ(y), see Fig. 2).

我们每隔几个层使用残差学习。基本单元显示如图2。函数F(x)表示被学习的残差映射。例如在图2中，有两层，“公式“，其中的偏移为简单起见被忽略了。F + x 操作通过短路和按位加法来实现。求和后又采用了一次非线性（也就是，σ(y)，见图2)。

The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).

公式1中的短路连接没有引入多余的参数，或者计算复杂度。这不仅在实际中是有吸引力的，并且在比较残差和普通网络的时候也非常重要。我们可以同时在相同的参数数量，深度和计算开销（除了那个几乎可以忽律不计的按位加法以外）下公平地比较普通和残差网络。

The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:

$$ y = F (x, \left\{ W_i \right\}) + W_s x $$

在公式1中，x和F的维度必须相等。如果这不成立的话（例如改变了输入输出的通道数），我们可以对短路连接进行一个线性变换$ W_s $来实现维度的匹配。

We can also use a square matrix Ws in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions.

我们也可以在公式1中使用一个方阵$ W_s $，但是通过实验表明，恒等映射已经完全可以解决这个退化问题，并且更经济，所以$ W_s $只是用于维度匹配。

The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: y = W1 x + x, for which we have not observed advantages.

残差函数F的形式是灵活的。本文的实验表明一个函数F有两层或者三层（图5），但是更多的层也是可能的。但如果F只有1层，公式1就类似于一个线性层：$ y = W_1 x + x $，这样并没有什么好处。

We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function F(x,{Wi}) can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.

尽管为了简单起见以上都使用全连接层来表述，但是卷积层也同样适用。函数$ F(x, \left\{ W_i \right\}) 可以表示为多个卷积层。按位加法在两个特征图上按通道进行加法计算。

3.3 Network Architectures

We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.

我们测试过了多种普通/残差网络，都得到了一致的现象。为了讨论方便，我们将描述两个用于ImageNet的模型。

Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [41] (Fig. 3, left). The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle).

**普通网络**。 我们的测试基准普通网络（图3，中）主要是受VGG的启发（图3，左）。卷积层主要都是3x3的卷积核，后跟两个简单的设计原则：(i) 相同的特征图输出有相同的卷积核个数；(ii) 如果特征图大小减半，卷积核的个数翻倍，来保持每一层的时间复杂度。我们直接使用步长为2的卷积层来降采样。这个网络以一个全局的平均池化层，和一个1000路的全连接softmax层结束。带参数的层的总数为34（图3，中）。

It is worth noticing that our model has fewer filters and lower complexity than VGG nets [41] (Fig. 3, left). Our 34layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).

Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

Figure 3. Example network architectures for ImageNet. Left: the VGG-19 model [41] (19.6 billion FLOPs) as a reference. Mid- dle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions. Table 1 shows more details and other variants.

![resnet.png](resnet.png)

3.4. Implementation



Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 × 104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16].


In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fullyconvolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).
