Deep Residual Learning for Image Regconition
----

https://arxiv.org/abs/1512.03385

Abstract

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers—8× deeper than VGG nets [41] but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers.

神经网络越深就越难训练。我们提出一个残差学习框架来训练比以往深得多的网络。我们明确地把层当做输入层的残差函数。我们提供综合经验证明，这种残差网络更容易被优化，通过显著增加深度可以获得更好的准确率。在ImageNet数据集上，我们使用深到152层的残差网络来做评估，它比VGG[41]深了8倍还多，但复杂程度更低。这种网络的组合模型在ImageNet的测试集上可以达到3.57%的错误率。这个结果赢得了ILSVRC2015分类任务的冠军。我们还给出了在CIFAR-10上的100和1000层的分析。

The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions1, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

对于很多视觉识别任务来说，深度表达都是非常重要的。仅仅由于极端的深度表达，我们在COCO物体检测数据集上获得了相对28%的改进。深度残差网络是我们提交ILSVRC和COCO2015竞赛的基础。我们同时也获得了ImageNet检测，定位，COCO定位和分割比赛的第一名。

1 Introduction

Deep convolutional neural networks [22, 21] have led to a series of breakthroughs for image classification [21, 50, 40]. Deep networks naturally integrate low/mid/high level features [50] and classifiers in an end-to-end multilayer fashion, and the “levels” of features can be enriched by the number of stacked layers (depth). Recent evidence [41, 44] reveals that network depth is of crucial importance, and the leading results [41, 44, 13, 16] on the challenging ImageNet dataset [36] all exploit “very deep” [41] models, with a depth of sixteen [41] to thirty [16]. Many other nontrivial visual recognition tasks [8, 12, 7, 32, 27] have also greatly benefited from very deep models.

深度卷积网络[22,21]在图片分类[21,50,40]上导致了一系列的突破。深度网络以端到端的方式自然地集成了底层/中间层/高层特征[50]还有分类器，而且特征的层次可以通过增加更多的层来丰富（深度）。最近的研究表明[41,44]网络的深度是非常重要的，并且在ImageNet上的最好结果[41,44,13,16]都是深度模型。从16层到30层。许多其他的视觉识别任务[8,12,7,32,27]也从深度模型得到非常多的好处。

Driven by the significance of depth, a question arises: Is learning better networks as easy as stacking more layers? An obstacle to answering this question was the notorious problem of vanishing/exploding gradients [1, 9], which hamper convergence from the beginning. This problem, however, has been largely addressed by normalized initialization [23, 9, 37, 13] and intermediate normalization layers [16], which enable networks with tens of layers to start converging for stochastic gradient descent (SGD) with backpropagation [22].

由于深度的增加，一个问题就产生了：训练更好的网路是否和叠加更多的层一样容易？回答这个问题的第一个障碍是臭名昭著的梯度消失/爆炸[1,9]，它从一开始就阻碍了收敛。但是这个问题已经极大低被初始化中的标准化[23,9,37,13]和中间层的标准化[16]解决了，使得几十层的网络通过反向传递和随机梯度下降[22]能够开始收敛。

When deeper networks are able to start converging, a degradation problem has been exposed: with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly. Unexpectedly, such degradation is not caused by overfitting, and adding more layers to a suitably deep model leads to higher training error, as reported in [11, 42] and thoroughly verified by our experiments. Fig. 1 shows a typical example.

当深度网路开始收敛，一个退化问题出现了：网络深度增加了，准确率饱和了（这可能并不奇怪）然后迅速下降。意外的是，这种退化并不是由过拟合引起的，在一个合适的深度模型中增加更多的层会导致更高的训练误差，[11,42]提到了这些，并且被我们的试验全面的验证了。图1是一个典型的例子。

![high_training_error.png](high_training_error.png)

Figure 1. Training error (left) and test error (right) on CIFAR-10 with 20-layer and 56-layer “plain” networks. The deeper network has higher training error, and thus test error. Similar phenomena on ImageNet is presented in Fig. 4.

图1：训练误差（左）和测试误差（右）在CIFAR-10上，20层和56层的简单网络。较深的网络有很大的训练和测试误差。图4显示了在ImageNet上的相同的现象。

The degradation (of training accuracy) indicates that not all systems are similarly easy to optimize. Let us consider a shallower architecture and its deeper counterpart that adds more layers onto it. There exists a solution by construction to the deeper model: the added layers are identity mapping, and the other layers are copied from the learned shallower model. The existence of this constructed solution indicates that a deeper model should produce no higher training error than its shallower counterpart. But experiments show that our current solvers on hand are unable to find solutions that are comparably good or better than the constructed solution (or unable to do so in feasible time).

这种训练准确率的退化表明不是所有的系统都能同样简单地优化。我们来考虑一个浅网络结构和一个在它上添加层的深网络。用一种方式来构造更深的模型：新增加的层是一对一的映射，其它层从已经训练的浅网络拷贝而来。这种构造方法表明一个深网络不应该比对应的浅网络有更高的训练误差。但是试验表明，我们目前的学习方案比不上这种构造方法（或者没有足够的时间）。

In this paper, we address the degradation problem by introducing a deep residual learning framework. Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F (x) := H(x) − x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping. To the extreme, if an identity mapping were optimal, it would be easier to push the residual to zero than to fit an identity mapping by a stack of nonlinear layers.

在这篇论文中，我们引入一种深度残差学习框架来解决这个退化问题。不同于让每个层直接来拟合底层的映射，我们显式的让这些层拟合一个残差映射。形式上，底层映射用H(x)指代，我们让堆叠的非线性层来拟合另一个映射F(x) := H(x) - x。原来的映射就变成了 F(x) + x。我们假设这个残差映射比原映射要更容易优化。极端情况下，如果原映射是identity，把残差变成0，比直接用这些层来拟合这个identity映射要容易。

The formulation of F (x) + x can be realized by feedforward neural networks with “shortcut connections” (Fig. 2). Shortcut connections [2, 34, 49] are those skipping one or more layers. In our case, the shortcut connections simply perform identity mapping, and their outputs are added to the outputs of the stacked layers (Fig. 2). Identity shortcut connections add neither extra parameter nor computational complexity. The entire network can still be trained end-to-end by SGD with backpropagation, and can be easily implemented using common libraries (e.g., Caffe [19]) without modifying the solvers.

公式 F(x) + x 可以通过前向网络的“短路连接”来实现（图2）。短路连接[2,34,49]跳过了一个或多个层。在我们这里，短路连接只是简单的做恒等的映射，它们的输出被添加到堆叠层的输出上（图2）。恒等短路连接每天增加新的参数，也没有额外的计算复杂度。整个网络依然可以使用端到端的反向传递来训练，通过一些公共库在不改变优化器的情况下可以很简单地被实现。

Figure 2. Residual learning: a building block.

![resnet_building_block](resnet_building_block.png)


We present comprehensive experiments on ImageNet [36] to show the degradation problem and evaluate our method. We show that: 1) Our extremely deep residual nets are easy to optimize, but the counterpart “plain” nets (that simply stack layers) exhibit higher training error when the depth increases; 2) Our deep residual nets can easily enjoy accuracy gains from greatly increased depth, producing results substantially better than previous networks.

我们在ImageNet[36]上做了充分了试验，来显示这个退化问题并评估我们的方法。1）我们的非常深的残差网络很容易优化，但是对应的“普通”网络（简单叠加层）在层次增加的时候出现了很高的训练误差；2）深度残差网络可以很容易通过增加深度来获得准确率的提升，得到比以往的网络好得多的结果。

Similar phenomena are also shown on the CIFAR-10 set [20], suggesting that the optimization difficulties and the effects of our method are not just akin to a particular dataset. We present successfully trained models on this dataset with over 100 layers, and explore models with over 1000 layers.

类似的现象同样出现在CIFAR-10上[20]，这说明我们的方法不仅仅只限于一个特定的数据集。我们给出了在这个数据集上一个成功训练的100层的模型，另外还探索了1000层的模型。

On the ImageNet classification dataset [36], we obtain excellent results by extremely deep residual nets. Our 152 layer residual net is the deepest network ever presented on ImageNet, while still having lower complexity than VGG nets [41]. Our ensemble has 3.57% top-5 error on the ImageNet test set, and won the 1st place in the ILSVRC 2015 classification competition. The extremely deep representations also have excellent generalization performance on other recognition tasks, and lead us to further win the 1st places on: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation in ILSVRC & COCO 2015 competitions. This strong evidence shows that the residual learning principle is generic, and we expect that it is applicable in other vision and non-vision problems.

在ImageNet分类数据集上，我们通过深度残差网络获得了出色的成绩。152层的残差网络是ImageNet上公开的最深的网络，并且比VGG[41]的复杂度要小。我们的组合分类器在ImageNet的测试集上的top-5错误率为3.57%，是ILSVRC2015的第一名。这个极深的表达方式在别的识别任务中也有出色的泛化能力，使得我们后来赢得了更多ILSVRC和COLO2015的第一名：ImageNet识别，ImageNet定位，COCO检测，COCO分割。这些事实强有力的说明了，残差学习的原理是通用的，我们希望它被应用到更多的视觉或者非视觉问题中。

2 Related Work

Residual Representations. In image recognition, VLAD [18] is a representation that encodes by the residual vectors with respect to a dictionary, and Fisher Vector [30] can be formulated as a probabilistic version [18] of VLAD. Both of them are powerful shallow representations for image retrieval and classification [4, 48]. For vector quantization, encoding residual vectors [17] is shown to be more effective than encoding original vectors.

**残差的表达形式**。在图片识别领域，VLAD[18]使用字典编码残差向量的方式，Fisher Vector[30]可以看成是VLAD的概率版本。这两种方法是图像识别和分类中非常有效的浅表示。对于向量量化，对残差向量编码比对于原向量编码要有效得多。

In low-level vision and computer graphics, for solving Partial Differential Equations (PDEs), the widely used Multigrid method [3] reformulates the system as subproblems at multiple scales, where each subproblem is responsible for the residual solution between a coarser and a finer scale. An alternative to Multigrid is hierarchical basis preconditioning [45, 46], which relies on variables that represent residual vectors between two scales. It has been shown [3, 45, 46] that these solvers converge much faster than standard solvers that are unaware of the residual nature of the solutions. These methods suggest that a good reformulation or preconditioning can simplify the optimization.

在底层视觉和计算机图形学中，为了求解PDE（偏微分方程），Multigrid方法被广泛使用来把系统分成多个尺度的子问题，而每个子问题负责在较粗和较细的尺度上解决残差问题。Multigrid的替代方法是层次基础预处理，它依赖于两个尺度之间的残差向量的变量表达。[3,45,46]已经证明这些优化方法比不考虑残差属性的方法收敛要迅速得多。这些方法都说明一种好的重新表达和预处理可以简化优化问题。

Shortcut Connections. Practices and theories that lead to shortcut connections [2, 34, 49] have been studied for a long time. An early practice of training multi-layer perceptrons (MLPs) is to add a linear layer connected from the network input to the output [34, 49]. In [44, 24], a few intermediate layers are directly connected to auxiliary classifiers for addressing vanishing/exploding gradients. The papers of [39, 38, 31, 47] propose methods for centering layer responses, gradients, and propagated errors, implemented by shortcut connections. In [44], an “inception” layer is composed of a shortcut branch and a few deeper branches.

**短路连接**。关于短路连接的理论和实验已经有很长时间了。较早的实践中，添加一个线性层连接网络的输入和输出来训练多层感知机（MLP）。在44,24中，一些中间层被直接连接到辅助分类器上，用来解决梯度消失和爆炸问题。论文[39,38,31,47]提出了一些方法通过短路连接来对，响应，梯度和传播错误进行居中。在44中，一个“inception”层由一个短路分支和一些较深的分支组成。

Concurrent with our work, “highway networks” [42, 43] present shortcut connections with gating functions [15]. These gates are data-dependent and have parameters, in contrast to our identity shortcuts that are parameter-free. When a gated shortcut is “closed” (approaching zero), the layers in highway networks represent non-residual functions. On the contrary, our formulation always learns residual functions; our identity shortcuts are never closed, and all information is always passed through, with additional residual functions to be learned. In addition, highway networks have not demonstrated accuracy gains with extremely increased depth (e.g., over 100 layers).

和我们的工作同时进行的，“高速网络”[42, 43]用门函数的方式来表达短路连接。这些“门”有参数和独立于数据，而我们的恒等短路是无参的。当一个短路门“关闭”时（接近于0），高速网络中的层就像是非-残差函数。相对的，我们的策略使用学习残差函数；我们的恒等短路永远也不能关闭，所有的信息一致都可以通过，同时残差网络被学习。另外，高速网络并没有显示通过增加深度来获得准确率的能力（例如，多于100层）。

3 Deep Residual Learning

3.1 Residual Learning

Let us consider H(x) as an underlying mapping to be fit by a few stacked layers (not necessarily the entire net), with x denoting the inputs to the first of these layers. If one hypothesizes that multiple nonlinear layers can asymptotically approximate complicated functions2, then it is equivalent to hypothesize that they can asymptotically approximate the residual functions, i.e., H(x) − x (assuming that the input and output are of the same dimensions). So rather than expect stacked layers to approximate H(x), we explicitly let these layers approximate a residual function F(x) := H(x) − x. The original function thus becomes F(x)+x. Although both forms should be able to asymptotically approximate the desired functions (as hypothesized), the ease of learning might be different.

我们假设通过几个叠加的层（不需要是整个网络）来拟合H(x)映射，x 表示这些层的第一个输入。如果假设多个非线性层可以渐进地逼近复杂的函数，那么也就是说它们同样可以渐进地逼近残差函数，也就是，H(x) - x（假设输入和输出维度相同）。我们显式地让这些层来逼近一个残差函数：F(x) := H(x) - x，而不是原函数H(x)。原函数就变成了F(x) + x。尽量两种形式都被逼近，但是学习的难易度是不同的。

This reformulation is motivated by the counterintuitive phenomena about the degradation problem (Fig. 1, left). As we discussed in the introduction, if the added layers can be constructed as identity mappings, a deeper model should have training error no greater than its shallower counterpart. The degradation problem suggests that the solvers might have difficulties in approximating identity mappings by multiple nonlinear layers. With the residual learning reformulation, if identity mappings are optimal, the solvers may simply drive the weights of the multiple nonlinear layers toward zero to approach identity mappings.

这个变形是因为反直觉的退化现象而提出的（图1左）。在介绍部分讨论过，如果新添加的层可以被组织成恒等映射，那么一个更深的模型的训练误差就不会比对应的浅模型要大。退化现象显示用多个非线性层逼近恒等映射，优化器有困难。在残差学习的形式中，如果最优的映射是恒等，优化器简单的把多个线性层的权重都设置成0，就可以来逼近恒等映射。

In real cases, it is unlikely that identity mappings are optimal, but our reformulation may help to precondition the problem. If the optimal function is closer to an identity mapping than to a zero mapping, it should be easier for the solver to find the perturbations with reference to an identity mapping, than to learn the function as a new one. We show by experiments (Fig. 7) that the learned residual functions in general have small responses, suggesting that identity mappings provide reasonable preconditioning.

在实际中，最优的目标是恒等映射这种情况不太可能，但是我们的变形有利于预先解决这个问题。如果最优化目标函数比0更近似于恒等映射，对于优化器，找到恒等映射的扰动，比把这个函数作为一个新函数，要容易一些（不知道说了些啥。。。）。通过实验（图7）表明，学习到的残差函数一般来说响应比较小，这说明恒等映射提供了合理的预处理。

3.2 Identity Mapping by Shortcuts

We adopt residual learning to every few stacked layers. A building block is shown in Fig. 2. Formally, in this paper we consider a building block defined as:

$$ y = F (x, \left\{ W_i \right\}) + x $$

Here x and y are the input and output vectors of the layers considered. The function $ F (x, \left\{ W_i \right\}) $ represents the residual mapping to be learned. For the example in Fig. 2 that has two layers, $ F = W_2 σ(W_1 x) $ in which σ denotes ReLU [29] and the biases are omitted for simplifying notations. The operation F + x is performed by a shortcut connection and element-wise addition. We adopt the second nonlinearity after the addition (i.e., σ(y), see Fig. 2).

我们每隔几个层使用残差学习。基本单元显示如图2。函数F(x)表示被学习的残差映射。例如在图2中，有两层，“公式“，其中的偏移为简单起见被忽略了。F + x 操作通过短路和按位加法来实现。求和后又采用了一次非线性（也就是，σ(y)，见图2)。

The shortcut connections in Eqn.(1) introduce neither extra parameter nor computation complexity. This is not only attractive in practice but also important in our comparisons between plain and residual networks. We can fairly compare plain/residual networks that simultaneously have the same number of parameters, depth, width, and computational cost (except for the negligible element-wise addition).

公式1中的短路连接没有引入多余的参数，或者计算复杂度。这不仅在实际中是有吸引力的，并且在比较残差和普通网络的时候也非常重要。我们可以同时在相同的参数数量，深度和计算开销（除了那个几乎可以忽律不计的按位加法以外）下公平地比较普通和残差网络。

The dimensions of x and F must be equal in Eqn.(1). If this is not the case (e.g., when changing the input/output channels), we can perform a linear projection Ws by the shortcut connections to match the dimensions:

$$ y = F (x, \left\{ W_i \right\}) + W_s x $$

在公式1中，x和F的维度必须相等。如果这不成立的话（例如改变了输入输出的通道数），我们可以对短路连接进行一个线性变换$ W_s $来实现维度的匹配。

We can also use a square matrix Ws in Eqn.(1). But we will show by experiments that the identity mapping is sufficient for addressing the degradation problem and is economical, and thus Ws is only used when matching dimensions.

我们也可以在公式1中使用一个方阵$ W_s $，但是通过实验表明，恒等映射已经完全可以解决这个退化问题，并且更经济，所以$ W_s $只是用于维度匹配。

The form of the residual function F is flexible. Experiments in this paper involve a function F that has two or three layers (Fig. 5), while more layers are possible. But if F has only a single layer, Eqn.(1) is similar to a linear layer: y = W1 x + x, for which we have not observed advantages.

残差函数F的形式是灵活的。本文的实验表明一个函数F有两层或者三层（图5），但是更多的层也是可能的。但如果F只有1层，公式1就类似于一个线性层：$ y = W_1 x + x $，这样并没有什么好处。

We also note that although the above notations are about fully-connected layers for simplicity, they are applicable to convolutional layers. The function F(x,{Wi}) can represent multiple convolutional layers. The element-wise addition is performed on two feature maps, channel by channel.

尽管为了简单起见以上都使用全连接层来表述，但是卷积层也同样适用。函数$ F(x, \left\{ W_i \right\}) 可以表示为多个卷积层。按位加法在两个特征图上按通道进行加法计算。

3.3 Network Architectures

We have tested various plain/residual nets, and have observed consistent phenomena. To provide instances for discussion, we describe two models for ImageNet as follows.

我们测试过了多种普通/残差网络，都得到了一致的现象。为了讨论方便，我们将描述两个用于ImageNet的模型。

Plain Network. Our plain baselines (Fig. 3, middle) are mainly inspired by the philosophy of VGG nets [41] (Fig. 3, left). The convolutional layers mostly have 3×3 filters and follow two simple design rules: (i) for the same output feature map size, the layers have the same number of filters; and (ii) if the feature map size is halved, the number of filters is doubled so as to preserve the time complexity per layer. We perform downsampling directly by convolutional layers that have a stride of 2. The network ends with a global average pooling layer and a 1000-way fully-connected layer with softmax. The total number of weighted layers is 34 in Fig. 3 (middle).

**普通网络**。 我们的测试基准普通网络（图3，中）主要是受VGG的启发（图3，左）。卷积层主要都是3x3的卷积核，后跟两个简单的设计原则：(i) 相同的特征图输出有相同的卷积核个数；(ii) 如果特征图大小减半，卷积核的个数翻倍，来保持每一层的时间复杂度。我们直接使用步长为2的卷积层来降采样。这个网络以一个全局的平均池化层，和一个1000路的全连接softmax层结束。带参数的层的总数为34（图3，中）。

It is worth noticing that our model has fewer filters and lower complexity than VGG nets [41] (Fig. 3, left). Our 34 layer baseline has 3.6 billion FLOPs (multiply-adds), which is only 18% of VGG-19 (19.6 billion FLOPs).

值得提出我们模型比VGG的卷积核少，复杂度也低（图3，左）。我们34层的基准模型有36亿次浮点计算（乘法加法），只有VGG-19的18%（196亿次）。

Residual Network. Based on the above plain network, we insert shortcut connections (Fig. 3, right) which turn the network into its counterpart residual version. The identity shortcuts (Eqn.(1)) can be directly used when the input and output are of the same dimensions (solid line shortcuts in Fig. 3). When the dimensions increase (dotted line shortcuts in Fig. 3), we consider two options: (A) The shortcut still performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no extra parameter; (B) The projection shortcut in Eqn.(2) is used to match dimensions (done by 1×1 convolutions). For both options, when the shortcuts go across feature maps of two sizes, they are performed with a stride of 2.

**残差网络**。在之上的普通网络的基础上，我们加入短路连接（图3，右）就形成了对应的残差版本。当输入和输出的维度一样时恒等短路（公式1）可以被直接使用（图3中的实线短路）。当维度增加时（图3中的虚线短路），我们有两个选项：(A)短路仍然执行恒等映射，对于多出的维度填充0.这种方法没有多余的参数；(B)公式2中的短路变换来匹配维度（使用1x1卷积完成）。对于两种选项，当短路遇到两种尺寸的特征图时，使用步长2来执行。

Figure 3. Example network architectures for ImageNet. Left: the VGG-19 model [41] (19.6 billion FLOPs) as a reference. Middle: a plain network with 34 parameter layers (3.6 billion FLOPs). Right: a residual network with 34 parameter layers (3.6 billion FLOPs). The dotted shortcuts increase dimensions. Table 1 shows more details and other variants.

图3：ImageNet示例网络结构。左：VGG-19模型（196亿次浮点计算）作为参考。中：一个34层的普通网络（36亿次浮点计算）。右：34层的残差网络（36亿次浮点计算）。虚线短路增加维度。表1给出了更多细节和变种的情况。

![resnet.png](resnet.png)

3.4. Implementation

Our implementation for ImageNet follows the practice in [21, 41]. The image is resized with its shorter side randomly sampled in [256, 480] for scale augmentation [41]. A 224×224 crop is randomly sampled from an image or its horizontal flip, with the per-pixel mean subtracted [21]. The standard color augmentation in [21] is used. We adopt batch normalization (BN) [16] right after each convolution and before activation, following [16]. We initialize the weights as in [13] and train all plain/residual nets from scratch. We use SGD with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error plateaus, and the models are trained for up to 60 × 104 iterations. We use a weight decay of 0.0001 and a momentum of 0.9. We do not use dropout [14], following the practice in [16].

我们的实现遵从[21,41]的实践经验。图片的较小边随机拉伸采样到[256, 480]这个区间来做数据增强。然后随机截取一个224x224的区域，或者随机水平翻转，最后减去所有样本基于像素的均值。标准的色彩增强[21]也使用了。按照[16]每个卷积层后和激励之前应用了批归一化（BN）。按照[13]来初始化权重，然后来从头训练所有的普通/残差网络。使用批大小为256的SGD。学习率从0.1开始，当误差平稳时减小10倍，最后训练了60x104迭代。使用权重衰减0.0001和动量0.9。按照[16]的经验，我们没有使用dropout[14]。

In testing, for comparison studies we adopt the standard 10-crop testing [21]. For best results, we adopt the fully convolutional form as in [41, 13], and average the scores at multiple scales (images are resized such that the shorter side is in {224, 256, 384, 480, 640}).

在测试的时候，采用了标准的10-截取[21]做对比。为了最好的结果，我们采用了[41,13]提到的全卷积形式，对多个尺寸的分支做平均（图片的最小边拉伸到{224, 256, 386, 480, 640}）。

4 Experiments

4.1 ImageNet Classification

We evaluate our method on the ImageNet 2012 classification dataset [36] that consists of 1000 classes. The models are trained on the 1.28 million training images, and evaluated on the 50k validation images. We also obtain a final result on the 100k test images, reported by the test server. We evaluate both top-1 and top-5 error rates.

我们用包含1000个类别的ImageNet2012分类数据集来评估我们的方法。模型在128w张图片上训练，在5w张图片上做验证。最后在10w张图片做测试，结果由测试服务器来报告。我们给出了top-1和top-5两个错误率。

Plain Networks. We first evaluate 18-layer and 34-layer plain nets. The 34-layer plain net is in Fig. 3 (middle). The 18-layer plain net is of a similar form. See Table 1 for detailed architectures.

**普通网络**。我们首先评估了18层和34层的普通网络。就是图3中间的那个34层网络。18层网络有类似的结构，详细的结构见表1。

Table 1. Architectures for ImageNet. Building blocks are shown in brackets (see also Fig. 5), with the number of blocks stacked. Downsampling is performed by conv3_1, conv4_1, and conv5_1 with a stride of 2.

表1：ImageNet的架构。基础模块在方括号中显示，同时显示了几个模块叠加。降采样由conv3_1, conv4_1, conv5_1这几个步长为2的卷积层来完成。

![resnet_plainnets](resnet_plainnets.png)

The results in Table 2 show that the deeper 34-layer plain net has higher validation error than the shallower 18-layer plain net. To reveal the reasons, in Fig. 4 (left) we compare their training/validation errors during the training procedure. We have observed the degradation problem the 34-layer plain net has higher training error throughout the whole training procedure, even though the solution space of the 18-layer plain network is a subspace of that of the 34-layer one.

表2给出的结果显示，34层的普通网络比18层的普通网络在验证集上的错误率高。为了揭示原因，图4（左）比较了它们在训练过程中的训练/验证错误率。可以看到退化问题使得34层的网络在整个训练过程中的训练误差较大，尽管18层网络的解空间只是34层网络的解空间的一个子集。

![resnet_tab2_18_34_error](resnet_tab2_18_34_error.png)

![resnet_fig4_18_34_training](resnet_fig4_18_34_training.png)

We argue that this optimization difficulty is unlikely to be caused by vanishing gradients. These plain networks are trained with BN [16], which ensures forward propagated signals to have non-zero variances. We also verify that the backward propagated gradients exhibit healthy norms with BN. So neither forward nor backward signals vanish. In fact, the 34-layer plain net is still able to achieve competitive accuracy (Table 3), suggesting that the solver works to some extent. We conjecture that the deep plain nets may have exponentially low convergence rates, which impact the reducing of the training error3. The reason for such optimization difficulties will be studied in the future.

我们认为这种优化的难题不太可能是因为梯度消失而造成的。这些普通网络在训练的时候都使用了BN[16]，这可以确保前向传递的信号具有非零的方差。我们也验证了反向传递的梯度在BN的作用下显示出健康的形式。因此无论前向还是后向的信号都没有消失。实际上，34层的普通网络仍然可以达到有竞争力的准确度（表3），这说明优化器在一定程度上还是起作用。我们推测深度的普通网络的收敛速度指数级下降，这就影响训练误差降低。后面将对这种优化难题的原因进行讨论。

![resnet_tab3_error_rate.png](resnet_tab3_error_rate.png)

Residual Networks. Next we evaluate 18-layer and 34layer residual nets (ResNets). The baseline architectures are the same as the above plain nets, expect that a shortcut connection is added to each pair of 3×3 filters as in Fig. 3 (right). In the first comparison (Table 2 and Fig. 4 right), we use identity mapping for all shortcuts and zero-padding for increasing dimensions (option A). So they have no extra parameter compared to the plain counterparts.

**残差网络**。下面我们将评估18层和34层的残差网络（ResNets）。基准架构和上面的普通网络类似，只是每一对3x3卷积层上添加了一个短路连接，如图3（右）所示。最初的比较（表2和图4（右）），我们使用恒等映射的短路连接，使用补0来实现升维（第一个种选择）。所以相对于普通网络没有增加任何外部的参数。

We have three major observations from Table 2 and Fig. 4. First, the situation is reversed with residual learning – the 34-layer ResNet is better than the 18-layer ResNet (by 2.8%). More importantly, the 34-layer ResNet exhibits considerably lower training error and is generalizable to the validation data. This indicates that the degradation problem is well addressed in this setting and we manage to obtain accuracy gains from increased depth.

从表2和图4来看主要三个观察：首先，残差学习的情况正好相反，34层的ResNet比18层的ResNet好出2.8%。更重要的是，34层的ResNet训练误差较低，并且很好的泛化到验证数据上。这说明退化问题被很好的解决了，我们可以通过增加深度来获得更高的准确度。

Second, compared to its plain counterpart, the 34-layer ResNet reduces the top-1 error by 3.5% (Table 2), resulting from the successfully reduced training error (Fig. 4 right vs. left). This comparison verifies the effectiveness of residual learning on extremely deep systems.

其次，相比于对应的普通网络，34层的ResNet成功的减少了训练误差（图4，右相对于左）导致其top-1错误率下降了3.5%（表2）。这种对比正式了残差学习在极深的网络上的效果。

Last, we also note that the 18-layer plain/residual nets are comparably accurate (Table 2), but the 18-layer ResNet converges faster (Fig. 4 right vs. left). When the net is “not overly deep” (18 layers here), the current SGD solver is still able to find good solutions to the plain net. In this case, the ResNet eases the optimization by providing faster convergence at the early stage.

最后，18层的普通和残差网络都具有相当的准确率（表2），但是18层的ResNet收敛更快（图4，右相对于左）。当网络”不是太深“（这里指18层），当前的优化器还有能力对普通网络找到不错的解。在这种情况下，ResNet让优化更容易，并在早期加快了收敛的速度。

Identity vs. Projection Shortcuts. We have shown that parameter-free, identity shortcuts help with training. Next we investigate projection shortcuts (Eqn.(2)). In Table 3 we compare three options: (A) zero-padding shortcuts are used for increasing dimensions, and all shortcuts are parameterfree (the same as Table 2 and Fig. 4 right); (B) projection shortcuts are used for increasing dimensions, and other shortcuts are identity; and (C) all shortcuts are projections.

**恒等 vs. 变换 短路**。无参数的恒等短路已经证实可以帮助训练。接下来我们要研究变换短路（公式2）。表3中，我们有三个选择：(A)补0升维；(B)变换升维，其他短路使用恒等；(C)所有的短路都使用变换。

Table 3 shows that all three options are considerably better than the plain counterpart. B is slightly better than A. We argue that this is because the zero-padded dimensions in A indeed have no residual learning. C is marginally better than B, and we attribute this to the extra parameters introduced by many (thirteen) projection shortcuts. But the small differences among A/B/C indicate that projection shortcuts are not essential for addressing the degradation problem. So we do not use option C in the rest of this paper, to reduce memory/time complexity and model sizes. Identity shortcuts are particularly important for not increasing the complexity of the bottleneck architectures that are introduced below.

表3显示了这三种选项都比简单的策略要好。B比A稍微好一些。我们认为这是因为在A的补0不包含残差学习。C比B稍好，我们认为这是多余的参数（十三）做出的贡献。但是这三个选项之间细微的差距并不是解决退化问题的关键。所以我们在本论文中不适用C，以减小内存/时间的复杂度和模型的大小。恒等短路尤为重要，因为没有增加下面要介绍的瓶颈架构的复杂度。

Deeper Bottleneck Architectures. Next we describe our deeper nets for ImageNet. Because of concerns on the training time that we can afford, we modify the building block as a bottleneck design4 . For each residual function F , we use a stack of 3 layers instead of 2 (Fig. 5). The three layers are 1×1, 3×3, and 1×1 convolutions, where the 1×1 layers are responsible for reducing and then increasing (restoring) dimensions, leaving the 3×3 layer a bottleneck with smaller input/output dimensions. Fig. 5 shows an example, where both designs have similar time complexity.

**深度瓶颈架构**。下面介绍ImageNet上更深的网络。因为时间成本的限制，我们把基础模块修改成瓶颈设计。对每个残差函数F，我们使用三个重叠的层而不是两个（图5）。这三个层由1x1, 3x3 和 1x1三个卷积层组成，其中1x1卷积用来降维和升维，使得3x3层有较小的输入输出维度。图5给出了一个例子，这两种设计有类似的时间复杂度。

![resnet_fig5_bottleneck.png](resnet_fig5_bottleneck.png)

The parameter-free identity shortcuts are particularly important for the bottleneck architectures. If the identity shortcut in Fig. 5 (right) is replaced with projection, one can show that the time complexity and model size are doubled, as the shortcut is connected to the two high-dimensional ends. So identity shortcuts lead to more efficient models for the bottleneck designs.

无参的恒等短路对于瓶颈架构尤为重要。如果恒等短路用变换来替代，时间复杂度和模型的尺寸会增倍，因为短路连接了两个高维的两端。所以恒等映射对于瓶颈架构更加高效。

50-layer ResNet: We replace each 2-layer block in the 34-layer net with this 3-layer bottleneck block, resulting in a 50-layer ResNet (Table 1). We use option B for increasing dimensions. This model has 3.8 billion FLOPs.

50层ResNet：把34层网络的两层模块替换成3层的瓶颈模块，这样就形成了50层的ResNet（表1）。使用B选项来增维。这个模型有38亿次浮点计算。

101-layer and 152-layer ResNets: We construct 101layer and 152-layer ResNets by using more 3-layer blocks (Table 1). Remarkably, although the depth is significantly increased, the 152-layer ResNet (11.3 billion FLOPs) still has lower complexity than VGG-16/19 nets (15.3/19.6 billion FLOPs).

101层和152层RestNet：使用3层基本模块我们构造了101层和152层的RestNet（表1）。尽管深度大大地增加了，但152层的ResNet（113亿次浮点操作）仍然比VGG-16/19（153/196亿次操作）的复杂度低。

The 50/101/152-layer ResNets are more accurate than the 34-layer ones by considerable margins (Table 3 and 4). We do not observe the degradation problem and thus enjoy significant accuracy gains from considerably increased depth. The benefits of depth are witnessed for all evaluation metrics (Table 3 and 4).

50/101/152层的ResNet比34层的精度要高的多（表3和表4）。没有再出现退化现象所以通过增加深度可以明显的获得精度的提升。增加深度带来的好处从各种指标上都能得到体现（表3和表4）。

![resnet_tab4_error_single_model.png](resnet_tab4_error_single_model.png)

Comparisons with State-of-the-art Methods. In Table 4 we compare with the previous best single-model results. Our baseline 34-layer ResNets have achieved very competitive accuracy. Our 152-layer ResNet has a single-model top-5 validation error of 4.49%. This single-model result outperforms all previous ensemble results (Table 5). We combine six models of different depth to form an ensemble (only with two 152-layer ones at the time of submitting). This leads to 3.57% top-5 error on the test set (Table 5). This entry won the 1st place in ILSVRC 2015.

前沿方法的比较。表4比较了几种之前最好单模型的结果。我们基准的34层ResNet有相当有竞争力的准确率。152层ResNet单模型的top-5验证集误差只有4.49%。这个单模型已经胜过了以往的组合模型的结果（表5）。我们用6个不同深度的模型进行了组合（在提交结果那是只有两个152层的模型）。在测试集上的top-5误差只有3.57%（表5）。从而获得了ILSVRC2015年的第一名。

![resnet_tab5_ensembles.png](resnet_tab5_ensembles.png)

4.2. CIFAR-10 and Analysis

We conducted more studies on the CIFAR-10 dataset [20], which consists of 50k training images and 10k testing images in 10 classes. We present experiments trained on the training set and evaluated on the test set. Our focus is on the behaviors of extremely deep networks, but not on pushing the state-of-the-art results, so we intentionally use simple architectures as follows.

我们对CIFAR-10数据集进行了更多的研究。它包含10个分类的50k张训练图片和10k张测试图片。我们在训练集上进行试验，然后用测试集来评估。我们的关注点在极深的网络，而不是追求更好的结果，所以有意的使用以下简单的架构。

The plain/residual architectures follow the form in Fig. 3 (middle/right). The network inputs are 32×32 images, with the per-pixel mean subtracted. The first layer is 3×3 convolutions. Then we use a stack of 6n layers with 3×3 convolutions on the feature maps of sizes {32, 16, 8} respectively, with 2n layers for each feature map size. The numbers of filters are {16, 32, 64} respectively. The subsampling is performed by convolutions with a stride of 2. The network ends with a global average pooling, a 10-way fully-connected layer, and softmax. There are totally 6n+2 stacked weighted layers. The following table summarizes the architecture:

普通/残差架构还是遵守图3（中/右）的规则。网络的输入是32x32的图片，减去了像素维度的均值。第一层是3x3卷积层。然后用一组6n个3x3的卷积层，特征图的大小分别为 {32, 16, 8}，每个不同大小的特征图对应2n个层。卷积核的个数分别为{16, 32, 64}。通过步长为2的卷积层来降采样。网络以一个全局和平均池化，和10分类的全连接softmax层结束。一共是6n+2个包含权重的层。下面的表格总结了架构：

| output map size | 32x32 | 16x16 | 8x8 |
| --- | --- | --- | --- |
| `#` layers | 1+2n | 2n | 2n |
| `#` filters | 16 | 32 | 64 |

When shortcut connections are used, they are connected to the pairs of 3×3 layers (totally 3n shortcuts). On this dataset we use identity shortcuts in all cases (i.e., option A), so our residual models have exactly the same depth, width, and number of parameters as the plain counterparts.

短路连接在每对3x3的层上（共3n个短路）。在这个数据集上只使用了恒等短路（选项A），所以ResNet和对应的普通网络有相同的深度，宽度和参数的个数。

We use a weight decay of 0.0001 and momentum of 0.9, and adopt the weight initialization in [13] and BN [16] but with no dropout. These models are trained with a minibatch size of 128 on two GPUs. We start with a learning rate of 0.1, divide it by 10 at 32k and 48k iterations, and terminate training at 64k iterations, which is determined on a 45k/5k train/val split. We follow the simple data augmentation in [24] for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. For testing, we only evaluate the single view of the original 32×32 image.

使用参数衰减0.0001和动量0.9,采用[13]提到的参数初始化方法，和BN[16]的归一化，但不使用dropout。这些模型使用两个GPU，批大小为128.学习率从0.1开始，在32k和48k个迭代时除以10，最终在64k时结束训练。45k训练，5k验证。根据[24]简单数据增强：每边补4个像素，再随机截取32x32的预期，或者水平翻转。测试时，只使用原始的32x32的图片。

We compare n = {3, 5, 7, 9}, leading to 20, 32, 44, and 56-layer networks. Fig. 6 (left) shows the behaviors of the plain nets. The deep plain nets suffer from increased depth, and exhibit higher training error when going deeper. This phenomenon is similar to that on ImageNet (Fig. 4, left) and on MNIST (see [42]), suggesting that such an optimization difficulty is a fundamental problem.

我们比较了当n等于{3, 5, 7, 9}时的{20, 32, 44, 56}层的网络。图6（左）显示了普通网络的行为。深度普通网络显示出较高的训练误差。这个现象和ImageNet和MNIT[42]上类似。这表明这种训练困难是很基础的问题。

![resnet_fig6_cifar10.png](resnet_fig6_cifar10.png)

We further explore n = 18 that leads to a 110-layer ResNet. In this case, we find that the initial learning rate of 0.1 is slightly too large to start converging5. So we use 0.01 to warm up the training until the training error is below 80% (about 400 iterations), and then go back to 0.1 and continue training. The rest of the learning schedule is as done previously. This 110-layer network converges well (Fig. 6, middle). It has fewer parameters than other deep and thin networks such as FitNet [35] and Highway [42] (Table 6), yet is among the state-of-the-art results (6.43%, Table 6).

进一步，我们探索了n等于18时的110层ResNet。在这时，初始学习率0.1就太大了而难以收敛。因此我们使用0.01来预热训练，只到训练误差小于80%（大概400个迭代），然后变回0.1再继续训练。剩下的学习计划就和以前一样了。这个110层的网络收敛得很好。（图6，中）。它比其他深和窄的网络，如FitNet[35]和Hightway[42]的参数都少（表6），但是结果和最先进的网络类似。

![resnet_tab6_error.png](resnet_tab6_error.png)

Analysis of Layer Responses. Fig. 7 shows the standard deviations (std) of the layer responses. The responses are the outputs of each 3×3 layer, after BN and before other nonlinearity (ReLU/addition). For ResNets, this analysis reveals the response strength of the residual functions. Fig. 7 shows that ResNets have generally smaller responses than their plain counterparts. These results support our basic motivation (Sec.3.1) that the residual functions might be generally closer to zero than the non-residual functions. We also notice that the deeper ResNet has smaller magnitudes of responses, as evidenced by the comparisons among ResNet-20, 56, and 110 in Fig. 7. When there are more layers, an individual layer of ResNets tends to modify the signal less.

**层响应分析**。图7显示了层的响应的标准差。这个响应是每个3x3层，BN之后，非线性激励之前（ReLU/加法）。对于ResNet，这个分析揭示了残差函数响应的强度。图7表明ResNet的响应一般较小，相比于对应的普通网络。这个结果支持了我们最近本的动机（3.1节），残差函数可能比非残差函数离0更近一些。越深的ResNet的响应数量级小，对比ResNet-20，56，100。当层次越多的时候，ResNet中的一个单独的层对信号的调整就偏小。

![resnet_fig7_respones_std.png](resnet_fig7_respones_std.png)

Exploring Over 1000 layers. We explore an aggressively deep model of over 1000 layers. We set n = 200 that leads to a 1202-layer network, which is trained as described above. Our method shows no optimization difficulty, and this $10^3$-layer network is able to achieve training error `<0.1%` (Fig. 6, right). Its test error is still fairly good (7.93%, Table 6).

**探索超过1000层**。我们探索了一个超过1000层的非常激进的深度模型。设n等于200，就形成一个1202层的网络，通过上面描述的方法来训练。我们的方法在训练时没有遇到困难，这个千层网络训练集误差小于0.1%（图6，右）。测试误差也相当不错（7.93%，表6）。

But there are still open problems on such aggressively deep models. The testing result of this 1202-layer network is worse than that of our 110-layer network, although both have similar training error. We argue that this is because of overfitting. The 1202-layer network may be unnecessarily large (19.4M) for this small dataset. Strong regularization such as maxout [10] or dropout [14] is applied to obtain the best results ([10, 25, 24, 35]) on this dataset. In this paper, we use no maxout/dropout and just simply impose regularization via deep and thin architectures by design, without distracting from the focus on the difficulties of optimization. But combining with stronger regularization may improve results, which we will study in the future.

对于这种极端的深度模型仍然还有些开放性的问题。这个1202层网络的测试结果比110层的网络要差，尽管训练误差差不多。我们认为这是因为过拟合。1202层对于这个小数据集大得不必要了（19.4M）。强正则化maxout[10]或者dropout[14]被采用来获得最好的结果[10,25,24,35]。在本文中，我们没有使用maxout/dropout只是简单的通过深和窄的设计来体现正则化，而不转移我们关于优化难度的焦点。结合强正则化应该可以达到更好的效果，这个我们将来去研究。

4.3. Object Detection on PASCAL and MS COCO

Our method has good generalization performance on other recognition tasks. Table 7 and 8 show the object detection baseline results on PASCAL VOC 2007 and 2012 [5] and COCO [26]. We adopt Faster R-CNN [32] as the detection method. Here we are interested in the improvements of replacing VGG-16 [41] with ResNet-101. The detection implementation (see appendix) of using both models is the same, so the gains can only be attributed to better networks. Most remarkably, on the challenging COCO dataset we obtain a 6.0% increase in COCO’s standard metric (mAP@[.5, .95]), which is a 28% relative improvement. This gain is solely due to the learned representations.

我们的方法在其他的数据集上有很好的泛化性能。表7和8显示了在PASCAL VOC2007，2012和COCO上的物体检测基准结果。我们采用Faster R-CNN作为检测方法。这里我们感兴趣的点是用ResNet-101来替换VGG-16得到的改进。两个模型的检测实现（见附录）都是一样的，所以效果来自于更好的网络。最值得注意的是，在COCO上我们的mAP上升了6.0%，28%的相对提升。这个收益仅仅因为学习出的表达形式。

Based on deep residual nets, we won the 1st places in several tracks in ILSVRC & COCO 2015 competitions: ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation. The details are in the appendix.

基于深度残差网络，我们获得了好几个ILSVRC和COCO的第一名：ImageNet检测，ImageNet定位，COCO检测，COCO分割。细节见附录。