ImageNet Classification with Deep Convolutional Neural Networks
-----

https://www.cs.toronto.edu/~kriz/


3 The Architecture

The architecture of our network is summarized in Figure 2. It contains eight learned layers — five convolutional and three fully-connected. Below, we describe some of the novel or unusual features of our network’s architecture. Sections 3.1-3.4 are sorted according to our estimation of their importance, with the most important first.

网络架构总结在图2。包含了8个可学习层-5个conv和3个fc。下面描述我们网络架构的几个新颖或者不常见的特性。3.1-3.4节按照我们评估出来的重要性排序，最重要的最前。

3.1 ReLU Nonlinearity

The standard way to model a neuron’s output f as a function of its input x is with f(x) = tanh(x) or f(x) = (1 + e−x)−1. In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0,x). Following Nair and Hinton [20], we refer to neurons with this nonlinearity as Rectified Linear Units (ReLUs). Deep convolutional neural networks with ReLUs train several times faster than their equivalents with tanh units. This is demonstrated in Figure 1, which shows the number of iterations required to reach 25% training error on the CIFAR-10 dataset for a particular four-layer convolutional network. This plot shows that we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron models.

标准模型中的神经元的输出f用函数tanh(x)来建模。在梯度下降的训练时间上，这些饱和的非线性函数比不饱和的非线性函数要慢很多，f(x) = max(0, x)。根据[20]，我们称这些非线性函数为relu。使用relu的深度卷积网络比使用tanh的在训练时要快好几倍。图1显示这个，一个4层的convnet在CIFAR-10上达到25%训练误差需要的迭代次数。这个图说明如果我们使用传统的饱和神经元模型，我们将没有能力处理如此大的神经网络。

We are not the first to consider alternatives to traditional neuron models in CNNs. For example, Jarrett et al. [11] claim that the nonlinearity f (x) = |tanh(x)| works particularly well with their type of contrast normalization followed by local average pooling on the Caltech-101 dataset. However, on this dataset the primary concern is preventing overfitting, so the effect they are observing is different from the accelerated ability to fit the training set which we report when using ReLUs. Faster learning has a great influence on the performance of large models trained on large datasets.

我们并不是第一个在CNN中考虑使用替代函数的。例如，[11]声称非线性函数f(x) = |tanh(x)|在Caltech-101数据集上工作得很好。但是，主要关心的是过拟合，所以他们关注的点并不是在训练中用于提速的能力。更快的训练速度对于大型的模型和大型数据集上的性能有很大的影响。

Figure 1: A four-layer convolutional neural network with ReLUs (solid line) reaches a 25% training error rate on CIFAR-10 six times faster than an equivalent network with tanh neurons (dashed line). The learning rates for each network were chosen independently to make training as fast as possible. No regularization of any kind was employed. The magnitude of the effect demonstrated here varies with network architecture, but networks with ReLUs consistently learn several times faster than equivalents with saturating neurons.

图1：4层conv和relu（实线）的网络，达到了25%的训练误差，在CIFAR-10上6倍快于对应的tanh网络（虚线）。为了使训练尽可能快，每个网路的学习率是独立选择的，但是relu网络始终比饱和神经元快好几倍。

![alexnet_fig1_relu.png](alexnet_fig1_relu.png)

3.2 Training on Multiple GPUs

A single GTX 580 GPU has only 3GB of memory, which limits the maximum size of the networks that can be trained on it. It turns out that 1.2 million training examples are enough to train networks which are too big to fit on one GPU. Therefore we spread the net across two GPUs. Current GPUs are particularly well-suited to cross-GPU parallelization, as they are able to read from and write to one another’s memory directly, without going through host machine memory. The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers. This means that, for example, the kernels of layer 3 take input from all kernel maps in layer 2. However, kernels in layer 4 take input only from those kernel maps in layer 3 which reside on the same GPU. Choosing the pattern of connectivity is a problem for cross-validation, but this allows us to precisely tune the amount of communication until it is an acceptable fraction of the amount of computation.

单个GTX 580 GPU只有3G显存，限制了它能进行训练的网络大小，单个gpu无法进行训练。因此我们把网络分散到两个gpu上。当前的gpu特别适合跨gpu的并行计算，因为它们可以直接读或者写对方的显存，而不需要通过主机的内存。我们采用的并行方案把一半的卷积核（或神经元）放在一个gpu上，还有一个额外的技巧：gpu只在某些层进行通讯。这意味着，例如，第三层的卷积核接收第二层所有的卷积图为输入，而第四层的卷积核只接收在同一个gpu上的第三层的输出。选择连接的模型对于交叉验证来说是一个问题，但这允许我们精确的对通讯进行调优，直到计算量的占比可以接受。

The resultant architecture is somewhat similar to that of the “columnar” CNN employed by Cires ̧an et al. [5], except that our columns are not independent (see Figure 2). This scheme reduces our top-1 and top-5 error rates by 1.7% and 1.2%, respectively, as compared with a net with half as many kernels in each convolutional layer trained on one GPU. The two-GPU net takes slightly less time to train than the one-GPU net2.

这个架构的结果多少有点类似于[5]实现的“columnar” CNN，除了我们的列不是独立的（见图2）。这种方案把top-1和top-5的错误率分别降低了1.7%和1.2%，相比于一个只有一半卷积核的网络在单个gpu上训练。两gpu网络比单gpu网络要运行快一些。

2The one-GPU net actually has the same number of kernels as the two-GPU net in the final convolutional layer. This is because most of the net’s parameters are in the first fully-connected layer, which takes the last convolutional layer as input. So to make the two nets have approximately the same number of parameters, we did not halve the size of the final convolutional layer (nor the fully-conneced layers which follow). Therefore this comparison is biased in favor of the one-GPU net, since it is bigger than “half the size” of the two-GPU net.

2这个单gpu网络实际上和两gpu网络在最后的卷积层有两桶的卷积核数。这是因为大部分的网络参数都在第一个fc层，它接收最后一个conv作为输入。所以为了让这两个网络有近似的参数，我们并没有把最后一个conv的尺寸减半（后跟的fc层也是一样）。因此这种比较偏向于单gpu网络，因为这比两个gpu网络的一半要大。

3.3 Local Response Normalization

ReLUs have the desirable property that they do not require input normalization to prevent them from saturating. If at least some training examples produce a positive input to a ReLU, learning will happen in that neuron. However, we still find that the following local normalization scheme aids generalization. Denoting by aix,y the activity of a neuron computed by applying kernel i at position

(x, y) and then applying the ReLU nonlinearity, the response-normalized activity bix,y is given by the expression

 min(N −1,i+n/2) β bi=ai/k+α   (aj)2

j =max(0,i−n/2)

where the sum runs over n “adjacent” kernel maps at the same spatial position, and N is the total number of kernels in the layer. The ordering of the kernel maps is of course arbitrary and determined before training begins. This sort of response normalization implements a form of lateral inhibition inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels. The constants k, n, α, and β are hyper-parameters whose values are determined using a validation set; we used k = 2, n = 5, α = 10−4, and β = 0.75. We applied this normalization after applying the ReLU nonlinearity in certain layers (see Section 3.5).

This scheme bears some resemblance to the local contrast normalization scheme of Jarrett et al. [11], but ours would be more correctly termed “brightness normalization”, since we do not subtract the mean activity. Response normalization reduces our top-1 and top-5 error rates by 1.4% and 1.2%, respectively. We also verified the effectiveness of this scheme on the CIFAR-10 dataset: a four-layer CNN achieved a 13% test error rate without normalization and 11% with normalization3.

3.4 Overlapping Pooling

Pooling layers in CNNs summarize the outputs of neighboring groups of neurons in the same kernel map. Traditionally, the neighborhoods summarized by adjacent pooling units do not overlap (e.g., [17, 11, 4]). To be more precise, a pooling layer can be thought of as consisting of a grid of pooling units spaced s pixels apart, each summarizing a neighborhood of size z × z centered at the location of the pooling unit. If we set s = z, we obtain traditional local pooling as commonly employed in CNNs. If we set s < z, we obtain overlapping pooling. This is what we use throughout our network, with s = 2 and z = 3. This scheme reduces the top-1 and top-5 error rates by 0.4% and 0.3%, respectively, as compared with the non-overlapping scheme s = 2, z = 2, which produces output of equivalent dimensions. We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit.

CNN中的汇聚层（池层）总结了同一个卷积图中临近组的神经元的输出。传统上，被相邻的汇聚单元合并的领域并不重叠。（例如[17,11,4]）。更精确地说，一个汇聚层被想象成有很多间隔s个像素的汇聚单元的方格组成，每一个合并中心在这个汇聚单元大小为z x z的领域。如果我们设 s = z，我们就得到了传统的在CNN里经常使用的局部汇聚。如果我们设置 s < z，我们就得到了重叠汇聚。我们的网络里使用s = 2, z = 3。这个方案分别降低了top-1, top-5错误率为0.4%和0.3%，相比于非重叠的s = 2, z = 2，它们产生的输出大小一样。我们发现包含重叠池的模型在训练时更不容易过拟合。

3.5 Overall Architecture

Now we are ready to describe the overall architecture of our CNN. As depicted in Figure 2, the net contains eight layers with weights; the first five are convolutional and the remaining three are fullyconnected. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. Our network maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution.

如图2，这个网络包含8个有权重的层；头5个是conv，后面3个是fc。最后一个fc的输出连接了一个1000路的softmax，来产生在1000个分类标签上的分布。网络最大化多项式逻辑回归目标，这等同于在预测分布下，最大化正确标签的对数概率的平均值。

The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU (see Figure 2). The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fullyconnected layers are connected to all neurons in the previous layer. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers, of the kind described in Section 3.4, follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.

第二，四，五层的卷积核只连接到同一个gpu上前一层的输出。第三层的连接到第二层所有的输出。fc层的神经元和前一层的所有神经元都进行连接。LRN层在第一个和第二个conv层后。maxpool在LRN层和第5个conv层后。relu应用到每个conv和fc层的输出上。

The first convolutional layer filters the 224 × 224 × 3 input image with 96 kernels of size 11 × 11 × 3 with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 × 3 × 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 × 3 × 192 , and the fifth convolutional layer has 256 kernels of size 3 × 3 × 192. The fully-connected layers have 4096 neurons each.

第1个conv的输入为224 x 224 x 3，使用96个11 x 11 x 3的卷积核，和4个像素的步长（这是两个感知域之间的距离）。第2个conv接收第1个conv的输出加上lrn和maxpooling，256个5x5x48的卷积核。第3,4,5conv，之间没有使用池化或者归一化。第3个conv有384个3x3x256的卷积核。第4个conv有384个3x3x192的卷积核，第5个有256个3x3x192卷积核。每个fc都有4096个神经元。

Figure 2: An illustration of the architecture of our CNN, explicitly showing the delineation of responsibilities between the two GPUs. One GPU runs the layer-parts at the top of the figure while the other runs the layer-parts at the bottom. The GPUs communicate only at certain layers. The network’s input is 150,528-dimensional, and the number of neurons in the network’s remaining layers is given by 253,440–186,624–64,896–64,896–43,264– 4096–4096–1000.

图2：CNN的架构说明，明确地说明了两个gpu的职责。一个gpu在上面，另一个在下面。gpu只在某些特定层沟通，网络的输入是224*224*3=150,528维。网络中后续层的神经元的个数：253,440 – 186,624 – 64,896 – 64,896 – 43,264 – 4096 – 4096 – 1000.

![alexnet_fig2_arch.png](alexnet_fig2_arch.png)

layer config | input dimension | output dimension | number of neurons
- | - | - | -
conv1-11x11x3s4-96 | 224*224*3=150,528 (padding 3) | 55x55x96, (227-11)/4 + 1 = 55 | 253,440
lrn | | |
maxpool-3x3s2 | | 27x27x48 |
conv2-5x5x48-256 | | | 186,624
maxpool-3x3s2 | | 13x13x128 |
conv3-3x3x256-384 | | | 64,896
conv4-3x3x192-384 | | | 64,896
conv5-3x3x192-256 | | | 43,264
maxpool-3x3s2 | | 6x6x128 |
fc-4096 | | | 4096
dropout-0.5 |
fc-4096 | | | 4096
dropout-0.5 |
fc-1000-softmax | | | 1000

max-poolling 怎么觉得连通道数都减半了？
每层的神经元数字怎么计算出来的？
每层的参数有多少个？

4 Reducing Overfitting

Our neural network architecture has 60 million parameters. Although the 1000 classes of ILSVRC make each training example impose 10 bits of constraint on the mapping from image to label, this turns out to be insufficient to learn so many parameters without considerable overfitting. Below, we describe the two primary ways in which we combat overfitting.

网络有6kw个参数。尽管ILSVRC的1000个类型使得每个训练样本给从图片到标签的映射强加了10个比特的限制，这导致学习这么多参数，不仔细考虑过拟合，数据仍然是不充分的。下面，我们描述两种应对过拟合的主要方法。

4.1 Data Augmentation

The easiest and most common method to reduce overfitting on image data is to artificially enlarge the dataset using label-preserving transformations (e.g., [25, 4, 5]). We employ two distinct forms of data augmentation, both of which allow transformed images to be produced from the original images with very little computation, so the transformed images do not need to be stored on disk. In our implementation, the transformed images are generated in Python code on the CPU while the GPU is training on the previous batch of images. So these data augmentation schemes are, in effect, computationally free.

最简单也是最常见的在图像数据上减少过拟合的方法是保持标签进行变换映射来人工增大数据集（例如[25,4,5]）。我们采用了两种不同形式的数据增强，它们都可以使用很少的计算从原始图片来产生变换后的图片，所以变换后的图片没有必要存储在磁盘上。在我们的实现中，图片变换使用python代码运行在cpu上，而gpu用来训练前一批图片。所以这种数据增强的方案是，有效而且计算量小。

The first form of data augmentation consists of generating image translations and horizontal reflections. We do this by extracting random 224 × 224 patches (and their horizontal reflections) from the 256×256 images and training our network on these extracted patches4. This increases the size of our training set by a factor of 2048, though the resulting training examples are, of course, highly interdependent. Without this scheme, our network suffers from substantial overfitting, which would have forced us to use much smaller networks. At test time, the network makes a prediction by extracting five 224 × 224 patches (the four corner patches and the center patch) as well as their horizontal reflections (hence ten patches in all), and averaging the predictions made by the network’s softmax layer on the ten patches.

第一种数据增强是，生成图片变换和水平映射。我们随机从256x256的训练图片中截取224x224的区域（和水平映射），然后用这些截取后的图片进行训练。这增大了我们的训练集2048倍，尽管生成后的训练样本之间是互相依赖的。没有这个方案，我们的网络严重的收到过拟合的影响，使得我们只能用小得多的网络。在测试的时候，抽取5个224x224的区域（4个角和中心）还有它们的水平翻转（一共10个）来进行预测，然后把网络的softmax对这10个区域产生的预测进行平均。

4This is the reason why the input images in Figure 2 are 224 × 224 × 3-dimensional.

这就是为什么图2的输入是224x224x3的。

The second form of data augmentation consists of altering the intensities of the RGB channels in training images. Specifically, we perform PCA on the set of RGB pixel values throughout the ImageNet training set. To each training image, we add multiples of the found principal components, with magnitudes proportional to the corresponding eigenvalues times a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Therefore to each RGB image pixel Ixy = [IR , IG , IB ]T we add the following quantity: xy xy xy [p1, p2, p3][α1λ1, α2λ2, α3λ3]T where pi and λi are ith eigenvector and eigenvalue of the 3 × 3 covariance matrix of RGB pixel values, respectively, and αi is the aforementioned random variable. Each αi is drawn only once for all the pixels of a particular training image until that image is used for training again, at which point it is re-drawn. This scheme approximately captures an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination. This scheme reduces the top-1 error rate by over 1%.

第二种数据增强是训练图片的RGB通道的强度。具体来说，我们在整个ImageNet训练集的RGB像素值上执行了PCA。对于每个训练图片，我们增加了多个找到的主成分，量级对应于特征值乘以一个随机变量，这个随机变量取自均值为0，标准差0.1的高斯分布。因此每个RGB图片的像素 $ I_{xy} = [I_{xy}^R, I_{xy}^G, I_{xy}^B]^T $ 我们加上这个数量：$$ [p_1, p_2, p_3][\alpha_1 \lambda_1, \alpha_2 \lambda_2, \alpha_3 \lambda_3]^T $$ 这里$p_i$和$\lambda_i$是RGB像素值的3x3协方差矩阵的第ith个特征向量和特征值。$\alpha_i$是上面提到的随机变量。每个$\alpha_i$只取一次对一个特定训练样本的所有像素，直到这个图片被再次用过训练，这个随机值再次生成。这个方案近似捕获了自然图片的一个重要属性，也就是说，物体标识相对于颜色和亮度的变化是固定的。这个方案使top-1错误率降低超过了1%。

4.2 Dropout

Combining the predictions of many different models is a very successful way to reduce test errors [1, 3], but it appears to be too expensive for big neural networks that already take several days to train. There is, however, a very efficient version of model combination that only costs about a factor of two during training. The recently-introduced technique, called “dropout” [10], consists of setting to zero the output of each hidden neuron with probability 0.5. The neurons which are “dropped out” in this way do not contribute to the forward pass and do not participate in backpropagation. So every time an input is presented, the neural network samples a different architecture, but all these architectures share weights. This technique reduces complex co-adaptations of neurons, since a neuron cannot rely on the presence of particular other neurons. It is, therefore, forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. At test time, we use all the neurons but multiply their outputs by 0.5, which is a reasonable approximation to taking the geometric mean of the predictive distributions produced by the exponentially-many dropout networks.

结合多种不同的模型进行预测是一种非常成功的降低测试误差的方法[1,3]，但是这对于需要训练好几天的大型的神经网络来说似乎太贵了。有一种非常有效的模型结合，但只花费大约两倍的训练时间的方法。最近被引入的技术，叫做"dropout"[10]，它对每个隐藏神经元的输出在0.5的概率下设置为0。那些被dropout的神经元不再参与前向计算和后向计算。每一次输入给定时，神经网络是一个不同的架构，但这些架构共享权重。这个技术减少了减少了神经元的互相操作，因为一个神经元不能依赖于某个特定的其他神经元。因此，强制学习更健壮的特征，从而结合其他神经元许多的不同随机子集。在测试时，我们使用所有的神经元，但是把它们的输出乘以0.5，这是一个合理的近似，取预测分布的均值。

We use dropout in the first two fully-connected layers of Figure 2. Without dropout, our network exhibits substantial overfitting. Dropout roughly doubles the number of iterations required to converge.

我们在图2中的前两个fc使用dropout。不使用dropout，我们的网络就会显示出严重的过拟合。dropout使得收敛需要的迭代次数大概翻倍。

5 Details of learning

We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005. We found that this small amount of weight decay was important for the model to learn. In other words, weight decay here is not merely a regularizer: it reduces the model’s training error. The update rule for weight w was vi+1 := 0.9·vi −0.0005·ε·wi −ε· wi+1 := wi + vi+1 ∂L    ∂w wi Di where i is the iteration index, v is the momentum variable, ε is the learning rate, and   ∂ L     is ∂w wi Di the average over the ith batch Di of the derivative of the objective with respect to w, evaluated at wi.

我们使用SGD，批128，动量0.9，权重衰减0.0005。我们发现这种很小的权重衰减对于模型的训练是很重要的。换句话说，这里的权重衰减不仅仅是正则化，它减少了模型的训练误差。对权重w的更新：xxxxxx

We initialized the weights in each layer from a zero-mean Gaussian distribution with standard deviation 0.01. We initialized the neuron biases in the second, fourth, and fifth convolutional layers, as well as in the fully-connected hidden layers, with the constant 1. This initialization accelerates the early stages of learning by providing the ReLUs with positive inputs. We initialized the neuron biases in the remaining layers with the constant 0.

初始化权重，均值0，标准差0.01的高斯分布。第2，4，5层，还有fc中的偏移变量都初始化为1。这种初始化方法提供给relu大于0的值在早期可以加速训练。而其他的偏移都初始化为0。

We used an equal learning rate for all layers, which we adjusted manually throughout training. The heuristic which we followed was to divide the learning rate by 10 when the validation error rate stopped improving with the current learning rate. The learning rate was initialized at 0.01 and reduced three times prior to termination. We trained the network for roughly 90 cycles through the training set of 1.2 million images, which took five to six days on two NVIDIA GTX 580 3GB GPUs.

所有的层都是用相同的学习率，在训练的过程中手动进行调整。启发式的策略是，当验证误差率停止改进时把当前的学习率缩小10倍。学习率初始化为0.01，在结束前减小了3次。训练了90个周期，在120w个训练图片上，花费了5-6天，在2个GTX 580 3GB的GPU上。

Figure 3: 96 convolutional kernels of size 11×11×3 learned by the first convolutional layer on the 224×224×3 input images. The top 48 kernels were learned on GPU 1 while the bottom 48 kernels were learned on GPU 2. See Section 6.1 for details.
