Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)
---

http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html

Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics.

In addition, many solid papers have been published in this topic, and some high quality open source CNN software packages have been made available. There are also well-written CNN tutorials or CNN software manuals. However, it might lack a recent and comprehensive summary about the details of how to implement an excellent deep convolutional neural networks from scratch. Thus, we collected and concluded many implementation details for DCNNs. Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.

在这个方向有很多坚实的论文，和一些高质量的开源软件。也有很多写得很好的CNN的入门和手册。但是，最近的和详尽的关于如何从头实现一个优秀的深度卷积网络的细节很缺失。因此，我们收集和总结了很多关于DCNN的实现细节。这里我们将介绍这些实现细节，例如，关于如何构造和训练你自己的深度网络的技巧或者提示。

# Introduction

We assume you already know the basic knowledge of deep learning, and here we will present the implementation details (tricks or tips) in Deep Neural Networks, especially CNN for image-related tasks, mainly in eight aspects: 1) data augmentation; 2) pre-processing on images; 3) initializations of Networks; 4) some tips during training; 5) selections of activation functions; 6) diverse regularizations; 7) some insights found from figures and finally 8) methods of ensemble multiple deep networks.

我们假设你已经知道了深度学习的基本知识，这里我们介绍实现细节，特别是图片相关的CNN，主要在8个方面：

1. 数据增强
2. 图片预处理
3. 网络初始化
4. 训练时的一些提示
5. 激励函数的选择
6. 多种正则化方法
7. 图表中的洞悉
8. 组合多个深度网络的方法

Additionally, the corresponding slides are available at [slide]. If there are any problems/mistakes in these materials and slides, or there are something important/interesting you consider that should be added, just feel free to contact me.

# Sec. 1: Data Augmentation

Since deep networks need to be trained on a huge number of training images to achieve satisfactory performance, if the original image data set contains limited training images, it is better to do data augmentation to boost the performance. Also, data augmentation becomes the thing must to do when training a deep network.

因为深度网络需要使用非常多的训练图片来达到满意的性能，如果原始的图片数据集只包含很少的训练图片，那最好使用数据增强来提升性能。另外，在训练深度网络时，数据增强已经不可或缺了。

- There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same time. In addition, you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.

- 数据增强有很多种方法，例如流行的水平翻转，随机截取和色彩变换。并且也可以组合多种不同的方法，比如，同时操作旋转和随机拉伸。另外，有可以对饱和度和亮度做调整（HSV的S和V两部分）。也可以调整H。

- Krizhevsky et al. [1] proposed fancy PCA when training the famous Alex-Net in 2012. Fancy PCA alters the intensities of the RGB channels in training images. In practice, you can firstly perform PCA on the set of RGB pixel values throughout your training images. And then, for each training image, just add the following quantity to each RGB image pixel (i.e., I_{xy}=[I_{xy}^R,I_{xy}^G,I_{xy}^B]^T): [bf{p}_1,bf{p}_2,bf{p}_3][alpha_1 lambda_1,alpha_2 lambda_2,alpha_3 lambda_3]^T where, bf{p}_i and lambda_i are the i-th eigenvector and eigenvalue of the 3times 3 covariance matrix of RGB pixel values, respectively, and alpha_i is a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Please note that, each alpha_i is drawn only once for all the pixels of a particular training image until that image is used for training again. That is to say, when the model meets the same training image again, it will randomly produce another alpha_i for data augmentation. In [1], they claimed that “fancy PCA could approximately capture an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination”. To the classification performance, this scheme reduced the top-1 error rate by over 1% in the competition of ImageNet 2012.

- Krizhevsky[1]在2012训练著名的AlexNet时提出了一种新颖的PCA方法。PCA修改训练图片中RGB通道的强度。实际上，可以先在所有的训练集上执行PCA。然后对每个训练图片，对每个RGB图片加上以下这个量。对分类性能，这个方案可以使top-1错误率减低1%在ImageNet2012竞赛中。

# Sec. 2: Pre-Processing

Now we have obtained a large number of training samples (images/crops), but please do not hurry! Actually, it is necessary to do pre-processing on these images/crops. In this section, we will introduce several approaches for pre-processing.

现在我们已经获得了大量的训练样本了（图片和截取），但是别着急。实际上，在这些数据上进行预处理也非常必要的。这节，我们会介绍几种预处理的方法。

The first and simple pre-processing approach is zero-center the data, and then normalize them, which is presented as two lines Python codes as follows:

    >>> X -= np.mean(X, axis = 0) # zero-center
    >>> X /= np.std(X, axis = 0) # normalize

第一种简单的方法是把数据以0居中，然后把它们标准化，用python代码来表示如下：（减去均值，除以标准差）

where, X is the input data (NumIns×NumDim). Another form of this pre-processing normalizes each dimension so that the min and max along the dimension is -1 and 1 respectively. It only makes sense to apply this pre-processing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional pre-processing step.

这里，X是输入数据。这种预处理的另一种形式是每个维度标准化，所以这些维度的最小最大值会被分别归一到-1和1。只有在你有理由相信不同的输入特征有不同的尺度，但是它们对学习算法的贡献率需要近似相同，这种情况下，使用这种预处理方法才有必要。在图片的情况下，像素点的相对尺度已经近似相等（在0-255的范围），所以并不是很严格必须来使用这种额外的预处理步骤。

Another pre-processing approach similar to the first one is PCA Whitening. In this process, the data is first centered as described above. Then, you can compute the covariance matrix that tells us about the correlation structure in the data:

另一种类似的预处理方法是PCA Whitening。在这种处理中，数据首先被居中，然后计算协方差矩阵，来反应数据的相关性：

    >>> X -= np.mean(X, axis = 0) # zero-center
    >>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

After that, you decorrelate the data by projecting the original (but zero-centered) data into the eigenbasis:

在这之后，把居中后的数据映射到特征基空间：

    >>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix
    >>> Xrot = np.dot(X, U) # decorrelate the data

The last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale:

最后的变换是白化，用特征基空间的数据除以每个维度的特征值来把尺度归一：

    >>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

Note that here it adds 1e-5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e., increasing 1e-5 to be a larger number).

注意这里可以增加 $1^{-5}$（或者很小的常量）来防止除零错误。这种变换的一个弱点是，它会增大数据中的噪声，因为所有输入维度（包括非常小的方差的不相关的维度，这些都几乎是噪声）都被拉伸到了相同的大小。这实际上可以通过更强的平滑来减轻（即，将1e-5增加到更大的数量）。

Please note that, we describe these pre-processing here just for completeness. In practice, these transformations are not used with Convolutional Neural Networks. However, it is also very important to zero-center the data, and it is common to see normalization of every pixel as well.

需要提出的事，我们在这里描述这些预处理是为了完整性。实际上，在convnet上这些变换并没有被使用。但是，数据0居中还是很重要的，而且对每个像素进行归一也是很常见的。

# Sec. 3: Initializations

Now the data is ready. However, before you are beginning to train the network, you have to initialize its parameters.

现在数据已经准备好了。在开始训练网路之前，还需要初始化它的参数。

## All Zero Initialization

In the ideal situation, with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to set all the initial weights to zero, which you expect to be the “best guess” in expectation. But, this turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during back-propagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

在理想的情况下，进行合适的数据归一化，有理由假设近似一半的权重为正，而另一半为负。一种听上去合理的想法是把权重都初始化为0，期望这是“最佳猜测”，但是这是个错误，因为如果网络中的每个神经元的输出都相同，那么它们反向传递过程中计算出的梯度也都相同，然后进行相同的参数更新。换句话说，如果所有神经元的参数都初始化成一样时，不对称的来源就没有了。

## Initialization with Small Random Numbers

Thus, you still want the weights to be very close to zero, but not identically zero. In this way, you can random these neurons to small numbers which are very close to zero, and it is treated as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. The implementation for weights might simply look like weightssim 0.001times N(0,1), where N(0,1) is a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

因此，你仍然希望权重非常接近0，但是不等于0。这样，你可以随机初始化成接近于0的小数，来破坏对称性。神经元在开始的时候都是随机并且唯一的，因此它们进行不同的更新，集成它们来组成整个网络不同的部分。参数可以简单的看成$ weights \sim 0.001 \times N(0, 1) $，N(0, 1)是均值0，单位标准差的高斯分布。也可以使用均匀分布来生成随机数，但是这对于最终的性能影响来说相对很小。

## Calibrating the Variances

One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that you can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows:

上面的方法有一个问题，一个随机初始化的神经元的输出分布的方差会随着输入数量增加而变大。事实上，可以通过把除以输入数量的平方根来归一化神经元输出的方差到1：

    >>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

where “randn” is the aforementioned Gaussian and “n” is the number of its inputs. This ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence. The detailed derivations can be found from Page. 18 to 23 of the slides. Please note that, in the derivations, it does not consider the influence of ReLU neurons.

randn是前面提到的高斯，n是输入的数量。这就保证了神经元的初始化后的输出分布近似，并且经验上可以改进收敛速度。详细的推导可以从slides的18-23页找到，请注意，这个推导，并没有考虑relu的影响。

## Current Recommendation

As aforementioned, the previous initialization by calibrating the variances of neurons is without considering ReLUs. A more recent paper on this topic by He et al. [4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be 2.0/n as:

前面提到的初始化方法和校正方法没有考虑relu。He[4]最近的研究推导了一种特定于relu的初始化方法，得出的结论是方差应该是2.0/n：

    >>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation

which is the current recommendation for use in practice, as discussed in [4].

这就是目前在实际中推荐的方法，[4]中讨论的。

# Sec. 4: During Training

Now, everything is ready. Let’s start to train deep networks!

都准备好了，可以开始训练深度网络了。

- **Filters and pooling size**. During training, the size of input images prefers to be power-of-2, such as 32 (e.g., CIFAR-10), 64, 224 (e.g., common used ImageNet), 384 or 512, etc. Moreover, it is important to employ a small filter (e.g., 3times 3) and small strides (e.g., 1) with zeros-padding, which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e., 3times 3 filters with stride 1, could preserve the spatial size of images/feature maps. For the pooling layers, the common used pooling size is of 2times 2.

- **卷积核和池化核大小**。在训练中，输入图片的大小建议是2的幂次方，如32（CIFAR-10），64，224（常见于ImageNet），384或者512等等。此外，使用小卷积核（如3x3）和小步长（如1）和0填充也很重要，这样不仅能减少参数的数量，并且能改进整个网络的准确度。同时，上面提到的是一个特殊的例子，3x3-s1会保持输入图片或者特征图的空间大小。对于池化层，常见的大小是2x2。

- **Learning rate**. In addition, as described in a blog by Ilya Sutskever [2], he recommended to divide the gradients by mini batch size. Thus, you should not always change the learning rates (LR), if you change the mini batch size. For obtaining an appropriate LR, utilizing the validation set is an effective way. Usually, a typical value of LR in the beginning of your training is 0.1. In practice, if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going, which might give you a surprise.

- **学习率**。其次，Ilay Sutskever在一篇博客[2]建议用批大小来除梯度。因此，当改变了批大小以后，就应该修改学习率。为了得到合适的LR，利用验证集是一种很好的方法。通常，在开始学习的时候典型的LR大小为0.1。实际上，如果在验证集上没有改进了，用2或者5除LR，然后继续，可能会得到惊喜。

- **Fine-tune on pre-trained models**. Nowadays, many state-of-the-arts deep networks are released by famous research groups, i.e., Caffe Model Zoo and VGG Group. Thanks to the wonderful generalization abilities of pre-trained deep models, you could employ these pre-trained models for your own applications directly. For further improving the classification performance on your data set, a very simple yet effective approach is to fine-tune the pre-trained models on your own data. As shown in following table, the two most important factors are the size of the new data set (small or big), and its similarity to the original data set. Different strategies of fine-tuning can be utilized in different situations. For instance, a good case is that your new data set is very similar to the data used for training pre-trained models. In that case, if you have very little data, you can just train a linear classifier on the features extracted from the top layers of pre-trained models. If your have quite a lot of data at hand, please fine-tune a few top layers of pre-trained models with a small learning rate. However, if your own data set is quite different from the data used in pre-trained models but with enough training images, a large number of layers should be fine-tuned on your data also with a small learning rate for improving performance. However, if your data set not only contains little data, but is very different from the data used in pre-trained models, you will be in trouble. Since the data is limited, it seems better to only train a linear classifier. Since the data set is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier on activations/features from somewhere earlier in the network.

- **在预训练的模型上微调**。当今，很多有名的研究组织都发布了先进的深度网络，例如Caffe Model Zoo和VGG Group。得益于预训练强大的泛化能力，可以在自己的应用上直接使用它们。为了进一步提高在自己数据集上的分类性能，在自己的数据集上进行微调是一种简单而有效的方法。如下面的表格，有两个最重要的因素，一是新数据集的大小，二是它和原数据集的相似度。在不同的情况下要使用不同的微调策略。
    - 例如，一种很好的情况是你的新数据和用来训练预训练模型的数据集很相似。在这种情况下，如果你只有很少的数据，你可以只在预训练模型抽取出的特征层之上训练一个线性分类器。
    - 如果你有很多的数据，请使用一个小的学习率来微调预训练模型的上面几层。
    - 但是，如果你的数据和原始数据非常不同，但是有足够的训练图片，那也可以使用小学习率在自己的数据集上来微调大部分的层
    - 但是，如果你的数据不仅很少，并且和原始数据也不相似，那就麻烦了。因为数据有限，那只能训练一个线性分类器。因为数据集差异大，在包含了很多数据相关特征的层上只训练一个分类器又不是很好。事实上，从网路中较早的地方基于特征或者激励来训练一个SVM分类可能是较好的选择。

| very similar dataset | very different dataset
- | - | -
**very little data** | use linear classifier on top layer | You'are in trouble...Try linear classifier from different stage
**quite a lot of data** | Finetune a few layers | Finetune a large number of layers

Fine-tune your data on pre-trained models. Different strategies of fine-tuning are utilized in different situations. For data sets, Caltech-101 is similar to ImageNet, where both two are object-centric image data sets; while Place Database is different from ImageNet, where one is scene-centric and the other is object-centric.

Caltech-101和ImageNet是类似的，它们都是以对象为中心的图片数据集；但是Place和ImageNet就不太像，一个是情景为中心的，另一个是对象为中心的。

# Sec. 5: Activation Functions

One of the crucial factors in deep networks is activation function, which brings the non-linearity into networks. Here we will introduce the details and characters of some popular activation functions and give advices later in this section.

深度网络中一个最关键的因素是激励函数，它给网路带来了非线性。这里我们介绍一些流行的激励函数的细节和特征，然后在这节的后面给出建议。

![neuron](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/neuron.png)
Figures courtesy of Stanford CS231n.

## Sigmoid

![sigmod](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/sigmod.png)

The sigmoid non-linearity has the mathematical form $\sigma(x)=\frac{1}{1 + e^{-x}}$. It takes a real-valued number and “squashes” it into range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. The sigmoid function has seen frequent use historically since it has a nice interpretation as the firing rate of a neuron: from not firing at all (0) to fully-saturated firing at an assumed maximum frequency (1).

sigmoid非线性的数学形式。它接收一个实数然后把它压扁到0-1之间。特别的，大负数变成0而大正数变成1。因为它作为一个神经元的发射率有很好的解释性，所以它在历史上频繁被使用：从完全不发射（0）到完全饱和发射（1）。

In practice, the sigmoid non-linearity has recently fallen out of favor and it is rarely ever used. It has two major drawbacks:

1. *Sigmoids saturate and kill gradients*. A very undesirable property of the sigmoid neuron is that when the neuron's activation saturates at either tail of 0 or 1, the gradient at these regions is almost zero. Recall that during back-propagation, this (local) gradient will be multiplied to the gradient of this gate's output for the whole objective. Therefore, if the local gradient is very small, it will effectively “kill” the gradient and almost no signal will flow through the neuron to its weights and recursively to its data. Additionally, one must pay extra caution when initializing the weights of sigmoid neurons to prevent saturation. For example, if the initial weights are too large then most neurons would become saturated and the network will barely learn.

2. *Sigmoid outputs are not zero-centered*. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g., x>0 element wise in f=w^Tx+b), then the gradient on the weights w will during back-propagation become either all be positive, or all negative (depending on the gradient of the whole expression f). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

实践中，sigmoid最近已经几乎不太被使用了。它有两个主要的缺点：

1. *sigmoid饱和使梯度消失*。一个sigmoid神经元不太被期望的属性是，但神经元的激励在0和1两边尾部饱和，这些区域的梯度几乎为0。回想一下，在反向传递中，这个（局部）梯度要乘以该门的输出的梯度来得到整个目标。因此，如果局部梯度非常小，它将有效的“杀死”梯度，几乎没有信号流过这个神经元到它的权重，然后递归的到它的数据上。另外，在初始化权重的时候要非常小心防止饱和。例如，如果初始化权重太大，那么神经元就变得饱和，网络就几乎无法学习。

2. *sigmoid的输出不以0位中心*。在神经网络的后层会接收到不是以0为中心的数据。这对于梯度下降有影响，因为如果所有的输入数据都是正数，那么在反向传递中在权重w上的梯度要不都是正，要不都是负。这会导致在权重上的梯度更新形成不必要的曲折更新。但是，需要注意的是，如果把一批数据的梯度都加到一起，最后的更新可能有不同的符号，这一定程度上减轻了这个问题。因此，这只是一个不变，和前面的饱和激励问题来说严重程度较小。

## tanh(x)

![tanh](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/tanh.png)

The tanh non-linearity squashes a real-valued number to the range [-1, 1]. Like the sigmoid neuron, its activations saturate, but unlike the sigmoid neuron its output is zero-centered. Therefore, in practice the tanh non-linearity is always preferred to the sigmoid nonlinearity.

tanh把一个实数压扁到[-1, 1]之间。就像sigmoid一样，它的激励也会饱和，但是不像sigmoid，它的输出是0对齐的。因此，实际中，tanh的使用总是先于sigmoid。

## Rectified Linear Unit

![relu](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/relu.png)

The Rectified Linear Unit (ReLU) has become very popular in the last few years. It computes the function f(x)=max(0,x), which is simply thresholded at zero.

整改线性单元（ReLu）最近这几年变得非常流行。它的函数是，也就是简单的以0设阈值。

There are several pros and cons to using the ReLUs:

1. (Pros) Compared to sigmoid/tanh neurons that involve expensive operations (exponentials, etc.), the ReLU can be implemented by simply thresholding a matrix of activations at zero. Meanwhile, ReLUs does not suffer from saturating.

2. (Pros) It was found to greatly accelerate (e.g., a factor of 6 in [1]) the convergence of stochastic gradient descent compared to the sigmoid/tanh functions. It is argued that this is due to its linear, non-saturating form.

3. (Cons) Unfortunately, ReLU units can be fragile during training and can “die”. For example, a large gradient flowing through a ReLU neuron could cause the weights to update in such a way that the neuron will never activate on any datapoint again. If this happens, then the gradient flowing through the unit will forever be zero from that point on. That is, the ReLU units can irreversibly die during training since they can get knocked off the data manifold. For example, you may find that as much as 40% of your network can be “dead” (i.e., neurons that never activate across the entire training dataset) if the learning rate is set too high. With a proper setting of the learning rate this is less frequently an issue.

优缺点：

1. （优点）相比于sigmoid/tanh的复杂计算（指数），ReLu可以简单低实现成在0上设置激励的阈值。同时，ReLu不会受到饱和的困扰。

2. （优点）相比于sigmoid/tanh，它可以极大的加速随机梯度下降收敛的速度（例如，6倍[1]）。有人认为只是由于其线性非饱和的形式。

3. （缺点）不幸的是，ReLu在训练中可能很脆弱，可以“死”。例如，一个很大的梯度流过relu神经元的时候会导致它的权重更新后，再也不产生任何激励。如果发生了这种情况，这个单元流过的梯度将永远为0。也就是说，relu单元在训练过程中可能会死亡。例如，你可能会发现如果学习率设置过大，40%的网路死亡（即神经元在整个数据上都不响应）。在学习率设置适当的时候，这不是一个大问题。

## Leaky ReLU

![lrelu](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/leaky.png)

Leaky ReLUs are one attempt to fix the “dying ReLU” problem. Instead of the function being zero when `x<0`, a leaky ReLU will instead have a small negative slope (of 0.01, or so). That is, the function computes $f(x)=\alpha x$ if x < 0 and f(x) = x if $x \geq 0$, where $\alpha$ is a small constant. Some people report success with this form of activation function, but the results are not always consistent.

Leaky relu是为了解决“死亡的relu”问题的一个尝试。在`x<0`函数值不是0，而是有一个很小的负斜率（0.01或者类似）。也就是说，函数。alpha是一个很小的常量。有些人报告了使用这种形式的激励函数的成功，但是这些结果并不是很一致。

## Parametric ReLU

Nowadays, a broader class of activation functions, namely the rectified unit family, were proposed. In the following, we will talk about the variants of ReLU.

如今，提出了叫做relu族的一类激励函数。下面，我们介绍这些relu的变种。

![relufamily](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/relufamily.png)

ReLU, Leaky ReLU, PReLU and RReLU. In these figures, for PReLU, $alpha_i$ is learned and for Leaky ReLU $alpha_i$ is fixed. For RReLU, $alpha_{ji}$ is a random variable keeps sampling in a given range, and remains fixed in testing.

The first variant is called parametric rectified linear unit (PReLU) [4]. In PReLU, the slopes of negative part are learned from data rather than pre-defined. He et al. [4] claimed that PReLU is the key factor of surpassing human-level performance on ImageNet classification task. The back-propagation and updating process of PReLU is very straightforward and similar to traditional ReLU, which is shown in Page. 43 of the slides.

第一个变种叫做PReLU[4]。在PReLu中，负数部分的斜率是从数据中学习的，而不是预先指定的。He[4]声称PReLU是ImageNet任务上超越人类的关键因素。反向传递过程和传统的relu非常相似，在slides的43页显示。

## Randomized ReLU

The second variant is called randomized rectified linear unit (RReLU). In RReLU, the slopes of negative parts are randomized in a given range in the training, and then fixed in the testing. As mentioned in [5], in a recent Kaggle National Data Science Bowl (NDSB) competition, it is reported that RReLU could reduce overfitting due to its randomized nature. Moreover, suggested by the NDSB competition winner, the random $alpha_i$ in training is sampled from 1/U(3,8) and in test time it is fixed as its expectation, i.e., 2/(l+u)=2/11.

第二个变种叫做RReLU。负数部分的斜率在训练的时候是一个范围内的随机数，在测试时是固定的。像[5]中提到的，在最近的Kaggle NDSB竞赛中，由于它的随机属性RReLU可以降低过拟合。此外，NDSB的冠军建议，训练时的随机alpha采样自1/U(3,8)，测试时固定到它的期望，也就是2/(l+u) = 2/11。

In [5], the authors evaluated classification performance of two state-of-the-art CNN architectures with different activation functions on the CIFAR-10, CIFAR-100 and NDSB data sets, which are shown in the following tables. Please note that, for these two networks, activation function is followed by each convolutional layer. And the a in these tables actually indicates 1/alpha, where alpha is the aforementioned slopes.

在[5]中，作者使用两个先进的CNN架构，在CIFAR-10，CIFAR-100和NDSB数据上，评估了不同的激励函数的分类性能，如下表显示。请注意，

![relures](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/relures.png)

From these tables, we can find the performance of ReLU is not the best for all the three data sets. For Leaky ReLU, a larger slope alpha will achieve better accuracy rates. PReLU is easy to overfit on small data sets (its training error is the smallest, while testing error is not satisfactory), but still outperforms ReLU. In addition, RReLU is significantly better than other activation functions on NDSB, which shows RReLU can overcome overfitting, because this data set has less training data than that of CIFAR-10/CIFAR-100. ***In conclusion, three types of ReLU variants all consistently outperform the original ReLU in these three data sets. And PReLU and RReLU seem better choices. Moreover, He et al. also reported similar conclusions in [4].***

总的来说，三种relu的变种在这三个数据集上都要优于原始的relu。并且prelu和rrelu看上去是更好的选择。另外，He也在[4]中给出了类似的结论。

# Sec. 6: Regularizations

There are several ways of controlling the capacity of Neural Networks to prevent overfitting:

几种控制过拟合的方法：

- **L2 regularization** is perhaps the most common form of regularization. It can be implemented by penalizing the squared magnitude of all parameters directly in the objective. That is, for every weight w in the network, we add the term $\frac{1}{2} \lambda w^2$ to the objective, where lambda is the regularization strength. It is common to see the factor of $\frac{1}{2}$ in front because then the gradient of this term with respect to the parameter w is simply lambda w instead of 2$\lambda$ w. The L2 regularization has the intuitive interpretation of heavily penalizing peaky weight vectors and preferring diffuse weight vectors.

- L2正则化也许是最重用的正则化形式。通过在目标函数里直接惩罚所有参数的平方量来实现。也就是说，对网络中的每个权重w，我们都在目标函数里增加一个量$\frac{1}{2} \lambda w^2$，这里lambda是正则强度。1/2常见是因为这个式子的梯度将会是lambda w而不是2 lambda w。L2有直观的解释，就是惩罚有峰值的向量，而倾向于弥散的向量。

- **L1 regularization** is another relatively common form of regularization, where for each weight w we add the term $\lambda |w|$ to the objective. It is possible to combine the L1 regularization with the L2 regularization: lambda_1 |w|+lambda_2 w^2 (this is called Elastic net regularization). The L1 regularization has the intriguing property that it leads the weight vectors to become sparse during optimization (i.e. very close to exactly zero). In other words, neurons with L1 regularization end up using only a sparse subset of their most important inputs and become nearly invariant to the “noisy” inputs. In comparison, final weight vectors from L2 regularization are usually diffuse, small numbers. In practice, if you are not concerned with explicit feature selection, L2 regularization can be expected to give superior performance over L1.

- L1是另一种常见的形式，对每一个w增加一个lambda |w|到目标函数中。也可以结合L1和L2（这称为弹性网路正规化）。L1在优化期间有迷人的特性，会导致权重向量变得稀疏（非常接近或者等于0）.换句话说，使用L1导致神经元仅使用它们最重要的一部分输入当做子集，然后变得对噪音输入不变。相比之下，L2的权重向量通常弥散的小数。实际中，如果你不关心显式的特征选择，L2比L1有更好的性能。

- **Max norm constraints**. Another form of regularization is to enforce an absolute upper bound on the magnitude of the weight vector for every neuron and use projected gradient descent to enforce the constraint. In practice, this corresponds to performing the parameter update as normal, and then enforcing the constraint by clamping the weight vector $\vec{w}$ of every neuron to satisfy  $\parallel \vec{w} \parallel_2 < c$. Typical values of c are on orders of 3 or 4. Some people report improvements when using this form of regularization. One of its appealing properties is that network cannot “explode” even when the learning rates are set too high because the updates are always bounded.

- **最大归一限制**。另一种形式的正则化是强制限制每个神经元的权重向量的最大值，使用投影的梯度下降来实现这个限制。实际中，这相当于执行参数更新时做归一，

- **Dropout** is an extremely effective, simple and recently introduced regularization technique by Srivastava et al. in [6] that complements the other methods (L1, L2, maxnorm). During training, dropout can be interpreted as sampling a Neural Network within the full Neural Network, and only updating the parameters of the sampled network based on the input data. (However, the exponential number of possible sampled networks are not independent because they share the parameters.) During testing there is no dropout applied, with the interpretation of evaluating an averaged prediction across the exponentially-sized ensemble of all sub-networks (more about ensembles in the next section). In practice, the value of dropout ratio p=0.5 is a reasonable default, but this can be tuned on validation data.

![dropout](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/dropout.png)

The most popular used regularization technique dropout [6]. While training, dropout is implemented by only keeping a neuron active with some probability p (a hyper-parameter), or setting it to zero otherwise. In addition, Google applied for a US patent for dropout in 2014.

# Sec. 7: Insights from Figures

Finally, from the tips above, you can get the satisfactory settings (e.g., data processing, architectures choices and details, etc.) for your own deep networks. During training time, you can draw some figures to indicate your networks’ training effectiveness.

在训练的时候，可以画些图来指示网络的训练有效性。

- As we have known, the learning rate is very sensitive. From Fig. 1 in the following, a very high learning rate will cause a quite strange loss curve. A low learning rate will make your training loss decrease very slowly even after a large number of epochs. In contrast, a high learning rate will make training loss decrease fast at the beginning, but it will also drop into a local minimum. Thus, your networks might not achieve a satisfactory results in that case. For a good learning rate, as the red line shown in Fig. 1, its loss curve performs smoothly and finally it achieves the best performance.

- lr是很敏感的。从下面的图1中看，很高的学习率会导致一个很奇怪的曲线。很低的学习率使损失函数减少得非常慢，及时经过大量的周期。相反的，汉高的学习率在开始时损失函数下降很快，但容易掉进局部最优解。因此，网络并不能得到满意的结果。像图1中的红色线表示的，一个好的学习率，它的损失函数的平滑，并最终达到最好的性能。

- Now let’s zoom in the loss curve. The epochs present the number of times for training once on the training data, so there are multiple mini batches in each epoch. If we draw the classification loss every training batch, the curve performs like Fig. 2. Similar to Fig. 1, if the trend of the loss curve looks too linear, that indicates your learning rate is low; if it does not decrease much, it tells you that the learning rate might be too high. Moreover, the “width” of the curve is related to the batch size. If the “width” looks too wide, that is to say the variance between every batch is too large, which points out you should increase the batch size.

- 现在放大损失曲线。周期数表示在所有训练数据上完整训练的次数，所以每个周期有很多批数据。如果我们对每个训练批画分类的损失，这个曲线如图2，类似于图1，如果曲线趋势太线性，表明lr太小；如果降低不下去，表示lr太大。此外，曲线的”宽度“和批大小有关联。如果“宽度“太宽，也就是说每个批之间的方差太大，这就表明你需要增大批大小。

- Another tip comes from the accuracy curve. As shown in Fig. 3, the red line is the training accuracy, and the green line is the validation one. When the validation accuracy converges, the gap between the red line and the green one will show the effectiveness of your deep networks. If the gap is big, it indicates your network could get good accuracy on the training data, while it only achieve a low accuracy on the validation set. It is obvious that your deep model overfits on the training set. Thus, you should increase the regularization strength of deep networks. However, no gap meanwhile at a low accuracy level is not a good thing, which shows your deep model has low learnability. In that case, it is better to increase the model capacity for better results.

- 另一个正确率曲线的提示。如图3，红线是训练正确率，绿线是验证集正确率。当验证集正确率收敛时，红线和绿线之间的间隙表示了网络的有效性。如果间隙太大，这表示网络在训练数据中有很好的正确率，而在验证集上很低，这明显表示模型在训练集上过拟合。因此，该增加正则化的强度。但是，没有间隙但是在很低的正确率水平也不是好现象，这说明网络的学习能力很低。在这种情况下，最好增加模型的能力来得到更好的结果。

![trainfigs](http://lamda.nju.edu.cn/weixs/project/CNNTricks/imgs/trainfigs.png)	


# Sec. 8: Ensemble

In machine learning, ensemble methods [8] that train multiple learners and then combine them for use are a kind of state-of-the-art learning approach. It is well known that an ensemble is usually significantly more accurate than a single learner, and ensemble methods have already achieved great success in many real-world tasks. In practical applications, especially challenges or competitions, almost all the first-place and second-place winners used ensemble methods.

组合模型通常比单个模型要明显更精确。在实际的应用，特别是挑战和竞赛中，几乎第一名和第二名都是用组合方法。

Here we introduce several skills for ensemble in the deep learning scenario.

- **Same model, different initialization.** Use cross-validation to determine the best hyperparameters, then train multiple models with the best set of hyperparameters but with different random initialization. The danger with this approach is that the variety is only due to initialization.

- 相同模型，不同初始化。是用交叉验证来决定最好的超参数，然后是用最好的超参数集合，但不同的随机初始化方法来训练多个不同的模型。这种方法的风险在于，多样性仅仅取决于初始化。

- **Top models discovered during cross-validation**. Use cross-validation to determine the best hyperparameters, then pick the top few (e.g., 10) models to form the ensemble. This improves the variety of the ensemble but has the danger of including suboptimal models. In practice, this can be easier to perform since it does not require additional retraining of models after cross-validation. Actually, you could directly select several state-of-the-art deep models from Caffe Model Zoo to perform ensemble.

- 是用交叉验证来确定最好的超参数，然后挑选最好的几个（例如10）来形成组合。这增加了组合的多样性，但是风险是引入了不太优秀的模型。实际中，这比较容易操作，因为在交叉验证后不需要额外的重新训练。事实上，你可以在Caffe Model Zoo选择几个先进的模型直接组合。

- **Different checkpoints of a single model**. If training is very expensive, some people have had limited success in taking different checkpoints of a single network over time (for example after every epoch) and using those to form an ensemble. Clearly, this suffers from some lack of variety, but can still work reasonably well in practice. The advantage of this approach is that is very cheap.

- 单个模型的不同检查点。如果训练太费时，有人使用单个网络在不同时间的不同检查点来组成网络，也得到了一些有限的成功。明显的，这缺少多样性，但也有理由在实际中工作得很好。这个方法的好处是成本低。

- **Some practical examples**. If your vision tasks are related to high-level image semantic, e.g., event recognition from still images, a better ensemble method is to employ multiple deep models trained on different data sources to extract different and complementary deep representations. For example in the Cultural Event Recognition challenge in associated with ICCV’15, we utilized five different deep models trained on images of ImageNet, Place Database and the cultural images supplied by the competition organizers. After that, we extracted five complementary deep features and treat them as multi-view data. Combining “early fusion” and “late fusion” strategies described in [7], we achieved one of the best performance and ranked the 2nd place in that challenge. Similar to our work, [9] presented the Stacked NN framework to fuse more deep networks at the same time.

- 一些实际的例子。如果你的视觉任务和高层次的图片语义相关。例如，静态图片的事件识别，一个很好的组合方法是利用多个在不同数据源上训练的深度模型，来提取不同的互补的深度表达。例如在Cultural Event Recognition挑战中，ICCV'15，我们利用了5个不同的在ImageNet,Place,cultural上训练的深度模型。然后，我们提取了5个互补的深度特征把它们看做是数据的多个视图。结合[7]中描述的“早期融合”和“后融合”策略，我们实现了最好的表现之一，并在该挑战中排名第二。 与我们的工作类似，[9]提出了Stacked NN框架来同时融合更深层次的网络。

# Miscellaneous

In real world applications, the data is usually **class-imbalanced**: some classes have a large number of images/training instances, while some have very limited number of images. As discussed in a recent technique report [10], when deep CNNs are trained on these imbalanced training sets, the results show that imbalanced training data can potentially have a severely negative impact on overall performance in deep networks. For this issue, the simplest method is to balance the training data by directly up-sampling and down-sampling the imbalanced data, which is shown in [10]. Another interesting solution is one kind of special crops processing in our challenge solution [7]. Because the original cultural event images are imbalanced, we merely extract crops from the classes which have a small number of training images, which on one hand can supply diverse data sources, and on the other hand can solve the class-imbalanced problem. In addition, you can adjust the fine-tuning strategy for overcoming class-imbalance. For example, you can divide your own data set into two parts: one contains the classes which have a large number of training samples (images/crops); the other contains the classes of limited number of samples. In each part, the class-imbalanced problem will be not very serious. At the beginning of fine-tuning on your data set, you firstly fine-tune on the classes which have a large number of training samples (images/crops), and secondly, continue to fine-tune but on the classes with limited number samples.