Must Know Tips/Tricks in Deep Neural Networks (by Xiu-Shen Wei)
---

http://lamda.nju.edu.cn/weixs/project/CNNTricks/CNNTricks.html

Deep Neural Networks, especially Convolutional Neural Networks (CNN), allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-arts in visual object recognition, object detection, text recognition and many other domains such as drug discovery and genomics.

In addition, many solid papers have been published in this topic, and some high quality open source CNN software packages have been made available. There are also well-written CNN tutorials or CNN software manuals. However, it might lack a recent and comprehensive summary about the details of how to implement an excellent deep convolutional neural networks from scratch. Thus, we collected and concluded many implementation details for DCNNs. Here we will introduce these extensive implementation details, i.e., tricks or tips, for building and training your own deep networks.

在这个方向有很多坚实的论文，和一些高质量的开源软件。也有很多写得很好的CNN的入门和手册。但是，最近的和详尽的关于如何从头实现一个优秀的深度卷积网络的细节很缺失。因此，我们收集和总结了很多关于DCNN的实现细节。这里我们将介绍这些实现细节，例如，关于如何构造和训练你自己的深度网络的技巧或者提示。

# Introduction

We assume you already know the basic knowledge of deep learning, and here we will present the implementation details (tricks or tips) in Deep Neural Networks, especially CNN for image-related tasks, mainly in eight aspects: 1) data augmentation; 2) pre-processing on images; 3) initializations of Networks; 4) some tips during training; 5) selections of activation functions; 6) diverse regularizations; 7) some insights found from figures and finally 8) methods of ensemble multiple deep networks.

我们假设你已经知道了深度学习的基本知识，这里我们介绍实现细节，特别是图片相关的CNN，主要在8个方面：

1. 数据增强
2. 图片预处理
3. 网络初始化
4. 训练时的一些提示
5. 激励函数的选择
6. 多种正则化方法
7. 数字中的洞悉
8. 组合多个深度网络的方法

Additionally, the corresponding slides are available at [slide]. If there are any problems/mistakes in these materials and slides, or there are something important/interesting you consider that should be added, just feel free to contact me.

# Sec. 1: Data Augmentation

Since deep networks need to be trained on a huge number of training images to achieve satisfactory performance, if the original image data set contains limited training images, it is better to do data augmentation to boost the performance. Also, data augmentation becomes the thing must to do when training a deep network.

因为深度网络需要使用非常多的训练图片来达到满意的性能，如果原始的图片数据集只包含很少的训练图片，那最好使用数据增强来提升性能。另外，在训练深度网络时，数据增强已经不可或缺了。

- There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. Moreover, you could try combinations of multiple different processing, e.g., doing the rotation and random scaling at the same time. In addition, you can try to raise saturation and value (S and V components of the HSV color space) of all pixels to a power between 0.25 and 4 (same for all pixels within a patch), multiply these values by a factor between 0.7 and 1.4, and add to them a value between -0.1 and 0.1. Also, you could add a value between [-0.1, 0.1] to the hue (H component of HSV) of all pixels in the image/patch.

- 数据增强有很多种方法，例如流行的水平翻转，随机截取和色彩变换。并且也可以组合多种不同的方法，比如，同时操作旋转和随机拉伸。另外，有可以对饱和度和亮度做调整（HSV的S和V两部分）。也可以调整H。

- Krizhevsky et al. [1] proposed fancy PCA when training the famous Alex-Net in 2012. Fancy PCA alters the intensities of the RGB channels in training images. In practice, you can firstly perform PCA on the set of RGB pixel values throughout your training images. And then, for each training image, just add the following quantity to each RGB image pixel (i.e., I_{xy}=[I_{xy}^R,I_{xy}^G,I_{xy}^B]^T): [bf{p}_1,bf{p}_2,bf{p}_3][alpha_1 lambda_1,alpha_2 lambda_2,alpha_3 lambda_3]^T where, bf{p}_i and lambda_i are the i-th eigenvector and eigenvalue of the 3times 3 covariance matrix of RGB pixel values, respectively, and alpha_i is a random variable drawn from a Gaussian with mean zero and standard deviation 0.1. Please note that, each alpha_i is drawn only once for all the pixels of a particular training image until that image is used for training again. That is to say, when the model meets the same training image again, it will randomly produce another alpha_i for data augmentation. In [1], they claimed that “fancy PCA could approximately capture an important property of natural images, namely, that object identity is invariant to changes in the intensity and color of the illumination”. To the classification performance, this scheme reduced the top-1 error rate by over 1% in the competition of ImageNet 2012.

- Krizhevsky[1]在2012训练著名的AlexNet时提出了一种新颖的PCA方法。PCA修改训练图片中RGB通道的强度。实际上，可以先在所有的训练集上执行PCA。然后对每个训练图片，对每个RGB图片加上以下这个量。对分类性能，这个方案可以使top-1错误率减低1%在ImageNet2012竞赛中。

# Sec. 2: Pre-Processing

Now we have obtained a large number of training samples (images/crops), but please do not hurry! Actually, it is necessary to do pre-processing on these images/crops. In this section, we will introduce several approaches for pre-processing.

现在我们已经获得了大量的训练样本了（图片和截取），但是别着急。实际上，在这些数据上进行预处理也非常必要的。这节，我们会介绍几种预处理的方法。

The first and simple pre-processing approach is zero-center the data, and then normalize them, which is presented as two lines Python codes as follows:

    >>> X -= np.mean(X, axis = 0) # zero-center
    >>> X /= np.std(X, axis = 0) # normalize

第一种简单的方法是把数据以0居中，然后把它们标准化，用python代码来表示如下：（减去均值，除以标准差）

where, X is the input data (NumIns×NumDim). Another form of this pre-processing normalizes each dimension so that the min and max along the dimension is -1 and 1 respectively. It only makes sense to apply this pre-processing if you have a reason to believe that different input features have different scales (or units), but they should be of approximately equal importance to the learning algorithm. In case of images, the relative scales of pixels are already approximately equal (and in range from 0 to 255), so it is not strictly necessary to perform this additional pre-processing step.

这里，X是输入数据。这种预处理的另一种形式是每个维度标准化，所以这些维度的最小最大值会被分别归一到-1和1。只有在你有理由相信不同的输入特征有不同的尺度，但是它们对学习算法的贡献率需要近似相同，这种情况下，使用这种预处理方法才有必要。在图片的情况下，像素点的相对尺度已经近似相等（在0-255的范围），所以并不是很严格必须来使用这种额外的预处理步骤。

Another pre-processing approach similar to the first one is PCA Whitening. In this process, the data is first centered as described above. Then, you can compute the covariance matrix that tells us about the correlation structure in the data:

另一种类似的预处理方法是PCA Whitening。在这种处理中，数据首先被居中，然后计算协方差矩阵，来反应数据的相关性：

    >>> X -= np.mean(X, axis = 0) # zero-center
    >>> cov = np.dot(X.T, X) / X.shape[0] # compute the covariance matrix

After that, you decorrelate the data by projecting the original (but zero-centered) data into the eigenbasis:

在这之后，把居中后的数据映射到特征基空间：

    >>> U,S,V = np.linalg.svd(cov) # compute the SVD factorization of the data covariance matrix
    >>> Xrot = np.dot(X, U) # decorrelate the data

The last transformation is whitening, which takes the data in the eigenbasis and divides every dimension by the eigenvalue to normalize the scale:

最后的变换是白化，用特征基空间的数据除以每个维度的特征值来把尺度归一：

    >>> Xwhite = Xrot / np.sqrt(S + 1e-5) # divide by the eigenvalues (which are square roots of the singular values)

Note that here it adds 1e-5 (or a small constant) to prevent division by zero. One weakness of this transformation is that it can greatly exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. This can in practice be mitigated by stronger smoothing (i.e., increasing 1e-5 to be a larger number).

注意这里可以增加 $1^{-5}$（或者很小的常量）来防止除零错误。这种变换的一个弱点是，它会增大数据中的噪声，因为所有输入维度（包括非常小的方差的不相关的维度，这些都几乎是噪声）都被拉伸到了相同的大小。这实际上可以通过更强的平滑来减轻（即，将1e-5增加到更大的数量）。

Please note that, we describe these pre-processing here just for completeness. In practice, these transformations are not used with Convolutional Neural Networks. However, it is also very important to zero-center the data, and it is common to see normalization of every pixel as well.

需要提出的事，我们在这里描述这些预处理是为了完整性。实际上，在convnet上这些变换并没有被使用。但是，数据0居中还是很重要的，而且对每个像素进行归一也是很常见的。

# Sec. 3: Initializations

Now the data is ready. However, before you are beginning to train the network, you have to initialize its parameters.

现在数据已经准备好了。在开始训练网路之前，还需要初始化它的参数。

## All Zero Initialization

In the ideal situation, with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to set all the initial weights to zero, which you expect to be the “best guess” in expectation. But, this turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during back-propagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

在理想的情况下，进行合适的数据归一化，有理由假设近似一半的权重为正，而另一半为负。一种听上去合理的想法是把权重都初始化为0，期望这是“最佳猜测”，但是这是个错误，因为如果网络中的每个神经元的输出都相同，那么它们反向传递过程中计算出的梯度也都相同，然后进行相同的参数更新。换句话说，如果所有神经元的参数都初始化成一样时，不对称的来源就没有了。

## Initialization with Small Random Numbers

Thus, you still want the weights to be very close to zero, but not identically zero. In this way, you can random these neurons to small numbers which are very close to zero, and it is treated as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. The implementation for weights might simply look like weightssim 0.001times N(0,1), where N(0,1) is a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

因此，你仍然希望权重非常接近0，但是不等于0。这样，你可以随机初始化成接近于0的小数，来破坏对称性。神经元在开始的时候都是随机并且唯一的，因此它们进行不同的更新，集成它们来组成整个网络不同的部分。参数可以简单的看成$ weights \sim 0.001 \times N(0, 1) $，N(0, 1)是均值0，单位标准差的高斯分布。也可以使用均匀分布来生成随机数，但是这对于最终的性能影响来说相对很小。

## Calibrating the Variances

One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that you can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows:

上面的方法有一个问题，一个随机初始化的神经元的输出分布的方差会随着输入数量增加而变大。事实上，可以通过把除以输入数量的平方根来归一化神经元输出的方差到1：

    >>> w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

where “randn” is the aforementioned Gaussian and “n” is the number of its inputs. This ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence. The detailed derivations can be found from Page. 18 to 23 of the slides. Please note that, in the derivations, it does not consider the influence of ReLU neurons.

randn是前面提到的高斯，n是输入的数量。这就保证了神经元的初始化后的输出分布近似，并且经验上可以改进收敛速度。详细的推导可以从slides的18-23页找到，请注意，这个推导，并没有考虑relu的影响。

## Current Recommendation

As aforementioned, the previous initialization by calibrating the variances of neurons is without considering ReLUs. A more recent paper on this topic by He et al. [4] derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be 2.0/n as:

前面提到的初始化方法和校正方法没有考虑relu。He[4]最近的研究推导了一种特定于relu的初始化方法，得出的结论是方差应该是2.0/n：

    >>> w = np.random.randn(n) * sqrt(2.0/n) # current recommendation

which is the current recommendation for use in practice, as discussed in [4].

这就是目前在实际中推荐的方法，[4]中讨论的。

# Sec. 4: During Training

Now, everything is ready. Let’s start to train deep networks!

都准备好了，可以开始训练深度网络了。

- **Filters and pooling size**. During training, the size of input images prefers to be power-of-2, such as 32 (e.g., CIFAR-10), 64, 224 (e.g., common used ImageNet), 384 or 512, etc. Moreover, it is important to employ a small filter (e.g., 3times 3) and small strides (e.g., 1) with zeros-padding, which not only reduces the number of parameters, but improves the accuracy rates of the whole deep network. Meanwhile, a special case mentioned above, i.e., 3times 3 filters with stride 1, could preserve the spatial size of images/feature maps. For the pooling layers, the common used pooling size is of 2times 2.

- **卷积核和chihuaaq亲爱去啊黄牛ENBQQQQBVAQ!BVA!QA!BBQUHjk3eeeeeeeejjjjdqw7e

- **Learning rate**. In addition, as described in a blog by Ilya Sutskever [2], he recommended to divide the gradients by mini batch size. Thus, you should not always change the learning rates (LR), if you change the mini batch size. For obtaining an appropriate LR, utilizing the validation set is an effective way. Usually, a typical value of LR in the beginning of your training is 0.1. In practice, if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going, which might give you a surprise.

- **Fine-tune on pre-trained models**. Nowadays, many state-of-the-arts deep networks are released by famous research groups, i.e., Caffe Model Zoo and VGG Group. Thanks to the wonderful generalization abilities of pre-trained deep models, you could employ these pre-trained models for your own applications directly. For further improving the classification performance on your data set, a very simple yet effective approach is to fine-tune the pre-trained models on your own data. As shown in following table, the two most important factors are the size of the new data set (small or big), and its similarity to the original data set. Different strategies of fine-tuning can be utilized in different situations. For instance, a good case is that your new data set is very similar to the data used for training pre-trained models. In that case, if you have very little data, you can just train a linear classifier on the features extracted from the top layers of pre-trained models. If your have quite a lot of data at hand, please fine-tune a few top layers of pre-trained models with a small learning rate. However, if your own data set is quite different from the data used in pre-trained models but with enough training images, a large number of layers should be fine-tuned on your data also with a small learning rate for improving performance. However, if your data set not only contains little data, but is very different from the data used in pre-trained models, you will be in trouble. Since the data is limited, it seems better to only train a linear classifier. Since the data set is very different, it might not be best to train the classifier from the top of the network, which contains more dataset-specific features. Instead, it might work better to train the SVM classifier on activations/features from somewhere earlier in the network.

| very similar dataset | very different dataset
- | - | -
**very little data** | use linear classifier on top layer | You'are in trouble...Try linear classifier from different stage
**quite a lot of data** | Finetune a few layers | Finetune a large number of layers

Fine-tune your data on pre-trained models. Different strategies of fine-tuning are utilized in different situations. For data sets, Caltech-101 is similar to ImageNet, where both two are object-centric image data sets; while Place Database is different from ImageNet, where one is scene-centric and the other is object-centric.