# 正则化
前面我们讲了数据增强和 dropout，而在实际使用中，现在的网络往往不使用 dropout，而是用另外一个技术，叫正则化。

正则化是机器学习中提出来的一种方法，有 L1 和 L2 正则化，目前使用较多的是 L2 正则化，引入正则化相当于在 loss 函数上面加上一项，比如

$$
f = loss + \lambda \sum_{p \in params} ||p||_2^2
$$

就是在 loss 的基础上加上了参数的二范数作为一个正则化，我们在训练网络的时候，不仅要最小化 loss 函数，同时还要最小化参数的二范数，也就是说我们会对参数做一些限制，不让它变得太大。

如果我们对新的损失函数 f 求导进行梯度下降，就有

$$
\frac{\partial f}{\partial p_j} = \frac{\partial loss}{\partial p_j} + 2 \lambda p_j
$$

那么在更新参数的时候就有

$$
p_j \rightarrow p_j - \eta (\frac{\partial loss}{\partial p_j} + 2 \lambda p_j) = p_j - \eta \frac{\partial loss}{\partial p_j} - 2 \eta \lambda p_j 
$$


可以看到 $p_j - \eta \frac{\partial loss}{\partial p_j}$ 和没加正则项要更新的部分一样，而后面的 $2\eta \lambda p_j$ 就是正则项的影响，可以看到加完正则项之后会对参数做更大程度的更新，这也被称为权重衰减(weight decay)，在`tf-slim`中正则项可以通过`slim.arg_scope`和`slim.regularizers`来实现, 因为卷积层,全连接层都具有参数`weight_regularzier`，因此我们使用`slim.arg_scope([slim.conv2d, slim.fully_connected], weight_regularzier=slim.regularizers.l2_regularizer(weight_decay=0.0001))`就可以实现所有卷积层的权重`L2`模衰减

注意正则项的系数的大小非常重要，如果太大，会极大的抑制参数的更新，导致欠拟合，如果太小，那么正则项这个部分基本没有贡献，所以选择一个合适的权重衰减系数非常重要，这个需要根据具体的情况去尝试，初步尝试可以使用 `1e-4` 或者 `1e-3` 

下面我们在训练 cifar 10 中添加正则项

In [1]:
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import

import tensorflow as tf
import tensorflow.contrib.slim as slim
from utils import cifar10_input
from utils.resnet import resnet

  from ._conv import register_converters as _register_converters


In [2]:
train_imgs, train_labels, val_imgs, val_labels = cifar10_input.load_data(image_size=96)

给所有的`slim.conv2d`和`slim.fully_connected`添加默认权重衰减, 用`slim.arg_scope`统一定义

In [3]:
with slim.arg_scope([slim.conv2d], activation_fn=tf.nn.relu, normalizer_fn=slim.batch_norm):
    with slim.arg_scope([slim.conv2d, slim.fully_connected], weights_regularizer = slim.regularizers.l2_regularizer(1e-4)) as sc:
        conv_scope = sc

In [4]:
is_training = tf.placeholder(tf.bool, name='is_training')

with slim.arg_scope(conv_scope):
    train_out = resnet(train_imgs, 10, is_training=is_training, verbose=True)
    val_out = resnet(val_imgs, 10, is_training=is_training, reuse=True)

input: (64, 96, 96, 3)
block1: (64, 48, 48, 32)
block2: (64, 24, 24, 128)
block3: (64, 12, 12, 256)
block4: (64, 6, 6, 512)
Instructions for updating:
keep_dims is deprecated, use keepdims instead
classification: (64, 10)


In [5]:
with tf.variable_scope('loss'):
    train_loss = tf.losses.sparse_softmax_cross_entropy(labels=train_labels, logits=train_out, scope='train')
    val_loss = tf.losses.sparse_softmax_cross_entropy(labels=val_labels, logits=val_out, scope='val')

In [6]:
with tf.name_scope('accuracy'):
    with tf.name_scope('train'):
        train_acc = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(train_out, axis=-1, output_type=tf.int32), train_labels), tf.float32))
    with tf.name_scope('val'):
        val_acc = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(val_out, axis=-1, output_type=tf.int32), val_labels), tf.float32))

In [7]:
lr = 0.01

opt = tf.train.MomentumOptimizer(lr, momentum=0.9)

In [8]:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = opt.minimize(train_loss)

In [9]:
from utils.learning import train_with_bn

In [10]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())

_ = train_with_bn(sess, train_op, train_loss, train_acc, val_loss, val_acc, 20000, is_training)

sess.close()

[train]: step 0 loss = 2.3034 acc = 0.1094 (0.0171 / batch)
[val]: step 0 loss = 2.3041 acc = 0.0781
[train]: step 1000 loss = 1.5135 acc = 0.4688 (0.0841 / batch)
[train]: step 2000 loss = 1.0731 acc = 0.6875 (0.0834 / batch)
[train]: step 3000 loss = 1.3064 acc = 0.6406 (0.0834 / batch)
[train]: step 4000 loss = 0.7001 acc = 0.7656 (0.0835 / batch)
[val]: step 4000 loss = 1.2375 acc = 0.5938
[train]: step 5000 loss = 0.9213 acc = 0.7188 (0.0836 / batch)
[train]: step 6000 loss = 0.4007 acc = 0.8438 (0.0841 / batch)
[train]: step 7000 loss = 0.8119 acc = 0.7500 (0.0844 / batch)
[train]: step 8000 loss = 0.6393 acc = 0.8438 (0.0821 / batch)
[val]: step 8000 loss = 1.4887 acc = 0.6719
[train]: step 9000 loss = 0.2326 acc = 0.8906 (0.0846 / batch)
[train]: step 10000 loss = 0.5696 acc = 0.8594 (0.0846 / batch)
[train]: step 11000 loss = 0.6103 acc = 0.8125 (0.0849 / batch)
[train]: step 12000 loss = 0.1505 acc = 0.9375 (0.0852 / batch)
[val]: step 12000 loss = 0.8726 acc = 0.6875
[train]