# 批标准化
在我们正式进入模型的构建和训练之前，我们会先讲一讲数据预处理和批标准化，因为模型训练并不容易，特别是一些非常复杂的模型，并不能非常好的训练得到收敛的结果，所以对数据增加一些预处理，同时使用批标准化能够得到非常好的收敛结果，这也是卷积网络能够训练到非常深的层的一个重要原因。

## 数据预处理
目前数据预处理最常见的方法就是中心化和标准化，中心化相当于修正数据的中心位置，实现方法非常简单，就是在每个特征维度上减去对应的均值，最后得到 0 均值的特征。标准化也非常简单，在数据变成 0 均值之后，为了使得不同的特征维度有着相同的规模，可以除以标准差近似为一个标准正态分布，也可以依据最大值和最小值将其转化为 -1 ~ 1 之间，下面是一个简单的图示

![](https://ws1.sinaimg.cn/large/006tKfTcly1fmqouzer3xj30ij06n0t8.jpg)

这两种方法非常的常见，如果你还记得，前面我们在神经网络的部分就已经使用了这个方法实现了数据标准化，至于另外一些方法，比如 PCA 或者 白噪声已经用得非常少了。

## Batch Normalization
前面在数据预处理的时候，我们尽量输入特征不相关且满足一个标准的正态分布，这样模型的表现一般也较好。但是对于很深的网路结构，网路的非线性层会使得输出的结果变得相关，且不再满足一个标准的 N(0, 1) 的分布，甚至输出的中心已经发生了偏移，这对于模型的训练，特别是深层的模型训练非常的困难。

所以在 2015 年一篇论文提出了这个方法，批标准化，简而言之，就是对于每一层网络的输出，对其做一个归一化，使其服从标准的正态分布，这样后一层网络的输入也是一个标准的正态分布，所以能够比较好的进行训练，加快收敛速度。

batch normalization 的实现非常简单，对于给定的一个 batch 的数据 $B = \{x_1, x_2, \cdots, x_m\}$算法的公式如下

$$
\mu_B = \frac{1}{m} \sum_{i=1}^m x_i
$$
$$
\sigma^2_B = \frac{1}{m} \sum_{i=1}^m (x_i - \mu_B)^2
$$
$$
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma^2_B + \epsilon}}
$$
$$
y_i = \gamma \hat{x}_i + \beta
$$

第一行和第二行是计算出一个 batch 中数据的均值和方差，接着使用第三个公式对 batch 中的每个数据点做标准化，$\epsilon$ 是为了计算稳定引入的一个小的常数，通常取 $10^{-5}$，最后利用权重修正得到最后的输出结果，非常的简单，下面我们可以实现一下简单的一维的情况，也就是神经网络中的情况

In [1]:
from __future__ import print_function
from __future__ import division
from __future__ import absolute_import

import tensorflow as tf

  from ._conv import register_converters as _register_converters


In [2]:
def simple_batch_norm_1d(x, gamma, beta):
    eps = 1e-5
    x_mean = tf.reduce_mean(x, axis=0, keep_dims=True)
    x_var = tf.reduce_mean(tf.square(x - x_mean), axis=0, keep_dims=True)
    x_hat = (x - x_mean) / tf.sqrt(x_var + eps)
    return gamma * x_hat + beta

我们来验证一下是否对于任意的输入，输出会被标准化

In [3]:
sess = tf.InteractiveSession()

In [4]:
x = tf.reshape(tf.range(15, dtype=tf.float32), (5, 3))
gamma = tf.ones(x.get_shape()[1])
beta = tf.zeros(x.get_shape()[-1])
print('before bn:')
print(sess.run(x))
y = simple_batch_norm_1d(x, gamma, beta)
print('after bn:')
print(sess.run(y))

before bn:
[[ 0.  1.  2.]
 [ 3.  4.  5.]
 [ 6.  7.  8.]
 [ 9. 10. 11.]
 [12. 13. 14.]]
Instructions for updating:
keep_dims is deprecated, use keepdims instead
after bn:
[[-1.4142132 -1.4142132 -1.4142132]
 [-0.7071066 -0.7071066 -0.7071066]
 [ 0.         0.         0.       ]
 [ 0.7071066  0.7071066  0.7071066]
 [ 1.4142132  1.4142132  1.4142132]]


可以看到这里一共是 5 个数据点，三个特征，每一列表示一个特征的不同数据点，使用批标准化之后，每一列都变成了标准的正态分布

这个时候会出现一个问题，就是测试的时候该使用批标准化吗？

答案是肯定的，因为训练的时候使用了，而测试的时候不使用肯定会导致结果出现偏差，但是测试的时候如果只有一个数据集，那么均值不就是这个值，方差为 0 吗？这显然是随机的，所以测试的时候不能用测试的数据集去算均值和方差，而是用训练的时候算出的移动平均均值和方差去代替

下面我们实现以下能够区分训练状态和测试状态的批标准化方法

In [5]:
def batch_norm_1d(x, is_training, decay=0.1, scope='bn'):
    eps = 1e-5
    
    with tf.variable_scope(scope):
        moving_mean = tf.get_variable('moving_mean', initializer=tf.zeros_initializer(), shape=x.get_shape()[-1:], dtype=tf.float32, trainable=False)
        moving_var = tf.get_variable('moving_var', initializer=tf.zeros_initializer(), shape=x.get_shape()[-1:], dtype=tf.float32, trainable=False)
        
        tf.add_to_collection('moving_mean', moving_mean)
        tf.add_to_collection('moving_var', moving_var)

        gamma = tf.get_variable('gamma', initializer=tf.random_normal_initializer(), shape=x.get_shape()[-1:])
        beta = tf.get_variable('beta', initializer=tf.random_normal_initializer(), shape=x.get_shape()[-1:])
        
        def batch_norm_train():
            x_mean, x_var = tf.nn.moments(x, axes=[0])
            x_hat = (x - x_mean) / tf.sqrt(x_var + eps)
            update_moving_mean = moving_mean.assign(decay * moving_mean + (1 - decay) * x_mean)
            update_moving_var = moving_var.assign(decay * moving_var + (1 - decay) * x_var)
            update_moving_op = tf.group(update_moving_mean, update_moving_var)
            tf.add_to_collection(tf.GraphKeys.UPDATE_OPS, update_moving_op)
            
            return x_hat
        
        def batch_norm_test():
            x_hat = (x - moving_mean) / tf.sqrt(moving_var + eps)

            return x_hat
        
        x_hat = tf.cond(is_training, batch_norm_train, batch_norm_test)
        
        return gamma * x_hat + beta

In [6]:
import tensorflow.examples.tutorials.mnist.input_data as input_data
import numpy as np

In [7]:
mnist = input_data.read_data_sets('MNIST_data', one_hot=True)

Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
Instructions for updating:
Please write your own downloading logic.
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-images-idx3-ubyte.gz
Instructions for updating:
Please use tf.data to implement this functionality.
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Instructions for updating:
Please use tf.one_hot on tensors.
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz
Instructions for updating:
Please use alternatives such as official/mnist/dataset.py from tensorflow/models.


In [8]:
train_set = mnist.train
test_set = mnist.test

In [9]:
# 定义占位符
input_ph = tf.placeholder(shape=(None, 784), dtype=tf.float32)
label_ph = tf.placeholder(shape=(None, 10), dtype=tf.int64)

In [10]:
from utils.layers import fc

In [11]:
def DNN(inputs, is_training=True, scope='dnn', reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        net = fc(inputs, 100, act=tf.identity, scope='fc1')
        net = batch_norm_1d(net, is_training, scope='bn')
        net = tf.nn.relu(net, name='activation')
        net = fc(net, 10, act=tf.identity, scope='classification')
        
        return net

In [12]:
is_training_ph = tf.placeholder(tf.bool, name='is_training_ph')
dnn = DNN(input_ph, is_training=is_training_ph)

In [13]:
loss = tf.losses.softmax_cross_entropy(logits=dnn, onehot_labels=label_ph)

acc = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(dnn, axis=-1), tf.argmax(label_ph, axis=-1)), dtype=tf.float32))

lr = 0.1
optimizer = tf.train.GradientDescentOptimizer(learning_rate=lr)

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op = optimizer.minimize(loss)

In [14]:
batch_size = 64

sess.run(tf.global_variables_initializer())

for e in range(10000):
    images, labels = train_set.next_batch(batch_size)
    sess.run(train_op, feed_dict={input_ph: images, label_ph: labels, is_training_ph: True})
    if e % 1000 == 999:
        test_imgs, test_labels = test_set.next_batch(batch_size)
        loss_train, acc_train = sess.run([loss, acc], feed_dict={input_ph: images, label_ph: labels, is_training_ph: False})
        loss_test, acc_test = sess.run([loss, acc], feed_dict={input_ph: test_imgs, label_ph: test_labels, is_training_ph: False})
        print('STEP {}: train_loss: {:.6f} train_acc: {:.6f} test_loss: {:.6f} test_acc: {:.6f}'.format(e + 1, loss_train, acc_train, loss_test, acc_test))

print('Train Done!')
print('-'*30)

# 计算所有训练样本的损失值以及正确率
train_loss = []
train_acc = []
for _ in range(train_set.num_examples // 100):
    image, label = train_set.next_batch(100)
    loss_train, acc_train = sess.run([loss, acc], feed_dict={input_ph: image, label_ph: label, is_training_ph: False})
    train_loss.append(loss_train)
    train_acc.append(acc_train)

print('Train loss: {:.6f}'.format(np.array(train_loss).mean()))
print('Train accuracy: {:.6f}'.format(np.array(train_acc).mean()))

# 计算所有测试样本的损失值以及正确率
test_loss = []
test_acc = []
for _ in range(test_set.num_examples // 100):
    image, label = test_set.next_batch(100)
    loss_test, acc_test = sess.run([loss, acc], feed_dict={input_ph: image, label_ph: label, is_training_ph: False})
    test_loss.append(loss_test)
    test_acc.append(acc_test)

print('Test loss: {:.6f}'.format(np.array(test_loss).mean()))
print('Test accuracy: {:.6f}'.format(np.array(test_acc).mean()))

STEP 1000: train_loss: 0.135436 train_acc: 0.984375 test_loss: 0.149804 test_acc: 0.953125
STEP 2000: train_loss: 0.147314 train_acc: 0.937500 test_loss: 0.253495 test_acc: 0.906250
STEP 3000: train_loss: 0.079757 train_acc: 0.968750 test_loss: 0.041458 test_acc: 0.984375
STEP 4000: train_loss: 0.109295 train_acc: 0.984375 test_loss: 0.105002 test_acc: 0.968750
STEP 5000: train_loss: 0.030301 train_acc: 1.000000 test_loss: 0.072805 test_acc: 0.984375
STEP 6000: train_loss: 0.062770 train_acc: 0.968750 test_loss: 0.219858 test_acc: 0.937500
STEP 7000: train_loss: 0.032305 train_acc: 1.000000 test_loss: 0.245801 test_acc: 0.953125
STEP 8000: train_loss: 0.028475 train_acc: 1.000000 test_loss: 0.095050 test_acc: 0.984375
STEP 9000: train_loss: 0.015281 train_acc: 1.000000 test_loss: 0.135433 test_acc: 0.968750
STEP 10000: train_loss: 0.013734 train_acc: 1.000000 test_loss: 0.155537 test_acc: 0.953125
Train Done!
------------------------------
Train loss: 0.051219
Train accuracy: 0.985345


这里的 $\gamma$ 和 $\beta$ 都作为参数进行训练，初始化为随机的高斯分布，`moving_mean` 和 `moving_var` 都初始化为 0，并不是更新的参数，训练完 10000 次之后，我们可以看看移动平均和移动方差被修改为了多少

In [15]:
moving_mean = tf.get_collection('moving_mean')[0]

In [16]:
print(sess.run(moving_mean)[:10])

[ 0.7441317   0.19530469  1.0170597  -0.10595255 -0.13077882  0.6251181
 -0.00693776  0.41463417 -0.04592047  0.14478667]


可以看到，这些值已经在训练的过程中进行了修改，在测试过程中，我们不需要再计算均值和方差，直接使用移动平均和移动方差即可

作为对比，我们看看不使用批标准化的结果

In [17]:
def DNN2(inputs, scope='dnn2', reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        net = fc(inputs, 100, act=tf.nn.relu, scope='fc1')
        net = fc(net, 10, act=tf.identity, scope='classification')
        
        return net

In [18]:
dnn2 = DNN2(input_ph)

In [19]:
loss2 = tf.losses.softmax_cross_entropy(logits=dnn2, onehot_labels=label_ph)

acc2 = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(dnn2, axis=-1), tf.argmax(label_ph, axis=-1)), dtype=tf.float32))

train_op2 = optimizer.minimize(loss2)

In [20]:
batch_size = 64

sess.run(tf.global_variables_initializer())

for e in range(10000):
    images, labels = train_set.next_batch(batch_size)
    sess.run(train_op2, feed_dict={input_ph: images, label_ph: labels})
    if e % 1000 == 999:
        test_imgs, test_labels = test_set.next_batch(batch_size)
        loss_train, acc_train = sess.run([loss2, acc2], feed_dict={input_ph: images, label_ph: labels})
        loss_test, acc_test = sess.run([loss2, acc2], feed_dict={input_ph: test_imgs, label_ph: test_labels})
        print('STEP {}: train_loss: {:.6f} train_acc: {:.6f} test_loss: {:.6f} test_acc: {:.6f}'.format(e + 1, loss_train, acc_train, loss_test, acc_test))

print('Train Done!')
print('-'*30)

# 计算所有训练样本的损失值以及正确率
train_loss = []
train_acc = []
for _ in range(train_set.num_examples // 100):
    image, label = train_set.next_batch(100)
    loss_train, acc_train = sess.run([loss2, acc2], feed_dict={input_ph: image, label_ph: label})
    train_loss.append(loss_train)
    train_acc.append(acc_train)

print('Train loss: {:.6f}'.format(np.array(train_loss).mean()))
print('Train accuracy: {:.6f}'.format(np.array(train_acc).mean()))

# 计算所有测试样本的损失值以及正确率
test_loss = []
test_acc = []
for _ in range(test_set.num_examples // 100):
    image, label = test_set.next_batch(100)
    loss_test, acc_test = sess.run([loss2, acc2], feed_dict={input_ph: image, label_ph: label})
    test_loss.append(loss_test)
    test_acc.append(acc_test)

print('Test loss: {:.6f}'.format(np.array(test_loss).mean()))
print('Test accuracy: {:.6f}'.format(np.array(test_acc).mean()))

STEP 1000: train_loss: 0.163043 train_acc: 0.968750 test_loss: 0.343354 test_acc: 0.937500
STEP 2000: train_loss: 0.171674 train_acc: 0.968750 test_loss: 0.229045 test_acc: 0.953125
STEP 3000: train_loss: 0.034481 train_acc: 1.000000 test_loss: 0.049881 test_acc: 1.000000
STEP 4000: train_loss: 0.048298 train_acc: 0.984375 test_loss: 0.301027 test_acc: 0.921875
STEP 5000: train_loss: 0.049916 train_acc: 1.000000 test_loss: 0.128424 test_acc: 0.968750
STEP 6000: train_loss: 0.057422 train_acc: 0.984375 test_loss: 0.046348 test_acc: 0.984375
STEP 7000: train_loss: 0.046987 train_acc: 1.000000 test_loss: 0.037427 test_acc: 0.984375
STEP 8000: train_loss: 0.006977 train_acc: 1.000000 test_loss: 0.038861 test_acc: 0.984375
STEP 9000: train_loss: 0.062435 train_acc: 0.984375 test_loss: 0.030002 test_acc: 1.000000
STEP 10000: train_loss: 0.015280 train_acc: 1.000000 test_loss: 0.021770 test_acc: 1.000000
Train Done!
------------------------------
Train loss: 0.053095
Train accuracy: 0.986073


可以看到虽然最后的结果两种情况一样，但是如果我们看前几次的情况，可以看到使用批标准化的情况能够更快的收敛，因为这只是一个小网络，所以用不用批标准化都能够收敛，但是对于更加深的网络，使用批标准化在训练的时候能够很快地收敛

从上面可以看到，我们自己实现了 2 维情况的批标准化，对应于卷积的 4 维情况的标准化是类似的，只需要沿着通道的维度进行均值和方差的计算，但是我们自己实现批标准化是很累的`tensorflow.contrib`中内置了批标准化的函数`tf.contrib.layers.batch_norm`, `tf-slim`下有`slim.batch_norm`, 它们的函数接口也非常简单. 

同时, 卷积层`slim.conv2d`具有输出标准化函数的参数`normalizer_fn=None`, 默认是没有, 我们可以用`slim.arg_scope`来给每个卷积层附加一个批标准化函数.

下面我们就在卷积网络下使用批标准化看看效果

In [21]:
import tensorflow.contrib.slim as slim

In [22]:
def conv_bn_net(inputs, is_training, scope='conv_bn_net', reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        with slim.arg_scope([slim.conv2d], activation_fn=None, normalizer_fn=slim.batch_norm):
            with slim.arg_scope([slim.batch_norm], is_training=is_training):
                net = slim.conv2d(inputs, 6, 3, scope='conv1')
                net = tf.nn.relu(net, name='activation1')
                net = slim.max_pool2d(net, 2, stride=2, scope='max_pool1')
                net = slim.conv2d(net, 16, 5, scope='conv2')
                net = tf.nn.relu(net, name='activation2')
                net = slim.max_pool2d(net, 2, stride=2, scope='max_pool2')
                net = slim.flatten(net, scope='flatten')
                net = slim.fully_connected(net, 10, activation_fn=None, scope='classification')

                return net

我们把`input_ph`的形状改变为`[-1, 28, 28, 1]`以适应卷积层

In [23]:
input_imgs = tf.reshape(input_ph, [-1, 28, 28, 1], name='reshape')

使用`conv_bn_net`构造输出

In [24]:
conv_bn_y = conv_bn_net(input_imgs, is_training=is_training_ph)

In [25]:
loss3 = tf.losses.softmax_cross_entropy(logits=conv_bn_y, onehot_labels=label_ph)

acc3 = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(conv_bn_y, axis=-1), tf.argmax(label_ph, axis=-1)), dtype=tf.float32))

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
    train_op3 = optimizer.minimize(loss3)

In [26]:
batch_size = 64

sess.run(tf.global_variables_initializer())

for e in range(5000):
    images, labels = train_set.next_batch(batch_size)
    sess.run(train_op3, feed_dict={input_ph: images, label_ph: labels, is_training_ph: True})
    if e % 1000 == 999:
        test_imgs, test_labels = test_set.next_batch(batch_size)
        loss_train, acc_train = sess.run([loss3, acc3], feed_dict={input_ph: images, label_ph: labels, is_training_ph: False})
        loss_test, acc_test = sess.run([loss3, acc3], feed_dict={input_ph: test_imgs, label_ph: test_labels, is_training_ph: False})
        print('STEP {}: train_loss: {:.6f} train_acc: {:.6f} test_loss: {:.6f} test_acc: {:.6f}'.format(e + 1, loss_train, acc_train, loss_test, acc_test))

print('Train Done!')
print('-'*30)

# 计算所有训练样本的损失值以及正确率
train_loss = []
train_acc = []
for _ in range(train_set.num_examples // 100):
    image, label = train_set.next_batch(100)
    loss_train, acc_train = sess.run([loss3, acc3], feed_dict={input_ph: image, label_ph: label, is_training_ph: False})
    train_loss.append(loss_train)
    train_acc.append(acc_train)

print('Train loss: {:.6f}'.format(np.array(train_loss).mean()))
print('Train accuracy: {:.6f}'.format(np.array(train_acc).mean()))

# 计算所有测试样本的损失值以及正确率
test_loss = []
test_acc = []
for _ in range(test_set.num_examples // 100):
    image, label = test_set.next_batch(100)
    loss_test, acc_test = sess.run([loss3, acc3], feed_dict={input_ph: image, label_ph: label, is_training_ph: False})
    test_loss.append(loss_test)
    test_acc.append(acc_test)

print('Test loss: {:.6f}'.format(np.array(test_loss).mean()))
print('Test accuracy: {:.6f}'.format(np.array(test_acc).mean()))

STEP 1000: train_loss: 1.165334 train_acc: 0.609375 test_loss: 1.056816 test_acc: 0.578125
STEP 2000: train_loss: 0.356737 train_acc: 0.875000 test_loss: 0.301841 test_acc: 0.890625
STEP 3000: train_loss: 0.036660 train_acc: 1.000000 test_loss: 0.059838 test_acc: 0.984375
STEP 4000: train_loss: 0.018737 train_acc: 1.000000 test_loss: 0.024945 test_acc: 0.984375
STEP 5000: train_loss: 0.006830 train_acc: 1.000000 test_loss: 0.054585 test_acc: 0.984375
Train Done!
------------------------------
Train loss: 0.034995
Train accuracy: 0.989473
Test loss: 0.042106
Test accuracy: 0.986800


作为对比, 我们再来看看在卷积网络下不用`bn`层的效果

In [27]:
def conv_net(inputs, scope='conv_net', reuse=None):
    with tf.variable_scope(scope, reuse=reuse):
        with slim.arg_scope([slim.conv2d], activation_fn=None, normalizer_fn=None):
            net = slim.conv2d(inputs, 6, 3, scope='conv1')
            net = tf.nn.relu(net, name='activation1')
            net = slim.max_pool2d(net, 2, stride=2, scope='max_pool1')
            net = slim.conv2d(net, 16, 5, scope='conv2')
            net = tf.nn.relu(net, name='activation2')
            net = slim.max_pool2d(net, 2, stride=2, scope='max_pool2')
            net = slim.flatten(net, scope='flatten')
            net = slim.fully_connected(net, 10, activation_fn=None, scope='classification')
            
            return net

In [28]:
conv_y = conv_net(input_imgs)

In [29]:
loss4 = tf.losses.softmax_cross_entropy(logits=conv_y, onehot_labels=label_ph)

acc4 = tf.reduce_mean(tf.cast(tf.equal(tf.argmax(conv_y, axis=-1), tf.argmax(label_ph, axis=-1)), dtype=tf.float32))

train_op4 = optimizer.minimize(loss4)

In [30]:
batch_size = 64

sess.run(tf.global_variables_initializer())

for e in range(5000):
    images, labels = train_set.next_batch(batch_size)
    sess.run(train_op4, feed_dict={input_ph: images, label_ph: labels})
    if e % 1000 == 999:
        test_imgs, test_labels = test_set.next_batch(batch_size)
        loss_train, acc_train = sess.run([loss4, acc4], feed_dict={input_ph: images, label_ph: labels})
        loss_test, acc_test = sess.run([loss4, acc4], feed_dict={input_ph: test_imgs, label_ph: test_labels})
        print('STEP {}: train_loss: {:.6f} train_acc: {:.6f} test_loss: {:.6f} test_acc: {:.6f}'.format(e + 1, loss_train, acc_train, loss_test, acc_test))

print('Train Done!')
print('-'*30)

# 计算所有训练样本的损失值以及正确率
train_loss = []
train_acc = []
for _ in range(train_set.num_examples // 100):
    image, label = train_set.next_batch(100)
    loss_train, acc_train = sess.run([loss4, acc4], feed_dict={input_ph: image, label_ph: label})
    train_loss.append(loss_train)
    train_acc.append(acc_train)

print('Train loss: {:.6f}'.format(np.array(train_loss).mean()))
print('Train accuracy: {:.6f}'.format(np.array(train_acc).mean()))

# 计算所有测试样本的损失值以及正确率
test_loss = []
test_acc = []
for _ in range(test_set.num_examples // 100):
    image, label = test_set.next_batch(100)
    loss_test, acc_test = sess.run([loss4, acc4], feed_dict={input_ph: image, label_ph: label})
    test_loss.append(loss_test)
    test_acc.append(acc_test)

print('Test loss: {:.6f}'.format(np.array(test_loss).mean()))
print('Test accuracy: {:.6f}'.format(np.array(test_acc).mean()))

STEP 1000: train_loss: 0.013257 train_acc: 1.000000 test_loss: 0.162615 test_acc: 0.953125
STEP 2000: train_loss: 0.010398 train_acc: 1.000000 test_loss: 0.012174 test_acc: 1.000000
STEP 3000: train_loss: 0.010519 train_acc: 1.000000 test_loss: 0.020304 test_acc: 0.984375
STEP 4000: train_loss: 0.007355 train_acc: 1.000000 test_loss: 0.049724 test_acc: 0.984375
STEP 5000: train_loss: 0.004530 train_acc: 1.000000 test_loss: 0.039649 test_acc: 0.968750
Train Done!
------------------------------
Train loss: 0.038749
Train accuracy: 0.987036
Test loss: 0.047929
Test accuracy: 0.984900


之后介绍一些著名的网络结构的时候，我们会慢慢认识到批标准化的重要性，使用 pytorch 能够非常方便地添加批标准化层