# 训练更好的模型

In [1]:
import os
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"   # see issue #152
os.environ["CUDA_VISIBLE_DEVICES"] = ""

In [2]:
%matplotlib inline
import imp
import utils
imp.reload(utils)
from utils import *

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [3]:
#path = "data/dogscats/sample/"
path = "data/dogscats/"
model_path = path + 'models/'
if not os.path.exists(model_path): 
    os.mkdir(model_path)

batch_size=8

## 模型欠拟合了吗?

Our validation accuracy so far has generally been higher than our training accuracy. That leads to two obvious questions:

1. How is this possible?
2. Is this desirable?

The answer to (1) is that this is happening because of *dropout*. Dropout refers to a layer that randomly deletes (i.e. sets to zero) each activation in the previous layer with probability *p* (generally 0.5). This only happens during training, not when calculating the accuracy on the validation set, which is why the validation set can show higher accuracy than the training set.

The purpose of dropout is to avoid overfitting. By deleting parts of the neural network at random during training, it ensures that no one part of the network can overfit to one part of the training set. The creation of dropout was one of the key developments in deep learning, and has allowed us to create rich models without overfitting. However, it can also result in underfitting if overused, and this is something we should be careful of with our model.

So the answer to (2) is: this is probably not desirable. It is likely that we can get better validation set results with less (or no) dropout, if we're seeing that validation accuracy is higher than training accuracy - a strong sign of underfitting. So let's try removing dropout entirely, and see what happens!

(We had dropout in this model already because the VGG authors found it necessary for the imagenet competition. But that doesn't mean it's necessary for dogs v cats, so we will do our own analysis of regularization approaches from scratch.)

## 移除 Dropout

Our high level approach here will be to start with our fine-tuned cats vs dogs model (with dropout), then fine-tune all the dense layers, after removing dropout from them. The steps we will take are:
- Re-create and load our modified VGG model with binary dependent (i.e. dogs v cats)
- Split the model between the convolutional (*conv*) layers and the dense layers
- Pre-calculate the output of the conv layers, so that we don't have to redundently re-calculate them on every epoch
- Create a new model with just the dense layers, and dropout p set to zero
- Train this new model using the output of the conv layers as training data.

As before we need to start with a working model, so let's bring in our working VGG 16 model and change it to predict our binary dependent...

In [4]:
model = vgg_ft(2)

...and load our fine-tuned weights.

In [5]:
model.load_weights(model_path+'finetune3.h5')

We're going to be training a number of iterations without dropout, so it would be best for us to pre-calculate the input to the fully connected layers - i.e. the *Flatten()* layer. We'll start by finding this layer in our model, and creating a new model that contains just the layers up to and including this layer:

找到最后一个卷积层

In [8]:
model.layers

[<keras.layers.core.Lambda at 0x2458283bac8>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245ec9d0940>,
 <keras.layers.convolutional.Conv2D at 0x245fcb37e48>,
 <keras.layers.convolutional.ZeroPadding2D at 0x2458283bc50>,
 <keras.layers.convolutional.Conv2D at 0x24582899ef0>,
 <keras.layers.pooling.MaxPooling2D at 0x24582862400>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4f356a0>,
 <keras.layers.convolutional.Conv2D at 0x245e4f48dd8>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4f5cda0>,
 <keras.layers.convolutional.Conv2D at 0x245e4f73080>,
 <keras.layers.pooling.MaxPooling2D at 0x245e4f73320>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4f84780>,
 <keras.layers.convolutional.Conv2D at 0x245e4f98d68>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4faef98>,
 <keras.layers.convolutional.Conv2D at 0x245e4faeef0>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4fc11d0>,
 <keras.layers.convolutional.Conv2D at 0x245e4fd2c50>,
 <keras.layers.pooling.Ma

In [7]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lambda_1 (Lambda)            (None, 3, 224, 224)       0         
_________________________________________________________________
zero_padding2d_1 (ZeroPaddin (None, 3, 226, 226)       0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 64, 224, 224)      1792      
_________________________________________________________________
zero_padding2d_2 (ZeroPaddin (None, 64, 226, 226)      0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 64, 224, 224)      36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 64, 112, 112)      0         
_________________________________________________________________
zero_padding2d_3 (ZeroPaddin (None, 64, 114, 114)      0         
__________

In [9]:
layers = model.layers

In [10]:
 for index,layer in enumerate(layers):
        if type(layer) is Conv2D:
            print("1")

1
1
1
1
1
1
1
1
1
1
1
1
1


In [11]:
[index for index,layer in enumerate(layers) if type(layer) is Conv2D]

[2, 4, 7, 9, 12, 14, 16, 19, 21, 23, 26, 28, 30]

In [12]:
# 找到最后一个卷积层
last_conv_idx = [index for index,layer in enumerate(layers) if type(layer) is Conv2D][-1]

In [13]:
last_conv_idx

30

In [14]:
layers[last_conv_idx]

<keras.layers.convolutional.Conv2D at 0x245e50c44e0>

In [15]:
layers[:last_conv_idx+1]

[<keras.layers.core.Lambda at 0x2458283bac8>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245ec9d0940>,
 <keras.layers.convolutional.Conv2D at 0x245fcb37e48>,
 <keras.layers.convolutional.ZeroPadding2D at 0x2458283bc50>,
 <keras.layers.convolutional.Conv2D at 0x24582899ef0>,
 <keras.layers.pooling.MaxPooling2D at 0x24582862400>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4f356a0>,
 <keras.layers.convolutional.Conv2D at 0x245e4f48dd8>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4f5cda0>,
 <keras.layers.convolutional.Conv2D at 0x245e4f73080>,
 <keras.layers.pooling.MaxPooling2D at 0x245e4f73320>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4f84780>,
 <keras.layers.convolutional.Conv2D at 0x245e4f98d68>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4faef98>,
 <keras.layers.convolutional.Conv2D at 0x245e4faeef0>,
 <keras.layers.convolutional.ZeroPadding2D at 0x245e4fc11d0>,
 <keras.layers.convolutional.Conv2D at 0x245e4fd2c50>,
 <keras.layers.pooling.Ma

In [16]:
layers[last_conv_idx+1:]

[<keras.layers.pooling.MaxPooling2D at 0x245e50c4710>,
 <keras.layers.core.Flatten at 0x245e50e82e8>,
 <keras.layers.core.Dense at 0x245e50e8b38>,
 <keras.layers.core.Dropout at 0x245e51108d0>,
 <keras.layers.core.Dense at 0x245e51103c8>,
 <keras.layers.core.Dropout at 0x245e6114160>,
 <keras.layers.core.Dense at 0x245e61d3fd0>]

In [17]:
conv_layers = layers[:last_conv_idx+1] # 所有卷积层
conv_model = Sequential(conv_layers) # 建立顺序模型-卷积部分
# Dense layers - also known as fully connected or 'FC' layers
fc_layers = layers[last_conv_idx+1:] # 全联接层

Now we can use the exact same approach to creating features as we used when we created the linear model from the imagenet predictions in the last lesson - it's only the model that has changed. As you're seeing, there's a fairly small number of "recipes" that can get us a long way!

In [18]:
path

'data/dogscats/'

In [19]:
batches = get_batches(path+'train', shuffle=False, batch_size=batch_size)
val_batches = get_batches(path+'valid', shuffle=False, batch_size=batch_size)

val_classes = val_batches.classes
trn_classes = batches.classes
val_labels = onehot(val_classes)
trn_labels = onehot(trn_classes)

Found 23000 images belonging to 2 classes.
Found 2000 images belonging to 2 classes.


In [16]:
val_classes

array([0, 0, 0, ..., 1, 1, 1])

In [17]:
trn_classes

array([0, 0, 0, ..., 1, 1, 1])

In [18]:
batches.class_indices

{'cats': 0, 'dogs': 1}

In [21]:
# 调用卷积层的predict_generator，提取验证集的特征
#val_features = conv_model.predict_generator(val_batches,steps=val_batches.n // batch_size)
#save_array(model_path + 'valid_convlayer_features.bc', val_features)

In [20]:
# 调用卷积层的predict_generator，提取训练集的特征，gpu太烂玩不了
#trn_features = conv_model.predict_generator(batches, steps = batches.n // batch_size)
# save_array(model_path + 'train_convlayer_features.bc', trn_features)

In [20]:
val_features = load_array(model_path+'valid_convlayer_features.bc')
trn_features = load_array(model_path+'train_convlayer_features.bc')

In [21]:
print(trn_features.shape)
print(val_features.shape)

(23000, 512, 14, 14)
(2000, 512, 14, 14)


For our new fully connected model, we'll create it using the exact same architecture as the last layers of VGG 16, so that we can conveniently copy pre-trained weights over from that model. However, we'll set the dropout layer's p values to zero, so as to effectively remove dropout.

对于我们新的完全连接的模型，我们将使用与VGG 16的最后一层完全相同的架构来创建它，以便我们可以方便地从该模型中复制经过预先训练的权重。但是，我们会将dropout层的p值设为0，以便有效地去除dropout。

In [22]:
# Copy the weights from the pre-trained model.
# 因为移除的dropout 所以把各层权重减低为一半
def proc_wgts(layer): 
    return [o/2 for o in layer.get_weights()]

In [23]:
# 像这样非常细微的调整，学习率要设定的非常小
opt = RMSprop(lr=0.00001, rho=0.7)

In [24]:
fc_layers

[<keras.layers.pooling.MaxPooling2D at 0x245e50c4710>,
 <keras.layers.core.Flatten at 0x245e50e82e8>,
 <keras.layers.core.Dense at 0x245e50e8b38>,
 <keras.layers.core.Dropout at 0x245e51108d0>,
 <keras.layers.core.Dense at 0x245e51103c8>,
 <keras.layers.core.Dropout at 0x245e6114160>,
 <keras.layers.core.Dense at 0x245e61d3fd0>]

In [25]:
def get_fc_model():
    model = Sequential([
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dense(4096, activation='relu'),
        Dropout(0.),# 0就是删除了
        Dense(4096, activation='relu'),
        Dropout(0.),# 0就是删除了
        Dense(2, activation='softmax')
        ])

    for l1,l2 in zip(model.layers, fc_layers):
        l1.set_weights(proc_wgts(l2)) # 把预训练好的模型的权重直接赋予给新的模型，

    model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])
    return model

In [26]:
fc_model = get_fc_model()

And fit the model in the usual way:

In [27]:
trn_labels

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       ...,
       [0., 1.],
       [0., 1.],
       [0., 1.]], dtype=float32)

In [28]:
fc_model.fit(trn_features, trn_labels, epochs=8, 
             batch_size=batch_size, validation_data=(val_features, val_labels))

Train on 23000 samples, validate on 2000 samples
Epoch 1/8


ResourceExhaustedError: OOM when allocating tensor with shape[25088,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: training/RMSprop/mul_2 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](RMSprop/lr/read, training/RMSprop/gradients/dense_5/MatMul_grad/MatMul_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.


Caused by op 'training/RMSprop/mul_2', defined at:
  File "C:\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Anaconda3\lib\site-packages\ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "C:\Anaconda3\lib\site-packages\traitlets\config\application.py", line 658, in launch_instance
    app.start()
  File "C:\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 486, in start
    self.io_loop.start()
  File "C:\Anaconda3\lib\site-packages\tornado\ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "C:\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "C:\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "C:\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\tornado\stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "C:\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "C:\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "C:\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2728, in run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2856, in run_ast_nodes
    if self.run_code(code, result):
  File "C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-28-84f7d022e1c8>", line 2, in <module>
    batch_size=batch_size, validation_data=(val_features, val_labels))
  File "C:\Anaconda3\lib\site-packages\keras\models.py", line 1002, in fit
    validation_steps=validation_steps)
  File "C:\Anaconda3\lib\site-packages\keras\engine\training.py", line 1682, in fit
    self._make_train_function()
  File "C:\Anaconda3\lib\site-packages\keras\engine\training.py", line 992, in _make_train_function
    loss=self.total_loss)
  File "C:\Anaconda3\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\Anaconda3\lib\site-packages\keras\optimizers.py", line 258, in get_updates
    new_p = p - lr * g / (K.sqrt(new_a) + self.epsilon)
  File "C:\Anaconda3\lib\site-packages\tensorflow\python\ops\variables.py", line 796, in _run_op
    return getattr(ops.Tensor, operator)(a._AsTensor(), *args)
  File "C:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 979, in binary_op_wrapper
    return func(x, y, name=name)
  File "C:\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1211, in _mul_dispatch
    return gen_math_ops.mul(x, y, name=name)
  File "C:\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 5066, in mul
    "Mul", x=x, y=y, name=name)
  File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 3392, in create_op
    op_def=op_def)
  File "C:\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[25088,4096] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
	 [[Node: training/RMSprop/mul_2 = Mul[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](RMSprop/lr/read, training/RMSprop/gradients/dense_5/MatMul_grad/MatMul_1)]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.



In [None]:
fc_model.save_weights(model_path+'no_dropout.h5')

In [None]:
fc_model.load_weights(model_path+'no_dropout.h5')

# 降低过拟合

我们已经让模型过拟合了，现在采取措施减少过拟合

## 降低过拟合的途径

We do not necessarily need to rely on dropout or other regularization approaches to reduce overfitting. There are other techniques we should try first, since regularlization, by definition, biases our model towards simplicity - which we only want to do if we know that's necessary. This is the order that we recommend using for reducing overfitting (more details about each in a moment):

1. Add more data
2. Use data augmentation
3. Use architectures that generalize well
4. Add regularization
5. Reduce architecture complexity.

We'll assume that you've already collected as much data as you can, so step (1) isn't relevant (this is true for most Kaggle competitions, for instance). So the next step (2) is data augmentation. This refers to creating additional synthetic data, based on reasonable modifications of your input data. For images, this is likely to involve one or more of: flipping, rotation, zooming, cropping, panning, minor color changes.

Which types of augmentation are appropriate depends on your data. For regular photos, for instance, you'll want to use horizontal flipping, but not vertical flipping (since an upside down car is much less common than a car the right way up, for instance!)

We recommend *always* using at least some light data augmentation, unless you have so much data that your model will never see the same input twice.

## 关于数据增强

Keras comes with very convenient features for automating data augmentation. You simply define what types and maximum amounts of augmentation you want, and keras ensures that every item of every batch randomly is changed according to these settings. Here's how to define a generator that includes data augmentation:

In [None]:
# dim_ordering='tf' uses tensorflow dimension ordering,
#   which is the same order as matplotlib uses for display.
# Therefore when just using for display purposes, this is more convenient
gen = image.ImageDataGenerator(rotation_range=10, width_shift_range=0.1, 
       height_shift_range=0.1, shear_range=0.15, zoom_range=0.1, 
       channel_shift_range=10., horizontal_flip=True)

Let's take a look at how this generator changes a single image (the details of this code don't matter much, but feel free to read the comments and keras docs to understand the details if you're interested).

In [None]:
test = ndimage.imread('data/dogscats/test1/7.jpg')

In [None]:
test.shape

In [None]:
np.expand_dims(test,0).shape

In [None]:
# 建立“batch”的单个图片
img = np.expand_dims(ndimage.imread('data/dogscats/test1/7.jpg'),0)

In [None]:
aug_iter = gen.flow(img)

In [None]:
type(aug_iter)

In [None]:
aug_iter # 迭代

In [None]:
next(aug_iter)[0].astype(np.uint8) # 获得单个生成图片

In [None]:
# 获得8个生成的增强图片
aug_imgs = [next(aug_iter)[0].astype(np.uint8) for i in range(8)]

In [None]:
len(aug_imgs)

In [None]:
img.shape

In [None]:
# 画出原始图片
plt.imshow(img[0])

As you can see below, there's no magic to data augmentation - it's a very intuitive approach to generating richer input data. Generally speaking, your intuition should be a good guide to appropriate data augmentation, although it's a good idea to test your intuition by checking the results of different augmentation approaches.

In [None]:
# 打出所有图
plots(aug_imgs, (20,7), 2)

In [None]:
# Ensure that we return to theano dimension ordering
K.set_image_dim_ordering('th')

## 增加数据增强

Let's try adding a small amount of data augmentation, and see if we reduce overfitting as a result. The approach will be identical to the method we used to finetune the dense layers in lesson 2, except that we will use a generator with augmentation configured. Here's how we set up the generator, and create batches from it:

In [None]:
gen = image.ImageDataGenerator(rotation_range=15, width_shift_range=0.1, 
                               height_shift_range=0.1, zoom_range=0.1, horizontal_flip=True)

In [None]:
batches = get_batches(path+'train', gen, batch_size=batch_size)
# NB: We don't want to augment or shuffle the validation set
val_batches = get_batches(path+'valid', shuffle=False, batch_size=batch_size)

When using data augmentation, we can't pre-compute our convolutional layer features, since randomized changes are being made to every input image. That is, even if the training process sees the same image multiple times, each time it will have undergone different data augmentation, so the results of the convolutional layers will be different.

Therefore, in order to allow data to flow through all the conv layers and our new dense layers, we attach our fully connected model to the convolutional model--after ensuring that the convolutional layers are not trainable:

当使用数据增强时，我们不能预先计算卷积层的特征，因为每个输入图像都进行了随机变化。也就是说，即使训练过程多次看到相同的图像，每次都会进行不同的数据增强，卷积层的结果也会不同。

因此，为了让数据流过所有的conv层和我们新的全联接层，我们将完全连接的模型附加到卷积模型中——在确保卷积层不可训练之后:

In [None]:
fc_model = get_fc_model()

In [None]:
for layer in conv_model.layers:
    layer.trainable = False
# 卷积层+新的全联接层，卷积层不训练
conv_model.add(fc_model)

Now we can compile, train, and save our model as usual - note that we use *fit_generator()* since we want to pull random images from the directories on every batch.

In [None]:
conv_model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
conv_model.fit_generator(batches,batches.n // batch_size,epochs=8,
                         validation_data=val_batches,validation_steps=val_batches.n//batch_size)

In [None]:
conv_model.save_weights(model_path + 'aug1.h5')

In [None]:
conv_model.load_weights(model_path + 'aug1.h5')

## 批量标准化Batch normalization

### 关于批量标准化

Batch normalization (*batchnorm*) is a way to ensure that activations don't become too high or too low at any point in the model. Adjusting activations so they are of similar scales is called *normalization*. Normalization is very helpful for fast training - if some activations are very high, they will saturate the model and create very large gradients, causing training to fail; if very low, they will cause training to proceed very slowly. Furthermore, large or small activations in one layer will tend to result in even larger or smaller activations in later layers, since the activations get multiplied repeatedly across the layers.

Prior to the development of batchnorm in 2015, only the inputs to a model could be effectively normalized - by simply subtracting their mean and dividing by their standard deviation. However, weights in intermediate layers could easily become poorly scaled, due to problems in weight initialization, or a high learning rate combined with random fluctuations in weights.

Batchnorm resolves this problem by normalizing each intermediate layer as well. The details of how it works are not terribly important (although I will outline them in a moment) - the important takeaway is that **all modern networks should use batchnorm, or something equivalent**. There are two reasons for this:
1. Adding batchnorm to a model can result in **10x or more improvements in training speed**
2. Because normalization greatly reduces the ability of a small number of outlying inputs to over-influence the training, it also tends to **reduce overfitting**.

As promised, here's a brief outline of how batchnorm works. As a first step, it normalizes intermediate layers in the same way as input layers can be normalized. But this on its own would not be enough, since the model would then just push the weights up or down indefinitely to try to undo this normalization. Therefore, batchnorm takes two additional steps:
1. Add two more trainable parameters to each layer - one to multiply all activations to set an arbitrary standard deviation, and one to add to all activations to set an arbitary mean
2. Incorporate both the normalization, and the learnt multiply/add parameters, into the gradient calculations during backprop.

This ensures that the weights don't tend to push very high or very low (since the normalization is included in the gradient calculations, so the updates are aware of the normalization). But it also ensures that if a layer does need to change the overall mean or standard deviation in order to match the output scale, it can do so.

### 为模型添加批量标准化层

We can use nearly the same approach as before - but this time we'll add batchnorm layers (and dropout layers):

In [None]:
conv_layers[-1].output_shape[1:]

In [None]:
def get_bn_layers(p):
    return [
        MaxPooling2D(input_shape=conv_layers[-1].output_shape[1:]),
        Flatten(),
        Dense(4096,activation='relu'),
        BatchNormalization(),
        Dropout(p),
        Dense(4096,activation='relu'),
        BatchNormalization(),
        Dropout(p),
        Dense(1000,activation='softmax')       
    ]

In [None]:
def load_fc_weights_from_vgg16bn(model):
    "Load weights for model from the dense layers of the Vgg16BN model."
    # See imagenet_batchnorm.ipynb for info on how the weights for
    # Vgg16BN can be generated from the standard Vgg16 weights.
    from vgg16bn import Vgg16BN
    vgg16_bn = Vgg16BN()
    _, fc_layers = split_at(vgg16_bn.model, Convolution2D)
    copy_weights(fc_layers, model.layers)

In [None]:
p=0.6

In [None]:
bn_model = Sequential(get_bn_layers(0.6))

In [None]:
load_fc_weights_from_vgg16bn(bn_model)

In [None]:
def proc_wgts(layer, prev_p, new_p):
    scal = (1-prev_p)/(1-new_p)
    return [o*scal for o in layer.get_weights()]

In [None]:
for l in bn_model.layers: 
    if type(l)==Dense:
        l.set_weights(proc_wgts(l, 0.5, 0.6))

In [None]:
bn_model.pop()
for layer in bn_model.layers: 
    layer.trainable=False

In [None]:
bn_model.add(Dense(2,activation='softmax'))

In [None]:
bn_model.compile(Adam(), 'categorical_crossentropy', metrics=['accuracy'])

In [None]:
bn_model.fit(trn_features, trn_labels, nb_epoch=8, validation_data=(val_features, val_labels))

In [None]:
bn_model.save_weights(model_path+'bn.h5')

In [None]:
bn_model.load_weights(model_path+'bn.h5')

In [None]:
bn_layers = get_bn_layers(0.6)
bn_layers.pop()
bn_layers.append(Dense(2,activation='softmax'))

In [None]:
final_model = Sequential(conv_layers)
for layer in final_model.layers: layer.trainable = False
for layer in bn_layers: final_model.add(layer)

In [None]:
for l1,l2 in zip(bn_model.layers, bn_layers):
    l2.set_weights(l1.get_weights())

In [None]:
final_model.compile(optimizer=Adam(), 
                    loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
final_model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=1, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)

In [None]:
final_model.save_weights(model_path + 'final1.h5')

In [None]:
final_model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=4, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)

In [None]:
final_model.save_weights(model_path + 'final2.h5')

In [None]:
final_model.optimizer.lr=0.001

In [None]:
final_model.fit_generator(batches, samples_per_epoch=batches.nb_sample, nb_epoch=4, 
                        validation_data=val_batches, nb_val_samples=val_batches.nb_sample)

In [None]:
bn_model.save_weights(model_path + 'final3.h5')