# Chapter7: Advanced deep-learning best practices

## 7.3 Getting the most out of your models

In this section, we'll go beyond "works okay" to "works great and wins machine-learning competitions" by offering you a quick guide to a set of must-know techniques for building state-of-the-art deep learning models.

### 7.3.1 Advanced architecture patterns

#### BATCH NORMALIZATION

*Normalization* is a broad category of methods that seek to make different samples seen by a machine-learning model more similar to each other, which helps the model learn and generalize well to new data.

Previous examples normalized data before feeding it into models. But data normalization should be a concern after every transformation operated by the network: even if the data entering a `Dense` or `Conv2D` network has a 0 mean and unit variance, there's no reason the expect a priori that this will be the case for the data coming out.

Batch normalization is a type of layer that can adaptively normalize data even as the mean and variance change over time during training. The main effect of batch normalization is that it helps with gradient propagation and thus allow for deeper networks. Some very deep networks can only be trained if they include multiple `BatchNormalization` layers.

The `BatchNormalization` layer is typically used after a convolutional or densely connected layer:

In [None]:
# After a Conv layer
conv_model.add(layers.Conv2D(32, 3, activation='relu'))
conv_model.add(layers.BatchNormalization())

# After a Dense layer
conv_model.add(layers.Dense(32, activation='relu'))
conv_model.add(layers.BatchNormalization())

#### DEPTHWISE SEPARABLE CONVOLUTION

*Depthwise separable convolution* layer performs a spatial convolution on each channel of its input, independently, before mixing output channels via a pointwise convolution. This is **equivalent to separating the learning of spatial features and the learning of channel-wise features**, which makes a lot of sense if you assume that spatial locations in the input are highly correlated, but different channels are fairly independent.

Depthwise separable convolution requires fewer parameters and involves fewer computations, thus resulting in smaller, speedier models. And because it's a more representationally efficient way to perform convolution, it tends to learn better representations using less data, resulting in better-performing models. 

<img src='image/fig716.PNG' width='550'>

These advantages become especially important when you're training small models from scratch on limited data.

Here's how you can build a lightweight, depthwise separable convnet for an image-classification task (softmax categorical classificiation) on a small dataset:

##### Training a Depthwise separable convolution on the CIFAR10 dataset

In [2]:
import keras
from keras.datasets import cifar10

# input image dimensions
height = 32
width = 32
channels = 3

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

x_train = x_train.reshape(x_train.shape[0], height, width, channels)
x_test = x_test.reshape(x_test.shape[0], height, width, channels)
input_shape = (height, width, channels)

x_train = x_train.astype('float32')
x_test = x_test.astype('float32')
x_train /= 255
x_test /= 255
print('x_train shape:', x_train.shape)
print(x_train.shape[0], 'train samples')
print(x_test.shape[0], 'test samples')

# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, 10)
y_test = keras.utils.to_categorical(y_test, 10)

x_train shape: (50000, 32, 32, 3)
50000 train samples
10000 test samples


In [4]:
from keras.models import Sequential, Model
from keras import layers
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D

batch_size = 128
epochs = 12

model = Sequential()
model.add(layers.SeparableConv2D(32, 3,
                                 activation='relu',
                                 input_shape=(height, width, channels,))) 
model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))
model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.MaxPooling2D(2)) 
model.add(layers.SeparableConv2D(64, 3, activation='relu')) 
model.add(layers.SeparableConv2D(128, 3, activation='relu')) 
model.add(layers.GlobalAveragePooling2D())
model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
separable_conv2d_7 (Separabl (None, 30, 30, 32)        155       
_________________________________________________________________
separable_conv2d_8 (Separabl (None, 28, 28, 64)        2400      
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 14, 14, 64)        0         
_________________________________________________________________
separable_conv2d_9 (Separabl (None, 12, 12, 64)        4736      
_________________________________________________________________
separable_conv2d_10 (Separab (None, 10, 10, 128)       8896      
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 5, 5, 128)         0         
_________________________________________________________________
separable_conv2d_11 (Separab (None, 3, 3, 64)          9408      
__________

In [5]:
model.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['acc'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          validation_split=0.2,
          verbose=2)

Instructions for updating:
Use tf.cast instead.
Train on 50000 samples, validate on 10000 samples
Epoch 1/12


KeyboardInterrupt: 

In [None]:
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', scroe[0])
print('Test accuracy:', score[1])

##### Regular training with a simple ConvNet on the CIFAR dataset

In [None]:
model_regular = Sequential()
model_regular.add(Conv2D(32, kernel_size=(3, 3),
                         activation='relu',
                         input_shape=input_shape))
model_regular.add(Conv2D(64, (3, 3), activation='relu'))
model_regular.add(MaxPooling2D(pool_size=(2, 2)))
model_regular.add(Dropout(0.25))
model_regular.add(Flatten())
model_regular.add(Dense(128, activation='relu'))
model_regular.add(Dropout(0.5))
model_regular.add(Dense(num_classes, activation='softmax'))

model_regular.compile(loss='categorical_crossentropy',
                      # replace by rmsprop for comparison
                      optimizer='rmsprop',
                      # Adding accuracy metrics 
                      metrics=['accuracy'])

model_regular.fit(x_train, y_train,
                  batch_size=batch_size,
                  epochs=epochs,
                  validation_split=0.2,
                  verbose=2)

In [None]:
score = model_regular.evaluate(x_test, y_test, verbose=0)
print('Test loss:', scroe[0])
print('Test accuracy:', score[1])

> The results are not very convincing. Depthwise separable convolution gives an accuracy of 55% that is inferior to the simple ConvNet model. Probably this is because the model is not big or deep enough.

##### Training with Adadelta optimizer on the CIFAR dataset

In [None]:
model_ada = Sequential()
model_ada.add(Conv2D(32, kernel_size=(3, 3),
                     activation='relu',
                     input_shape=input_shape))
model_ada.add(Conv2D(64, (3, 3), activation='relu'))
model_ada.add(MaxPooling2D(pool_size=(2, 2)))
model_ada.add(Dropout(0.25))
model_ada.add(Flatten())
model_ada.add(Dense(128, activation='relu'))
model_ada.add(Dropout(0.5))
model_ada.add(Dense(num_classes, activation='softmax'))

model_ada.compile(loss='categorical_crossentropy',
                  # The model_ada is optimized with the Adadelta optimizer
                  optimizer='Adadelta',
                  # Adding accuracy metrics 
                  metrics=['accuracy'])

model_ada.fit(x_train, y_train,
              batch_size=batch_size,
              epochs=epochs,
              validation_split=0.2,
              verbose=2)

In [None]:
score = model_ada.evaluate(x_test, y_test, verbose=0)
print('Test loss:', scroe[0])
print('Test accuracy:', score[1])

### 7.3.2 Hyperparameter optimization

*Hyperparameters* are architecture-level parameters which are trained via backpropagation. (eg. activation, dropout)

The process of optimizing hyperparameters typically looks like this:
  1. Choose a set of hyperparameters (automatically).
  2. Build the corresponding model.
  3. Fit it to your training data, and measure the final performance on the validation data.
  4. Choose the next set of hyperparameters to try (automatically).
  5. Repeat.
  6. Eventually, measure performance on your test data.
  
Training the weights of a model is relatively easy: you compute a loss function on a mini-batch of data and then use the Backpropagation algorith to move the weights in the right direction. Updating hyperparameters, on the other hand, is extremely challenging. Consider the following:
  + Computing the feedback signal can be extremely expensive: it requires creating and training a new model from scratch on your dataset.
  + In many cases, you must rely on gradient-free optimization techniques, which naturally are far less efficient then gradient descent.
  
Overall, hyperparameter optimization is a powerful technique that is an absolute requirement to get to state-of-the-art models on any task or to win machine-learning competitions.

### 7.3.3 Model ensembling

Another powerful technique for obtaining the best possible results on a task is *model ensembling*. Ensembling consists of pooling together the predictions of a set of different models, to produce better predictions.

Ensembling relies on the assumption that different good models trained independently are likely to be good for *different reasons*: each model looks at slightly different aspects of the data to make its predictions, getting part of the "truth" but not all of it.

In [None]:
x_val = x_test.reshape(10000, 32, 32, 3)

# Use three different models to compute initial predictions
preds = model.predict(x_val)
preds_regular = model_regular(x_val)
preds_ada = model_ada(x_val)

# This new prediction array should be more accurate than any of the initial ones
final_preds = 1/3 * (preds + preds_regular + preds_ada)

import numpy as np
final_preds_one_hot = np.zeros_like(final_preds)
final_preds_one_hot[np.arange(len(final_preds)), final_preds.argmax(1)] = 1

from sklearn.metrics import accuracy_score
accuracy_score(y_test, final_preds_one_hot)