### From 'works okay' to 'works great and wins ML competitions'
* Residual connections
* Normalization
* Depth-wise seperable convolution

##### Batch Normalization
* normalized_data = (data - np.mean(data, axis=...)) / np.std(data, axis=...)
* BATCH normalization is a type of layer (BatchNormalization in Keras)
* Batch normalization works by internally maintaining an eponential moving average of the batch-wise mean and variance seen during training. The main effect is to help with gradient propogation
* Used extensively in advanced covnet architectures in Keras such as ResNet50, Inception V3, and Xception
* Typically used after a convolutional or densely connected layer

default axis = -1 (Dense, Conv1D, RNN, and Conv2D with data_format set to "channels_last")
features_axis = 1 (Conv2D with data_format set to "channels_first")

In [2]:
import keras
from keras import layers

conv_model = keras.models.Sequential() 
conv_model.add(layers.Conv2D(32, 3, activation='relu'))
conv_model.add(layers.BatchNormalization())

dense_model = keras.models.Sequential()
dense_model.add(layers.Dense(32, activation='relu'))
dense_model.add(layers.BatchNormalization())

### Depthwise Seperable Convolution
* Lighter (fewer trainable parameters)
* Faster (fewer floating point operations)
* Few % points BETTER on it's task
# SeperableConv2D

Advantages especially importnat when you're training small models from scratch on limited data

In [10]:
from keras.models import Sequential, Model
from keras import layers

height = 64
width = 64
channels = 3
num_class = 10

model = Sequential()
model.add(layers.SeparableConv2D(32, 3, activation='relu', input_shape=(height, width, channels,)))
model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.MaxPooling2D(2))

model.add(layers.SeparableConv2D(64, 3, activation='relu'))
model.add(layers.SeparableConv2D(128, 3, activation='relu'))
model.add(layers.GlobalAveragePooling2D())

model.add(layers.Dense(32, activation='relu'))
model.add(layers.Dense(num_class, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy')

Instructions for updating:
Colocations handled automatically by placer.


### Hyperparameter optimization
Arbritary Decisions
* How many layers to stack
* How many units or filters in each layer
* Should you use 'relu' - or a different activation function
* Should you use BatchNormalization after a given layer
* How much dropout should you use
* ... And so on

Often turns out that random search (choosing hyperparams to evaluate at random) is the best solution.
But Hyperopt (https://github.com/hyperopt/hyperopt)
Hyperopt keras implementation (https://github.com/maxpumperla/hyperas) 

### Model Ensembling
preds_a = model_a.predict(x_val)
preds_b = model_b.predict(x_val)
...
preds_z = model_z.predict(x_val)

final_preds = 0.5*preds_a + 0.1*preds_b +....
Where 0.5, 0.1, etc are assumed to be learned empirically

Search for good ensembling weights: random search or simple optimization such as Nelder-Mead.

You should ensemble models that are as good as possible whilst being as different as possible.

Example: Ensemble gradient boosted decision trees and neural network based models!