<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#The-Vanishing/Exploding-Gradients-Problems" data-toc-modified-id="The-Vanishing/Exploding-Gradients-Problems-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The Vanishing/Exploding Gradients Problems</a></span><ul class="toc-item"><li><span><a href="#Glorot-and-He-Initialization" data-toc-modified-id="Glorot-and-He-Initialization-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Glorot and He Initialization</a></span></li><li><span><a href="#Nonsaturating-Activation-Functions" data-toc-modified-id="Nonsaturating-Activation-Functions-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Nonsaturating Activation Functions</a></span></li><li><span><a href="#Batch-Normalization" data-toc-modified-id="Batch-Normalization-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Batch Normalization</a></span></li><li><span><a href="#Gradient-Clipping" data-toc-modified-id="Gradient-Clipping-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Gradient Clipping</a></span></li></ul></li><li><span><a href="#Reusing-Pretained-Layers" data-toc-modified-id="Reusing-Pretained-Layers-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reusing Pretained Layers</a></span><ul class="toc-item"><li><span><a href="#Transfer-Learning-with-Keras" data-toc-modified-id="Transfer-Learning-with-Keras-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Transfer Learning with Keras</a></span></li><li><span><a href="#Unsupervised-Learning" data-toc-modified-id="Unsupervised-Learning-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Unsupervised Learning</a></span></li><li><span><a href="#Pretaining-on-an-Auxiliary-Task" data-toc-modified-id="Pretaining-on-an-Auxiliary-Task-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Pretaining on an Auxiliary Task</a></span></li></ul></li><li><span><a href="#Fast-Optimizers" data-toc-modified-id="Fast-Optimizers-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Fast Optimizers</a></span><ul class="toc-item"><li><span><a href="#Learning-Rate-Scheduling" data-toc-modified-id="Learning-Rate-Scheduling-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Learning Rate Scheduling</a></span></li></ul></li><li><span><a href="#Avoid-Overfitting-Through-regularization" data-toc-modified-id="Avoid-Overfitting-Through-regularization-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Avoid Overfitting Through regularization</a></span><ul class="toc-item"><li><span><a href="#$\ell_1$-and-$\ell_2$-Regularization" data-toc-modified-id="$\ell_1$-and-$\ell_2$-Regularization-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>$\ell_1$ and $\ell_2$ Regularization</a></span></li><li><span><a href="#Dropout" data-toc-modified-id="Dropout-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Dropout</a></span></li><li><span><a href="#Monte-Carlo(MC)-Dropout" data-toc-modified-id="Monte-Carlo(MC)-Dropout-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Monte Carlo(MC) Dropout</a></span></li><li><span><a href="#Max-Norm-Regularization" data-toc-modified-id="Max-Norm-Regularization-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Max-Norm Regularization</a></span></li></ul></li></ul></div>

## The Vanishing/Exploding Gradients Problems

### Glorot and He Initialization

- Golort: None, tanh, logistic, softmax
- He: ReLU and variants
- LeCunL SELU

### Nonsaturating Activation Functions

### Batch Normalization

### Gradient Clipping

## Reusing Pretained Layers

### Transfer Learning with Keras

### Unsupervised Learning

### Pretaining on an Auxiliary Task

## Fast Optimizers

- Momentum Optimiization
- Nesterov Accelerated Gradient
- AdaGrad
- RMSProp
- Adam and Nadam Optimization

Variants of Adam:
- AdamMax
- Nadam

### Learning Rate Scheduling

- Power scheduling
- Exponential scheduling
- Piecewise constant scheduling
- Performance scheduling
- 1cycle scheduling

In [1]:
from tensorflow import keras

In [2]:
# power
optimizer = keras.optimizers.SGD(learning_rate=0.01, decay=1e-4) # decay is reverse of s

In [6]:
# exponential
def exponential_decay(lr0, s):
    def exponential_decay_fn(epoch):
        return lr0 * 0.1 ** (epoch / s) # lr * 0.1^(epoch/s)
    return exponential_decay_fn

exponential_decay_fn = exponential_decay(lr0=0.01, s=20)

In [None]:
lr_scheduler = keras.callbacks.LearningRateScheduler(exponential_decay_fn)
history = model.fit(X, y, callbacks=[lr_scheduler])

In [19]:
# Piecewise constant
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

In [20]:
# performance scheduling
lr_scheduler = keras.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5)

Use keras.optimizers.schedules to define schedule lr

It's step-based, example: When fitting a Keras model, decay every 100000 steps with a base
of 0.96:

```python
def decayed_learning_rate(step):
  return initial_learning_rate * decay_rate ^ (step / decay_steps)
```

In [26]:
initial_learning_rate = 0.1
lr_schedule = keras.optimizers.schedules.ExponentialDecay(initial_learning_rate,
                                                             decay_steps=100000,
                                                             decay_rate=0.96,
                                                             staircase=True)
optimizer = keras.optimizers.SGD(learning_rate=lr_schedule)

## Avoid Overfitting Through regularization

### $\ell_1$ and $\ell_2$ Regularization

In [None]:
# regularizer will be called at each step and compute the regularization loss.
layer = keras.layers.Dense(100, activation='relu', kernel_initializer='he_normal', 
                           kernel_regularizer=keras.regularizers.l2(0.01))

In [41]:
from functools import partial
RegularizedDense = partial(keras.layers.Dense, activation='relu', kernel_initializer='he_normal',
                           kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    RegularizedDense(300),
    RegularizedDense(10, activation='softmax', kernel_initializer='glorot_uniform')
])

### Dropout

In [54]:
keras.layers.Dropout(0.2)

<tensorflow.python.keras.layers.core.Dropout at 0x7f45943f4160>

### Monte Carlo(MC) Dropout

In [None]:
y_probas = np.stack([model(X_test, training=True) for _ in range(100)])
y_proba = y_probas.mean(axis=0)

In [71]:
# Or wrap the Dropout layer, instead simply setting training=True to all layers(maybe BN layer in models)
class MCDropout(keras.layers.Dropout):
    def call(self, inputs):
        return super(MCDropout, self).call(inputs, training=True)

### Max-Norm Regularization

In [73]:
# for Dense Layber
keras.layers.Dense(100, activation='relu', kernel_initializer='he_normal', 
                   kernel_constraint=keras.constraints.max_norm(1.0, axis=0))
# for Conv2D
keras.layers.Conv2D(64, kernel_size=(3,3), strides=(1,1), data_format='channels_last', activation='relu', 
                    kernel_constraint=keras.constraints.max_norm(2, axis=[0, 1, 2]))

<tensorflow.python.keras.layers.convolutional.Conv2D at 0x7f4594612dd8>