# 11. Training Deep Neural Networks

After introducing relatively shallow nets, let's move on to deeper DNNs (layers >= 10; neurons/layer: X00, X000; connections: X0,000). 

Here are some problems we may encounter along the way，and some techniques we may try out to solve them:

1. Vanishing / Exploding gradients problem making lower layers very hard to train 
                > Initialization 
2. Not enough data / too costly to label 
                > Transfer learning and unsupervised pretraining
3. Painfully slow training 
                > Optimizers to the rescue!
4. Serious overfitting risk for millions params models, especially if there are not enough training instances or if they are too noisy 
                > Good ol' (and new) regularization techniques

### 1. Vanishing / Exploding Gradients Problems

As we know from our previous chapter, Gradient Descent goes from output > input layer propagating the error gradient along the way. Once it has computed the gradient of the cost function for each param of the network, it uses these gradients to update each parameter with a Gradient Descent step.

**Vanishing** gradients gets smaller and smaller, leaving lower layers weights virtually unchanged (no convergence to good solution).  
**Exploding** gradients gets bigger and bigger, making lower layers weights extremely large (divergence).

This behavious was not clearly understood until Glorot and Benjo suggested in a 2010 [paper](http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf) that this may be due to the the logistic sigmoid function and the weight initialization technique (normal dist 0,1).

In short, they showed that with this activation function and this initialization scheme, the **each layer outputs variance > inputs variance**. 

#### Glorot and He Initialization

The ideal solution would therefore be to have $var_{input} = var_{output}$ and $var_{forward} = var_{backwards}$. 

It is actually not possible to guarantee both unless the layer has an equal number of inputs and neurons ($fan_{in} = fan_{out}$) but there is workaround, i.e. initialize connection weight randomly from either:
* Normal dist of mean $0$ and variance $\sigma^2 = \frac{1}{fan_{avg}}$
* Uniform dist between {-r,r} with $r = \sqrt\frac{3}{fan_{avg}}$

Other initializations exist, differing in the variance used:

**Initialization** | **Activation function** | **$\sigma^2$(Normal)** 
-|-|-|
Glorot | None, tanh, logistic, softmax | $\frac{1}{fan_{avg}}$ 
He | ReLU and variants | $\frac{2}{fan_{in}}$
LeCun | SELU | $\frac{1}{fan_{in}}$

#### Nonsaturating activation functions

Altough the ReLU function solves some of the issues of the sigmoid function, it is far from perfect. A common issue are **dying ReLUs**, neurons which die when their weights get tweaked in such a way that the weighted sum of its inputs are negative for all instances in the training set, and therefore keeps their gradients will keep outputting 0.

To solve the problem, we could employ a **leaky ReLU**, which doesn't allow the neurons to die since it has a slope $\alpha$ also when $z<0$.  

![LeakyReLU](images/11.Leaky_ReLU.png)

Slope is generally 0.01 but could also be:
* Randomized: $\alpha$ is picked randomly in a given range during training and is fixed to an average value during testing (seems also to work as regularizer)
* Parametric: $\alpha$ authorized to be learned during training (not an hyperparam). Performs strongly on complex datasets but prone to overfitting in small ones. 

Finally, in 2015 Clevert et al. proposed a new activation function called the **Exponential Linear Unit (ELU)**.

$ELU_{\alpha}{(z)} = 
\begin{cases}
\alpha(exp(z) - 1) & z < 0\\
z & z \ge 0
\end{cases}$

![ELU](images/11.ELU.png)

Advantanges:

* Negative values when $z<0$ which allows the unit to have an average output closer to 0 and helps alleviate the vanishing gradients problem
* Non-zero gradient for $z<0$, avoid dead neuron issue
* If $\alpha = 0$ function is smooth everywhere, including $z=0$ which helps speed up Gradient Descent 

**Note**: $\alpha$ is the value that the ELU function approaches when $z$ is a large negative.

Last but not least **Scaled ELU (SELU)**, which under certain conditions outperforms all the above and self-normalize:

* Standardized input features (0,1)
* Hidden layers using LeCun normal initialization
* Sequential network architecture (may not work on RNN for instance）
* Dense layers (but may work with CNN as well)

**Tip**: Generally, SELU > ELU > leaky ReLU (and its variants) > ReLU > tanh > logistic BUT most used (and optimized) is still **ReLU**.

#### Batch Normalization

Although He + ELU can reduce problems at beginning of training, they could still come back later. 

This problem can be addressed with **Batch Normalization**. The technique consists of adding an operation in the model just before or after the activation function of each hidden layer. The operation lets the model learn the optimal _scale_ and _mean_ of each of the layer’s inputs.

In many cases, a BN layer can act as stardization for the training set. 

Step by step (in non-mathematical notation):

1. $\mu_B$ = Mean
2. $\sigma_B$ = Standard deviation
3. $\hat{x}^{(i)}$ = Vector of zero-centered and normalized inputs for instance $i$
4. $z^{(i)} = \gamma \otimes \hat{x}^{(i)} + \beta$

To sum up, the model is learning:

* $\gamma$ = output scale vector (through backprop)
* $\beta$ = output offset vector (through backprop)
* $\mu$ = final input mean vector (through exponential moving avg)
* $\sigma$ = final input stdev vector (through exponential moving avg)

**Note**: generally speaking, BN will make make epoch slower but will require less epochs. All in all, generally it saves time.  

An additional hyperparameter we may have to pay attention to is **momentum**, used in the calculation of running average as such:

$\hat{v}_1 = \hat{v}_0 \times momentum + v \times (1 - momentum)$

Generally a good value is very close to 1.  

#### Gradient Clipping

Another technique involves clipping the gradients so that they never exceed a certain threshold. Mostly used for RNNs, as BN is harder to apply. 

### 2. Reusing Pretrained Layers

**General note**: this section will have no cell output since we are working with fictional `model_A` and `model_B`.

Generally we don't want to train DNN from scratch. We want to find a network that may accomplish what we want to do and reuse the lower layers (**transfer learning**).
The most similar the tasks, the most layers we can reuse. We can check this by progressively _freezing_ upper layers and leaving their weights fixed. 

#### Transfer Learning with Keras

Let's use our Fashion MNIST dataset. We have a model A that works well (90% accuracy) on 8 classes excluding sandals and shirts, and want to train a new binary classifier (shirt/sandals) and apply it to only 200 images (model B).

Why not reuse model A? Except for the output layer, of course. 

In [None]:
import keras

model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

Our second step `model_B_on_A` will also affect model A. If we want to avoid that, we need to clone it first. 

In [None]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())

Since the new output layer was initialized randomly it will make large errors (at least at the beginning) so a common solution is to **freeze the reused layers** during the first epochs, giving the new layer some time to learn reasonable weights. 

In [None]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

In [None]:
model_B_on_A.compile(loss="binary_crossentropy", optimizer="sgd",
metrics=["accuracy"])

**Note**: always compile model after freezing or unfreezing layers.

After unfreezing the reused layers, it is usually a good idea to **reduce the learning rate**, once again to avoid damaging the reused weights:

In [None]:
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4,
                            validation_data=(X_valid_B, y_valid_B))

In [None]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

In [None]:
optimizer = keras.optimizers.SGD(lr=1e-4) # the default lr is 1e-2
model_B_on_A.compile(loss="binary_crossentropy", optimizer=optimizer,
                    metrics=["accuracy"])
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=16,
                            validation_data=(X_valid_B, y_valid_B))

This approach may lead to vast improvements, but are we being 'honest' here? Probably not. What we are doing is simply trying different configurations until we find one that yields a strong improvement. 

So why we had to do this in the first place? It turns out that transfer learning does **not** work very well with small dense networks, presumably because small networks learn few
patterns, and dense networks learn very specific patterns, which are unlikely to be useful in other tasks. 

#### Unsupervised Pretraining

If we want to tackle complex task without much labeled data and similar model, we can still perform _unsupervised pretraining_. 

Basically, we can train an autoencoder or GAN for the lower layers, add the output layer for our task on top, and fine-tune the final network using supervised learning. 

#### Pretraining on an Auxiliary Task

Another option is to train a first NN on a similar (auxiliary) task for which labeled data is more available and then reuse the lower layers for the actual task.  

### 3. Faster Optimizers

So far we have seen four ways to speed up NN training:

1. Good initialization strategy for connecting weights
2. Good activation function
3. Batch Normalization
4. Reusing parts of a pretrained network

In this section we will cover an additional set of tools: **faster optimizers** (than our plain vanilla Gradient Descent). More specifically:

1. Momentum Optimization 
2. Nesterov Accelerated Gradient
3. AdaGrad
4. RMSProp
5. Adam and Nadam optimization

#### 1. Momentum Optimization

The intuition behind momentum optimization (and hence the name) is physical momentum.

Our Gradient Descent does not care about what the earlier gradients were, since it simply updates the last weights: $\theta \leftarrow \theta - \eta \triangledown_\theta J(\theta)$. 
Momentum Optimization, on the other hand, at each iteration subtracts the local gradients from the **momentum vector** and it updates the weights by adding this momentum vector. Basically, we are looking at acceleration rather than speed. 
The tecnique also introduce a parameter $\beta$ (**friction**) to deal with excessive "acceleration" and help with convergence. 

More formally:

1. $m \leftarrow \beta m - \eta \triangledown_\theta J(\theta)$
2. $\theta \leftarrow \theta + m$

Super-easy Keras implementation:

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9)

#### 2. Nesterov Accelerated Gradient

Almost always faster than vanilla momentum optimization. NAG measures the gradient of the cost function not at the local position $\theta$ but slightly ahead in the direction of the momentum, at $\theta + \beta m$:

1. $m \leftarrow \beta m - \eta \triangledown_\theta J(\theta + \beta m)$
2. $\theta \leftarrow \theta + m$

In [None]:
optimizer = keras.optimizers.SGD(lr=0.001, momentum=0.9, nesterov=True)

#### 3. AdaGrad

AdaGrad correct the direction of the algorithm erlier to facilitate convergence. It does this by scaling down the gradient vector along the steepest dimensions:

1. $s \leftarrow s - \triangledown_\theta J(\theta) \otimes \triangledown_\theta J(\theta)$
2. $\theta \leftarrow \theta - \eta \triangledown_\theta J(\theta) \oslash \sqrt{s+\epsilon}$

**Note**: AdaGrad is not a good choice for training NNs. The learning rate gets scaled down so much that the algorithm ends up stopping entirely before reaching the global optimum.

#### 4. RMSProp

RMSProp fixes the "slowing down too fast" issue from AdaGrad by accumulating only the gradients from the most recent iterations:

1. $s \leftarrow \beta s + (1- \beta) \triangledown_\theta J(\theta) \otimes \triangledown_\theta J(\theta)$
2. $\theta \leftarrow \theta - \eta \triangledown_\theta J(\theta) \oslash \sqrt{s+\epsilon}$

Decay rate is typically set to 0.9 

In [None]:
optimizer = keras.optimizers.RMSprop(lr=0.001, rho=0.9)

#### 5. Adam and Nadam Optimization

Adam (adaptive moment estimation) combines RMSProp with Momentum Optimization: just like momentum optimization, it keeps track of an exponentially decaying average of past gradients; and
just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients:

1. $m \leftarrow \beta_1 m - (1-\beta_1) \triangledown_\theta J(\theta)$
2. $s \leftarrow \beta_2 s + (1- \beta_2) \triangledown_\theta J(\theta) \otimes \triangledown_\theta J(\theta)$
3. $\hat{m} \leftarrow \frac{m}{1-\beta_1^T}$
4. $\hat{s} \leftarrow \frac{s}{1-\beta_2^T}$
5. $\theta \leftarrow \theta + \eta hat{m} \oslash \sqrt{\hat{s}+\epsilon}$

**Note 1**: 3 and 4 help boost $s$ and $m$ at beginning of training.   
**Note 2**: usually $b_1$ = 0.9 | $b_2$ = 0.999 | $\epsilon$ = $10^{-7}$ | $\eta$ = 0.001 

In [None]:
optimizer = keras.optimizers.Adam(lr=0.001, beta_1=0.9, beta_2=0.999)

Another variant is Nadam (Adam + NAG) to converge faster. 

#### Notes on optimization

1. Adaptive optimization methods (including RMSProp, Adam, and Nadam optimization) are often great, but **may generalize poorly**. If model performance is not satisfactory, turn to NAG.
2. Everything discussed above is based on **first-order partial derivatives** (Jacobians). In literature we may also find **second-order partial derivatives** (Hessian) but given that there are $n^2$ (n = # params) Hessians per output this makes them less practical for DNNs with X0,000 params.
3. Everything above will lead to dense models. If you want to work with sparse models, you can:
    * Get rid of tiny weights
    * Apply strong $l_1$ regularization as it pushes the optimizer to zero out as many weights as it can

#### Optimizers comparison table

![Optimizers](images\11.Optimizers.png)

### Learning rate scheduling

**1. Power scheduling**  
Set the learning rate to a function of the iteration number: $\displaystyle t: \eta(t) = \frac{\eta_0}{(1+\frac{t}{s})^c}$

**2. Exponential scheduling**  
Set learning rate to $\displaystyle \eta(t) = \eta_0 0.1^{t/s}$

**3. Piecewise constant scheduling**  
Constant learning rate for a number of epochs. The dirty secret is then: what is the right sequence and for how long?

**4. Performance scheduling**    
Measure validation error every $N$ steps and reduce learning rate by $\gamma$ factor every time the error stops dropping.

**5. 1cycle scheduling**  
1cycle starts by increasing the initial learning rate $\eta_0$ to $\eta_1$ around halfway through training. Then dicrease linearly again to $\eta_0$ during the second half of training. 
$\eta_0$ = optimal learning rate
$\eta_1$ = 10 times lower than $\eta_0$

### 4. Avoiding Overfitting Through Regularization

_"With many parameters come great regularization"_ or something like that.  

#### $\ell_1$ and $\ell_2$ Regularization

Implementation using a regularization factor of 0.01:

In [None]:
layer = keras.layers.Dense(100, activation="elu",
                            kernel_initializer="he_normal",
                           
kernel_regularizer=keras.regularizers.l2(0.01))

`keras.regularizers.l1()` and `keras.regularizers.l1_l2()` are also available. 

To avoid rewriting regularizers, activation function and initialization strategy for all hidden layers, we can use `functools.partial()`: 

In [2]:
from functools import partial

RegularizedDense = partial(keras.layers.Dense,
                            activation="elu",
                            kernel_initializer="he_normal",

kernel_regularizer=keras.regularizers.l2(0.01))

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    RegularizedDense(300),
    RegularizedDense(100),
    RegularizedDense(10, activation="softmax",
                    kernel_initializer="glorot_uniform")
])

NameError: name 'keras' is not defined

#### Dropout

One of the most popular regularization techniques for DL. It is extremely effective (1-2% accuracy boost) and fairly simple: for every training step, every neuron (including inputs) has a % $p$ (or _dropout rate_, on avg. 10%-50% and 20-30% for RNNs and 40-50% for CNNs) of being temporarily dropped. 

To a certain extend, we can see this as a way to training a neural network which is a an ensemble of smaller NNs trained on one training instance (since basically all NNs, will be different from each other, albeit very similar).

**Note 1**: In practice we only apply dropout to last one to three layers (excl output).  
**Note 2**: We need to multiply each input connection weight by the keep probability (1 – p) after training.  
**Note 3**: Evaluate the training loss without dropout (e.g., after training)

Keras implementation:

In [3]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu",
kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu",
kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])

NameError: name 'keras' is not defined

Generally speaking:

* Model overfitting > Increase dropout rate
* Model underfitting > Decrease dropout rate