**Chapter 11 – Training Deep Neural Networks**

In [22]:
# Python ≥3.12 is required
import sys
assert sys.version_info >= (3, 12)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

print(f"Tensorflow version: {tf.__version__}")
# print(f"Keras Version: {tf.keras.__version__}")
print("GPU is", "available" if tf.config.list_physical_devices('GPU') else "NOT AVAILABLE")

%load_ext tensorboard

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Tensorflow version: 2.20.0
GPU is NOT AVAILABLE
The tensorboard extension is already loaded. To reload it, use:
  %reload_ext tensorboard


# Problems when training (large) neural networks

- Sometimes you need to tackle a complex problem, such as detecting hundreds of types of objects in high-resolution images? 
- You may then need to train a much deeper DNN
  - with 10 layers or many more
  - each containing hundreds of neurons
  - linked by hundreds of thousands of connections
- When training you may then run into several problems: 
  - _vanishing gradients_ problem or the related _exploding gradients_ problem. This is when the gradients grow smaller and smaller, or larger and larger, when flowing backward through the DNN during training. Both of these problems make lower layers very hard to train.
  - **not have enough training** data for such a large network, or it might be too costly to label.
  - training may be extremely **slow**.
  - a model with millions of parameters would severely risk **overfitting** the training set, especially if there are not enough training instances or if they are too noisy.
- In this chapter we will go through each of these problems and present techniques to solve them.

# Vanishing/Exploding Gradients Problem
- As we discussed in the previous chapter, the backpropagation algorithm works by going from the output layer to the input layer, propagating the error gradient along the way. 
- Once the algorithm has computed the gradient of the cost function with regard to each parameter in the network, it uses these gradients to update each parameter with a Gradient Descent step.
- Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers: _vanishing gradients_
- In some cases, the gradients can grow bigger and bigger until layers get insanely large weight updates and the algorithm diverges: _exploding gradients_
- Recent (2000-2015) scientific research has revealed that these problems are caused mainly by the use of the sigmoid activation function, which _saturates_ when inputs become large (negative or positive), causing the gradients (which are derivatives) to become close to 0:
![](img/sigmoid_saturation_plot.png)
- Other activation functions we have seen before also suffer from satuaration for large positive and/or negative inputs: 
![](img/activation_functions_plot.png)

- To cope with this problem several **non saturating activation functions** have been proposed: 
  - Leaky ReLU: 
    ![](img/leaky_relu_plot.png)
  - ELU (Exponential Linear Unit)
    ![](img/elu_plot.png)
  - SELU (Scaled ELU)
    ![](img/selu_plot.png)



## Training with nonsaturating Activation Functions

### Leaky ReLU

Let's train a neural network on Fashion MNIST using the Leaky ReLU:

In [23]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_train_full = X_train_full / 255.0
X_test = X_test / 255.0
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

- To use the leaky ReLU activation function, create a LeakyReLU layer and add it to your model just **after** the layer you want to apply it to. 
- We also illustrate the use of another weights initializer (`he_normal`). The mathematics behind this initializer are outside the scope of this course. 

In [24]:
tf.random.set_seed(42)
np.random.seed(42)

model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(300, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(100, kernel_initializer="he_normal"),
    keras.layers.LeakyReLU(),
    keras.layers.Dense(10, activation="softmax")
])

In [25]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [26]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.6156 - loss: 1.2858 - val_accuracy: 0.7320 - val_loss: 0.8586
Epoch 2/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.7486 - loss: 0.7776 - val_accuracy: 0.7796 - val_loss: 0.6925
Epoch 3/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.7808 - loss: 0.6704 - val_accuracy: 0.8006 - val_loss: 0.6223
Epoch 4/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.7978 - loss: 0.6154 - val_accuracy: 0.8146 - val_loss: 0.5803
Epoch 5/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.8074 - loss: 0.5802 - val_accuracy: 0.8228 - val_loss: 0.5517
Epoch 6/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.8140 - loss: 0.5551 - val_accuracy: 0.8280 - val_loss: 0.5305
Epoch 7/10
[1m1

### ELU

Implementing ELU in TensorFlow is trivial, just specify the activation function when building each layer:

In [27]:
keras.layers.Dense(10, activation="elu")

<Dense name=dense_213, built=False>

### SELU

It can be shown that: 
- If you build a neural network composed exclusively of a stack of dense layers, and if all hidden layers use the SELU activation function, then the network will self-normalize: the output of each layer will tend to preserve a mean of 0 and standard deviation of 1 during training, which solves the vanishing/exploding gradients problem.
- As a result, the SELU activation function often significantly outperforms other activation functions for such neural nets (especially deep ones).
- There are, however, a few conditions for self-normalization to happen:
  - The input features must be standardized (mean 0 and standard deviation 1).
  - Every hidden layer’s weights must be initialized with a special initialisation, called _LeCun normal initialization_. In Keras, this means setting kernel_initializer="lecun_normal".
  - The network’s architecture must be sequential (so no loops as in recurrent networks)

![Yann LeCun](img/lecun.png)  
  
Using SELU is easy:

In [28]:
keras.layers.Dense(10, activation="selu",
                   kernel_initializer="lecun_normal")

<Dense name=dense_214, built=False>

Let's create a neural net for Fashion MNIST with 100 hidden layers, using the SELU activation function:

In [29]:
np.random.seed(42)
tf.random.set_seed(42)

In [30]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="selu",
                             kernel_initializer="lecun_normal"))
for layer in range(99):
    model.add(keras.layers.Dense(100, activation="selu",
                                 kernel_initializer="lecun_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

In [31]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

Now let's train it. Do not forget to scale the inputs to mean 0 and standard deviation 1:

In [32]:
pixel_means = X_train.mean(axis=0, keepdims=True)
pixel_stds = X_train.std(axis=0, keepdims=True)
X_train_scaled = (X_train - pixel_means) / pixel_stds
X_valid_scaled = (X_valid - pixel_means) / pixel_stds
X_test_scaled = (X_test - pixel_means) / pixel_stds

In [33]:
history = model.fit(X_train_scaled, y_train, epochs=5,
                    validation_data=(X_valid_scaled, y_valid))

Epoch 1/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 15ms/step - accuracy: 0.6210 - loss: 1.0353 - val_accuracy: 0.7370 - val_loss: 0.7414
Epoch 2/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 15ms/step - accuracy: 0.7554 - loss: 0.6857 - val_accuracy: 0.7440 - val_loss: 0.6803
Epoch 3/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 15ms/step - accuracy: 0.7865 - loss: 0.5924 - val_accuracy: 0.8092 - val_loss: 0.5551
Epoch 4/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 15ms/step - accuracy: 0.7915 - loss: 0.5905 - val_accuracy: 0.7968 - val_loss: 0.5602
Epoch 5/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 15ms/step - accuracy: 0.8057 - loss: 0.5467 - val_accuracy: 0.8126 - val_loss: 0.5314


Now look at what happens if we try to use the ReLU activation function instead:

In [34]:
np.random.seed(42)
tf.random.set_seed(42)

In [35]:
model = keras.models.Sequential()
model.add(keras.layers.Flatten(input_shape=[28, 28]))
model.add(keras.layers.Dense(300, activation="relu", kernel_initializer="he_normal"))
for layer in range(99):
    model.add(keras.layers.Dense(100, activation="relu", kernel_initializer="he_normal"))
model.add(keras.layers.Dense(10, activation="softmax"))

In [36]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [37]:
history = model.fit(X_train_scaled, y_train, epochs=5,
                    validation_data=(X_valid_scaled, y_valid))

Epoch 1/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m31s[0m 16ms/step - accuracy: 0.2428 - loss: 1.8559 - val_accuracy: 0.3750 - val_loss: 1.5514
Epoch 2/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m26s[0m 15ms/step - accuracy: 0.4305 - loss: 1.3620 - val_accuracy: 0.4614 - val_loss: 1.3588
Epoch 3/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 14ms/step - accuracy: 0.5691 - loss: 1.0498 - val_accuracy: 0.6514 - val_loss: 0.9037
Epoch 4/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 11ms/step - accuracy: 0.6443 - loss: 0.8975 - val_accuracy: 0.5820 - val_loss: 0.9765
Epoch 5/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m22s[0m 13ms/step - accuracy: 0.6797 - loss: 0.8211 - val_accuracy: 0.6082 - val_loss: 1.0349


Not great at all, we suffered from the vanishing/exploding gradients problem.

# Batch Normalization

- To avoid the vanishing/exploding gradient problem we can also explicitly normalize the output of a layer. 
- This technique is called _batch normalization_.
- It consists of adding an operation in the model just before or after the activation function of each hidden layer. 
- This operation simply zerocenters and normalizes each input.

In [38]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(300, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.BatchNormalization(),
    keras.layers.Dense(10, activation="softmax")
])

In [39]:
model.summary()

In [40]:
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=keras.optimizers.SGD(learning_rate=1e-3),
              metrics=["accuracy"])

In [41]:
history = model.fit(X_train, y_train, epochs=10,
                    validation_data=(X_valid, y_valid))

Epoch 1/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 5ms/step - accuracy: 0.7192 - loss: 0.8278 - val_accuracy: 0.8202 - val_loss: 0.5456
Epoch 2/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.8035 - loss: 0.5669 - val_accuracy: 0.8442 - val_loss: 0.4686
Epoch 3/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.8238 - loss: 0.5046 - val_accuracy: 0.8524 - val_loss: 0.4330
Epoch 4/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.8359 - loss: 0.4686 - val_accuracy: 0.8590 - val_loss: 0.4120
Epoch 5/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 4ms/step - accuracy: 0.8440 - loss: 0.4432 - val_accuracy: 0.8626 - val_loss: 0.3977
Epoch 6/10
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 4ms/step - accuracy: 0.8512 - loss: 0.4234 - val_accuracy: 0.8658 - val_loss: 0.3871
Epoch 7/10
[1m

## Reusing Pretrained Layers (Transfer Learning)
- it is generally not a good idea to train a very large DNN from scratch
- instead, you should always try to find an existing neural network that accomplishes a similar task to the one you are trying to tackle
- then reuse the lower layers of this network
- this technique is called _transfer learning_. 
  - it speeds up training considerably
  - it requires significantly less training data

![](img/pretrained_layers.PNG)

- The output layer of the original model should usually be replaced because it is most likely not useful at all for the new task, and it may not even have the right number of outputs for the new task.
- Similarly, the upper hidden layers of the original model are less likely to be as useful as the lower layers, since the high-level features that are most useful for the new task may differ significantly from the ones that were most useful for the original task. 
- Reuse existing layers with Keras is very simple: 

In [42]:
model_A = keras.models.load_model("my_model_A.h5")
model_B_on_A = keras.models.Sequential(model_A.layers[:-1])  # reuse all layers except the output layer
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))   

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = 'my_model_A.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Note that `model_B_on_A` and `model_A` actually share layers now, so when we train one, it will update both models. If we want to avoid that, we need to build `model_B_on_A` on top of a *clone* of `model_A`:

In [None]:
model_A_clone = keras.models.clone_model(model_A)
model_A_clone.set_weights(model_A.get_weights())
model_B_on_A = keras.models.Sequential(model_A_clone.layers[:-1])
model_B_on_A.add(keras.layers.Dense(1, activation="sigmoid"))

# Faster Optimizers

Training a very large deep neural network can be painfully slow. So far we have seen four ways to speed up training (and reach a better solution):
- applying a good initialization strategy for the connection weights, 
- using a good activation function
- using Batch Normalization
- reusing parts of a pretrained network 
  
Another huge speed boost comes from using a faster optimizer (for finding the minimum of the cost function) than the regular Gradient Descent optimizer. 

The most popular algorithms are: 
- Momentum optimization
- Nesterov Accelerated Gradient
- AdaGrad
- RMSProp
- Adam and Nadam optimization

We don't go into the mathematical details of these algorithms but keep in mind they can be useful when finetuning your network. 

## Momentum optimization

In [None]:
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

## Nesterov Accelerated Gradient

In [None]:
optimizer = keras.optimizers.SGD(learning_rate=0.001, momentum=0.9, nesterov=True)

## AdaGrad

In [None]:
optimizer = keras.optimizers.Adagrad(learning_rate=0.001)

## RMSProp

In [None]:
optimizer = keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

## Adam Optimization

In [None]:
optimizer = keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

## Adamax Optimization

In [None]:
optimizer = keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

## Nadam Optimization

In [None]:
optimizer = keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

# Avoiding Overfitting Through Regularization

## Dropout
_Dropout_ is one of the most popular regularization techniques for deep neural networks.

- it is a simple algorithm: at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability $p$ of being temporarily “dropped out” 
- it will the be entirely ignored during this training step, but it may be active during the next step
- the hyperparameter p is called the dropout rate, and it is typically set between 10% and 50%
- we can understand the power of dropout by realizing that a unique neural network is generated at each training step.
  
  ![](img/dropout.png)

- To implement dropout using Keras, you can use the keras.layers.Dropout layer. 
- During training, it randomly drops some inputs (setting them to 0).
- If you observe that the model is overfitting, you can increase the dropout rate (and vice versa).

In [None]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(300, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(100, activation="elu", kernel_initializer="he_normal"),
    keras.layers.Dropout(rate=0.2),
    keras.layers.Dense(10, activation="softmax")
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="nadam", metrics=["accuracy"])
n_epochs = 2
history = model.fit(X_train_scaled, y_train, epochs=n_epochs,
                    validation_data=(X_valid_scaled, y_valid))