### Chapter 11. Training Deep Neural Networks
## Exercises
1. Glorot an He initialization try to tackle the problem of exploding/vanishing gradients. Rather than just initializing the weights completely random, this mechanism ensures that the standard derivation of the output equals roughly the inputs'. Unfortunately this only works for shallow networks, respectively for a few layers.
2. No, since the symmetry must be broken. Otherwise we will not be able to find individual gradients to tweak but all are the same and thereby the    model will not converge.
3. Yes, this would be okay.
4.  sigmoid: output layer of binary classification
    softmax: output layer of multilabel classification
    no activation function: regression
5.  0.99999 means that the momentum is preserved at almost 100% which results in a cost function that can jump out of local minima with ease but will also bounce quite heavily around the global optima until it converges.
6.  
7.  Yes, it slows down training considerably because the connections build up slower (slower convergence). It does not slow down inference however. MC dropout will slow down inference only. It requires multiple predictions with dropped out neurons each iteration.

8. Practice training a deep neural network on the CIFAR10 image dataset:
- Build a DNN with 20 hidden layers of 100 neurons each (that’s too
many, but it’s the point of this exercise). Use He initialization and
the Swish activation function.
- Using Nadam optimization and early stopping, train the network on
the CIFAR10 dataset. You can load it with
tf.keras.datasets.cifar10.load_ data(). The dataset is
composed of 60,000 32 × 32–pixel color images (50,000 for
training, 10,000 for testing) with 10 classes, so you’ll need a
softmax output layer with 10 neurons. Remember to search for the
right learning rate each time you change the model’s architecture or
hyperparameters.
- Now try adding batch normalization and compare the learning
curves: is it converging faster than before? Does it produce a
better model? How does it affect training speed?
- Try replacing batch normalization with SELU, and make the
necessary adjustments to ensure the network self-normalizes (i.e.,
standardize the input features, use LeCun normal initialization,
make sure the DNN contains only a sequence of dense layers, etc.).
- Try regularizing the model with alpha dropout. Then, without
retraining your model, see if you can achieve better accuracy using
MC dropout.
- Retrain your model using 1cycle scheduling and see if it improves
training speed and model accuracy.


In [1]:
from tensorflow import keras
import numpy as np
from joblib import load
cifar10 = load("C:/Users/MaxB2/Documents/Machine_Is_Learning/cifar10_data.joblib")

X_train, y_train = cifar10[0][0]/255.,cifar10[0][1]
X_test, y_test = cifar10[1][0]/255.,cifar10[1][1]


import tensorflow as tf
tf.keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[32, 32, 3]),
    tf.keras.layers.Dense(100,activation="swish",kernel_initializer="he_normal"),
    tf.keras.layers.Dense(100,activation="swish",kernel_initializer="he_normal"),
    tf.keras.layers.Dense(100,activation="swish",kernel_initializer="he_normal"),
    tf.keras.layers.Dense(100,activation="swish",kernel_initializer="he_normal"),
    tf.keras.layers.Dense(100,activation="swish",kernel_initializer="he_normal"),  # 5
    tf.keras.layers.Dense(10,activation="softmax")
])

optimizer = tf.keras.optimizers.Nadam(learning_rate=5e-3)
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])

from keras.callbacks import EarlyStopping
early_stopping_cb = EarlyStopping(patience=3, monitor='val_loss', restore_best_weights=True)
checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model", save_best_only=True)

history = model.fit(X_train, y_train, epochs=100,
                    batch_size=8,
                    validation_split=0.1,
                    callbacks=[early_stopping_cb,checkpoint_cb])




Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100


'import pandas as pd\nimport matplotlib.pyplot as plt\nnum_epochs_used = early_stopping_cb.stopped_epoch + 1\npd.DataFrame(history.history).plot(figsize=(8, 5), xlim=[0, num_epochs_used], ylim=[0, 1], grid=True, xlabel="Epoch",style=["r--", "r--.", "b-", "b-*"])\nplt.show()'

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
num_epochs_used = early_stopping_cb.stopped_epoch + 1
pd.DataFrame(history.history).plot(figsize=(8, 5), xlim=[0, num_epochs_used], ylim=[0, 1], grid=True, xlabel="Epoch",style=["r--", "r--.", "b-", "b-*"])

: 

: 

In [None]:
optimizer = tf.keras.optimizers.Nadam(learning_rate=1e-3)  # tune learning rate, compile and fit otra vez
model.compile(loss="sparse_categorical_crossentropy", optimizer=optimizer,
              metrics=["accuracy"])

history = model.fit(X_train, y_train, epochs=1000,
                    validation_split=0.1,
                    callbacks=[early_stopping_cb,checkpoint_cb])


num_epochs_used = early_stopping_cb.stopped_epoch + 1
pd.DataFrame(history.history).plot(figsize=(8, 5), xlim=[0, num_epochs_used], ylim=[0, 1], grid=True, xlabel="Epoch",style=["r--", "r--.", "b-", "b-*"])
plt.show()

