# Training Deep Neural Networks and How to solve common problems

## The Vanishing/Exploding Gradient Problem

During backpropagation, when going backwords and calculating the gradients to update the parameters, these gradients get smaller and smaller as they go to the lower levels. This can cause almost 0 changes in the parameters which means the model doesn't converge. This is called <b>The Vanishing Gradient Problem</b>. Also, the oppisite can happen, the gradients can start to become bigger and bigger which is called <b>The Exploding Gradient Problem</b>. Most deep neural networks suffer from unstable gradients, but there are a few ways to solve this issue.

### Glorot and He initialization

In [1]:
import tensorflow as tf

### Keras uses Glorot Uniform initialization by default ###

### He Init
dense = tf.keras.layers.Dense(50, activation='relu', 
                             kernel_initializer='he_normal')

### OR ###
dense = tf.keras.layers.Dense(50, activation='relu', 
                             kernel_initializer=tf.keras.initializers.HeNormal())


2025-01-04 17:59:55.065566: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-01-04 17:59:55.074610: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1736035195.085368    4467 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1736035195.088608    4467 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-01-04 17:59:55.099536: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instr

In [2]:
### He init with uniform distribution based on fan avg
he_avg_init = tf.keras.initializers.VarianceScaling(scale=2., mode='fan_avg', distribution='uniform')
dense = tf.keras.layers.Dense(50, activation='relu', 
                             kernel_initializer=he_avg_init)

### Better Activation Functions

One of the reasons that Unstable Gradients happen is due to a poor choice in activation functions. <br>
ReLU is a good activation function because it is quick to compute and does not saturate for positive values. However ReLU can cause a problem called <b>dying ReLUs</b>, which means that some neurons "die" or only output 0. This is caused when the weights get tweaked in a way that causes all inputs to neuron to be negative, and ReLU outputs 0 for all negatives. In some cases of neural networks, half of the neurons are "dead" especially when using high learning rates. <br>
To solve this you can use a variation of ReLU called <b>Leaky ReLU</b> <br>
$$
\text{LeakyReLU}(x) = \begin{cases}
x & \text{if } x > 0 \\
\alpha x & \text{if } x \leq 0
\end{cases}
$$ <br>

This variation ensures that neurons never die

In [3]:
leaky_relu = tf.keras.layers.LeakyReLU(negative_slope=0.2)
dense = tf.keras.layers.Dense(50, activation=leaky_relu, kernel_initializer='he_normal')

In [4]:
### You can use LeakyReLU as a layer in the network instead of an activation function. It makes no difference
model = tf.keras.models.Sequential([
    # [...]  # more layers
    tf.keras.layers.Dense(50, kernel_initializer="he_normal"),  # no activation
    tf.keras.layers.LeakyReLU(negative_slope=0.2),  # activation as a separate layer
    # [...]  # more layers
])

In [5]:
### Using PReLU instead
prelu = tf.keras.layers.PReLU()

A problem with ReLU, LeakyReLU, and PReLU, is that they are not smooth functions, meaning their derivative changes abruptly at x=0. This can cause the gradient descent to bounce around the optimum which makes it hard to converge. 

#### ELU and SELU

<b>Exponential Linear Unit (ELU)</b> is another activation function that can outperform ReLU and its variations in some cases. <br>
$$
\text{ELU}(x) = \begin{cases}
x & \text{if } x > 0 \\
\alpha (\exp(x) - 1) & \text{if } x \leq 0
\end{cases}
$$<br>
Its advantages include: Taking on negative values when x < 0 which allows the unit to have an output closer to 0 which helps alleviate the vanishing gradients problem. Non-zero gradient at x=0 which avoids dead neurons, and when the function at $ \alpha = 1 $ the function is smooth everywhere which means gradient descent converges faster. <br>
<b>Note: Should always use He Initilzation with ELU, and ELU is slower than ReLU</b>

In [6]:
dense = tf.keras.layers.Dense(50, activation='elu', 
                             kernel_initializer='he_normal')

<b>Scaled ELU (SELU)</b> is another activation function that is a scaled variant of ELU. If you have a deep neural network where the hidden layers are just stacks of Dense layers using SELU, then the network will self-normalize: the output of each layer will tend to perserve a mean of 0 and a standard deviation of 1 during training which solves the vanishing gradient problem. This can cause SELU to outperform other activation functions. <br>
Considerations to keep in mind about SELU:
- The input features must be standardized: mean 0 and standard deviation 1.
- Every hidden layer's weights must be initialized using LeCun normal init.
- The Self-normalizing property is only guaranteed with plain MLPs. If you try SELU with other architectures, like recurrent neural networks or networks with skip connections(ex. Wide & Deep nets), it will not outperform ELU.
- You cannot use regularization techniques with SELU

In [7]:
dense = tf.keras.layers.Dense(50, activation="selu",
                              kernel_initializer="lecun_normal")

SELU is not extremely popular or widly used due to the main considerations and is often outperformed by other activation functions like:<br>
#### GELU, Swish, and Mish

<b>GELU</b> is a smooth variant of the ReLU activation function. Due to its curvy/complex shape, gradient descent seems to find it easier to fit on to it. However, it is more computationaly expensive then the other activation functions and the performance boost doesn't always justify the extra cost. <br>
<b> Sigmoid linear unit (SiLU) aka Swish</b> is very close to GELU but has one extra hyperparameter that can cause it to be more effective in certain cases <br>
<b>Mish</b> is a smooth, nonconvex, and nonmonotonic variant of ReLU. It is similar or Swish and GELU.<br>

<b style="color: blue; font-size: 1.2em;">Which Activation function should you use??</b><br>

- ReLU is a good default for simple tasks: it's often just as good as the more sophisticated activation functions, plus its fast to compute and many libraries and hardware accelerators provide ReLU-specific optimizations.
- Swish is probably a better default for more complex tasks.
- Mish can give you slightly better results than Swish but at the cost of more computation time.
- If you care about runtime latency, then LeakyReLU or parametrized leaky ReLU might be a better option.
- For Deep MLPs, give SELU a try, but make sure to respect the constraints of it.
- If you have time and computing power, try cross validation to find the best activation functions

### Batch Normalization

Batch normalization is another great technique to solving the vanishing/exploding gradent problem.<br> Batch Normalization works by normalizing/standardzing the data after the activation function. It scales the data just like StandardScaler or Normalization layer. It does this by evaluating the mean and standard eviation of the input over the current mini-batch. <br>
Batch normalization can signifcantly reduce the vanishing gradient problem and even to the point where you can use old activation functions like tanh and maybe the sigmoid function. You can also use larger learning rates which speeds up the training process.<br> 
A problem caused by Batch Normalization is that it adds complexity to the model, meaning that each iteration will take longer due to more compuation required. However, this is usually worth it because BN can reduce the amount of epochs needed to reach the same performance.<br>
A possible solution is to fuse the BN layer with the layer before it. This is done in libraries like TFLite.

In [8]:
model = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28,28]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(300, activation='relu',
                          kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(100, activation='relu', 
                          kernel_initializer='he_normal'),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(10, activation='softmax')
])

  super().__init__(**kwargs)
I0000 00:00:1736035196.232928    4467 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9578 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 3060, pci bus id: 0000:01:00.0, compute capability: 8.6


In [9]:
model.summary()

In [10]:
[(var.name, var.trainable) for var in model.layers[1].variables]

[('gamma', True),
 ('beta', True),
 ('moving_mean', False),
 ('moving_variance', False)]

The main hyperparameters for Batch Normalization include:
- momentum: a good momentum is around typically very close to 1.
- axis: It determines which axis should be normalized. 

### Gradient Clipping

Gradient clipping is when you clip the gradients during backpropagation so they never exceed some threshold. This helps mitigate the exploding gradient problem.

In [11]:
## Just set the clipvalue or clipnorm of the optimizer
optimizer = tf.keras.optimizers.SGD(clipvalue=1.0)
model.compile(optimizer=optimizer)

This will clip each component in the gradient vectors to a value between -1 and 1. There may be a problem with when you clip one of the compnents much more than the others. However, this can be fixed by using the clipnorm parameter instead of the clipvalue.

## Reusing Pretrained Layers

It is generally not a good idea to train a very large DNN from scratch without first trying to find an existing neural network that accomplishes a similar task to the one you are trying to accomplish. <b>(How to find them is disscused in Chapter 14)</b> If you find such a model then you can generally reuse most of its layers, except for the top ones. This technique is called <b><i>Transfer Learning</i></b>. It will not only speed up training but require much less training data.<br><br>
Note: If the input shape of your task doesn't match the input shape of the model whose layers you are reusing, you will need to add some type of preprocessing layer to reshape the input. <br>
If the tasks are very similar you should first try to reuse all the layers except for the output. You can lock the weights of the layers you are reusing so you only train the layers you are adding. Start here and change the number of reused layers until you find an optimal solution.

### Transfer Learning with Keras

In [12]:
fashion_mnist = tf.keras.datasets.fashion_mnist.load_data()
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist
X_train, y_train = X_train_full[:-5000], y_train_full[:-5000]
X_valid, y_valid = X_train_full[-5000:], y_train_full[-5000:]
X_train, X_valid, X_test = X_train / 255, X_valid / 255, X_test / 255

In [13]:
# extra code – split Fashion MNIST into tasks A and B, then train and save
#              model A to "my_model_A".
import numpy as np

class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

pos_class_id = class_names.index("Pullover")
neg_class_id = class_names.index("T-shirt/top")

def split_dataset(X, y):
    y_for_B = (y == pos_class_id) | (y == neg_class_id)
    y_A = y[~y_for_B]
    y_B = (y[y_for_B] == pos_class_id).astype(np.float32)
    old_class_ids = list(set(range(10)) - set([neg_class_id, pos_class_id]))
    for old_class_id, new_class_id in zip(old_class_ids, range(8)):
        y_A[y_A == old_class_id] = new_class_id  # reorder class ids for A
    return ((X[~y_for_B], y_A), (X[y_for_B], y_B))

(X_train_A, y_train_A), (X_train_B, y_train_B) = split_dataset(X_train, y_train)
(X_valid_A, y_valid_A), (X_valid_B, y_valid_B) = split_dataset(X_valid, y_valid)
(X_test_A, y_test_A), (X_test_B, y_test_B) = split_dataset(X_test, y_test)
X_train_B = X_train_B[:200]
y_train_B = y_train_B[:200]

tf.random.set_seed(42)

model_A = tf.keras.Sequential([
    tf.keras.layers.Flatten(input_shape=[28, 28]),
    tf.keras.layers.Dense(100, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dense(100, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dense(100, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dense(8, activation="softmax")
])

model_A.compile(loss="sparse_categorical_crossentropy",
                optimizer=tf.keras.optimizers.SGD(learning_rate=0.001),
                metrics=["accuracy"])
history = model_A.fit(X_train_A, y_train_A, epochs=10,
                      validation_data=(X_valid_A, y_valid_A))
model_A.save("my_model_A.keras")

Epoch 1/10


I0000 00:00:1736035197.477082    4524 service.cc:148] XLA service 0x7676a00058c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1736035197.477192    4524 service.cc:156]   StreamExecutor device (0): NVIDIA GeForce RTX 3060, Compute Capability 8.6
2025-01-04 17:59:57.491643: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1736035197.536771    4524 cuda_dnn.cc:529] Loaded cuDNN version 90300
2025-01-04 17:59:57.557879: W external/local_xla/xla/service/gpu/nvptx_compiler.cc:930] The NVIDIA driver's CUDA version is 12.4 which is older than the PTX compiler version 12.5.82. Because the driver is older than the PTX compiler version, XLA is disabling parallel compilation, which may slow down compilation. You should update your NVIDIA driver or use the NVIDIA-provided CUDA forward compatibility packages.


[1m 197/1376[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 770us/step - accuracy: 0.2622 - loss: 2.0031

I0000 00:00:1736035199.554422    4524 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m1376/1376[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.4946 - loss: 1.5166 - val_accuracy: 0.8062 - val_loss: 0.6772
Epoch 2/10
[1m1376/1376[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 833us/step - accuracy: 0.8149 - loss: 0.6294 - val_accuracy: 0.8403 - val_loss: 0.5002
Epoch 3/10
[1m1376/1376[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 828us/step - accuracy: 0.8515 - loss: 0.4868 - val_accuracy: 0.8561 - val_loss: 0.4286
Epoch 4/10
[1m1376/1376[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 819us/step - accuracy: 0.8666 - loss: 0.4212 - val_accuracy: 0.8676 - val_loss: 0.3882
Epoch 5/10
[1m1376/1376[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 835us/step - accuracy: 0.8762 - loss: 0.3821 - val_accuracy: 0.8769 - val_loss: 0.3624
Epoch 6/10
[1m1376/1376[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 821us/step - accuracy: 0.8819 - loss: 0.3559 - val_accuracy: 0.8822 - val_loss: 0.3445
Epoch 7/10
[1m13

In [14]:
## Assuming model A was trained and saved
model_A = tf.keras.models.load_model('my_model_A.keras')
model_B_on_A = tf.keras.Sequential(model_A.layers[:-1])  ## Reuse all layers except for output
model_B_on_A.add(tf.keras.layers.Dense(1, activation='sigmoid')) ## Add the output layer that matches our task

<b>Note:</b> <br>
Since Model B uses the same layers as Model A, when you train Model B you will also affect Model A. To fix this make a copy and set the weights

In [15]:
## Solution
model_A_clone = tf.keras.models.clone_model(model_A)  #### Doesn't clone the weights only the architecture ####
model_A_clone.set_weights(model_A.get_weights())

To prevent the reused weights from being changed dramaticly due to large loss at the beginning, freeze them

In [16]:
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = False

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
### Must always compile after freezing or unfreezing layers
model_B_on_A.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

In [17]:
### You can train the model for a few epochs then unfreeze the weights to really optimize the new model
history = model_B_on_A.fit(X_train_B, y_train_B, epochs=4, validation_data=(X_valid_B, y_valid_B))

# Unfreeze layers
for layer in model_B_on_A.layers[:-1]:
    layer.trainable = True

optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
### Must always compile after freezing or unfreezing layers
model_B_on_A.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

history = model_B_on_A.fit(X_train_B, y_train_B, epochs=5, validation_data=(X_valid_B, y_valid_B))

Epoch 1/4
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 243ms/step - accuracy: 0.5699 - loss: 2.3539 - val_accuracy: 0.5153 - val_loss: 1.5553
Epoch 2/4
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.5699 - loss: 1.2449 - val_accuracy: 0.5282 - val_loss: 0.7212
Epoch 3/4
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.6135 - loss: 0.6234 - val_accuracy: 0.8210 - val_loss: 0.5583
Epoch 4/4
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.8057 - loss: 0.5345 - val_accuracy: 0.8457 - val_loss: 0.5470
Epoch 1/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 185ms/step - accuracy: 0.8419 - loss: 0.5253 - val_accuracy: 0.8190 - val_loss: 0.5212
Epoch 2/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step - accuracy: 0.8370 - loss: 0.4988 - val_accuracy: 0.8516 - val_loss: 0.4936
Epoch 3/5
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0

<div class="alert alert-block alert-danger">
Transfer learning does not typically work well with small dense networks because small dense networks learn few patterns. Dense networks typically learn very specific patterns which are unlikly to be useful in other tasks.
</div> 
<b>Transfer Learning works best with deep convolutional neural networks with tend to learn feature detectors that are more general</b>

### Unsupervised Pretraining

Suppose you want to tackle a complex task for which you don't have much labeled training data, but you also can't find a model that was trained on a similar task. <br>
First you should always try and find more labeled data. However, if you can easily get unlabeled data, then you can try <b>Unsupervised Pretraining</b> This envolves training an autoencoder or GAN (See Chapter 17) on this unlabeled data and reuse its lower level layers in your DNN, then train the DNN with the small labeled dataset.

### Pretraing on an Auxiliary Task

If you do not have much labeled data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual tasks. 
<br> 
One example in computer vision is the following: Say you want to build a system to recognize faces, but you only have a few pictures of each individual. Instead of trying to gather more pictures of the specific individuals, you can get lots of pictures of random people from the internet and train a first neural network to try and predict if the people in the images are the same person. The lower layers of this  neural network would be able to learn the lower level features of a face which can be reused in your actually tasks.
<br> 
Another example in NLP applications, you can download a corpus of millions of text documents and automatically generate labeled data from it. You can mask out words and try and make a neural network to predict the missing word. Then you can reuse the lower level layers of that network for your specific task.

## Faster Optimizers

Training Deep NNs is a slow process. So far there are four ways to speed up training:
- Applying a good initialization strategy for the connection weights
- Using a good activation function
- Using Batch Normalization
- Reuseing parts of a pretrained network(Possibly built from an auxiliary task or unsupervised learning task)

<b> Another way</b> is to use a faster optimizer rather than regular gradient descent optimizer. 

### Momentum

Regular gradient descent goes step by step, updating the weigths based on the calculated gradient and the step size.<br> <b>Momentum optimization</b> uses previous calculated gradients as an acceleration, not as a speed. It calculates a momentum vector and then uses it along with the gradient to determine how to change the weights, but it keeps in mind the previous gradients to increases its speed or to roll down the hill faster.<br>
To keep into account "friction" of the hill, it has a <b><i>momentum</i></b> hyperparameter that is a value between 0 (high friction) and 1 (no friction). The typical value for this parameter is 0.9<br>
<br>
Momentum optimization helps speed up the time to find the minimum and can better escape local minimums. However, it can bounce back and forth around the minimum which is why having a little bit of friction is good.

In [18]:
## Use the SGD with the momentum hyperparameter
optimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9)

### Nesterov Accelerated Gradient

NAG is a variation of Momentum that calculates the gradient ahead which allows NAG to converge faster

In [19]:
## NAG implementation
opimizer = tf.keras.optimizers.SGD(learning_rate=0.001, momentum=0.9,
                                  nesterov=True) ## Extra parameter

### RMSProp

AdaGrad is where the learning rate is decayed depending on how steep the slope is. This is called adaptive learning rate. However AdaGrad has a risk of slowing down too fast and never converging, which is where <b>RMSProp</b> comes in. It fixes this by accumulating only the gradients from the most recent iterations, as opposed to all the gradients since the beginning of training. 

In [20]:
## Hyperparamater rho is the decay rate. 
optimizer = tf.keras.optimizers.RMSprop(learning_rate=0.001, rho=0.9)

### Adam

The <b>Adam</b> optimizer combines the ideas of Momentum optimizing and RMSProp. It keeps track of an exponentially decaying average of past gradients like Momentum and keeps track of an exponentially decaying average of past squared gradients, like RMSProp. 

In [21]:
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)

### Adam Variations

- AdaMax: This variation can be unstable and doesn't always preform better than Adam
- Nadam: This is a variation of Adam that uses the Nesterov trick. It can often converge faster and outperform Adam but can also be outperformed by RMSProp.
- AdamW: This is a variation of Adam that adds a regularization technique called weight decay. This can keep the weights small which can boost performance.

<br>
<br>
<b>Adaptive optimization methods like RMSProp, Adam, and its variations are often great at converging fast to a good solution. However, they can lead to solutions that generalize poorly on some datasets. So if the results are not good, use Nesterov Accelerated Gradient.</b>

In [22]:
## Implementation of Adam Variations
optimizer = tf.keras.optimizers.Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
optimizer = tf.keras.optimizers.Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999)
optimizer = tf.keras.optimizers.AdamW(learning_rate=0.001, beta_1=0.9, beta_2=0.999, weight_decay=0.99)

## Sparse Models
<br>
If you need a very fast model at runtime or if you have low memory amounts, using a sparse model might be a good idea. A way to get these models is to apply strong l1 regularization during training, similar to Lasso Regression.
<br><br>
<b>The Tensorflow Model Optimization Toolkit provides an API to make these models easily.</b>

## Learning Rate Scheduling

Learning Rate Scheduling is when you change the learning rate during the training process based on some results<br>
Different common LRS methods:
- Power Scheduling
- Exponential Scheduling
- Piecewise constant scheduling
- Performance scheduling
- 1cycle scheduling
- Etc
<br><b>All of these can be good options</b>

In [23]:
## Exponential Scheduling implmentation
def exponential_decay_fn(epoch):
    return 0.01 * 0.1 ** (epoch / 20)

In [24]:
## This callback will update the learning_rate attribute at the beginning of each epoch
lr_scheduler = tf.keras.callbacks.LearningRateScheduler(exponential_decay_fn)
# history = model.fit(X_train, y_train, epochs=5, callbacks=[lr_scheduler])
## history.history['lr'] should give you the learning rate

In [25]:
## The schedule function can also take the current learning rate as a parameter
def exponential_decay_fn(epoch, lr):
    return lr * 0.1 ** (epoch / 20)

<b>When you save a model, the optimizer and its learning rate get saved with it. This means that with this new schedule function, you could just load a trained model and continue training where it left off. This doesn't work if your scheduler uses the epoch parameter only, because epoch count doesn't get saved.

In [26]:
## Piecewise constant scheduling
def piecewise_constant_fn(epoch):
    if epoch < 5:
        return 0.01
    elif epoch < 15:
        return 0.005
    else:
        return 0.001

In [27]:
## Use the ReduceLROnPlateau callback for performance scheduling.
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=5)

## Avoiding Overfitting Through Regularization

Due to deep neural networks many parameters they can learn complex patterns. However, this makes them prone to overfitting. <b>Good Regularization techniques, like EarlyStopping, are needed to prevent this</b><br>
Here are some other techniques.

### l1 and l2 Regularization

In [28]:
## To use l1 and l2 just like in simple linear models, you can use them in deep neural networks
layer = tf.keras.layers.Dense(100, activation='relu',
                             kernel_initializer='he_normal',
                             kernel_regularizer=tf.keras.regularizers.l2(0.01)) # Same for l1

### Dropout 

Dropout layers work by temporarily disabling a percentage of neurons during each training step. The percentage is controlled by the dropout rate which is typically a value between 10% and 50%. After training neurons don't get dropped anymore. This forces the neurons that don't get dropped to "learn better". This technique has been proven to be very affective. Dropout tends to lead to slower convergence but leads to better models especially for large models.

In [29]:
model = tf.keras.Sequential([
    tf.keras.layers.Input(shape=[28, 28]),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(100, activation="relu",
                          kernel_initializer="he_normal"),
    tf.keras.layers.Dropout(rate=0.2),
    tf.keras.layers.Dense(10, activation="softmax")
])

### Monte Carlo (MC) Dropout

In [30]:
import numpy as np

y_probs = np.stack([model(X_test, training=True) for sample in range(100)])

y_proba = y_probs.mean(axis=0)

In [31]:
model.predict(X_test[:1]).round(2)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 147ms/step


array([[0.16, 0.11, 0.06, 0.08, 0.1 , 0.16, 0.1 , 0.06, 0.1 , 0.08]],
      dtype=float32)

In [32]:
y_proba[0].round(2)

array([0.16, 0.1 , 0.06, 0.08, 0.11, 0.16, 0.09, 0.06, 0.1 , 0.08],
      dtype=float32)

In [33]:
class MCDropout(tf.keras.layers.Dropout):
    
    def call(self, inputs, training=False):
        return super().call(inputs, training=True)

### Max-Norm Regularization

In [34]:
dense = tf.keras.layers.Dense(
    100, activation='relu', kernel_initializer='he_normal',
    kernel_constraint=tf.keras.constraints.max_norm(1.))

# Summary

## Default DNN Configuration
| Hyperparameter       | Default Value         |
|----------------------|-----------------------|
| Kernel Initializer | He Initialization |
| Activation Function | ReLU if shallow; Swish if deep |
| Normalization | None if shallow; batch norm if deep |
| Regularization | Early stopping; weight decay if needed |
| Optimizer | Nesterov accelerated gradients or AdamW |
| Learning rate schedule | Performance scheduling or 1cycle |

<div class="alert alert-block alert-danger">
<b>Not Hard Rules. Still try other methods and think about which ones fit best</b>
</div> 

## DNN configuration for a self-normalizing net
| Hyperparameter       | Default Value         |
|----------------------|-----------------------|
| Kernel Initializer | LeCun initialization |
| Activation Function | SELU |
| Normalization | None(Self-Normalizaing) |
| Regularization | Alpha Dropout if needed |
| Optimizer | Nesterov accelerated gradients |
| Learning rate schedule | Performance scheduling or 1cycle |

<div class="alert alert-block alert-danger">
<b>Not Hard Rules. Still try other methods and think about which ones fit best</b>
</div> 

# Exercise 8

## Part A

In [35]:
tf.__version__

'2.18.0'

In [36]:
tf.random.set_seed(42)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(tf.keras.layers.Dense(100,
                                    activation="swish",
                                    kernel_initializer="he_normal"))

## Part B

In [37]:
# Output layer
model.add(tf.keras.layers.Dense(10, activation="softmax"))

In [38]:
optimizer = tf.keras.optimizers.Nadam(learning_rate=5e-5)
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

In [39]:
cifar10 = tf.keras.datasets.cifar10.load_data()
(X_train_full, y_train_full), (X_test, y_test) = cifar10

X_train = X_train_full[5000:]
y_train = y_train_full[5000:]
X_valid = X_train_full[:5000]
y_valid = y_train_full[:5000]

In [40]:
from pathlib import Path

early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     restore_best_weights=True)
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model.keras",
                                                         save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = Path() / "my_cifar10_logs" / f"run_{run_index:03d}"
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

In [41]:
model.fit(X_train, y_train, epochs=100,
          validation_data=(X_valid, y_valid),
          callbacks=callbacks)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 4ms/step - accuracy: 0.1246 - loss: 17.0840 - val_accuracy: 0.1894 - val_loss: 2.2359
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.2011 - loss: 2.1769 - val_accuracy: 0.2430 - val_loss: 2.0786
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.2569 - loss: 2.0266 - val_accuracy: 0.2780 - val_loss: 1.9721
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.2960 - loss: 1.9360 - val_accuracy: 0.3004 - val_loss: 1.8966
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.3190 - loss: 1.8775 - val_accuracy: 0.3314 - val_loss: 1.8472
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.3422 - loss: 1.8170 - val_accuracy: 0.3426 - val_loss: 1.8090
Epoch 7/

<keras.src.callbacks.history.History at 0x7677c23d3fe0>

## Part C

In [50]:
## Keras Tuner to find best learning_rate
import keras_tuner as kt

def build_model(hp: kt.HyperParameters):
    ## Hyperparameter selection
    learning_rate = hp.Float('learning_rate', min_value=7.5e-4, max_value=1e-3, sampling='log')

    ## Building model
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
    for _ in range(20):
        model.add(tf.keras.layers.Dense(100, kernel_initializer="he_normal"))
        model.add(tf.keras.layers.BatchNormalization())
        model.add(tf.keras.layers.Activation('swish'))
        
    # Output layer
    model.add(tf.keras.layers.Dense(10, activation="softmax"))

    optimizer = tf.keras.optimizers.Nadam(learning_rate=learning_rate) 
    model.compile(loss="sparse_categorical_crossentropy",
                  optimizer=optimizer,
                  metrics=["accuracy"])

    return model

In [53]:
## Do search
grid_search = kt.GridSearch(
    build_model, objective='val_accuracy', max_trials=10, seed=1,
)

# grid_search.search(X_train, y_train, epochs=30, validation_data=(X_valid, y_valid))

Reloading Tuner from ./untitled_project/tuner0.json


Best Learning rate: 5.0119e-4
<br> Best Val Accuracy: 46% 

In [55]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(tf.keras.layers.Dense(100, kernel_initializer="he_normal"))
    model.add(tf.keras.layers.BatchNormalization())
    model.add(tf.keras.layers.Activation('swish'))
    
# Output layer
model.add(tf.keras.layers.Dense(10, activation="softmax"))

optimizer = tf.keras.optimizers.Nadam(learning_rate=5.0119e-4) ## Found from fine tuning 
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

  super().__init__(**kwargs)


In [56]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     restore_best_weights=True,
                                                     monitor='val_accuracy')
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model.keras",
                                                         save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = Path() / "my_cifar10_logs" / f"run_{run_index:03d}_part_c"
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

In [58]:
model.fit(X_train, y_train, epochs=100,
          validation_data=(X_valid, y_valid),
          callbacks=callbacks)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.5950 - loss: 1.1562 - val_accuracy: 0.3924 - val_loss: 1.9901
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 4ms/step - accuracy: 0.6081 - loss: 1.1248 - val_accuracy: 0.3966 - val_loss: 2.0408
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.6170 - loss: 1.0934 - val_accuracy: 0.3844 - val_loss: 2.2029
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.6314 - loss: 1.0621 - val_accuracy: 0.3912 - val_loss: 2.1937
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 5ms/step - accuracy: 0.6381 - loss: 1.0318 - val_accuracy: 0.3962 - val_loss: 2.0691
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 5ms/step - accuracy: 0.6529 - loss: 0.9931 - val_accuracy: 0.3846 - val_loss: 2.1851
Epoch 7/10

<keras.src.callbacks.history.History at 0x767713a9c710>

## Part D

In [62]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(tf.keras.layers.Dense(100,
                                    kernel_initializer="lecun_normal",
                                   activation='selu'))
    
# Output layer
model.add(tf.keras.layers.Dense(10, activation="softmax"))

optimizer = tf.keras.optimizers.SGD(learning_rate=5e-4, momentum=0.9, nesterov=True) ## Nesterov Accelerated Gradients
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

In [63]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     restore_best_weights=True,
                                                     monitor='val_accuracy')
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model.keras",
                                                         save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = Path() / "my_cifar10_logs" / f"run_{run_index:03d}_part_d"
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

In [64]:
model.fit(X_train, y_train, epochs=100,
          validation_data=(X_valid, y_valid),
          callbacks=callbacks)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 3ms/step - accuracy: 0.2211 - loss: 2.1249 - val_accuracy: 0.3032 - val_loss: 1.8955
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 2ms/step - accuracy: 0.3251 - loss: 1.8391 - val_accuracy: 0.3626 - val_loss: 1.7634
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 2ms/step - accuracy: 0.3583 - loss: 1.7605 - val_accuracy: 0.3548 - val_loss: 1.7684
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.3826 - loss: 1.7026 - val_accuracy: 0.3874 - val_loss: 1.6920
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.3998 - loss: 1.6624 - val_accuracy: 0.3988 - val_loss: 1.6729
Epoch 6/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.4146 - loss: 1.6244 - val_accuracy: 0.4014 - val_loss: 1.6383
Epoch 7/10

<keras.src.callbacks.history.History at 0x767711061040>

## Part E

In [74]:
model = tf.keras.Sequential()
model.add(tf.keras.layers.Flatten(input_shape=[32, 32, 3]))
for _ in range(20):
    model.add(tf.keras.layers.Dropout(0.25))
    model.add(tf.keras.layers.Dense(100,
                                    kernel_initializer="lecun_normal",
                                   activation='selu'))
    
# Output layer
model.add(tf.keras.layers.Dense(10, activation="softmax"))

# optimizer = tf.keras.optimizers.SGD(learning_rate=5e-4, momentum=0.9, nesterov=True) ## Nesterov Accelerated Gradients
optimizer = tf.keras.optimizers.Nadam(learning_rate=5e-4)
model.compile(loss="sparse_categorical_crossentropy",
              optimizer=optimizer,
              metrics=["accuracy"])

  super().__init__(**kwargs)


In [75]:
early_stopping_cb = tf.keras.callbacks.EarlyStopping(patience=10,
                                                     restore_best_weights=True,
                                                     monitor='val_accuracy')
model_checkpoint_cb = tf.keras.callbacks.ModelCheckpoint("my_cifar10_model.keras",
                                                         save_best_only=True)
run_index = 1 # increment every time you train the model
run_logdir = Path() / "my_cifar10_logs" / f"run_{run_index:03d}_part_d"
tensorboard_cb = tf.keras.callbacks.TensorBoard(run_logdir)
callbacks = [early_stopping_cb, model_checkpoint_cb, tensorboard_cb]

In [76]:
model.fit(X_train, y_train, epochs=100,
          validation_data=(X_valid, y_valid),
          callbacks=callbacks)

Epoch 1/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m18s[0m 7ms/step - accuracy: 0.1021 - loss: 3.2686 - val_accuracy: 0.1046 - val_loss: 2.3227
Epoch 2/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.1033 - loss: 2.3626 - val_accuracy: 0.0972 - val_loss: 2.3131
Epoch 3/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 3ms/step - accuracy: 0.1038 - loss: 2.3406 - val_accuracy: 0.0970 - val_loss: 2.3144
Epoch 4/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.1030 - loss: 2.3346 - val_accuracy: 0.1064 - val_loss: 2.3248
Epoch 5/100
[1m1407/1407[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 3ms/step - accuracy: 0.1018 - loss: 2.3341 - val_accuracy: 0.1038 - val_loss: 2.3051
Epoch 6/100
[1m 714/1407[0m [32m━━━━━━━━━━[0m[37m━━━━━━━━━━[0m [1m2s[0m 3ms/step - accuracy: 0.0993 - loss: 2.3289

KeyboardInterrupt: 