### Steps:
- Understand the Momentum hyperparameter
- Search the space of hyperparameters
- Saving, checkpointing and loading your models
- Apply learning rate schedulers to our existing networks
- Debugging deep learning models

### Covered topics and learning objectives
- Playing with parameters. Momentum
- Saving, checkpointing and loading your models
- Learning Rate schedulers
- Debugging deep learning models


# ConvNets Parameters


In [None]:
from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import models

## Playing with parameters. Momentum

**Momentum:** You already learned about backpropagation and gradient decent, so we will not cover this here. 
Remember that some optimizers use momentum because it is a very effective optimization approach: Instead of using only the gradient of the current step to guide the search, momentum also accumulates the gradient of the past steps to determine the direction to go.

Additional Resources (optional reads):
- [Parameters vs. hyperparameters](https://machinelearningmastery.com/difference-between-a-parameter-and-a-hyperparameter/)
- [Momentum d2l](https://d2l.ai/chapter_optimization/momentum.html)
- [Momentum blog paperspace](https://blog.paperspace.com/intro-to-optimization-momentum-rmsprop-adam/)

Momentum is just one example of a hyperparameter that we can choose. 
Let's learn about how to optimize hyperparameters.

### Using Keras tuner for hyperparameter optimization 

Blindly trying out different architecture configurations works well enough if you just need something that works okay. In this section, we’ll go beyond "works okay", to "works great and wins machine-learning competitions", via a quick guide to a set of must-know techniques for building state-of-the-art deep-learning models.

#### Hyperparameter optimization

When building a deep-learning model, you have to make many seemingly arbitrary decisions: How many layers should you stack? How many units or filters should go in each layer? Should you use relu as activation, or a different function? Should you use BatchNormalization after a given layer? How much dropout should you use? And so on. These architecture-level parameters are called hyperparameters to distinguish them from the parameters of a model, which are trained via backpropagation.

In practice, experienced machine-learning engineers and researchers build intuition over time as to what works and what doesn’t when it comes to these choices—they develop hyperparameter-tuning skills. But there are no formal rules. If you want to get to the very limit of what can be achieved on a given task, you can’t be content with such arbitrary choices. Your initial decisions are almost always suboptimal, even if you have very good intuition. You can refine your choices by tweaking them by hand and retraining the model repeatedly—that’s what machine-learning engineers and researchers spend most of their time doing. But it shouldn’t be your job as a human to fiddle with hyperparameters all day—that is better left to a machine.

Thus you need to explore the space of possible decisions automatically, systematically, in a principled way. You need to search the architecture space and find the best-performing ones empirically. That’s what the field of automatic hyperparameter optimization is about: it’s an entire field of research, and an important one.

The process of optimizing hyperparameters typically looks like this:

1. Choose a set of hyperparameters (automatically).
1. Build the corresponding model.
1. Fit it to your training data, and measure performance on the validation data.
1. Choose the next set of hyperparameters to try (automatically).
1. Repeat.
1.Eventually, measure performance on your test data.

The key to this process is the algorithm that analyzes the relationship between validation performance and various hyperparameter values to choose the next set of hyperparameters to evaluate. Many different techniques are possible: Bayesian optimization, genetic algorithms, simple random search, and so on.

Training the weights of a model is relatively easy: you compute a loss function on a mini-batch of data and then use backpropagation to move the weights in the right direction. Updating hyperparameters, on the other hand, presents unique challenges. Consider the following:

* The hyperparameter space is typically made of discrete decisions and thus isn’t continuous or differentiable. Hence, you typically can’t do gradient descent in hyperparameter space. Instead, you must rely on gradient-free optimization techniques, which naturally are far less efficient than gradient descent.
* Computing the feedback signal of this optimization process (does this set of hyperparameters lead to a high-performing model on this task?) can be extremely expensive: it requires creating and training a new model from scratch on your dataset.
* The feedback signal may be noisy: if a training run performs 0.2% better, is that because of a better model configuration, or because you got lucky with the initial weight values?

Thankfully, there’s a tool that makes hyperparameter tuning simpler: KerasTuner. Let’s check it out.

The key idea that KerasTuner is built upon is to let you replace hard-coded hyperparameter values, such as units=32, with a range of possible choices, such as Int(name="units", min_value=16, max_value=64, step=16). The set of such choices in a given model is called the search space of the hyperparameter tuning process.

To specify a search space, define a model-building function. It takes a hp argument, from which you can sample hyperparameter ranges, and it returns a compiled Keras model.

In [None]:
import kerastuner as kt
from tensorflow import keras
from tensorflow.keras import layers

def build_model(hp):
    units = hp.Int(name="units", min_value=16, max_value=64, step=16)
    model = keras.Sequential([
        layers.Dense(units, activation="relu"),
        layers.Dense(10, activation="softmax")
    ])
    optimizer = hp.Choice(name="optimizer", values=["rmsprop", "adam"])
    model.compile(
        optimizer=optimizer,
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"])
    return model

The next step is to define a "tuner". Schematically, you can think of a tuner as for loop, which will repeatedly:

* Pick a set of hyperparameter values.
* Call the model-building function with these values to create a model.
* Train the model and record its metrics.

KerasTuner has several built-in tuners available—RandomSearch, BayesianOptimization, Hyperband. Let’s try BayesianOptimization, a tuner that attempts to make smart predictions for which new hyperparameter values are likely to perform best given the outcome of previous choices.

In [None]:
tuner = kt.BayesianOptimization(
    build_model,
    objective="val_accuracy",
    max_trials=4,
    executions_per_trial=2,
    directory="mnist_kt_test",
    overwrite=True,
)

# You can display an overview of the search space via the following
tuner.search_space_summary()

In [None]:
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape((-1, 28 * 28)).astype("float32") / 255
x_test = x_test.reshape((-1, 28 * 28)).astype("float32") / 255
x_train_full = x_train[:]
y_train_full = y_train[:]
num_val_samples = 10000
x_train, x_val = x_train[:-num_val_samples], x_train[-num_val_samples:]
y_train, y_val = y_train[:-num_val_samples], y_train[-num_val_samples:]
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_loss", patience=5),
]
tuner.search(
    x_train, y_train,
    batch_size=128,
    epochs=3,
    validation_data=(x_val, y_val),
    callbacks=callbacks,
    verbose=2,
)

In [None]:
x_train.shape

**Exercise**

Print out the best hyperparameters!

Hints: To do so, you can use the `tuner` object's tuner `get_best_hyperparameters` method. 
You will get a list of hyperparameters, each of which has a `values` instance variable.

In [None]:
# SOLUTION
top_n = 4
best_hps = tuner.get_best_hyperparameters(top_n)
for my_hp in best_hps:
    print(my_hp.values)

**Exercise**

How to set the following variables in a good way?

- `epochs`
- `patience`
- `max_trials`
- `executions_per_trial`


**SOLUTION**

Note: these might not be perfect solutions, but they can help to build a basic intuition/explanation.

A neural net trained from scratch probably needs a few epochs to get trained successfully. `epochs >= 15`

With 15 epochs in total, it is probably fine to set `patience = 5`

There are 8 (2 optimizers times 4 options for units) different trials possible, running 8 trials is still affordable, therefore `max_trials >= 8`.

Training on MNIST is pretty stable/robust, therefore `executions_per_trial = 1` would probably be enough.

**Exercise**

In Module 1.1, we defined a network using the RMSprop optimizer. 

Go back to Module 1.1 and try to find hyperparameters (learning rate and momentum) which yield a better model.

How could we be sure that we found the best hyperparameters?

**SOLUTION**

Here is the code for the optimizer:

```python
opt = keras.optimizers.RMSprop(
    learning_rate=0.001,
    momentum=0.0,
)
model.compile(optimizer=opt,
              loss='categorical_crossentropy',
              metrics=['accuracy'])
```

In module 1.1.:
- accuracy with defaults(`learning_rate=0.001`, `momentum=0.0`): `0.992%`, `0.991%`
- momentum 0.3: `0.9924`

Running kerastuner:

```
import kerastuner as kt
from tensorflow import keras
from tensorflow.keras import layers

def build_model(hp):
    model = models.Sequential()
    model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.MaxPooling2D((2, 2)))
    model.add(layers.Conv2D(64, (3, 3), activation='relu'))
    model.add(layers.Flatten())
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(10, activation='softmax'))

    opt = keras.optimizers.RMSprop(
        learning_rate=hp.Choice(name="learning_rate", values=[0.0001, 0.001, 0.01]),  # 0.001,
        momentum=hp.Choice(name="momentum", values=[0.0, 0.3, 0.6, 0.9]),  # 0.0 # 0.3,
    )
    model.compile(optimizer=opt,
                loss='categorical_crossentropy',
                metrics=['accuracy'])
    return model

tuner = kt.BayesianOptimization(
    build_model,
    objective="val_accuracy",
    max_trials=12,
    executions_per_trial=1,
    directory="../data/mnist_kt_test",
    overwrite=True,
)

EPOCHS = 5

tuner.search(
    train_images, train_labels,
    batch_size=64,
    epochs=EPOCHS,
    validation_data=(test_images, test_labels),
    verbose=2,
)
```

After running kerastuner you get an output like this: 

```
val_accuracy: 0.9918000102043152
Best val_accuracy So Far: 0.9922999739646912
```

After running:

```
top_n = 1
best_hps = tuner.get_best_hyperparameters(top_n)
for my_hp in best_hps:
    print(my_hp.values)
```

you get a parameter combination like this:

This means that the `{'learning_rate': 0.001, 'momentum': 0.3}` are a good hyperparameter choice. 
However the result is not much different from what we got in Module1.1.


**Exercise** 

Which hyperparameters can we tune in the ConvNets we have used so far in `Module 1`, `Module 2`, and `Module 3`?

**SOLUTION:**

- learning rate
- momentum
- Neural architecture
    - number of layers
    - number of filter maps
    - number of units per layer
- loss function
- optimizer
- expected input image size
- how strongly do we apply data augmentations
- when transfer-learning and/or fine-tuning: how many layers to freeze/fine-tune
- which learning rate scheduler to use
- strength (a.k.a. probability) of dropout

We generally could tune the following but the research already shows clear rules how to pick the following (so hyperparameter search is often unnecessary)
- kernel size
- batch size
- which data augmentations to use

## Saving, checkpointing and loading your models


**Exercise**

Do the tutorial: https://www.tensorflow.org/tutorials/keras/save_and_load. 

Then apply the learnings to your own model from Module 1:
- Save checkpoints during training
- Load the best-performing checkpoint from disk. 
- Evaluate your loaded model and confirm that the performance is the same as during training.

# Learning Rate Schedulers

Go to https://d2l.ai/chapter_optimization/lr-scheduler.html
and read this chapter on learning rate schedulers. 

**Exercise**: Change the cats-vs-dogs code such that it uses the learning rate schedulers. 
Which scheduler works best?
* Let's use the model in the end of 1.2., a convnet with data augmentation. 
* Let's say you have a limited resource of 20 epochs per try

Additional Resources (reading is optional):
- https://keras.io/api/callbacks/learning_rate_scheduler/
- https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1


In [None]:
# SOLUTION LR scheduler
import math

from tensorflow.keras.callbacks import LearningRateScheduler

EPOCHS = 20
LR = 0.001

def get_checkpoint_callback():
    return keras.callbacks.ModelCheckpoint(
      filepath="convnet_from_scratch_with_augmentation_lr_scheduler.keras",
      save_best_only=True,
      monitor="val_loss")

def create_model():
    inputs = keras.Input(shape=(180, 180, 3))
    x = data_augmentation(inputs)
    x = layers.Rescaling(1./255)(x)
    x = layers.Conv2D(filters=32, kernel_size=3, activation="relu")(x)
    x = layers.MaxPooling2D(pool_size=2)(x)
    x = layers.Conv2D(filters=64, kernel_size=3, activation="relu")(x)
    x = layers.MaxPooling2D(pool_size=2)(x)
    x = layers.Conv2D(filters=128, kernel_size=3, activation="relu")(x)
    x = layers.MaxPooling2D(pool_size=2)(x)
    x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
    x = layers.MaxPooling2D(pool_size=2)(x)
    x = layers.Conv2D(filters=256, kernel_size=3, activation="relu")(x)
    x = layers.Flatten()(x)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs=inputs, outputs=outputs)
    return model

In [None]:
class SquareRootScheduler:
    def __init__(self, lr):
        self.lr = lr
    def __call__(self, num_update):
        return self.lr * pow(num_update + 1.0, -0.5)
scheduler = SquareRootScheduler(lr=LR)

In [None]:
d2l.plot(tf.range(EPOCHS), [scheduler(t) for t in range(EPOCHS)])

In [None]:
def train_with_scheduler(scheduler):

    checkpoint_callback = get_checkpoint_callback()
    callbacks = [checkpoint_callback, LearningRateScheduler(scheduler),]
    opt = keras.optimizers.RMSprop(learning_rate=lr)
    model = create_model()
    model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])

    history = model.fit(train_dataset, epochs=EPOCHS,
        validation_data=validation_dataset, callbacks=callbacks)
    return history

In [None]:
history = train_with_scheduler(scheduler)

By running with different schedulers, we achieve the following training behavior:


SquareRoot scheduler:
<img src="./solution_assets/square_root.png" alt="square_root" style="width: 200px;"/>

Cosine scheduler:
<img src="./solution_assets/cosine.jpg" alt="cosine" style="width: 200px;"/>

Cosine scheduler with warmup:
<img src="./solution_assets/cosine_with_warmup.jpg" alt="cosine_with_warmup" style="width: 200px;"/>


## Debugging deep learning models

We plan that this section will take approx. 2 hours for studying materals

https://github.com/stanislav-chekmenev/debugging-dl-models

* `notebooks/basics/0_intro.ipynb`: Read/Skim
* `notebooks/basics/1_implement_a_bug_free_model.ipynb`: Read
* `notebooks/basics/2_most_common_bugs_I.ipynb`: Read and do exercise in the bottom
* `notebooks/basics/3_most_common_bugs_II.ipynb`: Read and do exercise in the bottom