# Exercise 1.5.3 - Learning Rate Schedules
#### By Jonathan L. Moran (jonathan.moran107@gmail.com)
From the Self-Driving Car Engineer Nanodegree programme offered at Udacity.

## Objectives

* Implement two [learning rate schedules](https://en.wikipedia.org/wiki/Learning_rate#Learning_rate_schedule): the [exponential decay](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules/ExponentialDecay) and [step-wise annealing](https://paperswithcode.com/method/step-decay) strategies;
* Use the off-the-shelf [`tf.keras.optimizers.schedules`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/schedules) and the custom LR schedule [`tf.keras.callbacks.LearningRateScheduler`](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/LearningRateScheduler) wrapper to implement the above strategies;
* Evaluate a lightweight deep neural network (simple [ConvNet](https://en.wikipedia.org/wiki/Convolutional_neural_network)) with the LR schedules on the [GTSRB](https://benchmark.ini.rub.de/gtsrb_dataset.html) dataset.

## 1. Introduction

In [None]:
### Importing the required modules

In [2]:
import logging
import matplotlib.pyplot as plt
import matplotlib.ticker as mtick
import os
import tensorflow as tf
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.preprocessing import image_dataset_from_directory
from typing import List, Tuple

In [None]:
tf.__version__

In [None]:
tf.test.gpu_device_name()

In [None]:
### Setting the environment variables

In [None]:
ENV_COLAB = False                # True if running in Google Colab instance

In [None]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/'

In [None]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out/')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data/')

In [None]:
### Creating subdirectories (if not exists)
os.makedirs(DIR_OUT, exist_ok=True)

### 1.1. Learning Rate Schedules

In machine learning and statistics, the [learning rate](https://en.wikipedia.org/wiki/Learning_rate) is a tunable hyperparameter in an optimisation algorithm that determines the step size at each iteration while moving towards a minimum of a loss function (credit: Wikipedia). Setting learning rates optimally is often a balancing act between over- and overshooting a global minima. When the learning rate is _too low_, a model might fail to converge as its steps in the direction of a function minima are simply too small. On the other hand, a learning rate that is _too high_ might result in extremely large steps that _overshoot_ the function minima and miss the target entirely. An optimal (fixed) learning rate value should be selected such that the likelihood of overshooting is minimised and that is sufficiently large to perform steepest descent towards convergence.

A [learning rate schedule](https://en.wikipedia.org/wiki/Learning_rate#Learning_rate_schedule) helps accomplish this by not only decreasing overshoot but also speeding up the time it could take to reach convergence. LR schedules have two important properties: _decay_ and _momentum_. Decay is a hyperparameter that controls overshooting by decreasing (_annealing_) the learning rate by a fixed factor. Momentum is another hyperparameter that helps us speed up the convergence time. Analogous to a ball rolling down a hill; momentum governs how quickly our learning rate decays. This is extremely useful for making sure that the direction we move in towards steepest descent is indeed towards the global minima and not just towards a local minimum (as a ball with little to no momentum will struggle to 'get over' the shallow [saddle points](https://en.wikipedia.org/wiki/Saddle_point) of the differentiable function.

#### Step-wise Annealing

One of the easiest learning rate schedules to implement is the [step-wise method](https://paperswithcode.com/method/step-decay). With this schedule, the learning rate is decreased by a static factor at evenly-spaced intervals during the training cycle (usually measured in epochs, i.e., full passes over a dataset). The scale factor $\gamma$ serves as a hyperparameter governing how much to decrease the previous learning rate $\eta_{i-1}$ by,

$$
\begin{align}
\eta_{i} &= \eta_{i-1} * \gamma, 
\end{align}
$$

for the updated learning rate value $\eta_{n}$.

Step-wise annealing can also be, well, _step-wise_. As the name implies, the number of current _steps_ $n$ can also be used to scale the initial fixed learning rate value $\eta_{0}$,

$$
\begin{align}
\eta_{i} = \eta_{0} * d^{\left\lfloor\frac{1 + n}{r}\right\rfloor},
\end{align}
$$

given a static decay factor $d$ (e.g., $d=0.5$ will decay the LR by half) and a drop-rate $r$ (e.g., $r=10$ will drop the LR every 10 iterations).

#### Exponential Decay

[Exponential Decay](https://paperswithcode.com/method/exponential-decay) is similar to step-wise annealing in that the LR is gradually decreased over time. However, instead of a linear step-based decay, a scaled, reflected exponential function is used to reduce the initial fixed learning rate $\eta_{0}$ over time,

$$
\begin{align}
\eta_{n} &= \eta_{0} * \mathcal{e}^{-dn}, \\
\end{align}
$$

given a static decay factor $d$ and the current number of iterations $n$. Depending on the choice of the decay factor $d$, the Exponential Decay schedule could (preferably) accelerate the learning rate decay more rapidly than with step-wise annealing.

### 1.2. Adaptive Learning Rate Methods

Given that both the learning rate and its schedules have hyperparameters that need to be manually selected and defined prior to training, a learning rate schedule might not always be optimal. Instead, we can use [adaptive learning rate](https://en.wikipedia.org/wiki/Learning_rate#Adaptive_learning_rate) methods to utilise heuristic approaches to parameter selection that provide reliable results. Adaptive learning rate methods such as [Adagrad](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#AdaGrad), [Adam](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam) and [RMSProp](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp) build upon the [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam) optimiser with adaptive learning rate tuning.

#### AdaGrad

The [AdaGrad](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#AdaGrad) (adaptive gradient algorithm) [1] was pioneered in 2011 by J. Duchi et al., and is a dynamic technique for the normalisation of parameter updates. In brief; weights who receive large gradient update values will have their effective learning rate reduced. Conversely, weights with small or infrequently-updated values will have their effective learning rate increased. Using a parameter referred to as a "cache" variable $c$, AdaGrad performs the following updates,

$$
\begin{align}
c &= c + dx^2, \\
x &= x - \frac{\alpha * dx}{\sqrt{x} + \epsilon}.
\end{align}
$$

The cache variable $c$ keeps track of the per-parameter sum of the squared gradients, which in turn is used to normalise the parameter updates. Here an $\epsilon$ term is added to the denominator of the update term for numerical stability (to avoid a divide by zero error), credit: [G. Singh](https://medium.com/@gsinghviews/adaptive-learning-rate-methods-e6e00dcbae5e).

#### Adam

_Coming soon_.

#### RMSProp

_Coming soon_.

## Details

To do so, you will have to leverage Keras `callbacks`. Callbacks performs various action
at different stages of training. For example, Keras uses a callback to save the models weights at 
the end of each training epoch.

In [None]:
### From Udacity's `utils.py`

In [None]:
class LrLogger(tf.keras.callbacks.Callback):
    def __init__(self):
        super().__init__()
        
    def on_train_begin(self, logs=None):
        history = self.model.history.history
        history['lr'] = []

    def on_epoch_end(self, epoch, logs=None):
        history = self.model.history.history
        optimizer = self.model.optimizer
        decayed_lr = optimizer._decayed_lr('float32').numpy()
        history['lr'].append(decayed_lr)

You can either use pre-implemented schedulers (see Tips) or implement a scheduler yourself 
using your own custom decay function, as shown below:

```
def decay(model, callbacks, lr=0.001):
    """ create custom decay that does not do anything """
    def scheduler(epoch, lr):
        return lr 

    callbacks.append(tf.keras.callbacks.LearningRateScheduler(scheduler))

    # compile model
    model.compile()
    
    return model, callbacks 
```

In [3]:
### From Udacity's `training.py`

In [5]:
def exponential_decay(
        model: tf.keras.Model, callbacks: List[tf.keras.callbacks.Callback]=[], 
        initial_lr: float=0.001
) -> Tuple[tf.keras.Model, List[tf.keras.callbacks.Callback]]:
    """Compiles and returns Model instance with exponential decay LR schedule.
    
    :param model: the tf.keras.Model instance to compile.
    :param callbacks: the list of tf.keras.callbacks to pass alongside model.
    :param initial_lr: the value to fix the learning rate at before annealing.
    :returns: tuple, the compiled Model instance and its callbacks.
    """
    # IMPLEMENT THIS FUNCTION
    
    # Instantiate the learning rate schedule
    lr_scheduler = tf.keras.optimizers.schedules.ExponentialDecay(
                        initial_learning_rate=initial_lr,
                        decay_steps=100,
                        decay_rate=0.95,
                        staircase=False
    )
    # Instantiate the optimiser
    optimizer = tf.keras.optimizers.Adam(learning_rate=lr_scheduler)
    # Compile the model
    model.compile(optimizer=optimizer, 
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics=['accuracy']
    )
    # Return the model and any specified callbacks
    return model, callbacks

In [7]:
model = tf.keras.Model()

2022-09-28 14:19:24.263631: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [8]:
model, callbacks = exponential_decay(model=model, callbacks=[], initial_lr=0.001)

In [22]:
def step_decay(
        model: tf.keras.Model, callbacks: List[tf.keras.callbacks.Callback]=[], 
        initial_lr: float=0.001
) -> Tuple[tf.keras.Model, List[tf.keras.callbacks.Callback]]:
    """Compiles and returns Model instance with step-wise decay LR schedule.
    
    :param model: the tf.keras.Model instance to compile.
    :param callbacks: the list of tf.keras.callbacks to pass alongside model.
    :param initial_lr: the value to fix the learning rate at before annealing.
    :returns: tuple, the compiled Model instance and its callbacks.
    """
    #  IMPLEMENT THIS FUNCTION
    
    def scheduler(epoch: int, lr: float):
        """Simple custom constant step-wise annealing schedule."""
        return lr / 2 if epoch % 10 == 0 and epoch > 0 else lr
    
    # Instantiate a custom Keras callback to perform LR annealing
    lr_schedule = tf.keras.callbacks.LearningRateScheduler(
                            schedule=scheduler, verbose=1
    )
    callbacks.append(lr_schedule)
    # Instantiate the optimiser with the initial learning rate value
    optimizer = tf.keras.optimizers.Adam(learning_rate=initial_lr)
    # Compile the model
    model.compile(optimizer=optimizer,
                  loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
                  metrics=['accuracy']
    )
    # Return the compiled model and the custom LR scheduler and any other callback
    return model, callbacks

In [23]:
model = tf.keras.Model()

In [24]:
model, callbacks = step_decay(model=model, callbacks=[], initial_lr=0.001)

Feel free to use any decay rates as well as a step size of your choice for the stepwise scheduler.

You can run `python training.py` to see the effect of different annealing strategies on your training and model performances. Make sure to feed in the GTSRB dataset as the image directory, and use the Desktop to view the visualization of final training metrics.

In [None]:
### From Udacity's `utils.py`

In [None]:
def get_module_logger(mod_name):
    logger = logging.getLogger(mod_name)
    handler = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s %(levelname)-8s %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.DEBUG)
    return logger

In [None]:
### From Udacity's `training.py`

In [None]:
logger = get_module_logger(__name__)

In [None]:
parser = argparse.ArgumentParser(description='Download and process tf files')
parser.add_argument('-d', '--imdir', required=True, type=str,
                    help='data directory')
parser.add_argument('-e', '--epochs', default=10, type=int,
                    help='Number of epochs')
args = parser.parse_args()    

logger.info(f'Training for {args.epochs} epochs using {args.imdir} data')

In [None]:
### From Udacity's `utils.py`

In [None]:
def process(image,label):
    """ small function to normalize input images """
    image = tf.cast(image/255. ,tf.float32)
    return image,label

In [None]:
def get_datasets(imdir):
    """ extract GTSRB dataset from directory """
    train_dataset = image_dataset_from_directory(imdir, 
                                       image_size=(32, 32),
                                       batch_size=32,
                                       validation_split=0.2,
                                       subset='training',
                                       seed=123,
                                       label_mode='int')

    val_dataset = image_dataset_from_directory(imdir, 
                                        image_size=(32, 32),
                                        batch_size=32,
                                        validation_split=0.2,
                                        subset='validation',
                                        seed=123,
                                        label_mode='int')
    train_dataset = train_dataset.map(process)
    val_dataset = val_dataset.map(process)
    return train_dataset, val_dataset

In [None]:
### From Udacity's `training.py`

In [None]:
# get the datasets
train_dataset, val_dataset = get_datasets(args.imdir)

In [None]:
logger = LrLogger()
callbacks = [logger]

In [None]:
### From Udacity's `utils.py`

In [25]:
print('-'*80)

--------------------------------------------------------------------------------


In [31]:
def create_network(
        inputs: tf.keras.Input, outputs: tf.keras.layers.Layer
) -> tf.keras.Model:
    """Creates a tf.keras.Sequential Model with the provided inputs and outputs.
    
    :param inputs: the tf.keras.Input layer of specified shape.
    :param outputs: the tf.keras.layers.Layer instance of desired output,
        this should be a Dense layer with units equal to num. classes.
    :returns: the tf.keras.Model instance.
    """
    
    net = tf.keras.models.Sequential([
        inputs,
        tf.keras.layers.Conv2D(
                filters=6, kernel_size=(3, 3), strides=(1, 1), activation='relu'),
        tf.keras.layers.MaxPooling2D(
                pool_size=(2, 2), strides=(2, 2)),
        tf.keras.layers.Conv2D(
                filters=16, kernel_size=(3, 3), strides=(1, 1), activation='relu'),
        tf.keras.layers.MaxPooling2D(
                pool_size=(2, 2), strides=(2, 2)),
        tf.keras.layers.Flatten(),
        tf.keras.layers.Dense(
                units=120, activation='relu'),
        tf.keras.layers.Dense(
                units=84, activation='relu'),
        outputs
    ])
    return net

In [32]:
### From Udacity's `training.py`

In [33]:
INPUT_SHAPE = (32, 32, 3)
N_CLASSES = 43

In [34]:
inputs = tf.keras.Input(shape=INPUT_SHAPE)
outputs = tf.keras.layers.Dense(units=N_CLASSES)

In [35]:
model = create_network(inputs, outputs)

In [36]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 30, 30, 6)         168       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 15, 15, 6)         0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 13, 13, 16)        880       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 6, 6, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 576)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 120)               69240     
_________________________________________________________________
dense_3 (Dense)              (None, 84)                1

In [None]:
### Create a list of all desired TensorFlow Keras callbacks
callbacks = []

In [None]:
### Compile the model with a Keras `ExponentialDecay` LR schedule
model, callbacks = step_decay(model, callbacks=callbacks, initial_lr=0.001)

In [None]:
### Compile the model with a custom `step_decay` LR schedule
# model = create_network(inputs, outputs)
# model, callbacks = step_decay(model, callbacks=callbacks, initial_lr=0.001)

In [None]:
### Fitting the model on the train data and passing in our callbacks

In [None]:
history = model.fit(x=train_dataset, 
                    epochs=EPOCHS, 
                    validation_data=validation_dataset,
                    callbacks=callbacks
)

In [None]:
### From Udacity's `utils.py`

In [None]:
def display_metrics(history):
    """ plot loss and accuracy from keras history object """
    f, ax = plt.subplots(1, 3, figsize=(15, 5))
    ax[0].plot(history.history['loss'], linewidth=3)
    ax[0].plot(history.history['val_loss'], linewidth=3)
    ax[0].set_title('Loss', fontsize=16)
    ax[0].set_ylabel('Loss', fontsize=16)
    ax[0].set_xlabel('Epoch', fontsize=16)
    ax[0].legend(['train loss', 'val loss'], loc='upper right')
    ax[1].plot(history.history['accuracy'], linewidth=3)
    ax[1].plot(history.history['val_accuracy'], linewidth=3)
    ax[1].set_title('Accuracy', fontsize=16)
    ax[1].set_ylabel('Accuracy', fontsize=16)
    ax[1].set_xlabel('Epoch', fontsize=16)
    ax[1].legend(['train acc', 'val acc'], loc='upper left')
    ax[2].plot(history.history['lr'], linewidth=3)
    ax[2].set_title('Learning rate', fontsize=16)
    ax[2].set_ylabel('Learning Rate', fontsize=16)
    ax[2].set_xlabel('Epoch', fontsize=16)
    ax[2].legend(['learning rate'], loc='upper right')
    # ax[2].ticklabel_format(axis='y', style='sci')
    ax[2].yaxis.set_major_formatter(mtick.FormatStrFormatter('%.2e'))
    plt.tight_layout()
    plt.show()

In [None]:
display_metrics(history)

## 3. Closing Remarks

##### Alternatives
* Test out various starting learning rate values.

##### Extensions to task
* Test learning rate strategies on other model architectures;
* Visualise the learning rate schedule over time;
* Implement other [popular](https://paperswithcode.com/methods/category/learning-rate-schedules) learning rate schedules in literature (e.g., [cosine annealing](https://paperswithcode.com/method/cosine-annealing), [linear warmup with linear decay](https://paperswithcode.com/method/linear-warmup-with-linear-decay));
* Implement [adaptive learning rate](https://en.wikipedia.org/wiki/Learning_rate#Adaptive_learning_rate) methods.

## 4. Future Work

- [ ] Visualise learning rate schedule over time;
- [ ] Compare their performance / affect on model accuracy;
- [ ] Implement other popular learning rate schedules.

## Credits

This assignment was prepared by Thomas Hossler et al., Winter 2021 (link [here](https://www.udacity.com/course/self-driving-car-engineer-nanodegree--nd0013)).


References
* [1] Duchi, J. et al., Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Researcher. 12(61):2121–2159. [doi:10.5555/1953048.2021068](https://dl.acm.org/doi/10.5555/1953048.2021068).


Helpful resources:
* [Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0 by B. Chen | Medium](https://towardsdatascience.com/learning-rate-schedule-in-practice-an-example-with-keras-and-tensorflow-2-0-2f48b2888a0c)
* [Adaptive Learning Rate Methods by G. Singh](https://medium.com/@gsinghviews/adaptive-learning-rate-methods-e6e00dcbae5e)