# Exercise 1.3.2: Stochastic Gradient Descent
#### By Jonathan L. Moran (jonathan.moran107@gmail.com)
From the Self-Driving Car Engineer Nanodegree programme offered at Udacity.

## Objectives

* Create training and validation loops in TensorFlow using the custom functions built in [Exercise 1.3.1](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Object-Detection-in-Urban-Environments/Exercises/1-3-1-Logistic-Regression/2022-08-27-Logistic-Regression.ipynb);
* Implement the logistic regression model using [stochastic gradient descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent);
* Train the model on the [German Traffic Sign Recognition Benchmark](https://benchmark.ini.rub.de) dataset.

## 1. Introduction

In [None]:
### Importing required modules

In [3]:
import numpy as np
import os
import tensorflow as tf
from typing import List

In [None]:
tf.__version__

In [None]:
### Setting environment variables

In [None]:
ENV_COLAB = False                # True if running in Google Colab instance

In [None]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/'

In [None]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out/')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data/')

### 1.1. Custom model functions

We will wrap the cross-entropy loss and accuracy metric functions in a [`tf.keras.metrics.MeanMetricWrapper`](https://www.tensorflow.org/api_docs/python/tf/keras/metrics/MeanMetricWrapper) which is a quick way to build a custom metric function in TensorFlow. Since the `MeanMetricWrapper` expects a per-sample loss array as output, we will slightly modify our `cross_entropy` and `accuracy` functions from [Exercise 1.3.1](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Object-Detection-in-Urban-Environments/Exercises/1-3-1-Logistic-Regression/2022-08-27-Logistic-Regression.ipynb).

#### Softmax activation function

This [softmax](https://en.wikipedia.org/wiki/Softmax_function) activation function is a generalisation of the sigmoid [logistic function](https://en.wikipedia.org/wiki/Logistic_function) to multiple dimensions. The softmax function computes the discrete probability distribution over all classes for each observation in the training step.

#### Cross-entropy loss function

The [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression) loss function serves as measure of dissimilarity between the maximum likelihood estimate (the class label prediction) and the corresponding ground truth class label. We will use cross-entropy loss as a cost function to minimise over all data points in our training set.

#### Accuracy scoring metric function

The average number of true class predictions over the total number of predicted class labels. For a single prediction, this value is either $1$ (correct) or $0$ (incorrect).

### 1.2. Modelling with TensorFlow

In this section we will build and prepare the [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) model for training and evaluation.

#### Stochastic Gradient Descent algorithm

The logistic regression solver we have selected for this assignment is called [_stochastic gradient descent_](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) (SGD). Stochastic gradient descent is simply a gradient descent optimisation with a batch size of $1$. That is, on every weight update step only a single training example is used to compute the gradients. This differs from _batch_ or _mini-batch_ gradient descent, which use the full, or subset of the full datasets, on each update step.

The `sgd` function loops over all weight values in vector $\mathrm{w}$ and subtracts the gradient computation from each value $w_{i}$, performing a gradient _descent_ towards the function minima. Here we perform the _stochastic_ approach with `grad` computed on a per-sample basis.

#### Logistic Regression model

Logistic regression is a simple [linear model](https://en.wikipedia.org/wiki/Linear_model) for classification that attempts to fit a mapping between input data $X$ and output labels $y$. To do so, a linear function $y = m*x + b$ is approximated such that two variables $m$ and $b$ minimise the loss. For a linear model such as this, our $m$ variable represents a vector of _weights_ (denoted $\mathrm{W}$) and our $b$ variable represents our _bias_ term. The loss (error) we are attempting to minimise is computed with an off-the-shelf [cross-entropy loss](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression) function (commonly referred to as _log loss_). This function produces a loss value which essentially measures the difference between the predicted and true outcome variables.

### 1.3. Training and validation

By utilising several TensorFlow Functional APIs we can perform custom training and validation loops.

**Training loop**:
1. Iterate over training dataset sample-by-sample;
2. Perform the forward pass of the model on each sample:
    * Scale each image data array to values between 0 and 1;
    * Compute the maximum likelihood estimate;
    * Calculate the cross-entropy loss (i.e., measure prediction error);
    * Obtain the gradient with respect to model weight vectors;
    * Update the weight vectors by subtracting the gradient (stochastic gradient descent);
    * Calculate the prediction accuracy;
3. Return the mean average loss and accuracy metrics.


**Validation loop**:
1. Iterate over the validation dataset sample-by-sample;
    * Scale each image data array to values between 0 and 1;
    * Compute the maximum likelihood estimate (i.e., making a prediction);
    * Calculate the prediction accuracy;
2. Return the mean average accuracy metric.

In our `training_loop()` method we will implement the [`tf.GradientTape`](https://www.tensorflow.org/api_docs/python/tf/GradientTape) tool for performing [automatic differentiation](https://en.wikipedia.org/wiki/Automatic_differentiation) (_autodiff_). TensorFlow's `GradientTape` API uses a reverse-order traversal of operations in the forward and backward passes of a model in order to differentiate a desired function. This is known as _reverse mode differentiation_ and is a more-optimal technique over [symbolic differentiation](https://en.wikipedia.org/wiki/Symbolic_differentiation) and [numerical differentiation](https://en.wikipedia.org/wiki/Numerical_differentiation) for computing the partial derivatives of a function with respect to many inputs.

Here our `GradientTape` routine will calculate the gradient of the loss with respect to the model variables after a forward pass of the model has been run. Once a maximum likelihood estimate has been worked out for an observation $x_{i}$ given its true class label $y = j$ with respect to the weight $w_{j}$, that is,

$$
\begin{align}
    P\left(y = j \vert z_{i} = x_{i}; w_{j} \right) &= \phi_{softmax}\left(z_{i}\right),
\end{align}
$$

where the input $z_{i}$ is defined to be
$$
\begin{align}
    z_{i} = w_{i}x_{i} 
\end{align}
$$

such that $w$ is the model weight vector and $x_{i}$ is the feature vector belonging to the single training observation. 

The gradients computed with `GradientTape` for the cross-entropy loss
$$
\begin{align}
    loss\left(x_{i}, y; w_{i}\right) = H\left(y, \hat{y}\right)
\end{align}
$$

for a one-hot encoded true class label vector $y$ and the corresponding predicted probability distribution $\hat{y}$ are with respect to the model weight and bias parameters $w$ and $b$.

In stochastic gradient descent, these iterative gradient computations are subtracted from the weight vector on a per-sample basis. An unbiased estimate of the _true_ gradient (i.e., gradient computed over the full dataset) using only a single observation can be achieved when sampling the observation uniformly at random over the entire dataset.

### 1.4. German Traffic Sign Recognition Benchmark (GTSRB) dataset

The German Traffic Sign Recognition Benchmark is a multi-class, single-image classification challenge created by J. Stallkamp et al. (2012) at the [Institut für Neuroinformatik](https://benchmark.ini.rub.de/gtsrb_news.html) [1]. In this dataset there exists over 50.000 unique images of more than 40 distinct classes of traffic signs. 

Each image has been reliably annotated with the following information:
* **Filename**: filename of the corresponding image;
* **Width**: width of the image;
* **Height**: height of the image;
* **ROI.x1**: x-coordinate of top-left corner of traffic sign bounding box;
* **ROI.y1**: y-coordinate of the top-left corner of traffic sign bounding box;
* **ROI.x2**: x-coordinate of bottom-right corner of traffic sign bounding box;
* **ROI.y2**: y-coordinate of the bottom-right corner of traffic sign bounding box; 
* **ClassId**: assigned class label.

In addition to the CSV-formatted annotations, the following information about the images is provided:
* Images contain one traffic sign each;
* Images contain a border of 10% around the actual traffic sign (at least 5 pixels) to allow for edge-based approaches;
* Images are stored in PPM format ([Portable Pixmap, P6](http://en.wikipedia.org/wiki/Netpbm_format));
* Image sizes vary between 15x15 to 250x250 pixels;
* Images are not necessarily squared;
* The actual traffic sign is not necessarily centred within the image. This is true for images that were close to the image border in the full camera image;
* The bounding box of the traffic sign is part of the annotations.

Lastly, several pre-calculated feature sets are provided. Namely, _Histogram of Oriented Gradients_ (HOG) features, _Haar-like_ features (5 distinct Haar-like features), and _hue histograms_ (256-bin HSV colour space).

This dataset presents unique real-world challenges within object recognition by providing traffic sign images captured in a variety of lighting/illumination conditions and images that have distortions (e.g., blurring, pixelation) as well as differences in shape, size, etc.

For more information on the GTSRB dataset, see [here](https://benchmark.ini.rub.de/gtsrb_dataset.html). 

## 2. Programming Task

### 2.1. Custom model functions

The following functions from [Exercise 1.3.1](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Object-Detection-in-Urban-Environments/Exercises/1-3-1-Logistic-Regression/2022-08-27-Logistic-Regression.ipynb) have been slightly modified to work on a single input observation.

#### Softmax activation function

In [None]:
### From J. Moran's `2022-08-27-Logistic-Regression.ipynb`

In [13]:
def softmax(logits: tf.Tensor, stable: bool=False) -> tf.Tensor:
    """Returns the softmax probability distribution.
    
    :param logits: a 1xN tf.Tensor of logits.
    :param stable: optional, flag indicating whether
        or not to normalise the input data.
    returns: soft_logits, a 1xN tf.Tensor of real 
        values in range (0,1) that sum up to 1.0.
    """
    
    assert isinstance(logits, tf.Tensor)
    if stable:
        logits = tf.subtract(logits, tf.reduce_max(logits))
    soft_logits = tf.math.exp(logits)
    soft_logits /= tf.math.reduce_sum(soft_logits)
    return soft_logits

#### Cross-entropy loss function

In [14]:
def cross_entropy(y_true: tf.Tensor, y_pred: tf.Tensor) -> tf.Tensor:
    """Returns the per-sample cross-entropy loss.
    
    :param y_true: a 1xC tf.Tensor, the ground truth class label
        as a one-hot encoded vector of length C (num of total classes).
    :param y_pred: a 1xC tf.Tensor, the predicted per-class probabilities.
    :returns: a 1x1 tf.Tensor, the categorical cross-entropy loss value
        for a single observation and its ground truth label.
    """
    
    # Pick the probability value from the distribution
    # that corresponds to the true class label
    preds = tf.boolean_mask(y_pred, mask=y_true)
    # Taking the negative log-likelihood
    neg_log_likelihood = -tf.math.log(preds)
    # Here we return the categorical cross-entropy loss
    # value for a single observation (no need to normalise)
    return tf.reduce_sum(neg_log_likelihood)

#### Accuracy scoring metric function

In [15]:
def accuracy(y_true: tf.Tensor, y_pred: tf.Tensor) -> tf.Tensor:
    """Evaluates a single prediction against the ground truth.

    :param y_true: a 1x1 scalar tf.Tensor, the ground truth class label
        (not one-hot encoded).
    :param y_pred: a 1x1 scalar tf.Tensor, the predicted class label.
    returns: acc, a 1x1 scalar tf.Tensor object, 1.0 if correct else 0.0.
    """

    return tf.cast(tf.math.equal(y_true, y_pred), dtype=tf.float32)

### 2.2. Modelling with TensorFlow

#### Stochastic Gradient Descent algorithm

In [None]:
### From Udacity's `training.py`

In [None]:
def sgd(params: List[tf.Tensor], grads: List[tf.Tensor], lr: float, bs: int=1):
    """Performs the stochastic gradient descent update step.
    
    SGD fits a linear model to the input data and target labels.
    Here we assume a logistic regression implementation with a
    softmax activation function and categorical cross-entropy loss.
    
    args:
    - params [list[tensor]]: model params
    - grads [list[tensor]]: param gradient such that params[0].shape == grad[0].shape
    - lr [float]: learning rate
    - bs [int]: batch_size
    """
    # IMPLEMENT THIS FUNCTION

#### Logistic Regression model

In [None]:
def model():
    # IMPLEMENT THIS FUNCTION
    pass

### 2.3. Training and validation loops

#### Custom training loop

In [None]:
def training_loop(train_dataset, model, loss, optimizer):
    """
    training loop
    args:
    - train_dataset: 
    - model [func]: model function
    - loss [func]: loss function
    - optimizer [func]: optimizer func
    returns:
    - mean_loss [tensor]: mean training loss
    - mean_acc [tensor]: mean training accuracy
    """
    
    accuracies = []
    losses = []
    for X_train, y_train in train_dataset:
        with tf.GradientTape() as tape:
            # IMPLEMENT THIS FUNCTION
            pass
    mean_acc = tf.math.reduce_mean(tf.concat(accuracies, axis=0))
    mean_loss = tf.math.reduce_mean(losses)
    return mean_loss, mean_acc

#### Custom validation loop

In [None]:
def validation_loop(val_dataset, model):
    """
    training loop
    args:
    - train_dataset: 
    - model [func]: model function
    - loss [func]: loss function
    - optimizer [func]: optimizer func
    returns:
    - mean_acc [tensor]: mean validation accuracy
    """
    # IMPLEMENT THIS FUNCTION
    return mean_acc

### 2.4. Evaluation on the GTSRB dataset

#### Considerations for our input data

The following `get_datasets()` method returns a tuple of [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) instances containing the training and validation datasets, respectively.

In [None]:
### From Udacity's `dataset.py`

In [None]:
def get_datasets(imdir: str) -> tuple:
    """Return the training and validation datasets.
    
    :param imdir: absolute path to the directory where the data is stored in.
    :returns: (train_dataset, val_dataset), tuple of tf.data.Dataset instances.
    """
    
    train_dataset = tf.keras.utils.image_dataset_from_directory(
                        imdir, 
                        image_size=(32, 32),
                        batch_size=256,
                        validation_split=0.1,
                        subset='training',
                        seed=123
    )
    val_dataset = tf.keras.utils.image_dataset_from_directory(
                        imdir, 
                        image_size=(32, 32),
                        batch_size=256,
                        validation_split=0.1,
                        subset='validation',
                        seed=123
    )
    return train_dataset, val_dataset

In [None]:
### Getting the training and validation data

In [None]:
imdir = os.path.join(DIR_SRC, 'GTSRB')

In [None]:
train_dataset, val_dataset = get_datasets(imdir)

#### Putting it all together

##### Initialising model parameters

In [None]:
### Defining the model input parameters

In [None]:
num_inputs = 1024*3
num_outputs = 43

In [None]:
imdir = os.path.join(DIR_SRC, 'GTSRB')
epochs = None
batch_size = None
lr = None
args = {'imdir': , 'epochs': , 'batch_size': batch_size, 'lr': lr}

We will use the TensorFlow built-in [`tf.Variable`](https://www.tensorflow.org/api_docs/python/tf/Variable?hl=en) tensor object to distinguish our trainable model parameters (weight, bias vectors) from otherwise static tensor objects. These `tf.Variable` objects maintain a shared, persistent state and therefore come with a few useful operations (e.g., [`assign_sub()`](https://www.tensorflow.org/api_docs/python/tf/Variable#assign_sub) method) that we will use during training to manipulate their values. Another nice feature of the `tf.Variable` is that they are automatically traced and watched during the `tf.GradientTape` steps.

In [None]:
### Initialising the model variables (weights and bias vectors)
W = tf.Variable(tf.random.normal(shape=(num_inputs, num_outputs), mean=0, stddev=0.01))
b = tf.Variable(tf.zeros(num_outputs))

##### Performing the training and validation loops

In [None]:
### From Udacity's `training.py`

In [None]:
def get_module_logger(mod_name):
    logger = logging.getLogger(mod_name)
    handler = logging.StreamHandler()
    formatter = logging.Formatter('%(asctime)s %(levelname)-8s %(message)s')
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.DEBUG)
    return logger

In [None]:
### Getting the console logger
logger = get_module_logger(__name__)
logger.info(f'Training for {args['batch_size']} epochs using {args['imdir']} data')

In [None]:
### Run training and validation loop

In [None]:
for epoch in range(epochs):
    logger.info(f'Epoch {epoch}')
    ### Perform stochastic gradient descent over training data
    loss, acc = training_loop(X_train, model, negative_log_likelihood, sgd)
    logger.info(f'Mean training loss: {loss}, mean training accuracy: {acc}')
    ### Compute validation set accuracy
    val_acc = validation_loop(val_datset)
    logger.info(f'Mean validation accuracy {acc}')

## Details

A training loop goes through element of the training dataset and uses it to update the model's weights.
A validation loop goes through each element of the validation dataset and uses it to calculate
the metrics (eg, accuracy). We call **epoch** an iteration of one training loop and one validation loop.

The input to your model should be normalized. You can do this by dividing them by 255: `X /= 255`.

You can run `python training.py` to train your first machine learning model!

You will need to specify the `--imdir`, e.g. `--imdir GTSRB/Final_Training/Images/`, using the provided GTSRB dataset.

## Tips

You don't need `tf.GradientTape` for the validation loop as you will not be updating gradients. 

The `assign_sub` Variable method will be useful to perform the weights update in the sgd optimizer.

Use the `tf.one_hot` function to get the one vector from the ground truth label.

## Credits

References
* [1] Stallkamp, J. et al. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks. 32:323-332. doi:10.1016/j.neunet.2012.02.016.

Helpful resources:
   * [An overview of gradient descent optimization algorithms | S. Ruder](https://ruder.io/optimizing-gradient-descent/)