# Exercise 1.3.1 - Logistic Regression

#### By Jonathan L. Moran (jonathan.moran107@gmail.com)

From the Self-Driving Car Engineer Nanodegree programme offered at Udacity.

## Objectives

In this exercise we will implement the following functions:
* `softmax`: computes the softmax (normalised exponential function) of a input tensor;
* `cross_entropy`: calculates the cross-entropy loss between one-hot encoded prediction and ground truth vectors;
* `model`: logistic regression algorithm;
* `accuracy`: calculates the accuracy between a set of predictions and corresponding ground truth labels.

## 1. Introduction

Here is a bit of [terminology](https://developers.google.com/machine-learning/glossary/) from the Google Machine Learning Glossary before we get started:
* **Accuracy**: fraction of predictions that a classification model got right;
* **Activation function**: a function that takes in the weighted sum of all inputs from the previous layer and generates an output value to be passed onto the next layer in a neural network;
* **Cost function**: measures how well a model is performing in terms of loss over the entire dataset;
* **Logistic regression**: classification model that uses an activation function (typically a [sigmoid function](https://developers.google.com/machine-learning/glossary/#sigmoid_function)) to convert a linear model's raw predictions into a value between 0 and 1;
* **Logits**: vector of raw (non-normalised) predictions that a classification model generates;
* **Loss**: a measure of how far a model's predictions are from its ground-truth label;
* **Log loss**: a function used in binary logistic regression to compute the loss value;
* **Softmax**: a function that generates a vector of (normalised) probabilities with one value for each class.

In [1]:
### Importing required modules

In [2]:
import numpy as np
import os
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
from tensorflow.keras import utils
import timeit

tnp.experimental_enable_numpy_behavior()

2023-08-20 00:44:51.062792: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-08-20 00:44:51.226340: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-08-20 00:44:52.048823: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: 
2023-08-20 00:44:52.048921: W tensorflow/compiler/xla/stream_executor/platform/default

In [3]:
### Setting environment variables

In [4]:
ENV_COLAB = False                # True if running in Google Colab instance

In [5]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/'

In [6]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out/')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data/')

### 1.1. Softmax

The [softmax function](https://en.wikipedia.org/wiki/Softmax_function) is a generalisation of the sigmoid [logistic function](https://en.wikipedia.org/wiki/Logistic_function) to multiple dimensions. In machine learning, particularly for logistic regression, the softmax function $\phi$ acts as a decision boundary applied to multi-class datasets, computing the probability of each observation $x_{i}$ belonging to one of $j = i,...,k$ class labels (assuming an independent relationship between the classes),

$$
\begin{align}
    P\left(y=j \ \vert \ z_{i}\right) = \phi_{softmax}\left(z_{i}\right) 
    = \frac{\mathcal{e}^{z_{i}}}{\sum_{j=0}^{k}\mathcal{e}^{z_{k}^{i}}}.
    \end{align}
$$
The input $z$ is defined to be
$$
\begin{align}
    z &= w_{0}x_{0} + w_{1}x_{1} + \ldots + w_{m}x_{m} = \sum_{i=0}^{m} w_{i}x_{i} = \mathrm{w}^{\top}\mathrm{x}.
    \end{align}
$$
such that $\mathrm{w}$ is the weight vector, $\mathrm{x}$ is the feature vector belonging to a single training observation, and $w_{0}$ is the bias unit.

The softmax function computes the probability for each class $P\left(y=j \vert x_{i}; w_{j}\right)$, then a correction step is applied to the predictions during training using a cost function that minimises the cross-entropy over the training set observations.

##### Note on numerical stability

Exponentiation in Python can be a problem for larger numbers. A Numpy `float64` value can represent a maximal number on the order of $10^{308}$, but with exponentiation in the softmax function it is possible to overshoot this number, even for fairly modest-sized inputs (as pointed out in [this](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/) post by E. Bendersky).

To handle this, we can normalise the inputs using an arbitrary constant $C$ and moving it into the exponent to obtain

$$
\begin{align}
    \phi_{softmax}\left(z_{i}\right) 
    = \frac{C\mathcal{e}^{z_{i}}}{\sum_{j=0}^{k}C\mathcal{e}^{z_{k}^{i}}} = \frac{\mathcal{e}^{z_{i} + \mathrm{log}\left(C\right)}}{\sum_{j=0}^{k}\mathcal{e}^{z_{k}^{i} + \mathrm{log}\left(C\right)}}.
    \end{align}
$$

Replacing $\mathrm{log}\left(C\right)$ with another arbitrary constant $D$, we can then select a value for $D$ as follows

$$
\begin{align}
    D = -max\left(x_{1}, x_{2},\ldots,x_{N}\right) 
    \end{align}
$$
such that all input observations $x$ will be shifted towards zero with _negative_ values. Because of this, we can better avoid NaNs as negatives with large exponents "saturate" to zero rather than infinity.

### 1.2. Cross-entropy

In machine learning, [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression) is often used as a loss function computed between two discrete probability distributions. Given a set of predictions $q_{i}$ and corresponding true probability values $p_{i}$ we can compute the cross-entropy loss, i.e., _log loss_ [1],

$$
\begin{align}
    H\left(p,q\right) = -\sum_{i}{p_i}\mathrm{log}q_i = -ylog\hat{y} - \left(1-y\right)\mathrm{log}\left(1-\hat{y}\right)
    \end{align}
$$

which serves as a measure of dissimilarity between $p$ and $q$. In classification problems, the higher the entropy value the less certain we are about the outcome variable, the prediction, we will get.

For a set of $n = 1,\ldots,N$ training observations, we can compute the average of the loss function over all observations such that $H\left(p,q\right)$ becomes

$$
\begin{align}
    J(\mathrm{w}) = \frac{1}{N}\sum_{n=1}^{N}H\left(p_{n},q_{n}\right) 
    = -\frac{1}{N}\sum_{n=1}^{N} \begin{bmatrix} y_{n}\mathrm{log}\hat{y}_{n} + \left(1-y_{n}\right)\mathrm{log}\left(1-\hat{y}_{n}\right) \end{bmatrix}.
    \end{align}
$$

#### Cross-entropy cost (loss) function

In predictive modelling, cost functions are used to estimate how poorly a model is performing (the loss). In other words, cost functions measure how wrong a model is in its ability to estimate the relationship between the input variables ($X$) and output variables ($y$). For our multi-class dataset we are interested in computing the categorical cross-entropy. 

The cross-entropy loss function for a multi-class dataset can be defined for a training sample $x_{i}$ belonging to class $j$ as

$$
\begin{align}
    loss\left(x, y; w\right) &= H\left(y, \hat{y}\right) = \sum_{j} y_{j}\mathrm{log}\hat{y}_{j} = -\mathrm{log}\frac{\mathcal{e}^{w_{j}^{\top}x_{i}}}{\sum_{j=1}^{k} \mathcal{e}^{w_{j}^{\top}x_{i}}}
    \end{align}
$$
where $y$ denotes the [one-hot](https://en.wikipedia.org/wiki/One-hot) encoded vector (the class labels) and $\hat{y}$ denotes the probability distribution $h\left(x_{i}\right)$ which is the scaled (softmax) logits.

The cross-entropy cost function for all observations $\left(\mathrm{X}_{i}, \mathrm{Y}_{i}\right)_{i=1}^{N}$ is then

$$
\begin{align}
    loss\left(\mathrm{X}, \mathrm{Y}; \mathrm{w}\right) = -\sum_{i=1}^{N}\sum_{j=1}^{k} {I}\left\{y_{i} = j\right\}\mathrm{log}\frac{\mathcal{e}^{w_{j}^{\top}x_{i}}}{\sum_{j=1}^{k}\mathcal{e}^{w_{j}^{\top} x_{i}}}.
    \end{align}
$$

Here, $I\{\cdot\}$ is the indicator function which evaluates at $1$ when the argument is true and is $0$ otherwise. Note that we use the _cost_ function [nomenclature](https://mmuratarat.github.io/2018-12-21/cross-entropy#difference-between-objective-function-cost-function-and-loss-function) to describe the average loss over all observations.

## 2. Programming Task

### 2.1. Softmax

In this exercise, you have to implement 4 different functions:

* `softmax`: compute the softmax of a vector. This function takes as input a tensor and outputs a discrete probability distribution.

In [7]:
### From Udacity's `logistic.py`

In [8]:
def softmax(logits, stable=False):
    """Returns the softmax probability distribution.
    
    :param logits: a 1xN tf.Tensor of logits.
    :param stable: optional, flag indicating whether
        or not to normalise the input data.
    returns: soft_logits, a 1xN tf.Tensor of real 
        values in range (0,1) that sum up to 1.0.
    """
    
    assert isinstance(logits, tf.Tensor)
    if stable:
        logits = tf.subtract(logits, tf.reduce_max(logits))
    soft_logits = tf.math.exp(logits)
    soft_logits /= tf.math.reduce_sum(soft_logits)
    return soft_logits

In [9]:
### Testing the softmax function with N=7 predictions
x = [1.0, 2.0, 3.0, 1.0, 2.0, 2.0, 3.0]
### Converting to tf.Tensor object
x = tf.constant(x, dtype=tf.float64)
### Computing the softmax function and printing results as Numpy array
x_scaled = softmax(x)
x_scaled.numpy()

2023-08-20 00:44:55.663403: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: 
2023-08-20 00:44:55.663447: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1934] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2023-08-20 00:44:55.664411: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compil

array([0.04010756, 0.10902364, 0.29635698, 0.04010756, 0.10902364,
       0.10902364, 0.29635698])

In [10]:
### Testing the softmax function with N=3 large values (without normalising)
x_large = tf.constant([1000, 2000, 3000], dtype=tf.float64)
softmax(x_large)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([nan, nan, nan])>

In [11]:
### Testing the softmax function with N=3 large values (with normalising)
x_large = tf.constant([1000, 2000, 3000], dtype=tf.float64)
softmax(x_large, stable=True)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([0., 0., 1.])>

**Note**: this output isn't very ideal either, since the softmax function does not typically result in a zero value. However, for very large numbers, we are expecting a result extremely close to zero anyway.

### 2.2. Cross-entropy

* `cross_entropy`: calculate the cross entropy loss given a vector of predictions (after softmax) and a vector of ground truth (one-hot vector).

In [12]:
### From Udacity's `logistic.py`

In [13]:
def cross_entropy(scaled_logits, one_hot, use_numpy=True):
    """Returns the cross-entropy loss.
    
    :param scaled_logits: an NxC tf.Tensor of scaled softmax
        distribution values, [n_samples x n_classes].
    :param one_hot: an NxC tf.Tensor of one-hot encoded 
        ground truth labels, [n_samples x n_classes].
    :param use_numpy: optional, uses  Numpy multidimensional
        array indexing on type-casted tf.experimental.numpy
        ndarrays, uses boolean masking if False.
    :returns: loss, a 1x1 tf.Tensor with cross-entropy loss. 
    """
    
    assert isinstance(scaled_logits, tf.Tensor)
    assert isinstance(one_hot, tf.Tensor)
    if use_numpy:
        n_samples = one_hot.shape[0]
        class_labels = tf.math.argmax(one_hot, axis=1)
        preds = scaled_logits[tnp.arange(n_samples), class_labels]
        log_likelihood = -tf.math.log(preds)
    else:
        n_samples = one_hot.shape[0]
        # For each sample, pick the probability value from the distribution
        # that corresponds to the true class label
        preds = tf.boolean_mask(scaled_logits, one_hot)
        # Taking the negative log-likelihood
        log_likelihood = -tf.math.log(preds)
    # Normalising by the sample size
    loss = tf.math.reduce_sum(log_likelihood) / n_samples
    return loss

In [14]:
# Creating our ground-truth labels and using one-hot encoding
y = tf.constant([2, 2, 3, 0, 2, 1, 3])
y_one_hot = tf.one_hot(y, depth=4)
y_one_hot

<tf.Tensor: shape=(7, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

In [15]:
### Creating pseudo-batched data by repeating 'predictions'
X_scaled = tf.stack([x_scaled] * 4, axis=1)
X_scaled

<tf.Tensor: shape=(7, 4), dtype=float64, numpy=
array([[0.04010756, 0.04010756, 0.04010756, 0.04010756],
       [0.10902364, 0.10902364, 0.10902364, 0.10902364],
       [0.29635698, 0.29635698, 0.29635698, 0.29635698],
       [0.04010756, 0.04010756, 0.04010756, 0.04010756],
       [0.10902364, 0.10902364, 0.10902364, 0.10902364],
       [0.10902364, 0.10902364, 0.10902364, 0.10902364],
       [0.29635698, 0.29635698, 0.29635698, 0.29635698]])>

**Note**: this data does not make any sense in the scheme of this problem, we are simply the output of our cross-entropy loss function. 

In [16]:
### Testing the cross-entropy loss function with N=1 batch
loss = cross_entropy(X_scaled, y_one_hot)
loss

<tf.Tensor: shape=(), dtype=float64, numpy=2.216190530018549>

### 2.3. Logistic Regression model

* `model`: takes a batch of images (stack of images along the first dimensions) and feeds it through the logistic regression model.

In [17]:
### From Udacity's `logistic.py`

In [18]:
def model(X, W, b):
    """Performs one step of the logistic regression model.
    
    :param X: tf.Tensor object, a training observation
        i.e., a single HxWx3 RGB image.
    :param W: tf.Tensor object, the weight vector.
    :param b: the bias term, tf.Tensor-like object.
    returns: tf.Tensor, the softmax probability distribution.
    """
    
    assert isinstance(X, tf.Tensor)
    assert isinstance(W, tf.Variable)
    assert isinstance(b, tf.Variable)
    # Compute the product between flattened input and weight vectors
    Z = tf.matmul(tf.reshape(X, shape=(-1, W.shape[0])), W)
    # Add the bias term
    Z += b
    # Return the softmax probabilities P
    return softmax(Z)

### 2.4. Prediction accuracy

* `accuracy`: given a vector of predictions and a vector of ground truth, calculate the accuracy.

In [19]:
### From Udacity's `logistic.py`

In [20]:
def accuracy(y_hat, y):
    """Calculates the average correct predictions.

    :param y_hat: tf.Tensor, NxC tensor-like object of 
        models predictions [n_samples x n_classes].
    :param y: tf.Tensor, N-dimensional tensor of
        ground truth class labels (not one-hot encoded).
    returns: acc, a 1x1 scalar tf.Tensor-like object
        with the accuracy score (correct / total predictions).
    """
    
    assert isinstance(y, tf.Tensor) and isinstance(y_hat, tf.Tensor)
    # Get predicted labels with highest probabilities
    y_preds = tf.cast(tf.math.argmax(y_hat, axis=1), dtype=y.dtype)
    # Get number of correct predictions
    n_correct = tf.math.count_nonzero(tf.cast(tf.math.equal(y_preds, y), dtype=tf.int32))
    # Compute average correct predictions
    acc = n_correct / y_hat.shape[0]
    return acc

### 2.5. Evaluation

We will check the above functions against the provided test values given by Udacity.

In [21]:
### From Udacity's `utils.py`

In [22]:
def check_softmax(func):
    logits = tf.constant([[0.5, 1.0, 2.0, 0.3, 4.0]])
    tf_soft = tf.nn.softmax(logits)
    soft = func(logits)
    l1_norm = tf.norm(tf_soft - soft, ord=1)
    assert l1_norm < 1e-5, 'Softmax calculation is wrong'
    print('Softmax implementation is correct!')


def check_ce(func):
    logits = tf.constant([[0.5, 1.0, 2.0, 0.3, 4.0]])
    scaled_logits = tf.nn.softmax(logits)
    one_hot = tf.constant([[0, 0, 0, 0, 1.0]])
    tf_ce = tf.nn.softmax_cross_entropy_with_logits(one_hot, logits)
    ce = func(scaled_logits, one_hot)
    l1_norm = tf.norm(tf_ce - ce, ord=1)
    assert l1_norm < 1e-5, 'CE calculation is wrong'
    print('CE implementation is correct!')


def check_model(func):
    # only check the output size here
    X = tf.random.uniform([28, 28, 3])
    num_inputs = 28*28*3
    num_outputs = 10
    W = tf.Variable(tf.random.normal(shape=(num_inputs, num_outputs),
                                    mean=0, stddev=0.01))
    b = tf.Variable(tf.zeros(num_outputs))
    out = func(X, W, b)
    assert out.shape == (1, 10), 'Model is wrong!'
    print('Model implementation is correct!')


def check_acc(func):
    y_hat = tf.constant([[0.8, 0.2, 0.5, 0.2, 5.0], [0.8, 0.2, 0.5, 0.2, 5.0]]) 
    y = tf.constant([4, 1])
    acc = func(y_hat, y)
    assert acc == tf.cast(tf.constant(0.5), dtype=acc.dtype), 'Accuracy calculation is wrong!'
    print('Accuracy implementation is correct!') 

    
def compute_ce(func, use_numpy):
    logits = tf.constant([[0.5, 1.0, 2.0, 0.3, 4.0]])
    scaled_logits = tf.nn.softmax(logits)
    one_hot = tf.constant([[0, 0, 0, 0, 1.0]])
    ce = func(scaled_logits, one_hot, use_numpy)

In [23]:
### Testing the `softmax` function

In [24]:
check_softmax(softmax)

Softmax implementation is correct!


In [25]:
### Testing the cross-entropy loss function

In [26]:
check_ce(cross_entropy)

CE implementation is correct!


In [27]:
# Testing average execution time using `tf.boolean_mask`, on m2 chip
timeit.timeit(lambda: compute_ce(cross_entropy, use_numpy=False), number=1000) / 1000

0.0019366699740057812

In [28]:
# Testing average execution time using Numpy indexing, on m2 chip
timeit.timeit(lambda: compute_ce(cross_entropy, use_numpy=True), number=1000) / 1000

0.002889374829013832

In [29]:
### Testing the logistic regression `model` function

In [30]:
check_model(model)

Model implementation is correct!


In [31]:
### Testing the `accuracy` scoring function

In [32]:
check_acc(accuracy)

Accuracy implementation is correct!


## 3. Closing Remarks

##### Alternatives
* Use [`tf.nn.softmax_cross_entropy_with_logits`](https://www.tensorflow.org/api_docs/python/tf/nn/softmax_cross_entropy_with_logits) instead of manually-computing the softmax and cross-entropy loss functions;
   * **Note**: this TF1.x loss function has been replaced with [`tf.keras.losses.CategoricalCrossentripy`](https://www.tensorflow.org/api_docs/python/tf/keras/losses/CategoricalCrossentropy) when `from_logits=True` in TF2;
* Use [`tf.boolean_mask`](https://www.tensorflow.org/api_docs/python/tf/boolean_mask) instead of Numpy multidimensional array indexing in `cross_entropy` to "mask" correct class prediction probabilities;
* Regularisation by multiplying an `alpha` hyperparameter with the product of the cross-entropy and L2 weight vector losses;
* Perform maximum likelihood estimation and optimisation using the [`tf.compat.v1.train.GradientDescentOptimizer`](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/GradientDescentOptimizer) and [`.mimimize(loss)`](https://www.tensorflow.org/api_docs/python/tf/compat/v1/train/GradientDescentOptimizer#minimize) method;
   * **Note**: this TF1.x optimizer has been replaced with [`tf.keras.optimizers.SGD`](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/SGD) in TF2.

##### Extensions of task
* Implement a `fit` function that performs the gradient descent optimisation over a number of epochs;
* Alternatively, use a `tf.Session` to iterate over model computations.

## 4. Future Work

- ✅ Run model on actual training data (see [Exercise 1.3.2](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Computer-Vision/Exercises/1-3-2-Stochastic-Gradient-Descent/2022-08-29-Stochastic-Gradient-Descent.ipynb) and [Exercise 1.4.2](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Computer-Vision/Exercises/1-4-2-Building-Custom-CNNs/2022-09-12-Building-Custom-Convolutional-Neural-Networks.ipynb));
- ✅ Use built-in TensorFlow methods for further performance optimisations (see [Exercise 1.4.2](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Computer-Vision/Exercises/1-4-2-Building-Custom-CNNs/2022-09-12-Building-Custom-Convolutional-Neural-Networks.ipynb));
- ✅ Encapsulate current model with `fit` function and perform mini-batched or stochastic gradient descent (see [Exercise 1.3.2](https://github.com/jonathanloganmoran/ND0013-Self-Driving-Car-Engineer/blob/main/1-Computer-Vision/Exercises/1-3-2-Stochastic-Gradient-Descent/2022-08-29-Stochastic-Gradient-Descent.ipynb)).

## Credits
This assignment was prepared by Thomas Hossler and Michael Virgo et al., Winter 2021 (link [here](https://www.udacity.com/course/self-driving-car-engineer-nanodegree--nd0013)).

References
* [1] Ji, S. Xie, Y. Logistic Regression: From Binary to Multi-Class. http://people.tamu.edu/~sji/classes/LR.pdf


Helpful resources:
* [Softmax Regression and How is it Related to Logistic Regression? | KDnuggets](https://www.kdnuggets.com/2016/07/softmax-regression-related-logistic-regression.html)
* [The Softmax function and its derivative | E. Bendersky](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/)
* [Softmax and Cross Entropy Loss | P. Dahal](https://deepnotes.io/softmax-crossentropy)
* [Multinomial Regression with TensorFlow | YouTube](https://www.youtube.com/watch?v=2JiXktBn_2M)
* [Logistic regression 5.2: Multiclass - Softmax regression | YouTube](https://www.youtube.com/watch?v=hYBwBmojXoU)
* [Cross Entropy for TensorFlow | M. Murat ARAT](https://mmuratarat.github.io/2018-12-21/cross-entropy#difference-between-objective-function-cost-function-and-loss-function)