# Exercise 1.3.1 - Logistic Regression

#### By Jonathan L. Moran (jonathan.moran107@gmail.com)

From the Self-Driving Car Engineer Nanodegree programme offered at Udacity.

## Objectives

In this exercise we will implement the following functions:
* `softmax`: computes the softmax (normalised exponential function) of a input tensor;
* `cross_entropy`: calculates the cross-entropy loss between one-hot encoded prediction and ground truth vectors;
* `model`: logistic regression algorithm;
* `accuracy`: calculates the accuracy between a set of predictions and corresponding ground truth labels.

## 1. Introduction

In [1]:
### Importing required modules

In [2]:
import numpy as np
import os
import tensorflow as tf
import tensorflow.experimental.numpy as tnp
from tensorflow.keras import utils

In [3]:
### Setting environment variables

In [4]:
ENV_COLAB = False                # True if running in Google Colab instance

In [5]:
# Root directory
DIR_BASE = '' if not ENV_COLAB else '/content/'

In [6]:
# Subdirectory to save output files
DIR_OUT = os.path.join(DIR_BASE, 'out/')
# Subdirectory pointing to input data
DIR_SRC = os.path.join(DIR_BASE, 'data/')

### 1.1. Softmax

The [softmax function](https://en.wikipedia.org/wiki/Softmax_function) is a generalisation of the sigmoid [logistic function](https://en.wikipedia.org/wiki/Logistic_function) to multiple dimensions. In machine learning, particularly for logistic regression, the softmax function $\phi$ acts as a decision boundary applied to multi-class datasets, computing the probability of each observation $x_{i}$ belonging to one of $j = i,...,k$ class labels (assuming an independent relationship between the classes),

$$
\begin{align}
    P\left(y=j \ \vert \ z_{i}\right) = \phi_{softmax}\left(z_{i}\right) 
    = \frac{\mathcal{e}^{z_{i}}}{\sum_{j=0}^{k}\mathcal{e}^{z_{k}^{i}}}.
    \end{align}
$$
The input $z$ is defined to be
$$
\begin{align}
    z &= w_{0}x_{0} + w_{1}x_{1} + \ldots + w_{m}x_{m} = \sum_{i=0}^{m} w_{i}x_{i} = \mathrm{w}^{\top}\mathrm{x}.
    \end{align}
$$
such that $\mathrm{w}$ is the weight vector, $\mathrm{x}$ is the feature vector belonging to a single training observation, and $w_{0}$ is the bias unit.

The softmax function computes the probability for each class $P\left(y=j \vert x_{i}; w_{j}\right)$, then a correction step is applied to the predictions during training using a cost function that minimises the cross-entropy over the training set observations.

##### Note on numerical stability

Exponentiation in Python can be a problem for larger numbers. A Numpy `float64` value can represent a maximal number on the order of $10^{308}$, but with exponentiation in the softmax function it is possible to overshoot this number, even for fairly modest-sized inputs (as pointed out in [this](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/) post by E. Bendersky).

To handle this, we can normalise the inputs using an arbitrary constant $C$ and moving it into the exponent to obtain

$$
\begin{align}
    \phi_{softmax}\left(z_{i}\right) 
    = \frac{C\mathcal{e}^{z_{i}}}{\sum_{j=0}^{k}C\mathcal{e}^{z_{k}^{i}}} = \frac{\mathcal{e}^{z_{i} + \mathrm{log}\left(C\right)}}{\sum_{j=0}^{k}C\mathcal{e}^{z_{k}^{i} + \mathrm{log}\left(C\right)}}.
    \end{align}
$$

Replacing $\mathrm{log}\left(C\right)$ with another arbitrary constant $D$, we can then select a value for $D$ as follows

$$
\begin{align}
    D = -max\left(x_{1}, x_{2},\ldots,x_{N}\right) 
    \end{align}
$$
such that all input observations $x$ will be shifted towards zero with _negative_ values. Because of this, we can better avoid NaNs as negatives with large exponents "saturate" to zero rather than infinity.

### 1.2. Cross-entropy

In machine learning, [cross-entropy](https://en.wikipedia.org/wiki/Cross_entropy#Cross-entropy_loss_function_and_logistic_regression) is often used as a loss function computed between two probability distributions. Given a set of predictions $q_{i}$ and corresponding true probability values $p_{i}$ we can compute the cross-entropy loss, i.e., _log loss_ [1],

$$
\begin{align}
    H\left(p,q\right) = -\sum_{i}{p_i}\mathrm{log}q_i = -ylog\hat{y} - \left(1-y\right)\mathrm{log}\left(1-\hat{y}\right)
    \end{align}
$$

which serves as a measure of dissimilarity between $p$ and $q$.

For a set of $n = 1,\ldots,N$ training observations, we can compute the average of the loss function over all observations such that $H\left(p,q\right)$ becomes

$$
\begin{align}
    J(\mathrm{w}) = \frac{1}{N}\sum_{n=1}^{N}H\left(p_{n},q_{n}\right) 
    = -\frac{1}{N}\sum_{n=1}^{N} \begin{bmatrix} y_{n}\mathrm{log}\hat{y}_{n} + \left(1-y_{n}\right)\mathrm{log}\left(1-\hat{y}_{n}\right) \end{bmatrix}.
    \end{align}
$$

#### Cross-entropy loss function

The cross-entropy loss function is defined for a training sample $x_{i}$ belonging to class $j$ as

$$
\begin{align}
    loss\left(x, y; w\right) &= H\left(y, \hat{y}\right) = \sum_{j} y_{j}\mathrm{log}\hat{y}_{j} = -\mathrm{log}\frac{\mathcal{e}^{w_{j}^{\top}x_{i}}}{\sum_{j=1}^{k} \mathcal{e}^{w_{j}^{\top}x_{i}}}
    \end{align}
$$
where $y$ denotes the [one-hot](https://en.wikipedia.org/wiki/One-hot) encoded vector and $\hat{y}$ denotes the probability distribution $h\left(x_{i}\right)$.

The cross-entropy loss function for all observations $\left(\mathrm{X}_{i}, \mathrm{Y}_{i}\right)_{i=1}^{N}$ is then

$$
\begin{align}
    loss\left(\mathrm{X}, \mathrm{Y}; \mathrm{w}\right) = -\sum_{i=1}^{N}\sum_{j=1}^{k} I\left[y_{i} = j\right]\mathrm{log}\frac{\mathcal{e}^{w_{j}^{\top}x_{i}}}{\sum_{j=1}^{k}\mathcal{e}^{w_{j}^{\top} x_{i}}}.
    \end{align}
$$

## 2. Programming Task

### 2.1. Softmax

In this exercise, you have to implement 4 different functions:

* `softmax`: compute the softmax of a vector. This function takes as input a tensor and outputs a discrete probability distribution.

In [7]:
### From Udacity's `logistic.py`

In [8]:
def softmax(logits, stable=False):
    """Returns the softmax probability distribution.
    
    :param logits: a 1xN tf.Tensor of logits.
    :param stable: optional, flag indicating whether
        or not to normalise the input data.
    returns: soft_logits, a 1xN tf.Tensor of real 
        values in range (0,1) that sum up to 1.0.
    """
    
    assert isinstance(logits, tf.Tensor)
    if stable:
        logits = tf.subtract(logits, tf.reduce_max(logits))
    soft_logits = tf.math.exp(logits)
    soft_logits /= tf.math.reduce_sum(soft_logits)
    return soft_logits

In [9]:
### Testing the softmax function with N=7 predictions
x = [1.0, 2.0, 3.0, 1.0, 2.0, 2.0, 3.0]
### Converting to tf.Tensor object
x = tf.constant(x, dtype=tf.float64)
### Computing the softmax function and printing results as Numpy array
x_scaled = softmax(x)
x_scaled.numpy()

2022-08-28 19:41:36.465409: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2022-08-28 19:41:36.466833: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


array([0.04010756, 0.10902364, 0.29635698, 0.04010756, 0.10902364,
       0.10902364, 0.29635698])

In [10]:
### Testing the softmax function with N=3 large values (without normalising)
x_large = tf.constant([1000, 2000, 3000], dtype=tf.float64)
softmax(x_large)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([nan, nan, nan])>

In [11]:
### Testing the softmax function with N=3 large values (with normalising)
x_large = tf.constant([1000, 2000, 3000], dtype=tf.float64)
softmax(x_large, stable=True)

<tf.Tensor: shape=(3,), dtype=float64, numpy=array([0., 0., 1.])>

**Note**: this output isn't very ideal either, since the softmax function does not typically result in a zero value. However, for very large numbers, we are expecting a result extremely close to zero anyway.

### 2.2. Cross-entropy

* `cross_entropy`: calculate the cross entropy loss given a vector of predictions (after softmax) and a vector of ground truth (one-hot vector).

In [12]:
### From Udacity's `logistic.py`

In [13]:
def cross_entropy(scaled_logits, one_hot):
    """Returns the cross-entropy loss.
    
    :param scaled_logits: an NxC tf.Tensor of scaled softmax
        distribution values, [n_samples x n_classes].
    :param one_hot: an NxC tf.Tensor of one-hot encoded 
        ground truth labels, [n_samples x n_classes].
    :returns: loss, a 1x1 tf.Tensor with cross-entropy loss. 
    """
    
    assert isinstance(scaled_logits, tf.Tensor)
    assert isinstance(one_hot, tf.Tensor)
    assert scaled_logits.shape == y_one_hot.shape
    n_samples = one_hot.shape[0]
    class_labels = tf.math.argmax(one_hot, axis=1)
    # For each sample, pick the probability value from the distribution
    # that corresponds to the true class label
    preds = tnp.asarray(scaled_logits)[range(n_samples), class_labels]
    log_likelihood = -tf.math.log(preds)
    loss = tf.math.reduce_sum(log_likelihood) / n_samples
    return loss

In [14]:
# Creating our ground-truth labels and using one-hot encoding
y = tf.constant([2.0, 2.0, 3.0, 0.0, 2.0, 1.0, 3.0])
y_one_hot = tf.constant(tf.keras.utils.to_categorical(y))
y_one_hot

<tf.Tensor: shape=(7, 4), dtype=float32, numpy=
array([[0., 0., 1., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.]], dtype=float32)>

In [15]:
### Creating pseudo-batched data by repeating 'predictions'
X_scaled = tf.stack([x_scaled] * 4, axis=1)
X_scaled

<tf.Tensor: shape=(7, 4), dtype=float64, numpy=
array([[0.04010756, 0.04010756, 0.04010756, 0.04010756],
       [0.10902364, 0.10902364, 0.10902364, 0.10902364],
       [0.29635698, 0.29635698, 0.29635698, 0.29635698],
       [0.04010756, 0.04010756, 0.04010756, 0.04010756],
       [0.10902364, 0.10902364, 0.10902364, 0.10902364],
       [0.10902364, 0.10902364, 0.10902364, 0.10902364],
       [0.29635698, 0.29635698, 0.29635698, 0.29635698]])>

In [16]:
### Testing the cross-entropy loss function with N=1 batch
loss = cross_entropy(X_scaled, y_one_hot)
loss

<tf.Tensor: shape=(), dtype=float64, numpy=2.216190530018549>

### 2.3. Logistic Regression model

* `model`: takes a batch of images (stack of images along the first dimensions) and feeds it through the logistic regression model.

In [17]:
### From Udacity's `logistic.py`

In [18]:
def model(X, W, b):
    """
    logistic regression model
    args:
    - X [tensor]: input HxWx3
    - W [tensor]: weights
    - b [tensor]: bias
    returns:
    - output [tensor]
    """
    # IMPLEMENT THIS FUNCTION
    return 

### 2.4. Prediction accuracy

* `accuracy`: given a vector of predictions and a vector of ground truth, calculate the accuracy.

In [19]:
### From Udacity's `logistic.py`

In [20]:
def accuracy(y_hat, Y):
    """
    calculate accuracy
    args:
    - y_hat [tensor]: NxC tensor of models predictions
    - y [tensor]: N tensor of ground truth classes
    returns:
    - acc [tensor]: accuracy
    """
    # IMPLEMENT THIS FUNCTION
    return acc

## Tips

You can leverage the `tf.boolean_mask` function to calculate the cross entropy. Keep in mind
that most elements of the ground truth vector are zeros.

## Credits
This assignment was prepared by Thomas Hossler and Michael Virgo et al., Winter 2021 (link [here](https://www.udacity.com/course/self-driving-car-engineer-nanodegree--nd0013)).

References
* [1] Ji, S. Xie, Y. Logistic Regression: From Binary to Multi-Class. http://people.tamu.edu/~sji/classes/LR.pdf


Helpful resources:
* [Softmax Regression and How is it Related to Logistic Regression? | KDnuggets](https://www.kdnuggets.com/2016/07/softmax-regression-related-logistic-regression.html)
* [The Softmax function and its derivative | E. Bendersky](https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative/)
* [Softmax and Cross Entropy Loss | P. Dahal](https://deepnotes.io/softmax-crossentropy)