Cross-entropy compares two probability distributions.

In [5]:
import math
import numpy as np

# An example output from the output layer of the neural network
softmax_output = [0.7, 0.1, 0.2]

# Ground truth
target_output = [1, 0, 0]

loss = -(math.log(softmax_output[0]) * target_output[0] +
         math.log(softmax_output[1]) * target_output[1] +
         math.log(softmax_output[2]) * target_output[2])

loss

0.35667494393873245

That is the full categorical cross-entropy calculation, but we can make a few assumptions given one-hot target vectors. First, the values for `target_output[1]` and `target_ouput[2]` in this case are both `0`, and anything multiplied by `0` is `0`. Thus, we don't need to calute these indices. Next, the value for `target_output[0]` in this case is 1. So this can be omitted as any number multiplied by `1` remains the same.

In [4]:
loss = -math.log(softmax_output[0])
loss

0.35667494393873245

Consider a scenario with a neural network that performs classification between three classes, and the neural network classifies in batches of three. After running through the softmax activation function with a batch of 3 samples and 3 classes, the network's output layer yields:

In [6]:
softmax_outputs = [[0.7, 0.1, 0.2],
                    [0.1, 0.5, 0.4],
                    [0.02, 0.9, 0.08]]
class_targets = [0, 1, 1]

# we can map these target indices to retrieve the va;ues from the softmax distribution
for targ_inx, distribution in zip(class_targets, softmax_outputs):
    print(distribution[targ_inx])

0.7
0.5
0.9


The `zip()` function lets us iterate over multiple iterables at the same time in Python. This can be further simplified using NumPy (we're creating a NumPy array of the Softmax outputs this time):

In [9]:
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])
class_targets = [0, 1, 1]

softmax_outputs[[0, 1, 2], class_targets]

array([0.7, 0.5, 0.9])

The list `[0, 1, 2]` is used to filter the first dimension. This dimension contains the predictions and we want to retain them all. We can achieve that by using a list containing numbers from `0` through all of the indices. We know we're going to have as many indices as distributions in our entire batch, so we can use a `range()` instead of hard-coding:

In [11]:
f = softmax_outputs[range(len(softmax_outputs)), class_targets]
f

array([0.7, 0.5, 0.9])

In [14]:
# Apply negative log
neg_log = -np.log(f)
neg_log

array([0.35667494, 0.69314718, 0.10536052])

Finally, we want an average loss per batch to have an idea about how our model is doing during training.

In [16]:
average_loss = np.mean(neg_log)
average_loss

np.float64(0.38506088005216804)

We have learned that targets can be one-hot encoded, where all values, except for one, are zeros, and the correct label's position is filled with 1. That can also be sparse, which means that the numbers they contain are the correct class numbers - we are generating them this way with the `spiral_data()` function, and we can allow the loss calculation to accept any of these forms. Since we implemented this to work with sparse labels (as in our training data), we have to add a check if they are one-hot encoded and handle it a bit differently in this new case. The check can be performed by counting the dimensions - if targets are single-dimensional (like a list), they are sparse, but if there are 2 dimensions (like a list of lists), then there is a set of one-hot encoded vectors. In this second case, instead of filtering out the confidences at the target labels, we have to multiply confidences by the targets, zeroing out all values except the ones at correct labels, performing a sum along the row axis (axis 1).

In [19]:
softmax_outputs = np.array([[0.7, 0.1, 0.2],
                            [0.1, 0.5, 0.4],
                            [0.02, 0.9, 0.08]])
class_targets = np.array([[1, 0, 0],
                            [0, 1, 0],
                            [0, 1, 0]])

# Probabilities for target values - only if categorical labels
if len(class_targets.shape) == 1:
    correct_confidences = softmax_outputs[
        range(len(softmax_outputs)),
        class_targets
    ]
# Mask values - only for one-hot encoded labels
elif len(class_targets.shape) == 2:
    correct_confidences = np.sum(
        softmax_outputs * class_targets,
        axis=1
    )

# Losses
neg_log = -np.log(correct_confidences)
print(neg_log)

avg_loss = np.mean(neg_log)
avg_loss

[0.35667494 0.69314718 0.10536052]


np.float64(0.38506088005216804)

From a mathematical point of view, `log(0)` is undefined. We already know the following dependence: if `y=log(0)`, then `e^y=x`. The question of what the resulting y is in `y=log(0)` is the same as the question of what's the `y` in `e^y=0`. In simplified term, the constant `e` to any power is always a positive number, and there is no `y` resulting in `e^y=0`. This means the `log(0)` is undefined. We need to be aware of what the `log(0)` is, and "undefined" does not mean that we don't know anything about it. Since `log(0)` is undefined, what's the result for a value very close to `0`?
It's " the limit of a natural logarithm of `x`, with `x` approaching `0` from a positive equals negative infinity. What this means is that the limit is negative infinity for an infintely small `x`, where `x` never reaches `0`.
The situation is a bit different in programming languages. We do not have limites hre, just a function which, given a parameter, returns some value. The negative natural logarith of `0`, in Python wiht NumPy, equals an infinitely big number, rather than undefined, and prints a wanting about division by `0`. If `-np.log(0)` equals `inf`, is it possible to calculate e to the power of negative infinity?

In [23]:
np.e**(-np.inf)


0.0

In [22]:
-np.log(0).item()

  -np.log(0).item()


inf

In [25]:
# A single infinite value in a list will cause the average of that list to also be infinite:
np.mean([1, 2, 3, -np.log(0)])

  np.mean([1, 2, 3, -np.log(0)])


np.float64(inf)

In [27]:
# We could ad a very small value to the confidence to prevent it from being a zero:
-np.log(1e-7).item()

16.11809565095832

Adding a very small value, one-tenth of a million, to the confidence at its far edge will insignificantly impact the result, but this method yields an additional 2 issues. First, in the case where confidence value is `1`:

In [29]:
-np.log(1 + 1e-7).item()

-9.999999505838704e-08

When the model is fully correct in a prediction and puts all the confidence in the correct label, loss becomes a negative value instead of being `0`. the other probelm here is shifting confidence towards `1`, even if by a very small value. To prevent both issues, it's better to clip values from both sides by the same number, `1e-7` in our case. That means that the lowest possible will become `1e-7` but the highest possible value, instead of being `1 + 1e-7`, will become `1 - 1e-7`

In [31]:
-np.log(1 - 1e-7).item()

1.0000000494736474e-07

# Common Loss Class

In [34]:
class Loss:

    # Calculate the data and regularization losses
    # given model output and ground truth values
    def calculate(self, output, y):

        # Calculate sample losses
        sample_losses = self.forward(output, y)

        # Calculate mean loss
        data_loss = np.mean(sample_losses)

        # Return loss
        return data_loss

# Cross-entropy loss

In [35]:
# This class inherits the `Loss` class and performs all the error calculations that we derived throughout this notebook and can be used as an object.
class Loss_CategoricalCrossentropy(Loss):

    # Forward pass
    def forward(self, y_pred, y_true):

        # Number of samples in a batch
        samples = len(y_pred)

        # Clip data to prevent division by 0
        # Clip both sides to not drag mean towards any value
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)

        # Probabilities for target values
        # only if categorical labels
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[
                range(samples),
                y_true
            ]
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(
                y_pred_clipped * y_true,
                axis=1
            )

        # Losses
        negative_log_likelihoods = -np.log(correct_confidences)

        return negative_log_likelihoods

In [37]:
loss_function = Loss_CategoricalCrossentropy()
loss = loss_function.calculate(softmax_outputs, class_targets)
loss.item()

0.38506088005216804

# Accuracy Calculation

While loss is a useful metric for optimizing a model, the metric commonly used in practice along with loss is the accuracy, which describes how often the largest confidence is the correct class in terms of a fraction. Conveniently, we can reuse existing variable definitions to calculate the accuracy metric. We will use the `argmax` values from the `softmax outputs` and then compare these to the targets. We have to modify the `softmax_outputs` slightly for this purpose:

In [38]:
# Probabilities of 3 samples
softmax_outputs = np.array([[0.7, 0.2, 0.1],
                            [0.5, 0.1, 0.4],
                            [0.02, 0.9, 0.08]])
# Target (ground-truth) labels for 3 samples
class_targets = np.array([0, 1, 1])

# Calculate values along second axis (axis of index 1)
predictions = np.argmax(softmax_outputs, axis=1)

# If targets are one-hot encoded - convert them
if len(class_targets.shape) == 2:
    class_targets = np.argmax(class_targets, axis=1)

# True evaluates to 1; False to 0
accuracy = np.mean(predictions==class_targets)

accuracy


np.float64(0.6666666666666666)