# Chapter 5: Loss Functions

In [4]:
# Preface: Install necessary packages:
import numpy as np
import matplotlib.pyplot as plt
import math
import nnfs
from resources.classes import DenseLayer, ReLU, SoftMax, Loss, CategoricalCrossEntropy

## Section 1: Categorical Cross-Entropy Loss

If we were doing linear regression and modeling a regression line, then we'd be looking at mean squared error (MSE) right now. But, because we're on classification right now, we're looking at categorical cross-entropy instead. That is a mouthful. 

Categorical cross-entropy is used to explicitly compare a ground-truth probability (called "y" or "targets") with some predicted distribution (called "y-hat" or "predictions").

$$
L_{i} = -{\sum_{j} y_{i,j}log(\hat{y}_{i,j})}
$$

That is, where L(i) denotes sample's loss value, i is the i-th sample from the set, j is the label/output index, y denotes the target values, and y-hat denotes the predicted values. 

We'll be able to simplify this down to:

$$
L_{i} = -log(\hat{y}_{i,k})
$$

Where L(i) denotes the sample's loss value, i is the i-th sample in the set, k is the index of the target label, y denotes the target values, and y-hat denotes the predictions.

We encode desired outcomes using something called "one-hot" encoding, which is basically where the ground truth value is given a 1, and all other values are assigned a 0. 

So, if we have three options to choose from [a, b, c] and we want the ground truth to be "b", then we make the target list be [0, 1, 0]. Say that our softmax outputs [0.3, 0.4, 0.3], then our categorical cross-entropy function becomes:

$$
L_{i} = -(0 * log(0.3) + 1 * log(0.4) + 0 * log(0.3))
$$ 

Which, when we have one hot encoded data, further simplifies to:

$$
L_{i} = -(log(0.4))
$$

Therefore proving why we can simplify the categorical cross-entropy loss function to just L(i) = -log(softmax_output[argmax(softmax_output))

Below, we'll implement this using python to show it in code.

In [None]:
softmax_output = [0.3, 0.4, 0.3]
target_output = [0, 1, 0]

# Here we're just doing (target[x] * softmax[x]) for each
loss = -(target_output[0]*math.log(softmax_output[0]) + target_output[1]*math.log(softmax_output[1]) + target_output[2]*math.log(softmax_output[2]))

print(f"Loss out of the 'complex' version: {loss}")

# Here we just simplify it down to the only one which is true in our one-hot encoding
loss = -(target_output[1]*math.log(softmax_output[1]))
print(f"Loss out of simplified version: {loss}")

As you can see: they're identical!!!

Now, let's do an example: one in which we're processing the outputs to find a categorical cross-entry for each sample in the batch and ultimately for the batch as a whole. Let's do so below!  

In [None]:
# We'll be using some placeholder data here.
softmax_outputs = np.array([[0.6, 0.2, 0.2],
                            [0.3, 0.5, 0.2],
                            [0.1, 0.1, 0.8]])

# We'll also assign ground truths to go on
class_targets = [0, 1, 2]

# Now we'll roll all of these up into one to calculate the loss per sample
neg_log = -np.log(softmax_outputs[range(len(softmax_outputs)), class_targets])
print(f"The c.ce.l. for each batch is: {neg_log}")

#Here we can leverage np.mean which just does sum(list x)/len(list x) in an easy to call method.
average_loss = np.mean(neg_log)
print(f"The average c.ce.l is: {average_loss}")

However, as we can see, data can be sparsely encoded or one-hot encoded!! In the former case that means looking at [1] for each row with the number referencing the index of the ground truth, whereas in the latter case it would mean [0, 1, 0]. 

As such, our loss must be calculated differently depending on the way in which the target data is encoded. 

In [None]:
# We'll be using some placeholder data here.
softmax_outputs = np.array([[0.6, 0.2, 0.2],
                            [0.3, 0.5, 0.2],
                            [0.1, 0.1, 0.8]])

# We'll also assign ground truths to go on
class_targets = np.array([[1, 0, 0],
                          [0, 1, 0],
                          [0, 0, 1]])

# If there are sparsely-encoded labels
if len(class_targets.shape) == 1:
    confidences = softmax_outputs[range(len(softmax_outputs)), class_targets]
elif len(class_targets.shape) == 2:
    confidences = np.sum(softmax_outputs * class_targets, axis=1)
    
# Losses
neg_log = -np.log(confidences)
avg_loss = np.mean(neg_log)

print(avg_loss)

There's rightfully one more thing to point out here -- that the softmax can produce outputs as 1 or 0. We're specifically worried about the latter of the two, because log(0) is undefined. 

And the truthful answer is... there's not a really ideal way of solving it. But there is one which is commonly used: clipping. What that basically means is that you reduce your bounds from being 0 to 1 to now being from 1-e7 to (1-(1-e7)). It doesn't need to exactly be clipped by 1-e7, but by a very small non-zero value which will have a negligible impact on your model but prevent the log(0) error. We do so using the np.clip() method in numpy.    

In [3]:
# We'll be using some placeholder data here.
softmax_outputs = np.array([[0.6, 0.2, 0.2],
                            [0.3, 0.5, 0.2],
                            [0.1, 0.1, 0.8]])

# We'll also assign ground truths to go on
class_targets = np.array([[1, 0, 0],
                          [0, 1, 0],
                          [0, 0, 1]])


lossFunction = CategoricalCrossEntropy()
loss = lossFunction.calculate(softmax_outputs, class_targets)
print(loss)

0.47570545188004854


We'll combine everything that's been done up to this point. We'll also add a measure to quantify precision/accuracy, which is how often it correctly predicts the right class.

In [7]:
from nnfs.datasets import spiral_data

nnfs.init()

# Let's generate some data quickly.
X, y = spiral_data(samples=100, classes=3)

# Now to initialize the neural network.
Dense1 = DenseLayer(2,3)
Dense2 = DenseLayer(3,3)
Activation1 = ReLU()
Activation2 = SoftMax()
Loss = CategoricalCrossEntropy()

# Now to do a forward pass through the whole model
Dense1.forward(X)
Activation1.forward(Dense1.output)
Dense2.forward(Activation1.output)
Activation2.forward(Dense2.output)

# Now to quantify the predictions so we can figure out our accuracy
predictions = np.argmax(Activation2.output, axis=1)
if len(y.shape) == 2:
    y = np.argmax(y, axis=1)
accuracy = np.mean(y == predictions)

# Lastly, we can calculate our loss in this batch.
loss = Loss.calculate(Activation2.output, y)

print(f"The output layer produced: {Activation2.output[:5]}")
print(f"The loss is: {loss}")
print(f"The accuracy is: {accuracy}")

The output layer produced: [[0.33333334 0.33333334 0.33333334]
 [0.33333316 0.3333332  0.33333364]
 [0.33333287 0.3333329  0.33333418]
 [0.3333326  0.33333263 0.33333477]
 [0.33333233 0.3333324  0.33333528]]
The loss is: 1.0986104011535645
The accuracy is: 0.34


Now we actually can see how well the model we've compiled has been working! Now, we should be all set to take steps which will use the above to optimize our model!!

### Anyways, that's it for this chapter! Thanks for following along with my annotations of *Neural Networks from Scratch* by Kinsley and Kukieła!