***
# ***Calculating Network Error with Loss***
***

In deep learning, calculating network error involves evaluating how well the neural network's predictions match the true labels. This is where the concept of **loss** comes in.

With a randomly-initialized model, or even a model initialized with more sophisticated approaches, our goal is to train, or teach, a model over time. To train a model, we tweak the weights and biases to improve the model’s accuracy and confidence. To do this, we calculate how much error the model has. The **loss function**, also referred to as the **cost function algorithm** that quantifies how wrong a model is. **Loss**, is the is the measure of this metric. Since loss is the model’s error, we ideally want it to be 0. 

Why we do not calculate the error of a model based on the argmax accuracy. take a example of confidence: **[0.22, 0.6, 0.18]** vs **[0.32, 0.36, 0.32]**. If the correct class were indeed the middle one (index 1), the model accuracy would be identical between the two above. But are these two examples really​ as accurate as each other? They are not, because accuracy is simply applying an argmax to the output to find the index of the biggest value. The output of a neural network is actually confidence, and more confidence in the correct answer is better. Because of this, we strive to increase correct confidence and decrease misplaced confidence. 

***
# ***Common Loss Functions***
***
There are several loss functions used in deep learning, depending on the type of task:

***
### ***Categorical Cross-Entropy Loss***
***

If you are familiar with **linear regression**, then you already know one of the **loss functions** used with neural networks that do regression: **squared error (or mean squared error networks)**.

We are not performing regression in this example; we are classifying, so we need a different loss function. The model has a softmax activation function for the output layer, which means it’s outputting a probability distribution. **Categorical cross-entropy**
is explicitly used to compare a **“ground-truth” probability (y ​ or ​ “targets​”**) and some **predicted distribution (y-hat ​ or 
“predictions​”)**, so it makes sense to use cross-entropy here. It is also one of the most commonly used loss functions with a softmax activation on the output layer. 

The formula for calculating the **categorical cross-entropy** of **y**​ (actual/desired distribution) and **y-hat​** (predicted distribution) is: 

$$
L_i = -\sum_{j} y_{i,j} \log(\hat{y}_{i,j})
$$

Where:
- **\(L_i\)**: Loss for the \(i^{th}\) sample.
- **\(j\)**: Index of the class.
- **\(y_{i,j}\)**: True label for the \(i^{th}\) sample and \(j^{th}\) class (one-hot encoded: 1 if true class, 0 otherwise).
- **\(\hat{y}_{i,j}\)**: Predicted probability for the \(i^{th}\) sample and \(j^{th}\) class (output from a softmax function).


In [2]:

import math 

# An example output from the output layer of the neural network
softmax_output = [0.7, 0.1, 0.2] 

# Ground truth 
target_output = [1, 0, 0] 

loss = - (math.log(softmax_output[0]) * target_output[0] + 
          math.log(softmax_output[1]) * target_output[1] + 
          math.log(softmax_output[2]) * target_output[2] ) 

print(loss)


0.35667494393873245


That is the full **categorical cross-entropy calculation**, but we can make a few assumptions given **one-hot target vectors**. First, what are the values for **target_output[1]** and **target_output[2]** in this case? They are both 0, and anything multiplied by 0 is 0. Thus, we do not need to calculate these indices. Next, what is the value for **target_outploss = -math.log(softmax_output[0])ut[0]** in this case? It is 1. So this can be omitted as any number multiplied by 1 remains the same. The same output then, in this example, can be calculated with: 

In [5]:

loss = -math.log(softmax_output[0])
loss


0.35667494393873245

The **Categorical Cross-Entropy Loss** accounts for that and outputs a larger loss the lower the confidence is: 


In [7]:

print(math.log(1.)) 
print(math.log(0.95)) 
print(math.log(0.9)) 
print(math.log(0.8)) 
print('...') 
print(math.log(0.2)) 
print(math.log(0.1)) 
print(math.log(0.05)) 
print(math.log(0.01)) 


0.0
-0.05129329438755058
-0.10536051565782628
-0.2231435513142097
...
-1.6094379124341003
-2.3025850929940455
-2.995732273553991
-4.605170185988091


We have printed different log values for a few example confidences. When the confidence level equals 1​, meaning the model is 100% **“sure”** about its prediction, the loss value for this sample equals 0​. The loss value raises with the confidence level, approaching 0. You might also wonder why we did not print the result of log(0)​ we’ll explain that shortly. 

So far, we have applied **log()** to the **softmax output**. For example, 10​^x ​= 100​ can be solved with a log: log​10​ (100)​, which evaluates to 2. This property of the log function is especially ​ beneficial when e ​ (Euler’s number or ~2.71828​) is used in the base (where 10 is in the example). The logarithm with e ​ as its base is referred to as the natural logarithm , natural log , or simply log you may also see this written as ln:ln(x)=log(x)=log​e(x)​ The variety of conventions can make this confusing, so to simplify things, any mention of log will always be a natural logarithm throughout this book. The natural log represents the solution for the x-term in the equation e​^x ​= b​; for example, e​^x ​= 5.2​ is solved by log(5.2)​.

In [9]:
import numpy as np 
 
b = 5.2 
print(np.log(b)) 

print(math.e ** np.log(b))


1.6486586255873816
5.199999999999999


The small difference is the result of floating-point precision. Getting back to the loss calculation, we need to modify our output in two additional ways. First, we wi1ll update our process to work on batches of softmax output distributions; and second, make the negative log calculation dynamic to the target index (the target index has been hard-coded so far). 


Consider a scenario with a neural network that performs classification between three classes, and the neural network classifies in batches of three. After running through the softmax activation function with a batch of 3 samples and 3 classes, the network’s output layer yields: 


In [11]:

# Probabilities for 3 samples 
softmax_outputs = np.array([[0.7, 0.1, 0.2], 
                            [0.1, 0.5, 0.4], 
                            [0.02, 0.9, 0.08]]) 

class_targets = [0, 1, 1]  # dog, cat, cat


We need a way to dynamically calculate the categorical cross-entropy, which we now know is a negative log calculation. To determine which value in the softmax output to calculate the negative log from, we simply need to know our target values. In this example, there are 3 classes; let’s say we are trying to classify something as a **“dog,”** **“cat,”** or **“human.”** A dog is class 0 (at index 0), a cat class 1 (index 1), and a human class 2 (index 2). Let’s assume the batch of three sample inputs to this neural network is being mapped to the target values of a dog, cat, and cat. So the targets (as a list of target indices) would be [0, 1, 1]​.)

The first value, **0**, in class_targets means the first softmax output distribution’s intended prediction was the one at the 0th index of **[0.7, 0.1, 0.2]**; the model has a **0.7** confidence score that this observation is a dog. This continues throughout the batch, where the intended target of the 2nd softmax distribution, **[0.1, 0.5, 0.4]**, was at an index of 1; the model only has a 0.5 confidence score that this is a cat the model is less certain about this observation. In the last sample, it’s also the 2nd index from the softmax distribution, a value of 0.9 in this case a pretty high confidence. 
    
With a collection of softmax outputs and their intended targets, we can map these indices to retrieve the values from the softmax distributions:

In [12]:

for targ_idx, distribution in zip(class_targets, softmax_outputs): 
    print(distribution[targ_idx])
    

0.7
0.5
0.9


 
The ***zip()*** function, again, lets us iterate over multiple iterables at the same time in Python. This can be further simplified using NumPy (we are creating a NumPy array of the Softmax outputs this time): 

In [16]:

softmax_outputs = np.array([[0.7, 0.1, 0.2], 
                            [0.1, 0.5, 0.4], 
                            [0.02, 0.9, 0.08]]) 

class_targets = [0, 1, 1] 

print(softmax_outputs[[0, 1, 2], class_targets])


[0.7 0.5 0.9]



What are the 0, 1, and 2 values? NumPy lets us index an array in multiple ways. One of them is to use a list filled with indices and that’s convenient for us we could use the class_targets for this purpose as it already contains the list of indices that we are interested in. The problem is that this has to filter data rows in the array the second dimension. To perform that, we also need to 
explicitly filter this array in its first dimension. This dimension contains the predictions and we, of course, want to retain them all. We can achieve that by using a list containing numbers from 0 through all of the indices. We know we are going to have as many indices as distributions in our entire batch, so we can use a **range()** instead of typing each value ourselves:


In [17]:

print(softmax_outputs[ 
    range(len(softmax_outputs)), class_targets 
]) 


[0.7 0.5 0.9]


This returns a list of the confidences at the target indices for each of the samples. Now we apply the negative log to this list: 

In [18]:

print(-np.log(softmax_outputs[ 
    range(len(softmax_outputs)), class_targets 
])) 


[0.35667494 0.69314718 0.10536052]


Finally, we want an average loss per batch to have an idea about how our model is doing during training. There are many ways to calculate an average in Python; the most basic form of an average is the **arithmetic mean** **: sum(iterable) / len(iterable)​**. NumPy has a method that computes this average on arrays, so we will use that instead. We add NumPy’s average to the code: 

In [19]:

neg_log = -np.log(softmax_outputs[ 
              range(len(softmax_outputs)), class_targets 
          ]) 

average_loss = np.mean(neg_log) 

print(average_loss) 


0.38506088005216804


We have already learned that targets can be one-hot encoded, where all values, except for one, are zeros, and the correct label’s position is filled with 1. They can also be sparse, which means that the numbers they contain are the correct class numbers we are generating them this way with the **spiral_data()** function, and we can allow the loss calculation to accept any of these forms. Since we implemented this to work with sparse labels (as in our training data), we have to add a check if they are one-hot encoded and handle it a bit differently in this new case. The check can be performed by counting the dimensions if targets are single-dimensional (like a list), they are sparse, but if there are 2 dimensions (like a list of lists), then there is a set of one-hot encoded vectors. In this second case, we will implement a solution using the first equation from this chapter, instead of filtering out the confidences at the target labels. We have to multiply confidences by the targets, zeroing out all values except the ones at correct labels, performing a sum along the row axis (axis 1​). We have to add a test to the code we just wrote for the number of dimensions, move calculations of the log values outside of this new if​ statement, and implement the solution for the one-hot encoded labels following the first equation:

In [24]:

import numpy as np 
 
softmax_outputs = np.array([[0.7, 0.1, 0.2], 
                            [0.1, 0.5, 0.4], 
                            [0.02, 0.9, 0.08]]) 

class_targets = np.array([[1, 0, 0], 
                          [0, 1, 0], 
                          [0, 1, 0]]) 

# Probabilities for target values - only if categorical labels 
if len(class_targets.shape) == 1: 
    correct_confidences = softmax_outputs[ 
        range(len(softmax_outputs)), 
        class_targets 
    ] 
    
# Mask values - only for one-hot encoded labels 
elif len(class_targets.shape) == 2: 
    correct_confidences = np.sum( 
    softmax_outputs * class_targets, 
    axis=1
    ) 

# Losses 
neg_log = -np.log(correct_confidences) 
average_loss = np.mean(neg_log) 
print(average_loss) 


0.38506088005216804


***
### ***The Categorical Cross-Entropy Loss Class***
***

In [29]:

# Common loss class 
class Loss:
 
    # Calculates the data and regularization losses 
    # given model output and ground truth values 
    def calculate(self, output, y):
 
        # Calculate sample losses 
        sample_losses = self.forward(output, y) 
 
        # Calculate mean loss 
        data_loss = np.mean(sample_losses) 
 
        # Return loss 
        return data_loss
        

Let’s convert our loss code into a class for convenience down the line: 

In [34]:

# Cross-Entropy Loss Class
class LossCategoricalCrossentropy(Loss):
    
    # Forward pass
    def forward(self, y_pred, y_true):
        # Number of samples in a batch
        samples = len(y_pred)
        
        # Clip predictions to prevent log(0)
        # Clip both sides to avoid shifting the mean
        y_pred_clipped = np.clip(y_pred, 1e-7, 1 - 1e-7)
        
        # Handle categorical labels (integer labels)
        if len(y_true.shape) == 1:
            correct_confidences = y_pred_clipped[range(samples), y_true]
        
        # Handle one-hot encoded labels
        elif len(y_true.shape) == 2:
            correct_confidences = np.sum(y_pred_clipped * y_true, axis=1)
        
        # Compute the negative log likelihoods
        negative_log_likelihoods = -np.log(correct_confidences)
        
        return negative_log_likelihoods


This class inherits the Loss class and performs all the error calculations that we derived throughout this chapter and can be used as an object. For example, using the manually-created output and targets: 

In [40]:

loss_function = LossCategoricalCrossentropy() 
loss = loss_function.calculate(softmax_outputs, class_targets) 
print(loss) 


0.38506088005216804


*** 
### ***Accuracy Calculation***
***

While loss is a useful metric for optimizing a model, the metric commonly used in practice along with loss is the **accuracy**, which describes how often the largest confidence is the correct class in terms of a fraction. Conveniently, we can reuse existing variable definitions to calculate the accuracy metric. We will use the argmax ​values from the softmax outputs ​and then compare these to the targets. This is as simple as doing (note that we slightly modified the softmax_outputs for the purpose of this example): 

In [44]:

import numpy as np

# Probabilities of 3 samples (output of softmax)
softmax_outputs = np.array([
    [0.7, 0.2, 0.1],
    [0.5, 0.1, 0.4],
    [0.02, 0.9, 0.08]
])

# Target (ground-truth) labels for 3 samples
class_targets = np.array([0, 1, 1])

# Get predicted class indices from softmax outputs
predictions = np.argmax(softmax_outputs, axis=1)

# If targets are one-hot encoded, convert them to class indices
if len(class_targets.shape) == 2:
    class_targets = np.argmax(class_targets, axis=1)

# Calculate accuracy (True -> 1, False -> 0)
accuracy = np.mean(predictions == class_targets)

# Print the accuracy
print('Accuracy:', accuracy)


Accuracy: 0.6666666666666666


We are also handling one-hot encoded targets by converting them to sparse values using np.argmax(). 

In [45]:
predictions = np.argmax(activation2.output, axis​=1) 
if len(y.shape) == 2: 
   y 
= np.argmax(y, axis​=1) 
accuracy = np.mean(predictions==y) 
# Print accuracy 
print('acc:', accuracy)

SyntaxError: invalid non-printable character U+200B (2366397136.py, line 1)