# Tutorial 3.1: Evaluating Neural Network Performance: Metrics and Visualizations

Author: [Erik Syniawa](mailto:erik.syniawa@informatik.tu-chemnitz.de)

This notebook introduces essential quantitative measures to evaluate neural network learning performance and how they are implemented in some of the Utils functions. We'll further explore key metrics and visualization techniques that help us understand:

1. How efficiently a model learns
2. Whether it's overfitting or underfitting
3. How its internal representations capture the structure of the data
4. How it performs on individual classes and examples

These insights are crucial for diagnosing problems, tuning hyperparameters, and ultimately building better models.

## 1. Setup and Imports

Let's start by importing the necessary libraries and the custom functions we'll use throughout this notebook.


In [None]:
import os, sys

# We append the parent directory into the sys.path here. Normally in python you create modules with __init__ but since we are working with jupyter notebook we simply do this:
notebook_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(notebook_dir, ".."))
if root_path not in sys.path:
    sys.path.append(root_path)
    print(f"Added {root_path} to sys.path")
    

# Import our custom functions
from Utils.functions import train_model, evaluate_model, test_model, get_model_predictions
from Utils.plotting import (visualize_training_results, 
                            visualize_test_results, 
                            visualize_misclassified, )

# some little useful helper functions                            
from Utils.little_helpers import set_seed, timer

# set random seed for reproducibility
set_seed(1729)

# Importing PyTorch and show version/ device info
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"PyTorch version: {torch.__version__} running on {device}")

### 1.1 The Importance of Setting Random Seeds in Deep Learning

Random seeds control the initialization of random number generators, which affect many aspects of neural network training:

1. **Weight initialization**: Random initial weights can lead to different local minima
2. **Data shuffling**: Different batch orderings affect gradient updates
3. **Dropout patterns**: Randomized dropout masks change which neurons are active
4. **Data augmentation**: Random transformations create different training examples

#### Benefits of fixed seeds:

1. **Reproducibility**: Ensures experiments can be reproduced exactly
2. **Debugging**: Helps distinguish between bugs and random variation
3. **Fair Comparisons**: Allows different models/approaches to be compared under identical conditions
4. **Scientific Rigor**: Makes results replicable by other researchers

#### Why CUDA-specific seed setting is necessary:

Neural network training often uses GPU acceleration through CUDA. Setting CUDA-specific seeds is critical because:

1. **GPU Parallelism**: GPUs use parallel processing with their own random number generators
2. **CUDA-specific operations**: Some operations have GPU-specific implementations with different random behaviors
3. **Multi-GPU setups**: Each GPU needs its own seed for consistent behavior

Our `set_seed(1729)` function addresses these issues by:

- Setting Python's `random` seed for general randomization -> `random.seed(1729)`
- Setting NumPy's seed for array operations -> `np.random.seed(1729)`
- Setting PyTorch's CPU seed with `torch.manual_seed(1729)`
- Setting CUDA seeds with `torch.cuda.manual_seed(1729)` for the current device
- Setting all CUDA device seeds with `torch.cuda.manual_seed_all(1729)` for using multiple GPUs
- Making cuDNN deterministic with `torch.backends.cudnn.deterministic = True`
- Disabling cuDNN benchmarking with `torch.backends.cudnn.benchmark = False`

The last two settings are particularly important but come with performance costs. Setting ``cudnn.deterministic = True`` forces cuDNN to use deterministic algorithms instead of selecting the fastest but potentially non-deterministic implementation. Setting ``cudnn.benchmark = False`` prevents cuDNN from benchmarking multiple convolution algorithms and selecting the fastest one, which can introduce non-determinism. For detailed information look [here](https://pytorch.org/docs/stable/notes/randomness.html).

> **Note**: Setting `torch.backends.cudnn.deterministic = True` and `torch.backends.cudnn.benchmark = False` can significantly slow down training, especially for convolutional networks. This trade-off between reproducibility and performance is an important consideration in research vs. production environments.

## 2. The Main Objective of Deep Learning: Learning Representations

The core idea behind deep learning is to automatically learn meaningful representations of data. As highlighted in [LeCun, Bengio & Hinton (2015)](https://www.nature.com/articles/nature14539), deep learning methods are representation-learning methods with multiple levels of representation. Each layer transforms the representation from the previous level into a more abstract representation, allowing the model to learn complex patterns from raw data.
Consider this example: Imagine a model trained to distinguish between cats and cows based on images. When we look at the final accuracy, we might be satisfied with around 90% correct classifications. However, what the model actually "learned" remains hidden.

It's possible the model has learned legitimate discriminative features (cats have pointed ears and whiskers, cows have horns and spots). But it might instead have learned from statistical biases in the training data - perhaps cows are more frequently photographed outdoors while cats are photographed indoors. In this case, the model might be associating blue skies and green grass with cows, and indoor lighting and furniture with cats.

This creates a problem: these associations are inherently biased by the training dataset and can lead to false classifications when cats are photographed outdoors or cows are photographed in stables. By visualizing the [embeddings](#41-why-embeddings-matter), we might discover that images are clustered primarily by background environment rather than by the animal's intrinsic features - revealing a potential reliability issue that accuracy alone wouldn't show.
The power of deep learning comes from its ability to progressively learn more abstract and useful representations:

- Early layers might detect simple patterns like edges and colors
- Middle layers combine these into more complex patterns like textures and shapes
- Later layers assemble these into object parts and abstract object representations

This hierarchical representation learning is what enables deep networks to solve complex tasks that were previously intractable with traditional machine learning approaches.

<div align="center">
    <img src="figures/features.png" width="750"/>
    <p><i>Figure 1: This image from <a href="https://link.springer.com/chapter/10.1007/978-3-319-10590-1_53">Zeiler & Fergus (2014)</a> visualizes what different layers in their convolutional neural network learn by showing reconstructions of features that most strongly activate specific feature maps. The progression clearly demonstrates the hierarchical nature of learned representations: lower layers (1-2) detect simple edges and textures, middle layers (3-4) capture more complex patterns and object parts (like dog faces or bird legs), while deeper layers (5) represent entire objects with significant pose variation.</i></p>
</div>


## 3. Data Splitting and Cross-Validation

Before we dive into loss functions and other evaluation metrics, we need to establish proper data splitting strategies to ensure reliable model evaluation. How we divide our data directly impacts our ability to assess model generalization.

### 3.1 The Three-Way Split: Train, Validation, and Test

- Training Set (60-80%): Used to optimize model parameters through backpropagation
- Validation Set (10-20%): For tracking accuracy and early stopping decisions. Can also be used for hyperparameter tuning (for example `ReduceLROnPlateau` Scheduler).
- Test Set (10-20%): Used only for final model evaluation, never for making modeling decisions. Note that in some datasets like in ImageNet the true labels are not even available for the test set for calculating the accuracy.

> What can be the advantages of using a validation set that the model has never seen before?

### 3.2 Creating an Example Dataset

Let's create a dataset for an exemplary animal classification task that we'll use throughout this notebook. We'll intentionally make it imbalanced to simulate real-world challenges:

In [None]:
# Create a synthetic animal classification dataset with class imbalance
import torch
from torch.utils.data import TensorDataset, DataLoader, random_split
from sklearn.datasets import make_classification
import numpy as np
import matplotlib.pyplot as plt

# Our class names
class_names = ['cat', 'dog', 'bird', 'cow', 'lynx']
num_classes = len(class_names)

# Create a synthetic classification dataset with imbalanced classes
X, y = make_classification(
    n_samples=5000,
    n_features=20,  # 20 features per sample
    n_informative=10,  # 10 informative features
    n_redundant=5,    # 5 redundant features
    n_classes=num_classes,      # 5 classes (matching our class_names)
    n_clusters_per_class=2,
    weights=[0.4, 0.3, 0.15, 0.1, 0.05],  # Making 'cat' most common, 'lynx' least common
    class_sep=1.5,    # Class separation factor
    random_state=42
)

# Convert to PyTorch tensors
X_tensor = torch.FloatTensor(X)
y_tensor = torch.LongTensor(y)

# Create a PyTorch dataset
dataset = TensorDataset(X_tensor, y_tensor)

# Let's check class distribution
unique_classes, class_counts = np.unique(y, return_counts=True)

plt.figure(figsize=(10, 5))
bars = plt.bar(range(len(unique_classes)), class_counts, color='skyblue')
plt.xticks(range(len(unique_classes)), [class_names[c] for c in unique_classes])
plt.xlabel('Animal Class')
plt.ylabel('Number of Samples')
plt.title('Class Distribution in Animal Classification Dataset')

# Add count labels above bars
for i, (count, bar) in enumerate(zip(class_counts, bars)):
    plt.text(i, count + 10, str(count), ha='center')

plt.tight_layout()
plt.show()

This creates an imbalanced dataset where cats are most common and lynxes are rare - similar to what we might encounter in real-world datasets. This imbalance will help us demonstrate the importance of proper evaluation techniques.

### 3.3 Basic Data Splitting Approaches

#### 3.3.1 Random Splitting

The simplest approach is random splitting, which works well when data is independently and identically distributed (i.i.d.):

In [None]:
# First split: separate test set (20%)
train_val_size = int(0.8 * len(dataset))
test_size = len(dataset) - train_val_size
train_val_dataset, test_dataset = random_split(dataset, [train_val_size, test_size])

# Second split: separate validation set from training (25% of train_val, which is 20% of total)
train_size = int(0.75 * len(train_val_dataset))
val_size = len(train_val_dataset) - train_size
train_dataset, val_dataset = random_split(train_val_dataset, [train_size, val_size])

# Create data loaders
batch_size = 64
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

print(f"Dataset sizes: Total={len(dataset)}, Train={len(train_dataset)}, Val={len(val_dataset)}, Test={len(test_dataset)}")

#### 3.3.2 The Problem with Random Splitting for Imbalanced Data
With random splitting on our imbalanced animal dataset, we might end up with splits that don't represent the true distribution. Let's examine the class distribution across our random splits:

In [None]:
# Function to get class distribution from subset indices
def get_class_distribution(dataset, indices):
    # Extract labels for the specified indices
    labels = [dataset.tensors[1][i].item() for i in indices]
    class_counts = np.bincount(labels, minlength=num_classes)
    return class_counts

# Get indices from each split
train_indices = train_dataset.indices
val_indices = val_dataset.indices
test_indices = test_dataset.indices

# Calculate class distributions
train_dist = get_class_distribution(dataset, train_indices)
val_dist = get_class_distribution(dataset, val_indices)
test_dist = get_class_distribution(dataset, test_indices)

# Visualize distributions
plt.figure(figsize=(12, 6))
x = np.arange(len(class_names))
width = 0.25

plt.bar(x - width, train_dist / train_dist.sum(), width, label='Train', color='skyblue')
plt.bar(x, val_dist / val_dist.sum(), width, label='Validation', color='lightgreen')
plt.bar(x + width, test_dist / test_dist.sum(), width, label='Test', color='salmon')

plt.xlabel('Animal Class')
plt.ylabel('Proportion')
plt.title('Class Distribution Across Random Splits')
plt.xticks(x, class_names)
plt.legend()
plt.tight_layout()
plt.show()

Notice how the proportions of each class can vary between splits. This could be problematic - especially for rare classes like 'lynx', which might be underrepresented or even in the worst case absent in some splits.

#### 3.3.3 Stratified Splitting
For imbalanced datasets like ours, stratified splitting maintains the class distribution across all partitions:

In [None]:
from sklearn.model_selection import train_test_split

# Stratified split using scikit-learn
X_train_val, X_test, y_train_val, y_test = train_test_split(
    X, y, test_size=0.2, random_state=1729, stratify=y  # stratify parameter ensures class balance
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train_val, y_train_val, test_size=0.25, random_state=1729, stratify=y_train_val
)

# Verify class distributions in each split
def plot_class_distribution(y_train, y_val, y_test):
    train_dist = np.bincount(y_train, minlength=num_classes) / len(y_train)
    val_dist = np.bincount(y_val, minlength=num_classes) / len(y_val)
    test_dist = np.bincount(y_test, minlength=num_classes) / len(y_test)
    
    plt.figure(figsize=(12, 6))
    x = np.arange(len(class_names))
    width = 0.25
    
    plt.bar(x - width, train_dist, width, label='Train', color='skyblue')
    plt.bar(x, val_dist, width, label='Validation', color='lightgreen')
    plt.bar(x + width, test_dist, width, label='Test', color='salmon')
    
    plt.xlabel('Animal Class')
    plt.ylabel('Proportion')
    plt.title('Class Distribution Across Stratified Splits')
    plt.xticks(x, class_names)
    plt.legend()
    plt.tight_layout()
    plt.show()

plot_class_distribution(y_train, y_val, y_test)

# Convert back to PyTorch datasets for model training
train_dataset = TensorDataset(torch.FloatTensor(X_train), torch.LongTensor(y_train))
val_dataset = TensorDataset(torch.FloatTensor(X_val), torch.LongTensor(y_val))
test_dataset = TensorDataset(torch.FloatTensor(X_test), torch.LongTensor(y_test))

# Create data loaders
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size)

With stratified sampling, each split maintains the same proportion of each animal class. This is crucial for rare classes like 'lynx', ensuring that our model sees examples from all classes during training and is evaluated fairly during validation and testing.

### 3.4 Cross-Validation Techniques

For smaller datasets or when we need more robust evaluation, cross-validation provides a way to use our data more efficiently.

#### 3.4.1 K-Fold Cross-Validation

<div align="center">
    <img src="figures/cross_validation.png" width="750"/>
    <p><i>Figure 2: k-fold cross validation. Source: https://en.wikipedia.org/wiki/Cross-validation_(statistics)</i></p>
</div>

K-fold cross-validation divides the data into $k$ equally sized folds and trains $k$ models, each using $k-1$ folds for training and the remaining fold for validation:

In [None]:
from sklearn.model_selection import KFold

# K-fold Cross Validation
k_folds = 5
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)

# Visualize the folds
plt.figure(figsize=(10, 6))
sample_range = np.arange(len(X))

# Create a colored array for visualization
fold_visualization = np.zeros(len(X))

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    print(f"Fold {fold+1}/{k_folds}")
    print(f"  Training: {len(train_idx)} samples")
    print(f"  Validation: {len(val_idx)} samples")
    
    # Mark this fold's validation indices in our visualization array
    fold_visualization[val_idx] = fold + 1
    
    # Check class distribution in this fold
    train_class_dist = np.bincount(y[train_idx], minlength=num_classes)
    val_class_dist = np.bincount(y[val_idx], minlength=num_classes)
    
    print(f"  Training class distribution: {train_class_dist}")
    print(f"  Validation class distribution: {val_class_dist}")
    print()

# Plot the fold assignment for a subset of data points for clearer visualization
subset_size = 500  # Show first 500 samples for clarity
plt.scatter(sample_range[:subset_size], fold_visualization[:subset_size], 
            c=fold_visualization[:subset_size], cmap='viridis', 
            marker='|', s=100, alpha=0.8)

plt.yticks(np.arange(0, k_folds+1))
plt.xlabel('Sample Index')
plt.ylabel('Fold Assignment (0 = not in validation set)')
plt.title('K-Fold Cross-Validation Data Splitting')
plt.colorbar(label='Validation Fold')
plt.tight_layout()
plt.show()

# Let's also look at how class distribution varies across folds
fold_class_dist = np.zeros((k_folds, num_classes))

for fold, (train_idx, val_idx) in enumerate(kf.split(X)):
    # Count class occurrences in each validation fold
    fold_class_dist[fold] = np.bincount(y[val_idx], minlength=num_classes)

# Plot class distribution across folds
plt.figure(figsize=(12, 6))
bar_width = 0.15
index = np.arange(num_classes)

for fold in range(k_folds):
    plt.bar(index + fold * bar_width, fold_class_dist[fold], bar_width,
            label=f'Fold {fold+1}')

plt.xlabel('Animal Class')
plt.ylabel('Number of Samples in Validation Set')
plt.title('Class Distribution Across Validation Folds')
plt.xticks(index + bar_width * (k_folds-1)/2, class_names)
plt.legend()
plt.tight_layout()
plt.show()

#### 3.4.2 Stratified K-Fold for Imbalanced Data

For our imbalanced animal dataset, standard K-Fold might result in some folds having few or no examples of the rare 'lynx' class just like in random data splitting (see section [3.3.2](#332-the-problem-with-random-splitting-for-imbalanced-data)). Stratified K-Fold can as well maintain same class distribution for each fold:

In [None]:
from sklearn.model_selection import StratifiedKFold

# Stratified K-fold cross-validation
skf = StratifiedKFold(n_splits=k_folds, shuffle=True, random_state=42)

# Visualize fold distributions
plt.figure(figsize=(15, 8))

for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
    # Calculate class distribution for this fold
    y_train_fold = y[train_idx]
    y_val_fold = y[val_idx]
    
    train_dist = np.bincount(y_train_fold, minlength=num_classes) / len(y_train_fold)
    val_dist = np.bincount(y_val_fold, minlength=num_classes) / len(y_val_fold)
    
    # Plot distributions
    plt.subplot(2, 3, fold+1)
    x = np.arange(len(class_names))
    width = 0.35
    
    plt.bar(x - width/2, train_dist, width, label='Train', color='skyblue')
    plt.bar(x + width/2, val_dist, width, label='Validation', color='salmon')
    
    plt.title(f'Fold {fold+1} Class Distribution')
    plt.xlabel('Animal Class')
    plt.ylabel('Proportion')
    plt.xticks(x, class_names, rotation=45)
    plt.legend()
    plt.tight_layout()

plt.tight_layout()
plt.show()

Notice how Stratified K-Fold maintains similar class proportions across all folds. This is particularly important for our animal classification task where 'lynx' examples are rare.

### 3.5 Handling Class Imbalance Beyond Stratification
#### 3.5.1 Weighted Loss Functions

Stratified sampling is just one approach to handle imbalanced data. But even with stratified sampling, we might still face challenges during training. For instance, if the model sees many more 'cat' examples than 'lynx' examples, it might learn to favor the former. So we need to ensure that the model learns to treat all classes equally. One way to do this is by using weighted loss functions so we can penalize errors on minority classes (like 'lynx') more heavily:


In [None]:
# Calculate class weights inversely proportional to class frequencies
class_counts = np.bincount(y, minlength=num_classes)
class_weights = 1. / class_counts
class_weights = class_weights / np.sum(class_weights) * num_classes  # Normalize

# Display weights
plt.figure(figsize=(10, 5))
bars = plt.bar(range(len(class_names)), class_weights, color='skyblue')
plt.xticks(range(len(class_names)), class_names)
plt.xlabel('Animal Class')
plt.ylabel('Weight')
plt.title('Class Weights for Imbalanced Data')

# Add weight values above bars
for i, (weight, bar) in enumerate(zip(class_weights, bars)):
    plt.text(i, weight + 0.1, f'{weight:.2f}', ha='center')

plt.tight_layout()
plt.show()

# Convert to PyTorch tensor for the loss function
class_weights_tensor = torch.FloatTensor(class_weights).to(device)

# Use weights in CrossEntropyLoss
weighted_criterion = torch.nn.CrossEntropyLoss(weight=class_weights_tensor)

#### 3.5.2 Resampling Techniques

Alternatively, we can modify the data distribution through:

- **Oversampling**: Duplicate samples from minority classes like 'lynx'
- **Undersampling**: Remove samples from majority classes like 'cat'
- **Combined approach**: Balance the dataset using both oversampling and undersampling

But please beware that these techniques can lead to other problems as well. For example, oversampling can lead to overfitting, as the model may learn to memorize duplicated samples rather than generalize from them. Undersampling can lead to loss of valuable information, especially if the removed samples are informative. Also undersampling may not be possible for smaller datasets. Note that there are no one-size-fits-all solutions. The best approach depends on the specific dataset and task at hand. 

# 4. Understanding Training and Validation Metrics
Now that we have properly split our data, we need to understand which metrics to use to evaluate our models and how these metrics are affected by our splitting strategies. The way we split our data directly impacts the reliability of our evaluation metrics - appropriate splitting ensures that our metrics actually reflect how well our model will perform on **unseen** data.

### 4.1 Loss Functions: The Optimization Target

The loss function is the core metric that guides neural network optimization. It quantifies how far the model's predictions are from the ground truth, providing a signal for how to adjust the model's parameters.
Loss functions are mathematical formulas that measure the error between predictions and actual values. The specific choice of loss function depends on the task.

#### 4.1.1 Classification Loss: Cross-Entropy

For classification tasks, the most common loss function is *Cross-Entropy Loss*. It measures the difference between the predicted probability distribution and the true distribution.

**Binary Cross-Entropy**

For binary classification (two classes), we use Binary Cross-Entropy:

$\text{BCE}(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]$

Where:

- $y_i$ is the true label (0 or 1)
- $\hat{y}_i$ is the predicted class label
- $N$ is the number of samples

Binary cross-entropy requires input to be probabilities. You initialize this with ``nn.BCELoss()`` in PyTorch.
- If you have raw logits (not probabilities), you can use ``nn.BCEWithLogitsLoss()`` which combines a sigmoid layer and the binary cross-entropy loss in one single class. This is numerically more stable than using a plain Sigmoid followed by a BCELoss.

**Categorical Cross-Entropy**

For multi-class classification, we use Categorical Cross-Entropy:

$\text{CE}(y, \hat{y}) = -\frac{1}{N}\sum_{i=1}^{N} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$

Where:

- $y_{i,c}$ is 1 if sample $i$ belongs to class $c$ and 0 otherwise (one-hot encoding)
- $\hat{y}_{i,c}$ is the predicted probability that sample $i$ belongs to class $c$
- $C$ is the number of classes
- $N$ is the number of samples

Categorical cross-entropy combines LogSoftmax and NLLLoss (negative log likelihood loss) in one single class. Importantly, this expects you to give raw logits (**not** softmax outputs) and class indices (**not** one-hot encoded labels). You initialize this with ``nn.CrossEntropyLoss()`` in PyTorch.

Let's see a practical example of how the model makes "decisions" based on logits and how to calculate the loss based on these logits.

In [None]:
import torch.nn.functional as F

class_names = ['cat', 'dog', 'bird', 'cow', 'lynx']
num_classes = len(class_names)

# Create dummy model outputs (logits) for 3 examples
# These are the raw scores before softmax
logits = torch.tensor([
    [2.0, 0.5, 0.2, 0.1, 0.1],  # Example 1: Model predicts cat with high "confidence"
    [0.2, 0.5, 1.8, 0.3, 0.2],  # Example 2: Model predicts bird with medium "confidence"
    [0.4, 0.4, 0.4, 0.4, 0.5]   # Example 3: Model predicts lynx with low "confidence"
])

# Convert logits to probabilities with softmax
probs = F.softmax(logits, dim=1)
for i, prob in enumerate(probs):
    pred_class = class_names[torch.argmax(prob).item()]

# Plot the predicted probabilities
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
fig.suptitle('Predicted Probabilities from Logits', fontsize=16)
for i, (prob, ax) in enumerate(zip(probs, axes)):
    prob_np = prob.detach().numpy()
    bars = ax.bar(np.arange(len(class_names)), prob_np, color='skyblue')
    
    # Highlight the highest probability
    max_idx = np.argmax(prob_np)
    bars[max_idx].set_color('navy')
    
    ax.set_xticks(np.arange(len(class_names)))
    ax.set_xticklabels(class_names, rotation=45)
    ax.set_ylim(0, 1)
    ax.set_ylabel('Probability')
    ax.set_title(f'Example {i+1}: Predicts {class_names[max_idx]}')
    
    # Add values above bars
    for j, p in enumerate(prob_np):
        ax.annotate(f'{p:.2f}', (j, p), ha='center', va='bottom')

plt.show()

In [None]:
# To calculate the loss, we need to know the true labels for these examples.
labels = torch.tensor([0, 2, 3])  # One label per logits
n_picks = len(labels)

# 1. Convert logits to log probabilities (log-softmax)
log_probs = F.log_softmax(logits, dim=1)

# 2. Pick the log probability corresponding to the true class for each example
picked_log_probs = log_probs[range(n_picks), labels]

# 3. Take the negative mean
manual_loss = -picked_log_probs.mean()
print(f"Loss calculated manually: {manual_loss.item():.4f}")

In [None]:
import torch.nn as nn

# Or we can use the built-in CrossEntropyLoss function
loss_fn = nn.CrossEntropyLoss()
loss = loss_fn(logits, labels)
print(f"Loss using CrossEntropyLoss: {loss.item():.4f}")

#### 4.1.2 Confidence and Loss

Despite the vast difference in confidence between Example 1 and 2 to 3, all are treated as the same definitive predictions during inference. In training however uncertain correct predictions get more penalized than confident correct predictions: 

In [None]:
# Let's see how the loss behaves for different scenarios:
loss_fn = nn.CrossEntropyLoss() 

# High confidence, correct prediction | Confidence is another term for the model's certainty about its prediction eg the "probability" of the predicted class
high_conf_correct = torch.tensor([[10.0, 0.1, 0.1, 0.1, 0.1]])  # Very confident it's a cat
loss_high_conf_correct = loss_fn(high_conf_correct, torch.tensor([0]))
print(f"High confidence, correct prediction: {loss_high_conf_correct.item():.4f}")

# Low confidence, correct prediction
low_conf_correct = torch.tensor([[1.0, 0.8, 0.7, 0.6, 0.5]])  # Somewhat confident it's a cat
loss_low_conf_correct = loss_fn(low_conf_correct, torch.tensor([0]))
print(f"Low confidence, correct prediction: {loss_low_conf_correct.item():.4f}")

# High confidence, wrong prediction
high_conf_wrong = torch.tensor([[0.1, 0.1, 0.1, 0.1, 10.0]])  # Very confident it's a lynx (but it's a cat)
loss_high_conf_wrong = loss_fn(high_conf_wrong, torch.tensor([0]))
print(f"High confidence, wrong prediction: {loss_high_conf_wrong.item():.4f}")

# Low confidence, wrong prediction
low_conf_wrong = torch.tensor([[0.5, 0.6, 0.7, 0.8, 1.0]])  # Somewhat confident it's a lynx (but it's a cat)
loss_low_conf_wrong = loss_fn(low_conf_wrong, torch.tensor([0]))
print(f"Low confidence, wrong prediction: {loss_low_conf_wrong.item():.4f}")

> **Note**: You can modulate the losses based on confidence by using label smoothing. Label smoothing is a technique used to prevent the model from becoming too confident in its predictions. Instead of assigning a probability of 1 to the true class and 0 to all others, we assign a small probability (e.g., 0.9) to the true class and distribute the remaining probability (e.g., 0.1) evenly among the other classes. This encourages the model to be less certain about its predictions and can help improve generalization. Try it out by inserting `nn.CrossEntropyLoss(label_smoothing=0.1)` above and see how the losses change!

#### 4.1.3 Regression Losses

For regression tasks, where we predict continuous values, the most common loss function is Mean Squared Error (MSE or L2 loss). It measures the average squared difference between predicted and true values:

$\text{MSE}(y, \hat{y}) = \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y}_i)^2$

Where:

- $y_i$ is the true value
- $\hat{y}_i$ is the predicted value
- $N$ is the number of samples

In PyTorch, this is implemented as ``nn.MSELoss()``.

The Mean Absolute Error (MAE) is another common regression loss function, which measures the average absolute difference between predicted and true values:

$\text{MAE}(y, \hat{y}) = \frac{1}{N}\sum_{i=1}^{N} |y_i - \hat{y}_i|$

Where:
- $y_i$ is the true value
- $\hat{y}_i$ is the predicted value
- $N$ is the number of samples

In PyTorch, this is implemented as ``nn.L1Loss()``.

This tutorial will focus on classification tasks, so we won't dwell on this.


#### 4.1.4 Loss Monitoring

During training, the model parameters are updated to minimize the loss, effectively improving the model's predictions. Monitoring the loss is crucial for understanding how well the model is learning. We typically track both training and validation loss. The former indicates how well the model fits the training data, while the latter shows how well it generalizes to **unseen** data.

<div align="center">
    <img src="figures/losses.png" width="1000"/>
    <p><i>Figure 3: Different loss behaviors</i></p>
</div>

Let's examine why monitoring loss is crucial:

1. **Good Convergence**: Both training and validation loss decrease and converge to similar values. The gap between them remains small, indicating the model generalizes well.

2. **Overfitting**: Training loss continues to decrease, but validation loss starts increasing after a certain point. This indicates the model is memorizing the training data rather than learning generalizable patterns.

3. **Underfitting**: Both losses plateau at relatively high values. This suggests the model lacks capacity to capture the underlying patterns in the data.

### 4.2 Accuracy: The Interpretable Metric

While loss functions drive optimization, accuracy is often more interpretable. It measures the percentage of correct predictions out of total predictions. To be complete:

$\text{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$

<div align="center">
    <img src="figures/accuracy.png" width="1000"/>
    <p><i>Figure 4: Different accuracy behaviors</i></p>
</div>

### 4.3 Using the `visualize_training_results` Function

Our `visualize_training_results` function makes it easy to plot both loss and accuracy during training:

```python
# The history will be returned from train_model

history = {
    'train_loss': train_loss,
    'val_loss': val_loss,
    'train_acc': train_acc,
    'val_acc': val_acc
}

# Visualize the results
visualize_training_results(
    train_losses=history['train_loss'],
    train_accs=history['train_acc'],
    test_losses=history['val_loss'],
    test_accs=history['val_acc'],
    output_dir=None  # Set to a path to save the plot
)
```
### 4.4 Implementing Early Stopping

To prevent overfitting, we often use early stopping. Our `train_model` function has built-in early stopping logic:

```python
# Not run here, but this is how you would use it
history = train_model(
    model=model,  # Your model
    train_loader=train_loader,  # Your training data
    val_loader=val_loader,  # Your validation data
    criterion=nn.CrossEntropyLoss(),  # Your loss function
    optimizer=optimizer,  # Your optimizer
    scheduler=scheduler,  # Your learning rate scheduler (optional can be None)
    device=device,  # Your device to train on
    num_epochs=100,  # Number of epochs
    checkpoint_path='./model_checkpoints/',  # Path to save the model with the best performance in `monitor`
    patience=10,  # Stop if no improvement for 10 epochs
    min_delta=0.001,  # Minimum change to qualify as improvement
    monitor='val_loss',  # Monitor validation loss for early stopping and checkpointing
)
```

Early stopping can help us find the sweet spot between underfitting and overfitting. In the example above we monitor the validation loss and stop training if it doesn't improve for 10 epochs. `min_delta` is the minimum change to qualify as an improvement. This prevents early stopping from being triggered by small fluctuations in the validation loss.

## 5. Understanding Test-Time Evaluation

Once a model is trained, we need to evaluate it thoroughly on a test set.

### 5.1 Beyond Simple Accuracy

Overall accuracy can be misleading, especially for imbalanced datasets. Our `test_model` function provides detailed metrics. Let's look at a simulated example of test results, why this might be a good idea:


In [None]:
import pandas as pd

# Use the existing dataset from Section 2 instead of creating a new one
num_samples = len(y)

# Simulate model predictions and confidences
per_image_data = {
    'image_idx': list(range(num_samples)),
    'true_label': y,
    'true_class': [class_names[label] for label in y],
    'predicted_label': [],
    'predicted_class': [],
    'confidence': np.random.uniform(0.7, 1.0, num_samples),
    'correct': [],
}

# Define class-specific accuracy rates
class_accuracy_rates = {
    'cat': 0.95,   # High accuracy for common class
    'dog': 0.85,
    'bird': 0.70,
    'cow': 0.65,
    'lynx': 0.50   # Low accuracy for rare class
}

# Simulate predictions
for i in range(num_samples):
    true_label = per_image_data['true_label'][i]
    true_class = class_names[true_label]
    
    # Determine if prediction is correct based on class-specific accuracy
    correct_prob = class_accuracy_rates[true_class]
    is_correct = np.random.random() < correct_prob
    
    # Get predicted label
    if is_correct:
        pred_label = true_label
    else:
        # If wrong, randomly choose another class
        pred_label = np.random.choice([c for c in range(num_classes) if c != true_label])
    
    # Store results
    per_image_data['predicted_label'].append(pred_label)
    per_image_data['predicted_class'].append(class_names[pred_label])
    per_image_data['correct'].append(is_correct)

# Create per-image DataFrame
per_image_df = pd.DataFrame(per_image_data)

# Calculate overall accuracy
overall_accuracy = 100 * per_image_df['correct'].mean()

# Create aggregated metrics
aggregate_data = []
for i, class_name in enumerate(class_names):
    # Filter for this class
    class_samples = per_image_df[per_image_df['true_label'] == i]
    
    if len(class_samples) > 0:
        class_accuracy = 100 * class_samples['correct'].mean()
        # Average confidence for correct predictions only
        correct_samples = class_samples[class_samples['correct']]
        avg_confidence = correct_samples['confidence'].mean() if len(correct_samples) > 0 else 0
        
        aggregate_data.append({
            'class_name': class_name,
            'accuracy': class_accuracy,
            'avg_confidence': avg_confidence,
            'support': len(class_samples)
        })

# Create aggregate DataFrame
aggregate_df = pd.DataFrame(aggregate_data)

# Print summary
print(f'Overall Test Accuracy: {overall_accuracy:.2f}%')
print("\nPer-class Performance:")
print(aggregate_df.to_string(index=False))

> Do you think that the model has learned to generalize well to categorize these classes?

### 5.2 Visualizing Test Results

After training and evaluating the model, it's important to visualize the results to gain deeper insights:


In [None]:
# Using our visualization function on the simulated test results
visualize_test_results(
    aggregate_df=aggregate_df,
    per_image_df=per_image_df,
    overall_accuracy=overall_accuracy,
    output_dir=None  # Set to a path to save the plot. If None the plot gets displayed
)

### Key insights from test visualizations:

1. **Per-class accuracy**: Reveals which classes the model struggles with, highlighting potential issues with the data or model architecture.

2. **Confidence vs. accuracy**: Shows if the model is well-calibrated. Ideally, confidence should align with accuracy. 

3. **Confusion matrix**: Identifies specific pairs of classes that get confused with each other.

4. **Confidence distribution**: Helps understand if the model is making confident predictions and if those confident predictions are actually correct.

## 6. Understanding Neural Network Embeddings

Deep learning models create internal representations (embeddings) of the input data. These embeddings capture the essential features that the model uses for classification, typically encoded as vectors in a high-dimensional space where similar concepts are positioned close together.

### 6.1 Why Embeddings Matter

Deep learning models create internal representations (embeddings) of the input data (see [introduction](#12-the-main-objective-of-deep-learning-learning-representations)). These embeddings capture the essential features that the model uses for classification, typically encoded as vectors in a high-dimensional space where similar concepts are positioned close together.

In summary, examining embeddings helps us:

- **Reveal feature learning**: Show what information the model extracts from raw data
- **Expose clustering structure**: Similar inputs should have similar embeddings
- **Help identify failure modes**: Misclassified examples may have unusual embedding patterns
- **Provide interpretability**: Show the model's "understanding" of the data

In our example we might assume that cats and lynx's are more similar in appearance than cats and cows, because they share the same feature (whiskers and ears). These similarity can be reflected in the embeddings.

### 6.2 Visualizing Embeddings with Dimension Reduction

High-dimensional embeddings are hard to visualize directly. Dimension reduction techniques like t-SNE and PCA help us visualize them in 2D or 3D:

In [None]:
import numpy as np
import random
from Utils.plotting import visualize_embeddings_tsne, visualize_embeddings_pca

random.seed(42)
np.random.seed(42)

samples_per_class = 500
embedding_dim = 128

# Create clustered embeddings for each class
embeddings = []
labels = []

for class_idx, class_name in enumerate(class_names):
    # Create a cluster center for this class
    center = np.random.normal(0, 1, embedding_dim)
    
    # Generate samples around this center
    class_samples = center + np.random.normal(0, 1.5, (samples_per_class, embedding_dim))
    
    embeddings.append(class_samples)
    labels.append(np.ones(samples_per_class) * class_idx)

# Combine all classes
embeddings = np.vstack(embeddings)
labels = np.hstack(labels)

# Visualize using t-SNE
visualize_embeddings_tsne(
    embeddings=embeddings,
    labels=labels,
    output_dir=None,
    class_names=class_names
)

# Visualize using PCA
visualize_embeddings_pca(
    embeddings=embeddings,
    labels=labels,
    output_dir=None,
    class_names=class_names
)


### 6.3 Understanding t-SNE vs PCA

Why is the clustering on the same embeddings so different between t-SNE and PCA?

1. **PCA visualization** shows the direction of maximum variance in the data. It tries to preserve the global structure, emphasizing the overall distribution of points. Large distances in the original space remain large in the PCA projection, and the relative positions of distant clusters are generally maintained.

2. **t-SNE visualization** focuses on preserving neighborhood relationships. It excels at revealing local clusters and patterns, making it excellent for visualizing how data points group together. However, the sizes and distances between clusters may not reflect their true relationships in the original space!

**Key Limitations to Be Aware Of**:

- **PCA can miss non-linear structures**: Since PCA is a linear technique, it may fail to capture complex, non-linear relationships in your embeddings. Two classes that are separable by a non-linear boundary might appear mixed in PCA.

- **t-SNE can distort global structure**: While excellent at showing clusters, t-SNE may not preserve the distances between them. Clusters that are far apart in the original space might appear close in t-SNE visualization, or vice versa.

- **t-SNE is sensitive to hyperparameters**: The perplexity parameter in t-SNE (which influences the effective number of neighbors) can significantly affect the visualization. Different perplexity values can produce dramatically different visualizations.

- **Random initialization in t-SNE**: Due to its non-deterministic nature, t-SNE can produce different visualizations in different runs unless you fix the random seed. In `visualize_embeddings_tsne` and `visualize_embeddings_pca` functions, we set the random seed to 42.

**When Interpreting Model Embeddings**:

When examining neural network embeddings, it's important to use both techniques:

- Use **PCA** to understand the global structure and the principal directions of variation
- Use **t-SNE** to identify clusters and local patterns
- Don't draw strong conclusions from a single visualization technique
- Remember that both techniques are simplifications of a complex, high-dimensional space

By understanding the strengths and limitations of each visualization technique, you can better interpret what your neural network has actually learned from the data. A detailed overview about the limitation of each technique can be found ["Understanding Dimensionality Reduction: PCA vs t-SNE vs UMAP vs FIt-SNE vs LargeVis vs Laplacian Eigenmaps"](https://carnotresearch.medium.com/understanding-dimensionality-reduction-pca-vs-t-sne-vs-umap-vs-fit-sne-vs-largevis-vs-laplacian-13d0be9ef7f4) and ["Why you should not rely on t-SNE, UMAP or TriMAP"](https://towardsdatascience.com/why-you-should-not-rely-on-t-sne-umap-or-trimap-f8f5dc333e59/).

In these articles there is also a discussion about other dimension reduction techniques like UMAP (Uniform Manifold Approximation and Projection), TriMAP (Large-scale Dimensionality Reduction Using Triplets), etc. that can be useful for embeddings visualization but won't be covered here.

### 6.4 Using Embeddings in Practice

Besides visualization, embeddings have several practical uses:

1. **Transfer learning**: Use embeddings from pre-trained models as features for new tasks
2. **Similarity search**: Find similar examples by comparing embedding distances
3. **Anomaly detection**: Identify unusual inputs that don't cluster well with known classes
4. **Interpretability**: Analyze which input features influence specific embedding dimensions

### 6.5 Visualizing Patch and Position Embeddings for Vision Transformers

We discuss the concepts of the patch and position embeddings for Vision Transformers in the corresponding notebook (section 04_1_ViT).

## 7. One more thing: Finding the right model capacity

Recent research has revealed that the relationship between model capacity and generalization is more complex than the classical U-shaped curve we see in overfitting. When models are scaled significantly beyond the traditional "sweet spot", we observe what's called the "double descent" phenomenon.

<div align="center">
    <img src="figures/double_descent.PNG" width="1000"/>
    <p><i>Figure 5: Double descent risk curve showing improved performance when scaling beyond the interpolation threshold</i></p>
</div>

> What is Double Descent?

The double descent phenomenon ([Belkin et al., 2019](https://www.pnas.org/doi/abs/10.1073/pnas.1903070116)) shows that:

- Classical Regime: Initially follows the traditional U-shaped curve with a sweet spot balancing underfitting and overfitting.
- Interpolation Threshold: Test error peaks when the model has just enough capacity to perfectly fit the training data.
- Modern Interpolating Regime: As model capacity increases further, test error surprisingly decreases again, often below the error at the classical sweet spot.

This explains why extremely large neural networks that perfectly fit training data can still generalize remarkably well in practice.
