📝 **Author:** Amirhossein Heydari - 📧 **Email:** <amirhosseinheydari78@gmail.com> - 📍 **Origin:** [mr-pylin/pytorch-workshop](https://github.com/mr-pylin/pytorch-workshop)

---


**Table of contents**<a id='toc0_'></a>    
- [Dependencies](#toc1_)    
- [Generate Artificial Outputs](#toc2_)    
- [Loss Function](#toc3_)    
  - [Built-in Losses](#toc3_1_)    
    - [Regression tasks](#toc3_1_1_)    
    - [Classification tasks](#toc3_1_2_)    
      - [Binary Classification](#toc3_1_2_1_)    
      - [Multiclass Classification](#toc3_1_2_2_)    
    - [Specialized Classification](#toc3_1_3_)    
    - [Metric Learning / Ranking Losses](#toc3_1_4_)    
    - [Other](#toc3_1_5_)    
  - [Custom Losses](#toc3_2_)    
    - [Example 1: Mean Squared Error [Regression]](#toc3_2_1_)    
    - [Example 2: Cross Entropy Loss [binary classification]](#toc3_2_2_)    
    - [Example 3: Cross Entropy Loss [multiclass classification]](#toc3_2_3_)    
- [Comparison](#toc4_)    
  - [BCELoss vs MSELoss](#toc4_1_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

# <a id='toc1_'></a>[Dependencies](#toc0_)


In [1]:
import matplotlib.pyplot as plt
import torch
import torch.nn.functional as F
from torch import nn

# <a id='toc2_'></a>[Generate Artificial Outputs](#toc0_)

In [None]:
# regression
y_true_reg = torch.randn(10, 1)  # ground truth values
y_pred_reg = torch.randn(10, 1)  # predicted values

# log
print(f"y_true_reg : {y_true_reg}")
print(f"y_pred_reg : {y_pred_reg}")

In [None]:
# binary classification
num_classes = 2
batch_size = 10

y_true_cls_bin = torch.randint(0, num_classes, (batch_size,), dtype=torch.float32)  # true class indices
y_pred_cls_bin = torch.randn(batch_size)  # logits (before sigmoid)

# log
print(f"y_true_cls_bin : {y_true_cls_bin}")
print(f"y_pred_cls_bin : {y_pred_cls_bin}")

In [None]:
# multiclass classification
num_classes = 5
batch_size = 10

y_true_cls_multi = torch.randint(0, num_classes, (batch_size,))  # true class indices
y_pred_cls_multi = torch.randn(batch_size, num_classes)  # logits (before softmax)

# log
print(f"y_true_cls_multi : {y_true_cls_multi}")
print(f"y_pred_cls_multi : {y_pred_cls_multi}")

# <a id='toc3_'></a>[Loss Function](#toc0_)

- A function that quantifies the difference between the **predicted** output of a model and the **true** target values.
- It serves as a **measure** of how well (or poorly) the model's predictions align with the actual outcomes, guiding the optimization process during training.

<figure style="text-align: center;">
  <img src="../../assets/images/third_party/loss-function.png" alt="loss-function.png" style="width: 100%;">
  <figcaption style="text-align: center;">©️ Image: <a href= "https://www.offconvex.org/2016/03/22/saddlepoints">offconvex.org/2016/03/22/saddlepoints</a></figcaption>
</figure>


## <a id='toc3_1_'></a>[Built-in Losses](#toc0_)

- PyTorch provides a variety of built-in **loss functions** to simplify training and evaluation, covering classification, regression, and other tasks.

📝 **Docs**:

- Loss Functions: [docs.pytorch.org/docs/stable/nn.html#loss-functions](https://docs.pytorch.org/docs/stable/nn.html#loss-functions)


### <a id='toc3_1_1_'></a>[Regression tasks](#toc0_)

1. [Mean Absolute Error](https://docs.pytorch.org/docs/stable/generated/torch.nn.L1Loss.html) (`torch.nn.L1Loss`)
    - Measures the mean absolute error (MAE) between each element in the input and target tensors (aka L1 norm)
    - Robust to outliers BUT does not provide gradients for large errors, leading to slower convergence
    - **Formula**:
      - $\text{L1Loss} = \frac{1}{N} \sum_{i=1}^{N} |\hat{y}_i - y_i|$
    - **Notations**:
      - $N$: Number of samples
      - $\hat{y}_i$: Predicted value for the $i_{th}$ sample
      - $y_i$: True value for the $i_{th}$ sample

1. [Mean Squared Error](https://docs.pytorch.org/docs/stable/generated/torch.nn.MSELoss.html) (`torch.nn.MSELoss`)
    - Measures the mean squared error (MSE) between each element in the input and target tensors (aka L2 norm)
    - Sensitive to outliers because it penalizes large errors quadratically (due to squaring)
    - **Formula**:
      - $\text{MSELoss} = \frac{1}{N} \sum_{i=1}^{N} (\hat{y}_i - y_i)^2$
    - **Notations**:
      - \(N\): Number of samples
      - \(\hat{y}_i\): Predicted value for the \(i\)-th sample
      - \(y_i\): True value for the \(i\)-th sample

1. [Huber](https://docs.pytorch.org/docs/stable/generated/torch.nn.HuberLoss.html) (`torch.nn.HuberLoss`)
    - Combines the best properties of MAE and MSE losses by being less sensitive to outliers than MSE and more stable than MAE
    - It acts `quadratic` for small errors (similar to MSE) and `linear` for large errors (similar to MAE)
    - **Formula**:
      - $
        \text{HuberLoss} = \frac{1}{N} \sum_{i=1}^{N}
        \begin{cases}
        0.5 (\hat{y}_i - y_i)^2 & \text{if } |\hat{y}_i - y_i| < \delta \\
        \delta \cdot (|\hat{y}_i - y_i| - 0.5 \cdot \delta) & \text{otherwise}
        \end{cases}
        $
    - **Notations**:
      - $N$: Number of samples
      - $\hat{y}_i$: Predicted value for the $i_{th}$ sample
      - $y_i$: True value for the $i_{th}$ sample
      - $\delta$: Threshold parameter

1. [Smooth L1](https://docs.pytorch.org/docs/main/generated/torch.nn.SmoothL1Loss.html) (`torch.nn.SmoothL1Loss`)
    - It provides a smooth transition between quadratic and linear behavior, unlike Huber Loss which has a sharp cutoff
    - Often used in object detection tasks
    - **Formula**:
      - $
        \text{SmoothL1Loss} = \frac{1}{N} \sum_{i=1}^{N}
        \begin{cases}
        0.5 \cdot \frac{(\hat{y}_i - y_i)^2}{\beta} & \text{if } |\hat{y}_i - y_i| < \beta \\
        |\hat{y}_i - y_i| - 0.5 \cdot \beta & \text{otherwise}
        \end{cases}
        $
    - **Notations**:
      - $N$: Number of samples
      - $\hat{y}_i$: Predicted value for the \hat{y}_i sample
      - $y_i$: True value for the \hat{y}_i sample
      - $\beta$: Threshold parameter


In [None]:
criterion = nn.MSELoss()
loss = criterion(y_pred_reg, y_true_reg)

# log
print(f"loss.item(): {loss.item()}")

### <a id='toc3_1_2_'></a>[Classification tasks](#toc0_)


#### <a id='toc3_1_2_1_'></a>[Binary Classification](#toc0_)

1. [Binary Cross-Entropy](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) (`torch.nn.BCELoss`)
    - Measures the binary cross-entropy loss between the target and the input probabilities (`Sigmoid`).
    - Penalizes incorrect predictions more heavily, especially when the predicted probability is far from the actual class.
    - **Formula**:
      - $\text{BCE} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]$
    - **Notations**:
      - $N$: Number of samples
      - $y_i$: True label for the $i_{th}$ sample (0 or 1)
      - $p_i$: Predicted probability for the $i_{th}$ sample

1. [Binary Cross-Entropy with Logits](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html) (`torch.nn.BCEWithLogitsLoss`)
    - Measures the binary cross-entropy loss between the target and the input logits.
    - Combines a `torch.nn.Sigmoid` and `torch.nn.BCELoss` in one single class.
    - More numerically stable than using a plain sigmoid followed by a binary cross-entropy loss.
    - **Formula**:
      - $\text{BCEWithLogits} = -\frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\sigma(\hat{y}_i)) + (1 - y_i) \log(1 - \sigma(\hat{y}_i)) \right]$
    - **Notations**:
      - $N$: Number of samples
      - $y_i$: True label for the $i_{th}$ sample (0 or 1)
      - $\hat{y}_i$: Logit (raw model output) for the $i_{th}$ sample
      - $\sigma(\hat{y}_i)$: Sigmoid function applied to $\hat{y}_i$, i.e., $\sigma(\hat{y}_i) = \frac{1}{1 + e^{- \hat{y}_i}}$

1. [Soft Margin](https://docs.pytorch.org/docs/main/generated/torch.nn.SoftMarginLoss.html) (`torch.nn.SoftMarginLoss`)
    - Measures the logistic loss between the target and the input.
    - Expects target values to be either 1 or -1.
    - **Formula**:
      - $\text{SoftMarginLoss} = \frac{1}{N} \sum_{i=1}^{N} \log(1 + \exp(-y_i \hat{y}_i))$
    - **Notations**:
      - $N$: Number of samples
      - $y_i$: True label for the $i_{th}$ sample (1 or -1)
      - $\hat{y}_i$: Logit (raw model output) for the $i_{th}$ sample


In [None]:
criterion = nn.BCEWithLogitsLoss()
loss = criterion(y_pred_cls_bin, y_true_cls_bin)

# log
print(f"loss.item(): {loss.item()}")

#### <a id='toc3_1_2_2_'></a>[Multiclass Classification](#toc0_)

1. [Negative Log Likelihood](https://docs.pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) (`torch.nn.NLLLoss`)
    - Measures the negative log likelihood loss between the target and the input log-probabilities (`LogSoftmax`).
    - Directly applying `LogSoftmax` to logits can lead to numerical instability (issues of overflow and underflow in computational systems).
    - **Formula**:
      - $\text{NLLLoss} = -\frac{1}{N} \sum_{i=1}^{N} \log(\hat{y}_{i, y_i})$
    - **Notations**:
      - $N$: Number of samples
      - $\hat{y}_{i, y_i}$: Log-probability of the true class $y_i$ for the $i_{th}$ sample

1. [Cross-Entropy Loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) (`torch.nn.CrossEntropyLoss`)
    - Measures the cross-entropy loss between the target and the input logits.
    - Combines a `torch.nn.LogSoftmax` and `torch.nn.NLLLoss` in one single class.
    - It reduces the number of operations required compared to applying sigmoid and BCELoss separately
    - **Formula**:
      - $\text{CrossEntropyLoss} = -\frac{1}{N} \sum_{i=1}^{N} \log\left(\frac{\exp(\hat{y}_{i, y_i})}{\sum_{j=1}^{C} \exp(\hat{y}_{i, j})}\right)$
    - **Notations**:
      - $N$: Number of samples
      - $C$: Number of classes
      - $\hat{y}_{i, y_i}$: Log-probability of the true class $y_i$ for the $i_{th}$ sample
      - $\hat{y}_{i, j}$: Logit (raw model output) for class $j$ of the $i_{th}$ sample
      - $y_i$: True class for the $i_{th}$ sample

1. [Multi-Label Soft Margin](https://docs.pytorch.org/docs/main/generated/torch.nn.MultiLabelSoftMarginLoss.html) (`torch.nn.MultiLabelSoftMarginLoss`)
    - Measures the multi-label one-versus-all loss based on max-entropy between the target and the input.
    - Useful for multi-label classification tasks where each sample can belong to multiple classes.
    - **Formula**:
      - $\text{MultiLabelSoftMarginLoss} = -\frac{1}{N} \sum_{i=1}^{N} \frac{1}{C} \sum_{j=1}^{C} \left[ y_{i, j} \log(\sigma(\hat{y}_{i, j})) + (1 - y_{i, j}) \log(1 - \sigma(\hat{y}_{i, j})) \right]$
    - **Notations**:
      - $N$: Number of samples
      - $C$: Number of classes
      - $y_{i, j}$: True label for class $j$ of the $i_{th}$ sample (0 or 1)
      - $\hat{y}_{i, j}$: Logit (raw model output) for class $j$ of the $i_{th}$ sample
      - $\sigma(\hat{y}_{i, j})$: Sigmoid function applied to $\hat{y}_{i, j}$, i.e., $\sigma(\hat{y}_{i, j}) = \frac{1}{1 + e^{-\hat{y}_{i, j}}}$


In [None]:
criterion = nn.CrossEntropyLoss()
loss = criterion(y_pred_cls_multi, y_true_cls_multi)

# log
print(f"loss.item(): {loss.item()}")

### <a id='toc3_1_3_'></a>[Specialized Classification](#toc0_)

1. [Connectionist Temporal Classification](https://docs.pytorch.org/docs/main/generated/torch.nn.CTCLoss.html) (`torch.nn.CTCLoss`)
1. [Poisson Negative log likelihood](https://docs.pytorch.org/docs/main/generated/torch.nn.PoissonNLLLoss.html) (`torch.nn.PoissonNLLLoss`)
1. [Gaussian negative log likelihood](https://docs.pytorch.org/docs/main/generated/torch.nn.GaussianNLLLoss.html) (`torch.nn.GaussianNLLLoss`)


### <a id='toc3_1_4_'></a>[Metric Learning / Ranking Losses](#toc0_)

1. [Margin Ranking](https://docs.pytorch.org/docs/main/generated/torch.nn.MarginRankingLoss.html) (`torch.nn.MarginRankingLoss`)
1. [Hinge Embedding](https://docs.pytorch.org/docs/main/generated/torch.nn.HingeEmbeddingLoss.html) (`torch.nn.HingeEmbeddingLoss`)
1. [Cosine Embedding](https://docs.pytorch.org/docs/main/generated/torch.nn.CosineEmbeddingLoss.html) (`torch.nn.CosineEmbeddingLoss`)
1. [Multi Margin](https://docs.pytorch.org/docs/main/generated/torch.nn.MultiMarginLoss.html) (`torch.nn.MultiMarginLoss`)
1. [Triplet Margin](https://docs.pytorch.org/docs/main/generated/torch.nn.TripletMarginLoss.html) (`torch.nn.TripletMarginLoss`)
1. [Triplet Margin With Distance](https://docs.pytorch.org/docs/main/generated/torch.nn.TripletMarginWithDistanceLoss.html) (`torch.nn.TripletMarginWithDistanceLoss`)


### <a id='toc3_1_5_'></a>[Other](#toc0_)

1. [Kullback-Leibler divergence](https://docs.pytorch.org/docs/main/generated/torch.nn.KLDivLoss.html) (`torch.nn.KLDivLoss`)
1. [Multi-Label Margin](https://docs.pytorch.org/docs/main/generated/torch.nn.MultiLabelMarginLoss.html) (`torch.nn.MultiLabelMarginLoss`)


## <a id='toc3_2_'></a>[Custom Losses](#toc0_)

- PyTorch lets you define **custom** loss functions using `torch.nn.Module` or simple **Python functions**.
- To create a custom loss, extend `torch.nn.Module` and implement the `forward` method, or define a function that operates on tensors directly.

📝 **Docs**:

- `nn.Module`: [docs.pytorch.org/docs/stable/generated/torch.nn.Module.html](https://docs.pytorch.org/docs/stable/generated/torch.nn.Module.html)

### <a id='toc3_2_1_'></a>[Example 1: Mean Squared Error [Regression]](#toc0_)

In [None]:
def custom_mse(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    return torch.mean((y_pred - y_true) ** 2)


# compute the loss
loss = custom_mse(y_pred_reg, y_true_reg)

# log
print(f"loss: {loss}")

In [None]:
class CustomMSE(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
        loss = torch.mean((y_pred - y_true) ** 2)
        return loss


# compute the loss
criterion = CustomMSE()
loss = criterion(y_pred_reg, y_true_reg)

# log
print(f"loss: {loss}")

### <a id='toc3_2_2_'></a>[Example 2: Cross Entropy Loss [binary classification]](#toc0_)

In [None]:
def custom_binary_cross_entropy(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    # numerically stable computation
    loss = torch.clamp(y_pred, min=0) - y_pred * y_true + torch.log1p(torch.exp(-torch.abs(y_pred)))

    # normal computation
    # y_pred_sigmoid = torch.sigmoid(y_pred)
    # loss = - (y_true * torch.log(y_pred_sigmoid) + (1 - y_true) * torch.log(1 - y_pred_sigmoid))

    return loss.mean()


# compute the loss
loss = custom_binary_cross_entropy(y_pred_cls_bin, y_true_cls_bin)

# log
print(f"loss: {loss}")

In [None]:
class CustomBinaryCrossEntropy(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
        # numerically stable computation
        loss = torch.clamp(y_pred, min=0) - y_pred * y_true + torch.log1p(torch.exp(-torch.abs(y_pred)))

        # normal computation
        # y_pred_sigmoid = torch.sigmoid(y_pred)
        # loss = - (y_true * torch.log(y_pred_sigmoid) + (1 - y_true) * torch.log(1 - y_pred_sigmoid))

        return loss.mean()


# compute the loss
criterion = CustomBinaryCrossEntropy()
loss = criterion(y_pred_cls_bin, y_true_cls_bin)

# log
print(f"loss: {loss}")

### <a id='toc3_2_3_'></a>[Example 3: Cross Entropy Loss [multiclass classification]](#toc0_)

In [None]:
def custom_cross_entropy(y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
    log_probs = F.log_softmax(y_pred, dim=1)
    loss = -log_probs[torch.arange(y_true.shape[0]), y_true]
    return loss.mean()


# compute the loss
loss = custom_cross_entropy(y_pred_cls_multi, y_true_cls_multi)

# log
print(f"loss: {loss}")

In [None]:
class CustomCrossEntropy(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, y_pred: torch.Tensor, y_true: torch.Tensor) -> torch.Tensor:
        log_probs = F.log_softmax(y_pred, dim=1)  # apply log softmax for numerical stability
        loss = -log_probs[torch.arange(y_true.shape[0]), y_true]
        return loss.mean()


# compute the loss
criterion = CustomCrossEntropy()
loss = criterion(y_pred_cls_multi, y_true_cls_multi)

# log
print(f"loss: {loss}")

# <a id='toc4_'></a>[Comparison](#toc0_)

## <a id='toc4_1_'></a>[BCELoss vs MSELoss](#toc0_)

- `BCELoss` is more **sensitive** to the amount of error (grows **faster** if the distance between `y_true` & `y_pred` is high)


In [None]:
# we have 3 samples for a binary classification
y_true = torch.tensor([[0], [0], [0]], dtype=torch.float32)

# output of model
output = torch.tensor([[0], [1.09864], [10]], dtype=torch.float32)
y_pred = torch.sigmoid(output)

mse_1 = nn.MSELoss(reduction="none")(y_pred, y_true).squeeze()
mse_2 = nn.MSELoss()(y_pred, y_true)
bce_1 = nn.BCELoss(reduction="none")(y_pred, y_true).squeeze()
bce_2 = nn.BCELoss()(y_pred, y_true)

# log
print(f"y_true: {y_true.squeeze()}")
print(f"y_pred: {y_pred.squeeze()}")
print("-" * 50)
print(f"MSELoss [per sample]: {mse_1}")
print(f"MSELoss             : {mse_2:.5f}")
print(f"BCELoss [per sample]: {bce_1}")
print(f"BCELoss             : {bce_2:.5f}")

In [None]:
# plot
y_true = torch.zeros(size=(100, 1))
y_pred = torch.sigmoid(torch.linspace(-10, +10, 100).reshape(-1, 1))
bce_loss = nn.BCELoss(reduction="none")(y_pred, y_true)
mse_loss = nn.MSELoss(reduction="none")(y_pred, y_true)

plt.plot(y_pred, bce_loss, label="BCELoss")
plt.plot(y_pred, mse_loss, label="MSELoss")
plt.title(f"y_true = {y_true[0, 0]}   |   {y_pred.min().round()} <= y_pred <= {y_pred.max().round()}")
plt.xlabel("y_pred")
plt.ylabel("Loss")
plt.legend()
plt.show()