<a href="https://colab.research.google.com/github/kameshcodes/deep-learning-theory/blob/main/Loss_Function.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$$\textbf{Loss Functions}$$

---
---

# $\textbf{I. Loss functions for Regression Tasks}$

## $\textbf{1. Mean Squared Error (MSE) Loss}$

---

### Use Case: $\text{Regression tasks}$

### Formula:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$


where:

- $n$ = total number of observations,
- $y_i$ = actual value of the $i$-th observation,
- $\hat{y}_i$ = predicted value for the $i$-th observation.

### Key Characteristics
---

- **Penalizes large errors**: Due to squaring the error, larger differences between predicted and actual values are penalized more severely. This makes MSE highly sensitive to outliers.
- **Always positive**: Since errors are squared, MSE is always non-negative.
- **Units**: The units of MSE are squared units of the target variable, making it harder to interpret directly in terms of the original values.

### Advantages
---

- **Smooth loss function**: Easy to compute derivatives for optimization algorithms like gradient descent.
- **Widely used**: MSE is a standard metric, often used for model evaluation and comparison.

### Disadvantages
---

- **Sensitivity to outliers**: Since large errors are squared, the presence of outliers can disproportionately increase the MSE.
- **Interpretability**: Due to squaring the errors, the result is in squared units, which may not be easy to interpret directly in the context of the original data.


### When to Use MSE
---

- When you want to heavily penalize large errors and are concerned with both the magnitude and the distribution of errors.
- Suitable when there are no significant outliers or outliers are acceptable.


<br>
<br>


### Pytorch Implementation

---

In [None]:
import torch.nn as nn
import torch

y = torch.tensor([3.0, -0.5, 2.0, 7.0])
y_hat = torch.tensor([2.5, 0.0, 2.0, 8.0])

mse_loss = nn.MSELoss()
mse = mse_loss(y_hat, y)

print(f"Mean Squared Error in pytorch implentation: {mse.item()}")

Mean Squared Error in pytorch implentation: 0.375


### Numpy Implementation
---

In [None]:
import numpy as np

y_actual = np.array([3, -0.5, 2, 7])
y_predicted = np.array([2.5, 0.0, 2, 8])

mse = np.mean((y_actual - y_predicted)**2)

print(f"Mean Squared Error in python implementation: {mse}")

Mean Squared Error: 0.375


## $\textbf{2. Mean Absolute Error (MSE) Loss}$


### **Use case : Regression tasks**

## Formula:

$$\text{MAE} = \frac{1}{n} \sum_{i=1}^{n} | y_i - \hat{y}_i |$$


- $n$: The number of observations.
- $y_i$: The actual value at instance \(i\).
- $\hat{y}_i\$: The predicted value at instance \(i\).

- **Units**: MAE is expressed in the same units as the target variable.
- **Error Magnitude**: Represents the average absolute difference between predictions and actual values.
- **No Direction Bias**: MAE does not indicate whether predictions are over or underestimations.

## Advantages

---
- **Simplicity**: Easy to understand and compute.
- **Robustness to Outliers**: Less sensitive to large errors compared to metrics like Mean Squared Error (MSE).
- **Interpretability**: Provides a direct measure of the average error.

## Disadvantages

---
- **Non-Differentiability**: The absolute value function is not differentiable at zero, which can complicate optimization algorithms like gradient descent.
- **Equal Weight to Errors**: MAE assigns equal weight to all errors, which may not be ideal in all contexts.

## When to Use MAE

---

MAE is a good choice in the following situations:

1. **Interpretability**: If you need a metric that is easy to interpret in the same units as your target variable, MAE is ideal. It directly reflects the average error in the predicted values.

2. **Robustness to Outliers**: Use MAE when you want to reduce the influence of outliers compared to other metrics like Mean Squared Error (MSE). Since MAE does not square the errors, it gives a more balanced view of the overall performance, making it suitable when extreme errors are less important.

3. **Equal Weight to Errors**: If all prediction errors should be treated equally, regardless of their magnitude, MAE is appropriate. Unlike MSE, which gives more weight to larger errors, MAE considers all errors equally.

4. **Non-Smooth Cost Functions**: In cases where you don’t mind non-smooth loss functions (because of the absolute value), and optimization methods can handle it, MAE can be a good fit. Some robust regression models use MAE as their loss function.


### When Not to Use MAE

---
- **When Large Errors Matter More**: If you need a metric that penalizes large errors more heavily, Mean Squared Error (MSE) or Root Mean Squared Error (RMSE) would be a better choice.
- **Optimization with Gradient Descent**: If your model relies on gradient-based optimization, the non-differentiability of MAE at zero could be problematic. Smooth loss functions like MSE might be preferred.



### Python implementation

---


In [None]:
import numpy as np

actual = np.array([50, 60, 55, 70])
predicted = np.array([47, 65, 53, 68])

mae = np.mean(np.abs(actual - predicted))

print(f"Mean Absolute Error (MAE): {mae}")


Mean Absolute Error (MAE): 3.0


### Pytorch Implementation

---

In [None]:
import torch
import torch.nn as nn



actual = torch.tensor([50, 60, 55, 70], dtype=torch.float32)
predicted = torch.tensor([47, 65, 53, 68], dtype=torch.float32)


mae_loss = nn.L1Loss()


mae = mae_loss(predicted, actual)

print(f"Mean Absolute Error (MAE): {mae.item()}")


Mean Absolute Error (MAE): 3.0


## $\textbf{3.Root Mean Squared Error (RMSE)}$

---

### **Use Case:** - Regression Task

**Formula:**
$$\text{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2}$$

**Advantages:**
- **Interpretability:** Easier to understand as it is in the same units as the target variable.
- **Standard Deviation Representation:** Provides insight into the spread of prediction errors.

**Disadvantages:**
- **Sensitive to Outliers:** Large residuals disproportionately affect RMSE.
- **Not Robust to Non-Normal Errors:** May not fully capture model performance if residuals are not normally distributed.

**When to Use RMSE:**
- When you need a metric in the same units as the target variable.
- When you want to penalize large errors more heavily and interpret results in the context of the target variable's units.

<br>

### Pytorch Implementation

---

In [None]:
import torch.nn as nn
import torch
predictions = torch.tensor([2.5, 0.0, 2.1, 7.8], dtype=torch.float32)
targets = torch.tensor([3.0, -0.5, 2.0, 7.5], dtype=torch.float32)

class RMSELoss(nn.Module):
    def __init__(self):
        super(RMSELoss,self).__init__()

    def forward(self,x,y):
        criterion = nn.MSELoss()
        eps = 1e-9
        loss = torch.sqrt(criterion(x, y) + eps)
        return loss

rmse_loss = RMSELoss()
rmse = rmse_loss(predictions, targets)

print(f"Root Mean Squared Error (RMSE) using PyTorch: {rmse.item()}")

Root Mean Squared Error (RMSE) using PyTorch: 0.3872983753681183


### Numpy Implementation
---

In [None]:
# Manual calculation using NumPy
predictions_np = np.array([2.5, 0.0, 2.1, 7.8])
targets_np = np.array([3.0, -0.5, 2.0, 7.5])
mse = np.mean((predictions_np - targets_np) ** 2)
rmse_np = np.sqrt(mse)

print(f"Root Mean Squared Error (RMSE) using NumPy: {rmse_np}")

Root Mean Squared Error (RMSE) using NumPy: 0.38729833462074165


## $\textbf{4. Mean Absolute Percentage Error (MAPE) Loss}$

---



### Definition
---
Mean Absolute Percentage Error (MAPE) measures the accuracy of a forecasting method by calculating the average percentage difference between actual values and predicted values.

### Use Case: $\text{Regression and Forcasting}$

### Formula
$$ˇ{MAPE} = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{y_i - \hat{y}_i}{y_i} \right| \times 100\%$$

where:
- $n$ = total number of observations,
- $y_i$ = actual value of the $i$-th observation,
- $\hat{y}_i$ = predicted value of the $i$-th observation.

### Key Characteristics

---

- **Percentage-based**: Expresses errors as a percentage of the actual values, making it easier to interpret relative to the scale of the data.
- **Non-negative**: MAPE is always non-negative and gives an average percentage error across all observations.
- **Interpretability**: Results are in percentage terms, which is intuitive and easy to understand.

### Advantages

---

- **Scale-independent**: Since it uses percentages, MAPE is not affected by the scale of the data, allowing for comparisons across different datasets or models.
- **Intuitive**: Provides a clear and easily interpretable measure of forecast accuracy.

### Disadvantages

---

- **Sensitive to zero values**: When actual values $y_i$ are close to zero, MAPE can become extremely large or undefined, leading to potential issues with interpretation.
- **Asymmetric**: MAPE does not differentiate between overestimations and underestimations equally. For instance, an overestimation and an underestimation of the same magnitude will have different impacts on MAPE.


### When to Use MAPE

---

- When you need a **percentage-based measure** of forecasting accuracy and want to understand errors relative to the size of the actual values.
- Suitable when actual values are **not close to zero** and when interpretability in percentage terms is desired.

### Common Questions

---

- **What happens if actual values are zero or near zero?**  
  MAPE can become very large or undefined if actual values are zero or near zero, potentially distorting the error measure.

- **How does MAPE handle overestimation and underestimation?**  
  MAPE treats overestimations and underestimations equally, as it measures the absolute percentage error.

### Real-World Example

---

Suppose you are forecasting monthly sales and the actual sales for a month are $\$500$, and your model predicts $\$450$. The absolute percentage error for this prediction would be:

$$\left| \frac{500 - 450}{500} \right| \times 100\% = 10\%$$

If you have multiple such forecasts, averaging these errors gives you the MAPE for your model's performance.


### Numpy Implementation

In [None]:
import numpy as np

# Direct MAPE calculation
def mape(y_true, y_pred):
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

# Example data
y_true = np.array([100, 200, 300, 400, 500])
y_pred = np.array([110, 195, 290, 405, 510])

# Calculate MAPE
mape_value = mape(y_true, y_pred)
print(f'MAPE: {mape_value}%')


MAPE: 3.816666666666667%


### Pytorch - torchmetric Implementation

In [None]:
!pip install torchmetrics -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/866.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m860.2/866.2 kB[0m [31m25.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m866.2/866.2 kB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
from torch import tensor
from torchmetrics.regression import MeanAbsolutePercentageError
target = tensor([100, 200, 300, 400, 500])
preds = tensor([110, 195, 290, 405, 510])
mean_abs_percentage_error = MeanAbsolutePercentageError()
mape = mean_abs_percentage_error(preds, target)*100

print(f'MAPE: {mape}%')

MAPE: 3.816666841506958%


# $\textbf{II. Loss functions for Classfication Tasks}$

## $\textbf{1. Binary Cross Entropy Loss (BCE) Loss}$


### **Use Case:** Binary Classification tasks

**Formula:**
$$\text{BCE} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$

where:
- $n$ = total number of observations
- $y_i$ = actual binary label for the i-th observation
- $\hat{y}_i$ = predicted probability for the i-th observation (between 0 and 1)

### Advantages

---
- **Directly interpretable in probability space:** The output values are probabilities, making the interpretation straightforward for classification tasks.
- **Logarithmic penalization:** Assigns high penalty to incorrect confident predictions (i.e., very high or very low predicted probabilities that are incorrect).
- **Sensitivity to confidence:** Penalizes confident but incorrect predictions heavily, which encourages the model to calibrate probabilities accurately.
- **Smooth and differentiable:** This makes it suitable for optimization using gradient-based methods.

### Disadvantages

---

- **Sensitive to imbalanced datasets:** The loss can be dominated by the majority class if the dataset is imbalanced, leading to poor performance on the minority class.
- **Overconfident predictions:** Large errors for predictions that are close to 0 or 1 can lead to unstable gradients during training.

### When to Use BCE

---
- When performing binary classification and you need the model to output $probabilities$.
- Suitable when both false positives and false negatives have different costs or significance.


### Pytorch Implementation

---

In [None]:
import torch
import torch.nn as nn

y_pred = torch.tensor([0.4498, 0.8845, 0.4576, 0.6450, 0.2304], requires_grad=True)
y_true = torch.tensor([0, 1, 1, 0, 1], dtype = torch.float32)


bce_loss = nn.BCELoss()
bce_with_logits_loss = nn.BCEWithLogitsLoss()

loss_1 = bce_loss(y_true, y_pred)
print('Binary Cross Entropy Loss:', loss_1.item())


loss_2 = bce_with_logits_loss(y_true, y_pred)
print('Binary Cross Entropy with Logits Loss:', loss_2.item())

Binary Cross Entropy Loss: 50.44600296020508
Binary Cross Entropy with Logits Loss: 0.7507158517837524


In [None]:
import numpy as np

def binary_cross_entropy(y_true, y_pred, eps=1e-9):
    y_pred = np.clip(y_pred, eps, 1 - eps)
    bce = -( (1/len(y_true)) * np.sum(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred)) )
    return bce



y_pred = np.array(torch.tensor([0.4498, 0.8845, 0.4576, 0.6450, 0.2304]))
y_true = np.array([0, 1, 1, 0, 1], dtype = np.float32)

loss = binary_cross_entropy(y_true, y_pred)
print("Binary Cross Entropy Loss:", loss)

Binary Cross Entropy Loss: 0.8011083602905273


In [None]:
import torch
import torch.nn.functional as F

torch.manual_seed(0)

sigmoid_layer = nn.Sigmoid()
loss = nn.BCELoss()

y_pred = torch.tensor([0.4498, 0.8845, 0.4576, 0.6450, 0.2304])
y_true = torch.tensor([0.0, 1.0, 1.0, 0.0, 1.0])

#sigmoid layer
y_logit  = sigmoid_layer(y_pred)

#loss
output = F.binary_cross_entropy(y_true, y_logit)

output

tensor(47.7786)

In [None]:
torch.manual_seed(0)
loss = nn.BCEWithLogitsLoss()

y_pred = torch.tensor([0.4498, 0.8845, 0.4576, 0.6450, 0.2304])
y_true = torch.tensor([0.0, 1.0, 1.0, 0.0, 1.0])

loss = F.binary_cross_entropy_with_logits(y_true, y_pred)
loss

tensor(0.7507)

---

$$INCOMPLETE:$$ $$WORKING$$ $$ON$$ $$CODE$$ $$UNDERSTANDING$$

---
---

## $\textbf{2. Categorical Cross Entropy Loss (CCE) Loss}$


### **Use Case:** Multi-class Classification tasks  

### **Formula:**  
$$\text{CCE} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{c=1}^{C} y_{i,c} \log(\hat{y}_{i,c})$$

$where:$  
- $n$ = total number of observations  
- $C$ = total number of classes  
- $y_{i,c}$ = binary indicator ($0$ or $1$)$:$ $1$ if class $(c)$ is the true label for for observation $(i)$, else $0$
- $\hat{y}_{i,c}$ = predicted probability for class $(c)$ for observation $(i)$

<br>

---
---
$\textbf{Note: CCE has characteristics, advantage and disadvantage very similar to
BCE}$

---
---

<br>

### **Advantages**  

---

- **Directly Interpretable in Probability Space:** Outputs are probabilities for each class, making it straightforward to interpret and evaluate model performance for multi-class classification tasks.  
**Logarithmic Penalization:** Assigns high penalty to incorrect predictions, especially if the predicted probability for the true class is low. The logarithmic function penalizes incorrect predictions more severely the more confident the incorrect prediction is.
- **Sensitivity to Confidence:** Penalizes confidently incorrect predictions, encouraging the model to output accurate class probabilities.  
- **Smooth and Differentiable:** Suitable for gradient-based optimization methods, as the loss function is smooth and differentiable.

### **Disadvantages**  

---
- **Sensitive to Imbalanced Datasets:** If some classes are underrepresented, the loss can be dominated by the majority classes, potentially leading to poor performance on minority classes.  
- **Overconfident Predictions:** Similar to BCE, large errors from very high or very low predicted probabilities can lead to unstable gradients during training.

### **When to Use CCE**  

---

- When performing multi-class classification and you need the model to output probabilities for each class.  
- Suitable when each class has equal significance or when the cost of misclassification is balanced across classes.


### Pytorch Implementation

---

In [None]:
import torch
import torch.nn as nn

y_true = torch.tensor([2, 0, 1])
y_pred = torch.tensor([
    [0.1, 0.2, 0.7],
    [0.8, 0.1, 0.1],
    [0.2, 0.7, 0.1]
])

criterion = nn.CrossEntropyLoss()
loss = criterion(y_pred, y_true)
print(f'Categorical Cross Entropy Loss: {loss.item()}')

Categorical Cross Entropy Loss: 0.7418753504753113


In [None]:
import numpy as np

def categorical_cross_entropy(y_true, y_pred, epsilon=1e-15):
    y_pred = np.clip(y_pred, epsilon, 1. - epsilon)
    n_samples = y_true.shape[0]
    cce_loss = -np.sum(y_true * np.log(y_pred)) / n_samples
    return cce_loss

y_true = np.array([
    [0, 0, 1],
    [1, 0, 0],
    [0, 1, 0]
])

y_pred = np.array([
    [0.1, 0.2, 0.7],
    [0.8, 0.1, 0.1],
    [0.2, 0.7, 0.1]
])

loss = categorical_cross_entropy(y_true, y_pred)
print(f'Categorical Cross Entropy Loss: {loss}')


Categorical Cross Entropy Loss: 0.3121644797305582


## $\textbf{3. BCE Vs. CCE}$


| **Aspect**                   | **Binary Cross Entropy (BCE)**                              | **Categorical Cross Entropy (CCE)**                        |
|------------------------------|--------------------------------------------------------------|-------------------------------------------------------------|
| **Use Case**                 | Binary classification tasks with two classes                | Multi-class classification tasks with more than two classes |
| **Formula**                  | $-\frac{1}{n} \sum_{i=1}^{n} [y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)]$ | $-\sum_{i=1}^{C} y_{i} \log(\hat{y}_{i})$       |
| **Output Range**             | Probabilities between 0 and 1 (using sigmoid function)      | Probability distribution summing to 1 (using softmax function) |
| **Logarithmic Penalization** | Penalizes high-confidence incorrect predictions heavily      | Penalizes incorrect predictions based on confidence for the true class |
| **Activation Function**      | Sigmoid function                                              | Softmax function                                             |
| **Key Characteristics**      | - Logarithmic penalization <br> - Direct output probability  | - Logarithmic penalization <br> - Output probability distribution |
| **Training Context**         | Outputs a probability for a single class                     | Outputs probabilities for multiple classes                  |
| **Usage**                    | Suitable for binary classification problems                  | Suitable for multi-class classification problems           |
