Cross-entropy loss is commonly used in Convolutional Neural Networks (CNNs) when solving classification problems. Here’s when you should use it:

1. Multi-Class Classification
When your CNN is used for multi-class classification (e.g., classifying images into one of several categories).
The output layer should have a softmax activation function to produce probabilities for each class.
Example: Classifying handwritten digits (0–9) in the MNIST dataset.

2. Binary Classification
When your CNN is used for binary classification (e.g., cat vs. dog).
The output layer should have a sigmoid activation function to output a probability between 0 and 1.
In this case, binary cross-entropy (log loss) is used instead of categorical cross-entropy.
Example: Detecting if an X-ray image shows pneumonia (yes/no).

3. Multi-Label Classification
When an image can belong to multiple categories at the same time.
The output layer should have sigmoid activation for each class instead of softmax.
Example: An image of a scene containing both a dog and a car—both labels should be predicted.

__Mathematical Formulation__

- Categorical Cross-Entropy (Multi-Class)

$$L=-\sum_{i=1}^{N} y_i log(\hat{y_i})$$
where $y_i$  is the true label (one-hot encoded), and \hat{y_i}is the predicted probability for class 

- Binary Cross-Entropy (Binary & Multi-Label)
**Formula:**
$$
L = - \frac{1}{N} \sum_{i=1}^{N} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]
$$

where:
- $ y_i $ is the true label (0 or 1).
- $ \hat{y}_i $ is the predicted probability.

### Why Use the Logarithm?
1. Probability-Based Interpretation
In classification, we model the likelihood of the correct class. The log function is used because multiplying small probabilities leads to numerical underflow. Taking the logarithm transforms these probabilities into a summation, making computation stable.

2. Penalizing Confident Wrong Predictions More Heavily

If the true label is \( y = 1 \) and the predicted probability \( \hat{y} \) is close to 0, the loss becomes very large:

$$
L = -\log(\hat{y}) \rightarrow \text{large penalty when } \hat{y} \approx 0
$$

- Conversely, if \( y = 0 \) and \( \hat{y} \approx 1 \), the term \( \log(1 - \hat{y}) \) also results in a large penalty:
- This prevents the model from confidently making incorrect predictions.

3. Ensuring Convexity for Optimization
    - The log function ensures the loss function is convex, which helps gradient-based optimization methods like stochastic gradient descent (SGD) converge efficiently.
4. Connection to Maximum Likelihood Estimation (MLE)
    - The Binary Cross-Entropy loss is equivalent to the negative log-likelihood of a Bernoulli-distributed variable.
    - Minimizing BCE is equivalent to maximizing the likelihood of correct classifications.

Binary Cross-Entropy (BCE) has a log loss because it is derived from the logarithm of probabilities in probabilistic modeling. This ensures that the loss function properly penalizes incorrect predictions while being convex and differentiable.
- It penalizes incorrect confident predictions heavily.
- It makes the function convex and differentiable for optimization.
- It aligns with the principle of Maximum Likelihood Estimation (MLE).

### When Not to Use Cross-Entropy
- Regression Problems: Use Mean Squared Error (MSE) or Mean Absolute Error (MAE) instead.
- Unsupervised Learning: Cross-entropy loss is not directly applicable in clustering or autoencoders unless using a classification component.



# Alternative Loss Functions to Binary Cross-Entropy (BCE) in PyTorch

## 1. Mean Squared Error (MSE) Loss
**Formula**:
$$
L = \frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2
$$

**When to use**:
- Used for regression tasks but sometimes used for classification.
- Less sensitive to outliers.

**PyTorch Example**:
```python
import torch
import torch.nn as nn

y_true = torch.tensor([1.0, 0.0, 1.0])
y_pred = torch.tensor([0.9, 0.1, 0.8])

mse_loss = nn.MSELoss()
loss = mse_loss(y_pred, y_true)
print("MSE Loss:", loss.item())
```

---

## 2. Hinge Loss
**Formula**:
$$
L = \sum_{i=1}^{N} \max(0, 1 - y_i \hat{y}_i)
$$

**When to use**:
- Used in SVMs and margin-based classification.
- Requires labels in {-1, +1} format.

**PyTorch Example**:
```python
y_true = torch.tensor([1, -1, 1], dtype=torch.float32)
y_pred = torch.tensor([0.8, -0.5, 0.6], dtype=torch.float32)

hinge_loss = nn.HingeEmbeddingLoss()
loss = hinge_loss(y_pred, y_true)
print("Hinge Loss:", loss.item())
```

---

## 3. Focal Loss
**Formula**:
$$
L = - \alpha (1 - \hat{y})^\gamma y \log(\hat{y}) - (1 - \alpha) \hat{y}^\gamma (1 - y) \log(1 - \hat{y})
$$

**When to use**:
- For handling imbalanced datasets by down-weighting easy examples.

**PyTorch Example**:
```python
class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs, targets):
        bce_loss = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
        probas = torch.sigmoid(inputs)
        loss = self.alpha * (1 - probas) ** self.gamma * bce_loss
        return loss.mean()

y_true = torch.tensor([1.0, 0.0, 1.0])
y_pred = torch.tensor([0.9, 0.1, 0.8])

focal_loss = FocalLoss()
loss = focal_loss(y_pred, y_true)
print("Focal Loss:", loss.item())
```

---

## 4. Log-Cosh Loss
**Formula**:
$$
L = \sum_{i=1}^{N} \log(\cosh(y_i - \hat{y}_i))
$$

**When to use**:
- A robust alternative to MSE that is less sensitive to outliers.

**PyTorch Example**:
```python
class LogCoshLoss(nn.Module):
    def forward(self, inputs, targets):
        return torch.mean(torch.log(torch.cosh(inputs - targets)))

y_true = torch.tensor([1.0, 0.0, 1.0])
y_pred = torch.tensor([0.9, 0.1, 0.8])

log_cosh_loss = LogCoshLoss()
loss = log_cosh_loss(y_pred, y_true)
print("Log-Cosh Loss:", loss.item())
```

---

## 5. Kullback-Leibler (KL) Divergence Loss
**Formula**:
$$
L = \sum_{i=1}^{N} y_i \log \frac{y_i}{\hat{y}_i}
$$

**When to use**:
- For comparing probability distributions.

**PyTorch Example**:
```python
y_true = torch.tensor([0.8, 0.1, 0.1])
y_pred = torch.tensor([0.7, 0.2, 0.1])

kl_loss = nn.KLDivLoss(reduction="batchmean")
loss = kl_loss(torch.log(y_pred), y_true)
print("KL Divergence Loss:", loss.item())
```


In [3]:
import torch
import torch.nn as nn

y_true = torch.tensor([1.0, 0.0, 1.0], dtype=torch.float32)
y_pred = torch.tensor([0.9, 0.1, 0.8], dtype=torch.float32)

mse_loss = nn.MSELoss()
loss = mse_loss(y_pred, y_true)
print("MSE Loss:", loss.item())

#----------------------------------------------------------#
# Requires labels in {-1, +1} format.
y_true = torch.tensor([1, -1, 1], dtype=torch.float32)
y_pred = torch.tensor([0.8, -0.5, 0.6], dtype=torch.float32)

hinge_loss = nn.HingeEmbeddingLoss()
loss = hinge_loss(y_pred, y_true)
print("Hinge Loss:", loss.item())

#----------------------------------------------------------#

class FocalLoss(nn.Module):
    def __init__(self, alpha=0.25, gamma=2.0):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma

    def forward(self, inputs, targets):
        bce_loss = nn.functional.binary_cross_entropy_with_logits(inputs, targets, reduction='none')
        probas = torch.sigmoid(inputs)
        loss = self.alpha * (1 - probas) ** self.gamma * bce_loss
        return loss.mean()

y_true = torch.tensor([1.0, 0.0, 1.0])
y_pred = torch.tensor([0.9, 0.1, 0.8])

focal_loss = FocalLoss()
loss = focal_loss(y_pred, y_true)
print("Focal Loss:", loss.item())

#----------------------------------------------------------#.

class LogCoshLoss(nn.Module):
    def forward(self, inputs, targets):
        return torch.mean(torch.log(torch.cosh(inputs - targets)))

y_true = torch.tensor([1.0, 0.0, 1.0])
y_pred = torch.tensor([0.9, 0.1, 0.8])

log_cosh_loss = LogCoshLoss()
loss = log_cosh_loss(y_pred, y_true)
print("Log-Cosh Loss:", loss.item())


#----------------------------------------------------------#

y_true = torch.tensor([0.8, 0.1, 0.1])
y_pred = torch.tensor([0.7, 0.2, 0.1])

kl_loss = nn.KLDivLoss(reduction="batchmean")
loss = kl_loss(torch.log(y_pred), y_true)
print("KL Divergence Loss:", loss.item())


#----------------------------------------------------------#

MSE Loss: 0.020000001415610313
Hinge Loss: 0.9666666984558105
Focal Loss: 0.019345112144947052
Log-Cosh Loss: 0.009950477629899979
KL Divergence Loss: 0.012503479607403278
