# Loss Functions and Output Transformations in Deep Networks

## Overview
- Deep networks can approximate any continuous function from $\mathbb{R}^n \rightarrow \mathbb{R}^m$
- However, many tasks need non-continuous outputs (e.g., classes, positive values)
- We use:
  - **Output transformations** (for inference)
  - **Loss functions** (for training)

---

<br>

<img src="./images/ortldr.png" width="500" style="display: block; margin: auto;">

<br>

## Output Transformations (for Inference)
- **Each output transformation has its own very specific loss function**
- Applied **after** the model is trained
- Convert raw network outputs into usable values
- Examples:
  - Ensure positive outputs using ReLU or Softplus
  - Map outputs to class probabilities using sigmoid or softmax

## Loss Functions (for Training)
- Work with raw outputs (logits)
- Must **always** provide **useful gradients** to guide learning
- Represented as:
  - Lowercase $l$: loss for individual data points
  - Uppercase $L$: total loss over dataset

---

## Goal of Training

<br>

<img src="./images/loss.png" width="500" style="display: block; margin: auto;">

<br>

- Minimize **expected loss**:
  $$
  \mathbb{E}_{(x, y) \sim \mathcal{D}}[l(f_\theta(x), y)]
  $$
- Good loss functions:
  - Provide large gradients when the model is wrong
  - Provide small or zero gradients when the model is right

---

# **Types of Losses**

## Regression Losses

<br>

<img src="./images/regrl.png" width="500" style="display: block; margin: auto;">

<br>

### L1 Loss (Absolute Error)
- $l = |f_\theta(x) - y|$
- Also called "Manhattan" or "taxi" distance

### L2 Loss (Squared Error)
- $l = (f_\theta(x) - y)^2$
- Also called "Euclidean distance"

**Note**: In practice, L1 vs L2 doesn’t matter much — both perform well.

---

## Binary Classification

<br>

<img src="./images/closs.png" width="500" style="display: block; margin: auto;">

<br>

- Labels: $y \in \{0, 1\}$
- Model outputs a logit → pass through sigmoid:
  $$
  \sigma(o) = \frac{1}{1 + e^{-o}}
  $$
- Then compute the **negative log likelihood**:
  $$
  l = -[y \cdot \log(\sigma(o)) + (1 - y) \cdot \log(1 - \sigma(o))]
  $$
- This is also known as **binary cross-entropy loss**

### Why use the log?
- Without the log:
  - Gradient is **flat** when the model is very wrong → no learning
- With log:
  - Gradient is **large** when wrong → faster learning
  - Gradient is **small** when right → stable convergence

### Numerical Stability

<br>

<img src="./images/binl.png" width="500" style="display: block; margin: auto;">

<br>

- Very wrong predictions (large negative logits) can make $\sigma(o) \to 0$
- Then $\log(\sigma(o))$ becomes undefined (NaN)
- Solution: use PyTorch's built-in **`BCEWithLogitsLoss`**, which combines sigmoid + log safely

---

## Multi-Class Classification

<br>

<img src="./images/mcc.png" width="500" style="display: block; margin: auto;">

<br>

- Labels: $y \in \{1, 2, ..., C\}$
- Output: vector of logits → pass through softmax:
  $$
  \text{softmax}(o_i) = \frac{e^{o_i}}{\sum_j e^{o_j}}
  $$
- Then use **negative log-likelihood** of the correct class

### Cross-Entropy Loss

<br>

<img src="./images/mccl.png" width="500" style="display: block; margin: auto;">

<br>

- PyTorch function: `CrossEntropyLoss`
- Takes:
  - Raw logits from the model (not softmaxed)
  - Ground truth labels (as class indices)
- Internally applies `log(softmax(logits))` in a numerically stable way

### Softmax issues
- If logits differ by large amounts (e.g., >100), softmax can output exact zeros
- Taking log of zero → NaN
- Solution: use `CrossEntropyLoss`, not manual softmax + log

---

## Summary Table

| Task                     | Output Type   | Recommended Loss Function              |
|--------------------------|---------------|----------------------------------------|
| Regression               | Real values   | L1 or L2 loss                          |
| Binary Classification    | 0 or 1        | `BCEWithLogitsLoss` (binary cross-entropy) |
| Multi-Class Classification | 1 to C       | `CrossEntropyLoss`                     |

> Always use built-in PyTorch loss functions for classification tasks.  
> You may hand-code L1 or L2, but avoid manual sigmoid/softmax + log.

