# Output Representations and Transformations in Deep Networks

## Motivation
- Deep networks can approximate any continuous function from $ \mathbb{R}^n \rightarrow \mathbb{R}^m $

<br>

<img src="./images/recap.png" width="500" style="display: block; margin: auto;">

<br>

- However, tasks often require outputs that are **not continuous**, such as:
  - Classification labels
  - Probabilities
  - Positive-only values
- To handle these cases, we apply **output transformations** to the network output

---

## Notation

<br>

<img src="./images/outputs.png" width="500" style="display: block; margin: auto;">

<br>

- Input: $ \mathbf{x} \in \mathbb{R}^n $
- Output (deep network function): $ f_\theta(\mathbf{x}) $
- Deep network: $ f_\theta$
- Output transformations: $g$
- Overall model: $ \psi(\mathbf{x}) = f_\theta(\mathbf{x}) \circ g $, where $ g $ is the output transformation

---

## Case 1: Regression

<br>

<img src="./images/posr.png" width="500" style="display: block; margin: auto;">

<br>

### Standard Regression
- $ \psi  : \mathbb{R}^n \rightarrow \mathbb{R}$
- Output is a real number
- No transformation needed:
  $$
  \psi(\mathbf{x}) = f_\theta(\mathbf{x})
  $$

### Positive-only Regression
- $ \psi  : \mathbb{R}^n \rightarrow \mathbb{R}_+$
- Goal: output values in $ \mathbb{R}^+ $
- Possible transformations:
  1. **ReLU**: $ T(o) = \max(0, o) $
  2. **Softplus**: $ T(o) = \log(1 + e^o) $
     - Always > 0
     - Differentiable everywhere

#### Trade-offs:
- ReLU allows exact zero but has zero gradient when $ o < 0 $
- Softplus is smooth and always gives gradient, but never exactly zero

---

## Case 2: Binary Classification

<br>

<img src="./images/binc.png" width="500" style="display: block; margin: auto;">

<br>

- Goal: $ \psi : \mathbb{R}^n \rightarrow [0, 1] $

### Transformations:
1. **Thresholding** (non-differentiable):
   $$
   \hat{y} =
   \begin{cases}
   1 & \text{if } o > 0 \\
   0 & \text{otherwise}
   \end{cases}
   $$

2. **Sigmoid** (differentiable):
   $$
   \hat{y} = \sigma(o) = \frac{1}{1 + e^{-o}}
   $$

#### Trade-offs:
- Thresholding is simple but not differentiable (no gradients)
- Sigmoid is differentiable and gives a **probabilistic interpretation**

---

## Case 3: Multi-Class Classification

<br>

<img src="./images/genc.png" width="500" style="display: block; margin: auto;">

<br>

- Goal: $ \psi : \mathbb{R}^n \rightarrow [1, 2, \dots, C] $
- Network outputs a **C-dimensional vector**

### Options:
1. **Argmax**:
   - Output: index of the highest logit
   - Not differentiable
   - Ties are broken arbitrarily

2. **One-hot encoding**:
    $$\hat{y} = 1 \; \text{if} \; \mathbb{o}_i \ge \mathbb{o}_j \; \forall j$$
   - Vector with 1 at max index, 0 elsewhere
   - Also non-differentiable
   - Can indicate ties if multiple 1s present

4. **Softmax** (differentiable):
   $$
   g(\mathbf{o})_i = \frac{e^{o_i}}{\sum_{j=1}^{C} e^{o_j}}
   $$
   - Converts logits into a probability distribution
   - Preferred if gradient flow is required

---

## Best Practices

<br>

<img src="./images/ortldr.png" width="500" style="display: block; margin: auto;">

<br>

- **Do not embed output transformations inside the model**
  - Keep model as raw layers: linear + non-linear operations
  - Output raw **logits** (for classification) or raw **values** (for regression)
  - Apply transformations only during:
    - Inference
    - Loss calculation (matched to task)

### Terminology:
- **Logits**: raw network outputs before softmax/sigmoid
- **Inference**: Apply transformations like argmax or softmax
- **Training**: Use appropriate **loss functions** that internally apply the correct transformations

---

## Summary

- Deep networks output real values, but many tasks require specific output forms
- Output transformations allow us to map from $ \mathbb{R} $ to:
  - $ \mathbb{R}^+ $
  - Binary labels
  - Multi-class probabilities or labels
- Choose transformations based on:
  - Whether you need gradients
  - Whether you care about exact 0/1 vs probabilities
  - Whether you're training or doing inference
- **Always apply transformations outside the model**, not embedded within
