## Lecture 3.3: Output Representations

#### Recap: Deep Networks

**Universal Approximation Theorem**

A two-layer deep network can approximate any continuous function.

We might not always want continuous (real-valued) outputs
* How can we convert the real value to what we want?

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{o}$

#### Inputs and Outputs of Networks

Input: $\text{x}\ \in\ \mathbb{R}^{n}$

Output: $\text{o}\ =\ f_{\theta}(\text{x})$

* $f_{\theta}$: deep network

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{o}$

Output transformations: $g$

&ensp;&ensp;&ensp;&ensp;$\psi\ :\ f_{\theta}\ \circ\ g$

#### Positive Regression

Positive regression: $\psi\ :\ \mathbb{R}^{n}\ \rightarrow\ \mathbb{R}_{+}$

Option 1: ReLU
* $\hat{y}\ =\ g(\text{o})\ =\ \text{max}(\text{o},\ 0)$

Option 2: Soft ReLU
* $\hat{y}\ =\ g(\text{o})\ =\ \text{log}(1\ +\ e^{\text{o}})$

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{g}\ \rightarrow\ \hat{y}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

#### Regression

Regression: $\psi\ :\ \mathbb{R}^{n}\ \rightarrow\ \mathbb{R}$
* Identity mappging: $g(\text{o})\ =\ \text{o}$

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{g(o) = o}\ \rightarrow\ \hat{y}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

#### Binary Classification

Binary classification $\psi\ :\ \mathbb{R}^{n}\ \rightarrow\ [0,\ 1]$

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{g}\ \rightarrow\ \hat{y}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

Option 1: Thresholding
* $\hat{y}\ =\ g(\text{o})\ =\ 1\{\text{o}\ >\ 0\}$

Option 2: Logistic Regression
* $\hat{y}\ =\ \sigma(\text{o})\ =\ \frac{1}{1\ +\ e^{-\text{o}}}$

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \sigma\ \rightarrow\ \hat{y}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

#### General Classification

Multi-class classification $\psi\ :\ \mathbb{R}^{n}\ \rightarrow\ [1\ ...\ C]$

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{g}\ \rightarrow\ \hat{y}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

Option 1: argmax
* $\hat{y}\ =\ \text{arg max}(\text{o})$

Option 2: one-hot
* $\hat{\text{y}}\ =\ [0,\ ...,\ 1,\ ...,\ 0]^{\top}$
* $\hat{\text{y}}_{i}\ =\ 1\ \text{if}\ \text{o}_{i}\ \ge\ \text{o}_{j}\ \forall_{j}$

Option 3: softmax
* $p(y)\ =\ \text{softmax}(\text{o})$

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{Softmax}\ \rightarrow\ \hat{y}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

#### Output Representations in Practice

Do **not** add to model
* Most output transformations are not differentiable (or hard to differentiate)
* Model cannot train with them

**Model Output**
* Always output raw values

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{g}\ \rightarrow\ \hat{\text{y}}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

#### Output Representations - TL;DR
* Deep networks always output real values
* Output transformations convert them into what you want
* Train the network *without* output transformations!

