## Lecture 3.4: Loss Functions

#### Recap: Output Transformations

Input: $\text{x}$

Output: $\text{o}$

Output transformation: $g$

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{g}\ \rightarrow\ \hat{\text{y}}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

Training

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \text{Linear}\ \rightarrow\ \text{Loss}\ \leftarrow\ \text{y}$

Inference

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{g}\ \rightarrow\ \hat{\text{y}}$

where $f_{\theta}$ lies within the linear computation layers and $\psi$ lies within all interim layers before computing $\hat{y}$

#### Recap: Loss

Loss function: </br>
&ensp;&ensp;&ensp;&ensp;$l(\theta\ |\ \text{x}_{i},\ \text{y}_{i})$

Expected Loss: </br>
&ensp;&ensp;&ensp;&ensp;$L(\theta\ |\ \mathcal{D})\ =\ \mathbb{E}_{(\text{x},\ \text{y})\ \sim\ \mathcal{D}}[l(\theta\ |\ \text{x},\ \text{y})]$

$x\ \rightarrow\ \text{Linear}\ \rightarrow\ \text{ReLU}\ \rightarrow\ \text{...}\ \text{Linear}\ \rightarrow\ \text{Loss}\ \leftarrow\ \text{y}$

#### Regression

Regression: $\psi\ :\ \mathbb{R}^{n}\ \rightarrow\ \mathbb{R}$

L1 Loss: </br>
$l(\theta\ |\ \text{x},\ \text{y})\ =\ ||\text{y}\ -\ \text{o}||_{1}\ =\ ||\text{y}\ -\ f_{\theta}(\text{x})||_{1}$

L2 Loss: </br>
$l(\theta\ |\ \text{x},\ \text{y})\ =\ ||\text{y}\ -\ \text{o}||_{2}^{2}\ =\ ||\text{y}\ -\ f_{\theta}(\text{x})||_{2}^{2}$

#### Binary Classification

Binary classification $\psi\ :\ \mathbb{R}^{n}\ \rightarrow\ [0,\ 1]$
* labels $y\ \in\ \{0,\ 1\}$

Likelihood estimation
* $p(0)\ =\ 1\ -\ \sigma(f_{\theta}(x))$
* $p(1)\ =\ \sigma(f_{\theta}(x))$

Binary cross entroy (negative log-likelihood) </br>
&ensp;&ensp;&ensp;&ensp;$l(\theta\ |\ \text{x},\ \text{y})\ \ =\ -\text{log}\ p(y)$

&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;&ensp;$=\ -[y\ \text{log}\ p(1)\ +\ (1\ -\ y)\ \text{log}\ p(0)]$

#### Binary Classification Loss in Practice

Numerical stability
* $\sigma(\text{o})\ =\ 0\ \text{for}\ o\ \rightarrow\ -100$
* $\text{log}(\sigma(o))\ =\ \text{log}(0)\ =\ \text{NaN !!}$

Combine log and $\sigma$ </br>
&ensp;&ensp;&ensp;&ensp;$l(\theta\ |\ \text{x},\ \text{y})\ =\ -[y\ \text{log}\ \sigma(o)\ +\ (1\ -\ y)\ \text{log}\ (1\ -\ \sigma(o))]$
* Use BCEWithLogitsLoss !!
* Numerically more stable than Sigmoid + BCELoss

#### Multi-Class Classfication

Binary classification $\psi\ :\ \mathbb{R}^{n}\ \rightarrow\ [1,\ ...,\ C]$
* labels $y \in\ \{1,\ ...,\ C\}$

Likelihood estimation </br>
$$
\text{p}\ =\ \text{softmax}(\text{o})\ =\ 
\begin{bmatrix}
p(1) \\
p(2) \\
\vdots \\
p(C)
\end{bmatrix}
$$

Cross entropy (negative log-likelihood) </br>
$l(\theta\ |\ \text{x},\ \text{y})\ =\ -\text{log}\ p(y)$

#### Multi-Class Classification Loss in Practice

Numerical stability
* $\text{softmax}(o)_{i}\ \rightarrow\ 0\ \text{for}\ o_{j}\ -\ o_{i}\ >\ 100$
* $\text{log}(\text{softmax}(o)_{i})\ =\ \text{log}(0)\ \text{is Nan}$

Combine log and softmax </br>
&ensp;&ensp;&ensp;&ensp;$l(\theta\ |\ \text{x},\ \text{y})\ =\ -\text{log softmax}(\text{o})_{y}$
* Use CrossEntropyLoss !!
* numerically more stable

#### Loss Functions - TL;DR
* **Regression**: L1 loss `torch.nn.L1Loss`, L2 loss `torch.nn.MSELoss`
* **Binary Classification**: binary cross-entropy loss `torch.nn.BCEWithLogitsLoss`
* **Multi-Class Classification**: cross-entropy loss `torch.nn.CrossEntropyLoss`
* **Always** use PyTorch loss for better numerical stability!