# Empirical Risk Minimization (ERM)

What is Empirical Risk Minimization (ERM)? Empirical Risk is the average loss over all data points - $\frac{1}{n}\sum^{n}_{i=1}\mathcal{L}(\hat{y}^i, y^i)$. Minimizing the empirical risk gives us a better predictor for our data. Losses can be derived from MLE / MAP principles if algorithm is parametric such as cross-entropy loss. However, Hinge Loss is derived from SVM, a non-parametric ML algo. Squared Loss / MSE Loss can be derived through both MLE when we assume the data follows a Gaussian Distribution and when we use the Ordinary Least Squares method to minimize the squared residuals directly.

We can first train / fit the model on a stricter type of loss function and at deployment time, use a looser loss function to make actual predictions.

In [3]:
import numpy as np
import matplotlib.pyplot as plt
# plotting defaults
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.figsize'] = (18, 12)

In [4]:
plt.rcParams['figure.dpi'] = 300
plt.rcParams['figure.figsize'] = (18, 12)

---
# Convexity

- A function is **strictly convex** if the line segment connecting any two points on the graph of $f$ lies **strictly** above the graph
    - Convex: If there is a local min, then it is a **global** min
    - Strictly Convex: If there is a local min, then it is the **unique global** min

---
# Classification Losses

For Classification, we use **Prediction Error-based Losses**:
1. Margin-based Loss
    - Margin $\gamma$ for predicted score $\hat{y}$ and true class $y \in \{-1, 1\}$: $y\hat{y} = yf_w(x) = yw^\top x$ (Linear SVM) 
    - Our objective is to maximize the margin by having $y$ and $\hat{y}$ have the same sign, $\therefore$ correct prediction (Refer to [Perceptron](https://jeffchenchengyi.github.io/machine-learning/01-supervised-learning/classification/perceptron.html) and [Linear SVM](https://jeffchenchengyi.github.io/machine-learning/01-supervised-learning/classification/linear-support-vector-classifiers.html) notes)

### Zero-One Loss

$$
l_{0-1} = 1(\text{Margin}\,\gamma \leq 0)
$$

- Non-convex

### Perceptron Loss

$$
l_{\text{Perceptron}} = max\{-\gamma, 0\}
$$

- Convex
- Non-differentiable @ $\gamma = 0$

### SVM / Hinge Loss

$$
l_{\text{Hinge}} = max\{1-\gamma, 0\} = (1-\gamma)_+
$$

- Convex, upper bound on 0-1 loss
- Non-differentiable @ $\gamma=1$
- "Margin-error" for $0 < \gamma < 1$ (Prediction is correct, but loss still gives a loss > 0)

### Logistic / Log Loss

$$
l_{\text{Logistic}} = log(1 + e^{-\gamma})
$$

- Differentiable
- Loss will never be 0

### Cross-Entropy Loss

$$H(p, q) = -\sum_{i=1}^n {\mathrm{p}(x_i) \log_b \mathrm{q}(x_i)}\,\{\text{for discrete }x\} \\ = -\int_{x} {\mathrm{p}(x) \log_b \mathrm{q}(x)}{dx}\,\{\text{for continuous }x\}$$

If we build an animal image classifier to predict a red panda:

<img src="https://live.staticflickr.com/8146/29545409156_e6c3547efc_b.jpg" width="500px"/>

| Animal Classes | Predicted Distribution $q$ | True Distribution $p$ |
| :------------: | :------------------------: | :-------------------: |
| Cat            | 0.02                       | 0.00                  |
| Dog            | 0.30                       | 0.00                  |
| Fox            | 0.45                       | 0.00                  |
| Cow            | 0.00                       | 0.00                  |
| Red Panda      | 0.25                       | 1.00                  |
| Bear           | 0.05                       | 0.00                  |
| Dolphin        | 0.00                       | 0.00                  |

$$H(p, q) = -\sum_{i=1}^n {\mathrm{p}(x_i) \log_b \mathrm{q}(x_i)} = -{log}_2{0.25} = 1.386$$

---
# Regression Losses

Regression Spaces:
- Input Space $\mathcal{X} = \mathbf{R}^d$
- Action Space $\mathcal{A} = \hat{y} = \mathbf{R}$
- Outcome Space $\mathcal{y} = \mathbf{R}^d$

Regression Losses usually only depend on residuals ${r = y - \hat{y}}$

For Regression, we normally use **Distance-based Losses**
1. Only depends on the residual: ${l(\hat{y}, y) = \psi(y - \hat{y})\,\text{for some}\,\psi:\mathbf{R} \rightarrow \mathbf{R}}$
2. Loss is zero when residual is 0: $\psi(0) = 0$

- Distance-based losses are *Translation-invariant*:
    - ${ l(\hat{y} + a, y + a ) = l(\hat{y}, y)}$
    
- When would you not want to use a Distance-based loss?
    - When we're regressing on percentage
    - e.g. If we're predicting a percentage,
        - We might want to make sure that ${ l(\hat{y}=9\%, y=10\%) > l(\hat{y}=99\%, y=100\%)}$
        - Hence, we'll use something like **Relative Error** instead ${\frac{y - \hat{y}}{y}}$
        - However, we can often transform the response $y$ so that it's translation invariant like using a log-transform

### Squared Loss (${l_2}$ Loss)

$${l(r) = r^2}$$

- Not robust to outliers, penalizes heavily for outliers

### Absolute / Laplace Loss (${l_1}$ Loss)

$${l(r) = \vert r\vert}$$

- But not differentiable
- Gives Median Regression

### Huber Loss

- Quadratic for $\vert r \vert \leq \delta$ and linear for $\vert r \vert > \delta$
- Robust and Differentiable

---
# Other Losses

### Contrastive Loss

### Triplet Loss

---
## Resources

- [David Rosenberg's Lecture on Classification and Regression Losses @ Bloomberg](https://bloomberg.github.io/foml/#lecture-8-loss-functions-for-regression-and-classification)
- [David Rosenberg's Lecture on Statistical Learning Theory and ERM @ Bloomberg](https://bloomberg.github.io/foml/#lecture-3-introduction-to-statistical-learning-theory)
- [Concept Drift](https://machinelearningmastery.com/gentle-introduction-concept-drift-machine-learning/)
- [Covariate Shift](https://www.quora.com/What-is-Covariate-shift)
- [Empirical Risk Minimization Lecture by Kilian Weinberger](http://www.cs.cornell.edu/courses/cs4780/2018fa/lectures/lecturenote10.html)
- [5 Regression Loss Functions All Machine Learners Should Know](https://heartbeat.fritz.ai/5-regression-loss-functions-all-machine-learners-should-know-4fb140e9d4b0)
- [Contrastive and Triplet Loss](https://machinelearningmastery.com/one-shot-learning-with-siamese-networks-contrastive-and-triplet-loss-for-face-recognition/)