[ICCV 2023 Tutorial] Sharon Yixuan Li: Out-of-Distribution detection

- https://abursuc.github.io/many-faces-reliability/
- https://abursuc.github.io/many-faces-reliability/slides/2023_iccv_reliability_sharon_ood.pdf
- https://youtu.be/hgLC9_9ZCJI

<img src="src/001.jpg" style="max-height: 400px" />

Marginal distribution (In distribution; ID): $p^{\text{in}}(\mathbf x)$ \
Input space: $\mathcal X = \R^d$ \
Label space: $\mathcal Y = \{1, -1\}$

Novel distribution (Out of distribution; OOD)

## Challenges

High-capacity neural networks exacerbate __over-confident__ predictions

(Left) In-distribution: mixture of 3 gaussians \
(Right) Decision boundary learned by MLP: overconfident in red regions

<img src="src/002.png" style="max-height: 250px; margin: 10px" />
<img src="src/003.png" style="max-height: 250px; margin: 10px" />

## Tutorial

- Inference-time OOD detection
  - Output-based methods
  - Distance-based methods
- Training-time regularization for OOD detection
  - Safety-aware learning objective
  - Synthesizing virtual outliers
  - Leveraging wild unlabeled data

### Empirical risk minimization

V. Vapnik. Principles of risk minimization for learning theory. NIPS 1991

The objective of the basic training method (what we all know)

#### Function Estimation Model

The learning process is described through three components:
1. A generator of random vector $x$, drawn independently from a fixed but unknown distrubution $P(x)$.
1. A supervisor which returns an output vector $y$ to every input vector $x$, according to a condiitonal distribution function $P(y|x)$, also fixed but unknown.
1. A learning machine capable of implementing a set of function $f(x;w), w \in W$.

The problem of learning is that of choosing from the given set of functions the one which approximates best the supervisor's response. The selecition is based on a training set of $n$ independent observations:

$$
(x_1, x_2), \cdots, (x_n, y_n) \qquad \cdots (1)
$$

The formulation given above implies that learning corresponds to the problem of function approximation.

#### Problem of Risk ($R$) Minimization

In order to choose the best available approximation to the supervisor's response, we measure the loss or discrepancy $L(y, f(x;w))$ between the response $y$ of the supervisor to a given input $x$ and the response $f(x, w)$ provided by the learning machine.

$$
R(w) = \int L(y, f(x; w)) dP(x, y) \qquad \cdots (2)
$$

The goal is to minimize the risk functional $R(w)$ over the class of functions $f(x; w), w \in W$.
But the joint probability distribution $P(x,y) = P(y|x)P(x)$ is unknown and the only available information is contained in the training set (1).

#### Empirical Risk ($E$) Minimization

In order to solve this problem, the following induction principle is proposed: the risk functional $R(w)$ is replaced by the empirical risk functional

$$
E(w) = \dfrac{1}{n} \sum \limits_{i=1}^{n} L(y_i, f(x_i; w))
$$