[Understanding Deep Learning; Chapter 5](https://udlbook.github.io/udlbook/)

# Loss function

When we train these models, we seek the parameters that produce the best possible mapping from input to output for the task we are considering.

We have a training dataset $\{x_i, y_i\}_{i=1}^{N}$ of input/output pairs.
Consider a model $f_{\theta}(x)$ with parameters $\theta$ that computes an output from input $x$.

We often think that the model directly computes a prediction $y$. We now shift perspective and consider the model as computing a conditional probability distribution $p(y|x)$ over possible outputs $y$ given input $x$.

A **loss function** $L(\theta)$ returns a single number that describes the mismatch between the model predictions and their corresponding ground-truth outputs.
The loss encourages each training output $y_i$ to have a high probability under the distribution $p(y_i|x_i)$ computed from the corresponding input $x_i$.

## Negative log-likelihood criterion

We choose a parametric distribution $p(y|\phi)$ defined on the output domain $y$. Then we use the neural network to compute one or more of the parameters $\phi$ of this distribution.

The model now computes different distribution parameters $\phi_i = f_\theta(x_i)$ for each training input $x_i$. Each observed training output $y_i$ should have high probability under its corresponding distribution $p(y_i|\phi_i)$. Hence, we choose the model parameters $\theta$ so that they maximize the combined probability across all $N$ training examples.

A conditional probability $p(z|\psi)$ can be considered in two ways.
- As a function of $z$, it is a probability distribution that sums to one.
- As a function of $\psi$, it is a likelihood and does not generally sum to one.

We assume that
- the data are identically distributed (the form of the probability distribution over the outputs $y_i$ is the same for each data point).
- the conditional distribution $p(y_i|x_i)$ of the output given the input are independent, so the total likelihood of the training data decomposes as:
$$
p(y_1, y_2, ..., y_N | x_1, x_2, ..., x_N) = \prod_{i=1}^{N}p(y_i|x_i)
$$
In other words, we assume the data are **independent and identically distributed (i.i.d.)**.

Then, the model parameter $\hat{\theta}$ we want to find is:

$$
\begin{align}
\hat{\theta} & = \underset{\theta}{\mathrm{argmax}}\left[ \prod_{i=1}^{N} p(y_i|x_i)  \right] \\
& = \underset{\theta}{\mathrm{argmax}}\left[ \prod_{i=1}^{N} p(y_i|\phi_i)  \right] \\
& = \underset{\theta}{\mathrm{argmax}}\left[ \prod_{i=1}^{N} p(y_i|f_\theta(x_i))  \right] \\
\end{align}
$$

The combined probability term is the **likelihood** of the parameters and this equation is known as the **maximum likelihood** criterion.

The maximum likelihood criterion is not very practical. Each term $p(y_i|f_\theta(x_i))$ can be small, so the product of many of these terms can be tiny. It may be difficult to represent this quantity with finite precision arithmetic. Fortunately, we can equivalently maximize the logarithm of the likelihood:

$$
\begin{align}
\hat{\theta} & = \underset{\theta}{\mathrm{argmax}}\left[ \prod_{i=1}^{N} p(y_i|f_\theta(x_i))  \right] \\
& = \underset{\theta}{\mathrm{argmax}}\left[ \log \prod_{i=1}^{N} p(y_i|f_\theta(x_i))  \right] \\
& = \underset{\theta}{\mathrm{argmax}}\left[ \sum_{i=1}^{N} \log p(y_i|f_\theta(x_i))  \right] \\
\end{align}
$$

This **log-likelihood** criterion is equivalent because the logarithm is a monotonically increasing function. The log-likelihood criterion has the practical advantage of using a sum of terms, not a product, so representing it with finite precision isn't problematic.

By convention, model fitting problems are framed in terms of minimizing a loss. To convert the maximum log-likelihood criterion to a minimization
problem, we multiply by minus one, which gives us the **negative log-likelihood criterion**:

$$
\begin{align}
\hat{\theta} & = \underset{\theta}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log p(y_i|f_\theta(x_i))\right] \\
& = \underset{\theta}{\mathrm{argmin}}[L(\theta)]
\end{align}
$$
which is what forms the final loss function $L(\theta)$.

$$
L(\theta) = -\sum_{i=1}^{N}\log p(y_i|f_\theta(x_i))
$$

The network no longer directly predicts the outputs $y$ but instead determines a probability distribution over $y$. When we perform inference, we often want a point estimate rather than a distribution, so we return the maximum of the distribution:

$$
\hat{y} = \underset{y}{\mathrm{argmax}}[p(y|f_{\hat{\theta}}(x))]
$$


### Recipe for constructing loss functions

The recipe for constructing loss functions for training data $\{x_i, y_i\}_{i=1}^{N}$ using the maximum likelihood approach is:

1. Choose a sutiable probability distribution $p(y|\phi)$ defined over the domain of the predictions $y$ with distribution parameters $\phi$.
1. Set the machine learning model $f_\theta(x)$ to predict one or more of these parameters, so $\phi=f_\theta(x)$ and $p(y|\phi)=p(y|f_\theta(x))$.
1. To train the model, find the network parameters $\hat{\theta}$ that minimize the negative log-likelihood loss function over the training dataset pairs $\{x_i, y_i\}_{i=1}^{N}$:
$$
\hat{\theta} = \underset{\theta}{\mathrm{argmin}}[L(\theta)] = \underset{\theta}{\mathrm{argmin}}\left[-\sum_{i=1}^{N}\log p(y_i|f_\theta(x_i))\right]
$$
1. To perform inference for a new test example $x$, return either the full distribution $p(y|f_{\hat{\theta}}(x))$ or the value where the distribution is maximized.

## Cross entropy

The **information** quantifies the number of bits required to encode and transmit an event. Lower probability events have more information, higher probability events have less information.

In information theory, we like to describe the “surprise” of an event. An event is more surprising the less likely it is, meaning it contains more information.

- Low Probability Event (surprising): More information.
- Higher Probability Event (unsurprising): Less information.

Information $h(x)$ can be calculated for an event $x$, given the probability of the event $P(x)$ as follows:

$$
h(x) = -\log P(x)
$$

The **entropy** is the number of bits required to transmit a randomly selected event from a probability distribution. A skewed distribution has a low entropy, whereas a distribution where events have equal probability has a larger entropy.

A skewed probability distribution has less “surprise” and in turn a low entropy because likely events dominate. Balanced distribution are more surprising and turn have higher entropy because events are equally likely.

- Skewed Probability Distribution (unsurprising): Low entropy.
- Balanced Probability Distribution (surprising): High entropy.

Entropy $H(P)$ is an expected information for probability distribution $P(x)$.

$$
H(P) = \sum_x P(x)h(x) = -\sum_x P(x) \log P(x)
$$

The **cross-entropy** is the average number of bits needed to encode data coming from a source with distribution $P$ when we use model $Q$

$$
H(P, Q) = -\sum_x P(x) \log Q(x)
$$

The **Kullback-Leibler (KL) divergence** is the average number of extra bits needed to encode the data, due to the fact that we used distribution $Q$ to encode the data instead of the true distribution $P$.

- Cross-Entropy: Average number of total bits to represent an event from $Q$ instead of $P$.
- KL Divergence (also called Relative Entropy): Average number of extra bits to represent an event from $Q$ instead of $P$.

$$
D_{KL}(P || Q) = H(P, Q) - H(P) = \sum_x P(x) \log \frac{P(x)}{Q(x)}
$$

## Cross-entropy criterion

The cross-entropy loss is based on the idea of finding parameters $\phi$ that minimize the distance between the empirical distribution $q(y)$ of the observed data $y$ and a model distribution $p(y|\phi)$.

Consider an empirical data distribution at points $\{ y_i \}_{i=1}^{N}$. We can describe this as a weighted sum of point masses:

$$
q(y) = \frac{1}{N} \sum_{i=1}^{N} \delta(y-y_i)
$$

where $\delta$ is the Dirac delta function. 

The distance between two probability distributions $q(z)$ and $p(z)$ can be evaluated using the Kullback-Leiber (KL) divergence:

$$
D_{KL}(q||p) = \int_{-\infty}^{\infty} q(z) \log \frac{q(z)}{p(z)} dz
$$

We want to minimize the KL divergence between this empirical distribution $q(y)$ and the model distribution $p(y|\phi)$:

$$
\begin{align}
\hat{\phi} & = \underset{\phi}{\mathrm{argmin}}\left[
  \int_{-\infty}^{\infty} q(y) \log q(y) dy - \int_{-\infty}^{\infty} q(y) \log p(y|\phi) dy
  \right] \\
  & = \underset{\phi}{\mathrm{argmin}}\left[- \int_{-\infty}^{\infty} q(y) \log p(y|\phi) dy
  \right]
\end{align}
$$

where the first term disapperas, as it has no dependence on $\phi$.  The remaining second term is known as the **cross-entropy**.

$$
\begin{align}
\hat{\phi}
  & = \underset{\phi}{\mathrm{argmin}}\left[- \int_{-\infty}^{\infty} \frac{1}{N} \sum_{i=1}^{N} \delta(y-y_i) \log p(y|\phi) dy
  \right] \\
  & = \underset{\phi}{\mathrm{argmin}}\left[- \frac{1}{N} \sum_{i=1}^{N} \log p(y_i|\phi)
  \right] \\
  & = \underset{\phi}{\mathrm{argmin}}\left[- \sum_{i=1}^{N} \log p(y_i|\phi)
  \right]
\end{align}
$$


In machine learning, the distribution parameters $\phi$ are computed by the model $f_\theta(x_i)$, so we have:

$$
\hat{\theta} = \underset{\theta}{\mathrm{argmin}}\left[- \sum_{i=1}^{N} \log p(y_i|f_\theta(x_i))
  \right]
$$

This is precisely the negative log-likelihood criterion.

**It follows that the negative log-likelihood criterion (from maximizing the data likelihood) and the cross-entropy criterion (from minimizing the distance between the model and empirical data distributions) are equivalent.**