(dl/01ba-nll)=
# Negative log loss (MLE)

The machine learning method follows four steps: defining a model, defining a loss function,
choosing an optimizer, and running it on large compute (e.g. GPUs). A **loss function** 
acts a smooth surrogate to the true objective which may not be amenable to available optimization 
techniques. Hence, we can think of loss functions as a measure of model quality.
The choice of loss function determines what the model parameters will optimize towards.

```{figure} ../../../img/nn/02-loss-surface.png
---
name: 01c-loss-surface
width: 60%
align: center
---
Loss surface for a model with two weights. [Source](https://cs182sp21.github.io/static/slides/lec-4.pdf)
```


Here we derive a loss function based on the principle of [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation) (MLE), i.e. finding optimal parameters such that the dataset is most probable. Consider a parametric model of the target $p_{\boldsymbol{\Theta}}(y \mid \boldsymbol{\mathsf{x}}).$ 
The **likelihood** of the [iid](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables) sample $\mathcal{D} = \{(\boldsymbol{\mathsf{x}}_i, y_i)\}_{i=1}^N$ can be defined as

$$
\begin{aligned}
L(\boldsymbol{\Theta}) 
= \left({\prod_{i=1}^N {p_{\boldsymbol{\Theta}}(y_i \mid \boldsymbol{\mathsf{x}}_i)}}\right)^{\frac{1}{N}}.
\end{aligned}
$$

This can be thought of as the probability assigned by the parametric model on the sample.
The iid assumption is important. It also means that the model gets to focus on inputs 
that are more probable since they are better represented in the sample. 
Probabilities are
small numbers in $[0, 1]$ and we are multiplying lots of them, so applying the logarithm which is monotonic
and converts the product into a sum is a good idea:

$$
\begin{aligned}
\log L(\boldsymbol{\Theta}) 
&= \frac{1}{N}\sum_{i=1}^N \log p_{\boldsymbol{\Theta}}(y_i \mid \boldsymbol{\mathsf{x}}_i).
\end{aligned}
$$

MLE then maximizes the log-likelihood with respect to the parameters $\boldsymbol{\Theta}.$ The idea is that a good model should make the data more probable. It is common practice in optimization literature to convert this to a minimization problem. The following then becomes our optimization problem:

$$\boldsymbol{\Theta}^* = \underset{\boldsymbol{\Theta}}{\text{argmin}}\,\left( -\frac{1}{N}\sum_{i=1}^N \log p_{\boldsymbol{\Theta}}(y_i \mid \boldsymbol{\mathsf{x}}_i)\right).$$

This allows us to define $\ell = -\log p_{\boldsymbol{\Theta}}(y \mid \boldsymbol{\mathsf{x}}).$ In general, the loss function can be any nonnegative function whose value approaches zero whenever the prediction of the network the target value. Observe that:

- $p_{\boldsymbol{\Theta}}(y \mid \boldsymbol{\mathsf{x}}) \to 1$ $\implies$ $\ell \to 0$
- $p_{\boldsymbol{\Theta}}(y \mid \boldsymbol{\mathsf{x}}) \to 0$ $\implies$ $\ell \to \infty$ 

Using an expectation of the loss over the underlying distribution allows the model to focus on errors based on its probability of occuring. For parameters $\boldsymbol{\Theta},$ we will approximate the **true risk** which is the expectation of $\ell$ on the underlying distribution with the **empirical risk** calculated on the sample $\mathcal{D}$:

$$
\begin{aligned}
\mathcal{L}(\boldsymbol{\Theta}) 
&= \mathbb{E}_{\boldsymbol{\mathsf{x}},y}\left[\ell(y, f_{\boldsymbol{\Theta}}(\boldsymbol{\mathsf{x}}))\right] \\
&\approx \mathcal{L}_\mathcal{D}(\boldsymbol{\Theta}) = \frac{1}{|\mathcal{D}|} \sum_i \ell(y_i, f_{\boldsymbol{\Theta}}(\boldsymbol{\mathsf{x}}_i)).
\end{aligned}
$$

The optimization problem can be written more generally as
$\boldsymbol{\Theta}^* = \underset{\boldsymbol{\Theta}}{\text{argmin}}\, \mathcal{L}_\mathcal{D}(\boldsymbol{\Theta})
$.