# L04: Maximum likelihood methods

**Sources and additional reading:**
- Lupton, chapter 7
- Ivezić, chapter 4.2

## Bayes' Theorem as an inference method

In L02 we derived Bayes' Theorem from the definition of conditional probability: let $A$ and $B$ be two events, the joint probability of $A$ and $B$ is then given by $$P(A\cap B) = P(A|B)P(B) = P(B|A) P(A).$$ From this, we can now derive Bayes' Theorem as $$P(B|A) = \frac{P(A|B)P(B)}{P(A)}.$$ As discussed, this relation gives us a way to translate from $P(A|B)$ to $P(B|A)$.

Now the crucial point is that we can reinterpret this equation as a means for performing inference. Let the event $B$ be a particular realization of parameters $\theta$ of a model $M$, and let $A$ be a particular realization of experimental data $x$. Then we can rewrite Bayes' Theorem as: $$P(\theta, M|x) = \frac{P(x|\theta, M)P(\theta, M)}{P(x)}.$$ Here we have introduced three important quantities:

The *likelihood* $P(x|\theta, M)$: The likelihood gives us the probability to obtain an observation $x$ given parameters $\theta$ of a model $M$.

The *prior* $P(\theta, M)$: The prior gives the probability of model parameters $\theta$ prior to having conducted an experiment.

The *posterior* $P(\theta, M|x)$: The posterior is the probability of a set of model parameters $\theta$ given we have observed a particular data realization $x$.

The *evidence* $P(x)$: The evidence acts as a normalizing constant for the posterior and is given by $P(x) = \int \mathrm{d}\theta P(x|\theta, M)P(\theta, M)$. In this sense it gives the probability of the data given a specific model.

Written in this way Bayes' Theorem allows us to translate the probability of data given a model to the probability of a model given the data - Bayes' Theorem has just become a tool for statistical inference. In other words, Bayes' Theorem allows us to obtain an expression for the distribution of model parameters using the likelihood and the prior.

## Maximum likelihood inference

So we can now use Bayes' Theorem to perform inference on parameters $\boldsymbol{\theta}$ of a model $M$ (or equivalently estimate the parameters that underlie a given population pdf). We have that $$P(\boldsymbol{\theta}, M|x) \propto P(x|\boldsymbol{\theta}, M)P(\boldsymbol{\theta}, M).$$ If we make the assumption that all values of model parameters $\theta$ are equally likely a priori, we can simplify even further, i.e. $$P(\boldsymbol{\theta}, M|x) \propto P(x|\boldsymbol{\theta}, M).$$ So we see that the model parameters that maximize the likelihood also maximize the posterior. The principle of maximum likelihood estimation defines the maximum likelihood estimator for the model parameters $\theta$ as $$\hat{\boldsymbol{\theta}}_{\mathrm{ML}} = \max_{\boldsymbol{\theta}} P(x|\boldsymbol{\theta}, M).$$ So we choose as our estimator for the model parameters those set of parameter values that maximize the likelihood of obtaining the data as observed. 

![MLE.png](attachment:MLE.png)

## Maximum likelihood estimator for the mean $\mu$

Let us assume that we have measured an iid sample of size $n$, $(x_1, ..., x_n)$. Our model is that these measurements are drawn from a Gaussian distribution $\mathcal{N}(\mu, \sigma)$ with mean $\mu$ and standard deviation $\sigma$. While $\sigma$ is known, we would like to use MLE to estimate $\mu$.

Based on our assumptions the probability for each measurement $x_i$ is $$P(x_i|\mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}e^{-(x_i-\mu)^2/(2\sigma^2)}.$$ The probability to get the sample that we observed given $\mu$ has a particular value is thus $$P(x_1, ..., x_n|\mu, \sigma) = \frac{1}{\left(\sqrt{2\pi}\sigma \right)^n}\prod_i e^{-(x_i-\mu)^2/(2\sigma^2)}.$$ So we now need to find the value of $\mu$ that maximizes this likelihood. For simplicity, we can work with log of the likelihood, i.e. $$\log P(x_1, ..., x_n|\mu, \sigma) = -\frac{n}{2}\log{2\pi}-n\log{\sigma}-\sum_i \frac{(x_i-\mu)^2}{2\sigma^2}.$$ Then we have (maximizing the log of the likelihood is equivalent to maximizing the likelihood, as the log function is monotonic) $$\frac{\partial \log P}{\partial \mu}=\frac{1}{\sigma^2}\sum_i (x_i-\mu)\overset{!}{=}0.$$ Thus the MLE for the mean $\mu$ is $$\hat{\mu}_{\mathrm{MLE}}=\frac{1}{n}\sum_i x_i.$$ So we see that our oldie-but-goldie sample mean is also the ML estimator for $n$ draws from a Gaussian distribution. 

## Maximum likelihood estimator for the variance $\sigma^2$

In the case in which we know the mean $\mu$ of the underlying distribution but would like to derive the variance $\sigma^2$, we can do $$\frac{\partial \log P}{\partial \sigma}=-\frac{n}{\sigma}+\frac{1}{\sigma^3}\sum_i (x_i-\mu)^2\overset{!}{=}0,$$ which leads to $$\hat{\sigma}_{\mathrm{ML}}^2 = \frac{1}{n}\sum_i (x_i-\mu)^2.$$ So, if the mean of the Gaussian is known, the ML estimator for the variance is unbiased, but if the mean is unknown, it is biased. Namely in this case we get $$\hat{\sigma}_{\mathrm{ML}}^2 = \frac{1}{n}\sum_i (x_i-\hat{\mu}_{\mathrm{ML}})^2.$$

## Properties of ML estimators

Maximum likelihood estimators have a number of properties:

- They are *consistent*. This means that the MLE coverges in probability to the true population parameter with increasing sample size.
- They are *asymptotically normally distributed*, i.e. the distribution of the parameter estimate tends to a Gaussian as the sample size increases. The Gaussian is centered on the MLE.
- Asymptotically, they are the minimum-variance estimator of a given quantity.

These properties make ML estimators very popular, but keep in mind that these only hold for large sample sizes. Often these conditions are not met.

## Error of ML estimators

Another helpful feature of ML methods is that we can obtain approximate errors for these estimators. Let us assume that we have determined the MLE $\boldsymbol{\theta}_{\mathrm{MLE}}$ for a vector of model parameters $\boldsymbol{\theta}$ and data $x$ given a likelihood $P(x|\theta, M)$. We can now expand the log-likelihood of our experiment around the MLE $\boldsymbol{\theta}_{\mathrm{ML}}$ as $$\log{P(x|\boldsymbol{\theta}, M)}=\log{P(x|\boldsymbol{\theta}_{\mathrm{ML}}, M)} + \frac{\partial \log{P(x|\boldsymbol{\theta}, M)}}{\partial \theta_{\alpha}}\vert_{\boldsymbol{\theta}_{\mathrm{ML}}}(\theta_{\alpha}-\theta_{\alpha, \mathrm{ML}}) + \frac{1}{2}(\theta_{\alpha}-\theta_{\alpha, \mathrm{ML}})\frac{\partial^2 \log{P(x|\boldsymbol{\theta}, M)}}{\partial \theta_{\alpha}\partial \theta_{\beta}}\vert_{\boldsymbol{\theta}_{\mathrm{ML}}}(\theta_{\beta}-\theta_{\beta, \mathrm{ML}})+...,$$ where we sum over repeated indices. The second term in the expansion vanishes and thus we see that around the MLE, we can approximate any likelihood as Gaussian, i.e. $$P(x|\boldsymbol{\theta}, M)\simeq P(x|\boldsymbol{\theta}_{\mathrm{ML}}, M) e^{-\frac{1}{2}(\boldsymbol{\theta}-\boldsymbol{\theta}_{\mathrm{ML}})^TC_{\mathrm{ML}}^{-1}(\boldsymbol{\theta}-\boldsymbol{\theta}_{\mathrm{ML}})},$$ where we have defined $$C_{\mathrm{ML}}=-\left(\frac{\partial^2 \log{P(x|\boldsymbol{\theta}, M)}}{\partial \theta_{\alpha}\partial \theta_{\beta}}\vert_{\boldsymbol{\theta}_{\mathrm{ML}}}\right)^{-1}.$$ Note though that this is only approximately true near the MLE, the posterior in the model parameter can be significantly non-Gaussian, even if the data are Gaussian. The distribution of the MLE only becomes Gaussian in the limit of $n\to\infty$.

### Example: What is the error on the MLE for the mean?

From above we have that $$\log P(x_1, ..., x_n|\mu, \sigma) = -\frac{n}{2}\log{2\pi}-n\log{\sigma}-\sum_i \frac{(x_i-\mu)^2}{2\sigma^2},$$ and $$\frac{\partial \log P}{\partial \mu}=\frac{1}{\sigma^2}\sum_i (x_i-\mu).$$ Thus we get $$\frac{\partial^2 \log P}{\partial \mu^2}=-\frac{n}{\sigma^2}.$$ Therefore, we have $$\sigma^2(\hat{\mu}_{\mathrm{MLE}})=-\left(\frac{\partial^2 \log P}{\partial \mu^2}\right)^{-1}=\frac{\sigma^2}{n}.$$