# Maximum Likelihood Estimation

## Overview

Perhaps the most common approach for deriving estimators is the maximum likelihood method.
This is a frequentist probabilistic framework that seeks a set of parameters for the model 
that maximizes a likelihood function. In particular, we wish to maximize the conditional probability 
of observing the data $D$ given a specific probability distribution and its parameters $\theta$.

Maximum likelihood parameter estimation is a technique that can be used
when we are willing to make assumptions about the probability distribution of
the data. Based on the theoretical probability distribution and the observed data,
the likelihood function is a probability statement that can be made about a
particular set of parameter values. If two sets of parameters values are being
identified, the set with the larger likelihood would be deemed more consistent
with the observed data.

Maximum likelihood estimates have many appealing properties [1]

- Consistency
- Equivariance
- Asymptotically normal
- Asymptotically optimal or efficient
- It is approximately the Bayes estimator


## Maximum likelihood estimation

We define the likelihood function as follows [1]

----
**Definition: Likelihood function**

The likelihood function is defined as

\begin{equation}
L_{n}(\theta) = \prod_{i=1}^{n} f(X;\theta)
\end{equation}


It often easier to work with the logarithm of $L_{n}(\theta)$. Hence, the log-likelihood function as 

\begin{equation}
l_{n}(\theta) = log L_{n}(\theta)
\end{equation}


----


The likelihood function $L_{n}(\theta)$ is the joint  density of the data, assuming that the data is i.i.d. We treat it as a function
of the parameter $\theta$ i.e. [1]

\begin{equation}
L_{n}(\theta):\Theta \rightarrow [0, \infty]
\end{equation}


----
**Remark 1**

The function $L_{n}(\theta)$ is not a density function. That is it is not true that integrates to 1 with respect to $\theta$ [1].

----

The maximum likelihood estimator or MLE is the value of $\theta$ that maximizes $L_{n}(\theta)$. We 
will denote this value with $\hat{\theta}$. In addition, given that the maximum of the log-likelihood occurs
at the same point as the maximum of $L_{n}(\theta)$, we will often maximize $l_n(\theta)$. Let's see some theoretical
examples in order to solidify the process.

### Example 1

This is a classical example often cited when discussing maximum likelihood estaimators.  Specifically, let the data
follow the normal distribution with parameters $\mu$ and $\sigma^2$. The likelihood function then is given as


\begin{equation}
L_{n}(\mu, \sigma) = \prod_{i} \frac{1}{\sigma}exp\{-\frac{1}{2\sigma^2}\left(x_i - \mu\right)^2\}
\end{equation}

This can be written as

\begin{equation}
L_{n}(\mu, \sigma) = \frac{1}{\sigma^n}exp\{-\frac{1}{2\sigma^2}\Sigma_i\left(x_i - \mu\right)^2\} = \frac{1}{\sigma^n}exp\{-\frac{nS^2}{2\sigma^2}\}exp\{-\frac{1}{2\sigma^2}\left(n\bar{x} - \mu\right)^2\} 
\end{equation}

Taking the log of the last expression leads to,

\begin{equation}
l_n(\mu, \sigma) = -nlog\sigma -\frac{nS^2}{2\sigma^2} - \frac{1}{2\sigma^2}\left(n\bar{x} - \mu\right)^2
\end{equation}

Since we seek the maximum of this function, we take the derivative with respect to $\mu$ and $\sigma$ and set these to zero. Solving the resulting equations,
leads to that the MLE for $\mu$ and $\sigma$ are respectively

\begin{equation}
\hat{\mu} = \bar{x}, ~~ \hat{\sigma}=S
\end{equation}

## Properties of maximum likelihood estimators

ML  estimators poses some really nice properties. Such as [1]

- The MLE is consistent i.e. converges in probability to the true value of the parameter $\theta$
- If $\hat{\theta}$ is the MLE of $\theta$ then $g(\hat{\theta})$ is the MLE of $g({\theta})$. This is called equivariance.
- The MLE is asymptotically normal
- The MLE is asymptotically efficient i.e. among all etsimators the MLE has the smallest variance.
- The MLE is approximately the Bayes estimator.

Below, we briefly discuss these properties. For more information you should look into [1] and references therein.

### MLE consistency

Consistency of the MLE means that the estimator converges in probability to the true value of the parameter. We have the following
definition


----
**Definition: Consistency**

Let ${X_1, \dots , X_n }$ denote a sample of observations. Assume that $\hat{\theta}_n$ be the estimator using the sample. We say that
$\hat{\theta}_n$ is consistent if $\hat{\theta}_n \rightarrow_{P} \theta$

\begin{equation}
P\left(|\hat{\theta}_n - \theta| > \epsilon \right)\rightarrow 0, ~~ \text{as} ~~ n\rightarrow \infty 
\end{equation}

----


----
**Remark 2**

A sufficient condition for consistency is the following condition to hold


\begin{equation}
E \left[ \left(\hat{\theta}_n - \theta \right) ^2 \right] \rightarrow 0, ~~ \text{as} ~~ n\rightarrow \infty 
\end{equation}

----

----
#### Example 1: Consistency of the mean estimate for normal sample

We have shown above tha the sample mean is an MLE of a normally distributed  population. We now want to show that
$\bar{x}$ is a consistent estimator. According to **Remark 2** above, we need to show 


\begin{equation}
E \left[ \left(\bar{x} - \mu \right) ^2 \right] \rightarrow 0, ~~ \text{as} ~~ n\rightarrow \infty 
\end{equation}

Given that 

\begin{equation}
\bar{x} = \frac{1}{n}\sum_i x_i
\end{equation}

then 

\begin{equation}
E \left[ \left(\bar{x} - \mu \right) ^2 \right] = Var\left[ \bar{x} \right ] = \frac{\sigma^2}{n}\rightarrow 0, ~~ \text{as} ~~ n\rightarrow \infty 
\end{equation}

----

### MLE asymptotic efficiency

This property means that among all well-behaved estimators, the maximum likelihood estimator has the smallest variance for large samples [1].
Recall the Cramer-Rao Lower Bound (CRLB). This describes a lower bound on the variance of estimators for the parameter $\theta$. Meaning the
variance of an estimator cannot be less than what the CRLB describes.
The bound is given by


\begin{equation}
Var\left[\hat{\theta}\right] = \frac{\left(\frac{\partial}{\partial \theta} E\left[\hat{\theta}\right]\right)^2}{I(\theta)}
\end{equation}

where $I(\theta)$ is the Fisher information matrix. For an unbiased estimator, this can be simplified as 


\begin{equation}
Var\left[\hat{\theta}\right] = \frac{1}{I(\theta)}
\end{equation}


which implies that the variance of any unbiased estimator is at least as large as the inverse of the Fisher information.

----
#### Example 2: Efficiency of the mean estimate for normal sample

Example 1 above showed that for a normally distributed sample, the MLE of the mean is

\begin{equation}
\bar{x} = \frac{1}{n}\sum_i x_i
\end{equation}

is this an efficient estimator? Given that $\bar{x}$ is an unbiased estimator, all we need to show is 


\begin{equation}
Var\left[\bar{x}\right] = \frac{1}{I(\theta)}
\end{equation}

The Fisher information matrix is

\begin{equation}
I(\theta)= - E\left[\frac{\partial^2}{\partial \theta^2} log f(\mathbf{x}; \bar{x})\right]
\end{equation}

whuch for our case will be

\begin{equation}
I(\theta)= \frac{\sigma^2}{n}
\end{equation}

This is equal to the variance of $\bar{x}$. Hence the estimator $\bar{x}$ is an optimal estimator i.e. we cannot find an estimator with variance
that is less than that.

----

### Minimum variance estimator

For a parameter $\theta$ we may be able to device many etsimators. The question therefore arises which estimator to choose.
One reasonable criterion is to use the estimator that has the smallest variance. Let's consider again a normally distributed sample.
The MLE for the population mean is $\bar{x}$. The median however, is another option. It satisfies, see [1],

\begin{equation}
\sqrt{n}\left(M-\mu \right) \asymp N(0, \sigma^2\frac{\pi}{2})
\end{equation}

This implies that $M$ converges to the right value. Something to be expected given that the normal distribution is symmetric. 
However, it exhibits a larger variance than $\bar{x}$. 


More generally, he relative efficiency of two estimators $T_1$ and $T_2$ with variance $Var\left[T_1\right]$ and $Var\left[T_2\right]$ 
that satisfy

\begin{equation}
\sqrt{n}\left(T_1-\theta \right) \asymp N(0, Var\left[T_1\right])
\end{equation}

\begin{equation}
\sqrt{n}\left(T_2-\theta \right) \asymp N(0, Var\left[T_2\right])
\end{equation}

is defined as the ratio of their variances

\begin{equation}
RE = \frac{Var\left[T_1\right]}{Var\left[T_2\right]}
\end{equation}

## Summary

In this section we discussed maximum likelihood estimators.

## References

1. Larry Wasserman, _All of Statistics. A Concise Course in Statistical Inference_, Springer 2003.