In [2]:
%run ../../common/import_all.py

from common.setup_notebook import set_css_style, setup_matplotlib, config_ipython
config_ipython()
setup_matplotlib()
set_css_style()

# The Maximum Likelihood Estimation method

## The likelihood

Imagine you have a statistical model, that is, a mathematical description of your data which depends on some parameters $\theta$. The *likelihood function*, usually indicated as $\mathcal{L}$, is a function of these parameters and represents the probability of observing evidence (observed data) $E$ given said parameters:

$$
\mathcal{L} = P(E \ | \  \theta)
$$

Because it is a function of the parameters given the outcome, you write

$$
\mathcal{L}(\theta \  | \ E) = P(E \ | \  \theta)
$$

The difference between *probability* and *likelihood* is quite subtle in that in common language they are be casually swapped, but they represent different things. The probability mesaures the outcomes observed as a function of the parameters $\theta$ of the underlying model. But in reality $\theta$ are unknown and in fact, we go through the reverse process: estimating the parameters given the evidence we observe. For this, we use the likelihood, which is defined as above because we maximise it in such a way to respond to the equality above. This is exactly what the ML estimation does, as per below.

Bear in mind that the likelihood is a function of $\theta$. 

## The MLE method

The Maximum Likelihood Estimation (MLE) is a procedure to find the parameters of a statistical model via the maximisation of the likelihood so as to maximise the agreement between the model and the observed data.

The maximisation of the likelihood is usually performed via the maximisation of its logarithm as it is much more convenient; the logarithm is a monotonic function so the procedure is legit.

### Example: a Bernoulli distribution

The likelihood function for a [Bernoulli distribution](../distributions-measures/famous-distributions.ipynb#Bernoulli)  ($x_i \in {0, 1}$) is, for parameter $p$: 
 
\begin{align}
\mathcal{L}(x_1, x_2, \ldots, x_n \ | \ p) &= P(X_1 = x_1, X_2 = x_2, \ldots, X_n = x_n \ | \ p) \\
&= p^{x_1}(1-p)^{1-x_1} \cdot \ldots \cdot p^{x_n}(1-p)^{1-x_n} \\
&= p^{\sum_i x_i}(1-p)^{\sum_i(1-x_i)} \\
&=  p^{\sum_i x_i}(1-p)^{n -\sum_i x_i}
\end{align}

so that if we take the logarithm, we get

$$
\log \mathcal{L} = \sum_i x_i \log p + \Big(n - \sum_i x_i\Big) \log (1-p) \ .
$$

To maximise it, we compute and nullify the first derivative

$$
\frac{d \log \mathcal{L}}{d p} = \frac{\sum_i x_i}{p} - \frac{n - \sum_i x_i}{1-p} = 0
$$

which leads to

$$
\sum_i x_i - p \sum_i x_i = np - p \sum_i x_i
$$

and finally to

$$
p = \frac{\sum_i x_i}{n}
$$

### Example: estimating the best mean of some data

This example is reported from [[here]](#2). Let us assume we know the weights of women are normally distributed with a mean $\mu$ and standard deviation $\sigma$. A random sample of $10$ women is (in pounds):

$$
115, 122, 130, 124, 149, 160, 152, 138, 149, 180
$$

We want to estimate $\mu$. We know

$$
P(x_i ; \mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma^2}} e^{- \frac{(x_i - \mu)^2}{2 \sigma^2}}
$$

The likelihood is (note that the $X_i$ are independent)

\begin{align}
\mathcal{L}(x_i | \mu, \sigma) &= P(X_1=x_1, \ldots, X_n=x_n)  \\
&= \Pi_i p(x_i; \mu \sigma) \\
&= \sigma^n (2 \pi)^{-n/2} e^{- \frac{1}{2 \sigma^2} \sum_i (x_i - \mu)^2}
\end{align}

Now, again it is easier to work with the logarithm:

$$
\log \mathcal{L} = -n \log \sigma \frac{n}{2} \log 2 \pi - \frac{1}{2 \sigma^2} \sum_i (x_i - \mu)^2
$$

so that 

$$
\frac{d \log \mathcal{L}}{d \mu} = -\frac{1}{2 \sigma^2} 2 \sum_i (x_i - \mu) = 0
$$

$$
\sum_i x_i - n \mu = 0
$$

$$
\mu = \frac{\sum_i x_i}{n}
$$

and so the maximum likelihood estimate for a given sample is 142.2 and we can could do the same to estimate $\sigma$, obtaining (can be proven through second derivative that it is a maximum)

$$
\sigma^2 = \frac{\sum_i (x_i - \mu)^2}{n}
$$

## References

1. <a name="cv"></a> [Cross Validated on the difference between Probability and Likelihood](https://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability)
2. <a name="mle"></a> Some examples in [this course](https://onlinecourses.science.psu.edu/stat414/node/191) from Penn State