# Maximum likelihood estimation (MLE)

MLE is a method that determines values for the parameters of a model. The parameter values are found such that they maximise the likelihood that the processes described by the model produced the data that were actually observed.

For example, the **probability density** of observing a single data point x, that is generated from a Gaussian distribution is given by:

$$
    P(x; \mu, \sigma) = \frac{ 1 }{ \sigma \sqrt{2 \pi} } \exp{ -( \frac{ (x - \mu)^{2} }{ 2 \sigma^{2} } ) }
$$

Total (joint) probability density of observing the three data points 9, 9.5, 11:

$$
    P(9, 9.5, 11; \mu, \sigma) = \frac{ 1 }{ \sigma \sqrt{2 \pi} } \exp{ -( \frac{ (9 - \mu)^{2} }{ 2 \sigma^{2} } ) } \times \frac{ 1 }{ \sigma \sqrt{2 \pi} } \exp{ -( \frac{ (9.5 - \mu)^{2} }{ 2 \sigma^{2} } ) } \times \frac{ 1 }{ \sigma \sqrt{2 \pi} } \exp{ -( \frac{ (11 - \mu)^{2} }{ 2 \sigma^{2} } ) }
$$

Log-likelihood of the above:

$$
    \ln{ (P(9, 9.5, 11; \mu, \sigma)) } = \ln{ (\frac{ 1 }{ \sigma \sqrt{2 \pi} }) } - \frac{ (9 - \mu)^{2} }{ 2 \sigma^{2} } + \ln{ (\frac{ 1 }{ \sigma \sqrt{2 \pi} }) } - \frac{ (9.5 - \mu)^{2} }{ 2 \sigma^{2} } + \ln{ (\frac{ 1 }{ \sigma \sqrt{2 \pi} }) } - \frac{ (11 - \mu)^{2} }{ 2 \sigma^{2} } 
$$

<br>

$$
    \ln{ (P(9, 9.5, 11; \mu, \sigma)) } = -3 \ln{ (\sigma) } - \frac{ 3 }{ 2 }\ln{ (2 \pi) } - \frac{ 1 }{ 2 \sigma^{2} }[ (9 - \mu)^{2} + (9.5 - \mu)^{2} + (11 - \mu)^{2} ]
$$

<br>

$$
    \frac{ \partial{\ln{ (P(9, 9.5, 11; \mu, \sigma)}) } }{ \partial{\mu} } = \frac{ 1 }{ \sigma^{2} }[9 + 9.5 + 11 - 3 \mu] = 0
$$

<br>

$$
    \mu = \frac{ 9 + 9.5 + 11 }{ 3 }
$$

<br>

### The difference between likelihood and probability

$$
    L(\mu, \sigma; data) = P(data; \mu, \sigma)
$$

<br>

The definition of $L(\mu, \sigma; data)$ is given as *the likelihood of the parameters $\mu$ and $\sigma$ taking certain values given the observed data*, where as $P(data; \mu, \sigma)$ is defined as *the probability density of observing the data given parameters $\mu$ and $\sigma$.*

### Least Squares Estimate is the same as MLE under a Gaussian model

Intuition: the predictions is equal to signal plus white noise (zero mean).

$$
    Y = f^{*}(X) + \epsilon = X \beta^{*} + \epsilon \quad \forall \quad \epsilon \text{ ~ } \mathcal{N}(0, \sigma^{2}I)
$$

$$
    Y \text{ ~ } \mathcal{N}(0, \sigma^{2}I)
$$

$$
    \hat{\beta}_{MLE} = \underset{\beta}{\arg\max} \ln{ (P( (y_{i}, x_{i})_{i=1}^{n} | \beta, \sigma^{2} )) } = \underset{\beta}{\arg\min} \sum_{i=1}^{n} (y_{i} - x_{i}\beta)^{2}
$$

### Regularised Least Squares and maximum a posteriori MAP estimate

$$
    \hat{\beta}_{MAP} = \underset{\beta}{\arg\max} \ln{ (P( (y_{i}, x_{i})_{i=1}^{n} | \beta, \sigma^{2} )) } + \ln{(P(\beta)}) = \underset{\beta}{\arg\min} \sum_{i=1}^{n} (y_{i} - x_{i}\beta)^{2} + \lambda ||\beta||_{2}^{2}
$$

where:

* $\beta \text{ ~ } \mathcal{N}(0, \sigma^2, I)$


* $P(\beta) \propto e^{\frac{ -\beta^{T}\beta }{ 2\sigma^{2} }}$


* $\ln{(P(\beta)})$ is a *prior*, i.e. Gaussian with zero mean

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [4]:
 def gaussian_pdf(x, mu=0.0, sigma=1.0):
    return (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-(np.power(x - mu, 2) / (2 * np.power(sigma, 2))))

In [5]:
def log_likelihood(x, mu=0, sigma=1):
    return np.log(1 / (sigma * np.sqrt(2 * np.pi))) - (np.power(x - mu, 2) / (2 * np.power(sigma, 2)))