# Maximum Likelihood Estimation for Normal Random Variables

This tutorial demonstrates finding the maximum likelihood estimates of the mean $\mu$ and the standard deviation $\sigma$ for a Normal random variable $X$. The probability density function of $X$ is:

$$f_X(x;\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left({-\frac{(x-\mu)^2}{2\sigma^2}}\right)$$

Suppose that you observe 500 independent and identically distributed (i.i.d) samples of $X$:

In [56]:
import torch

mu = 2
sigma = 3
num_samples = 500

X = torch.randn((num_samples,),requires_grad = False)
X = sigma*X + mu

print('Number of samples: {}\nVariance: {}\nMean: {}'.format(X.shape[0],*torch.var_mean(X)))

Number of samples: 500
Variance: 9.628891944885254
Mean: 2.014725923538208


The independence assumption is valid if observing one of the samples does not give you information about the other samples, while the identically distributed assumption makes sense if these observations originated from the same underlying random experiment. Therefore, the dataset $\mathcal{D}$ is:

$$\mathcal{D} = \{X = x_1,X = x_2,...,X = x_{500}\}$$

The likelihood function is therefore:

$$ p(\mathcal{D};\mu,\sigma^2) = \prod_{i=1}^{500}f_X(x_i;\mu_i,\sigma_{i}^2) = \prod_{i=1}^{500} \frac{1}{\sqrt{2\pi\sigma_{i}^2}} \exp\left({-\frac{(x_i-\mu_i)^2}{2\sigma_{i}^2}}\right) $$

Since the samples are identically distributed:

$$
\mu_1 = \mu_2 = ... = \mu_{500} = \mu \\
\sigma_{1}^2 = \sigma_{2}^2 = ... = \sigma_{500}^2 = \sigma^2
$$

And the log-likelihood function is:

$$ \ln(p(\mathcal{D};\mu,\sigma^2)) = \sum_{i=1}^{500} \ln\left(\frac{1}{\sqrt{2\pi\sigma_{i}^2}}\right) + \ln\left(\exp\left({-\frac{(x_i-\mu_i)^2}{2\sigma_{i}^2}}\right)\right)$$

$$= -\left(\sum_{i=1}^{500} \ln\left(\sqrt{2\pi\sigma_{i}^2}\right) + \sum_{i=1}^{500} \frac{(x_i-\mu_i)^2}{2\sigma_{i}^2}\right)$$

In [50]:
# Negative log-likelihood of X

from math import pi

def normal_NLLL(X,mu,sigma):
    first_term = torch.sum(torch.log(torch.sqrt(2*pi*torch.pow(sigma,2))))
    second_term = torch.sum(torch.div(torch.pow(X-mu,2),2*torch.pow(sigma,2)))
    return (first_term + second_term)

# track gradients of NLLL with respect to parameters

mu = torch.randn((num_samples,))
sigma = torch.randn((num_samples,))

NLLL = normal_NLLL(X,mu,sigma)

print(NLLL)

tensor(282913.4688)


Therefore, the maximum likelihood estimates of $\mu$ and $\sigma^2$ are:

\begin{equation}
\DeclareMathOperator*{\argmin}{\arg\!\min}
\hat{\mu} = \argmin_{\mu}{\left(-\ln\left(p\left(\mathcal{D};\mu,\sigma^2\right)\right)\right)} =  \argmin_{\mu}\sum_{i=1}^{500} \frac{(x_i-\mu)^2}{2\sigma^2}\\
\hat{\sigma}^2 = \argmin_{\sigma^2}{\left(-\ln\left(p\left(\mathcal{D};\mu,\sigma^2\right)\right)\right)} = \argmin_{\sigma^2} \left(\sum_{i=1}^{500}\ln\left(\sqrt{2\pi\sigma^2}\right) + \sum_{i=1}^{500} \frac{(x_i-\mu)^2}{2\sigma^2}\right)
\end{equation}

$\hat{\mu}$ and $\hat{\sigma}^2$ can be computed analytically. First for $\hat{\mu}$:

$$- \frac{\partial\ln\left(p\left(\mathcal{D};\mu,\sigma^2\right)\right)}{\partial\mu} = \frac{1}{2\sigma^2} \sum_{i=1}^{500} -2(x_i-\mu) = -\frac{1}{\sigma^2} \sum_{i=1}^{500} x_i-\mu = 0 $$

Solving for $\mu$:

$$\hat{\mu} = \frac{\sum_{i=1}^{500} x_i}{500}$$

Which is just the sample mean. Similarly, for the standard deviation:

$$
-\frac{\partial\ln\left(p\left(\mathcal{D};\mu,\sigma^2\right)\right)}{\partial\sigma^2} = \frac{500}{2\sigma^2} - \frac{1}{2\sigma^4} \sum_{i=1}^{500} (x_i-\mu)^2 = 0
$$

Solving for $\sigma^2$:

$$
\hat{\sigma}^2 = \frac{1}{500} \sum_{i=1}^{500} (x_i-\mu)^2
$$

Which is just the sample variance. However, this is a biased estimator of the $\sigma^2$. Instead, the sample variance can be computed as follows:

$$
\hat{\sigma}^2 = \frac{1}{500-1} \sum_{i=1}^{500} (x_i-\mu)^2
$$

This modification is called [Bessel's correction](https://en.wikipedia.org/wiki/Bessel%27s_correction). The maximum likelihood estimates of $\mu$ and $\sigma^2$ can then be computed as follows:

In [57]:
# sample mean

mu_hat = torch.sum(X)/X.shape[0]

print(mu_hat)

tensor(2.0147)


In [60]:
# sample variance

sigma_squared_hat = torch.sum(torch.pow(X-mu_hat,2))/(X.shape[0]-1)

print(sigma_squared_hat)

tensor(9.6289)


Alternatively, the maximum likelihood estimates of $\mu$ and $\sigma^2$ can be computed using gradient descent. Gradient descent is an iterative algorithm based on the following general methodology:
1. Guess initial values for $\mu$ and $\sigma$.
2. Use the rate of change (gradient) of the negative log-likeihood function at these initial values of $\mu$ and $\sigma$ to find new values for $\mu$ and $\sigma$ that minimize the negative log-likelihood function. More precisely, if $\mu_0$ and $\sigma_0$ are the initial values, then use:

$$
-\left.\frac{\partial\ln\left(p\left(\mathcal{D};\mu,\sigma^2\right)\right)}{\partial\mu} \right\rvert_{\mu = \mu_0} \\
-\left.\frac{\partial\ln\left(p\left(\mathcal{D};\mu,\sigma^2\right)\right)}{\partial\sigma} \right\rvert_{\sigma = \sigma_0}
$$

3. Repeat steps 1 and 2 until convergence.

The following explanation for how to choose $f(\cdot)$ is adapted from [here](https://eli.thegreenplace.net/2016/understanding-gradient-descent/). Consider the simple function $f(x) = x^2$. The value of $f(x)$ decreases when x is negative and increasing, while its value increases when x is positive and increasing. Suppose that you do not know where the minimum of $f(x)$ is. Let us guess first that the minimum is at $x = -2$. The derivative of $f(x)$ is

3. Compute new estimates of $\mu$ and $\sigma$ using the following update rules:

$$
\mu_{n+1} = \mu_n + \eta f(\mathcal{D},\mu_n,\sigma_n) \\
\sigma_{n+1} = \sigma_n + \eta f(\mathcal{D},\mu_n,\sigma_n)
$$

Where:
* $\mu_{n+1}$ and $\sigma_{n+1}$ are the new estimates of $\mu$ and $\sigma$.
* $\mu_n$ and $\sigma_n$ are the current estimates of $\mu$ and $\sigma$.
* $f(\cdot)$ is some function of the dataset $\mathcal{D}$ and the current estimates of $\mu$ and $\sigma$.
* $\eta$ is called the _learning rate_ which controls the magnitude of $f(\cdot)$.