# Maximum Likelihood Estimator

Maximum likelihood estimator (MLE) is method of finding the estimators of unknow parameters that maximized the probability of the know outcome samples.

That is, what value of $\theta$ that gave us the maximum probability (If it's continuous, then think of it $dx$ as very small, that is the instantanuous probability of getting an exact outcome of $x_1, x_2,...$)

$$
\hat{\theta} = \underset{\theta}{\text{argmax}} L(\theta, \mathbf{X})
$$

Where $L(\theta, \mathbf{X}) = \prod{f(\theta, x_i)}$

Equivalently,  $\ell(\theta, \mathbf{X})=\log{L}=\sum_{i=1}^n \log{f(\theta,x_i)}$, most cases, dealing with log is much easier.

That is to find the $\theta$ where the $\text{D}_\theta(\prod{f(\theta, x_i)}) = 0$ and $\text{D}_{\theta}^2(\prod{f(\theta, x_i)})$ is negative definite.

__The problem is how good is this MLE?__
__How close is our estimate to the real value $\theta_0$__

## __Theorem__ 


$\hat{\theta}_n \xrightarrow{p} \theta_0$ in probability.


_Proof:_

(Brief) It can be proved that $\lim\limits_{n \rightarrow \infty} P_{\theta_0}[ \ell(\theta_0, \mathbf{X}) > \ell(\theta, \mathbf{X})] = 1$ for all $\theta \ne \theta_0$ That is $\theta_0$ yields the largest $L$ value in probability when $n \rightarrow \infty$ 

What we need to  prove here is $\lim\limits_{n \rightarrow \infty} P[\lvert \hat{\theta}_n - \theta_0 \rvert < \epsilon] = 1$, 

$$
\begin{aligned}
S_1 & = \lbrace\mathbf{X}: \ell(\theta_0, \mathbf{X}) > \max(\ell(\theta_0 - \frac{\epsilon}{2}, \mathbf{X}), \ell(\theta_0 + \frac{\epsilon}{2}, \mathbf{X}))\rbrace \\
S_2 & = \lbrace\mathbf{X}: \lvert\hat{\theta}_n(\mathbf{X}) - \theta_0 \rvert < \epsilon \rbrace
\end{aligned}
$$

If $\mathbf{X} \in S_1$, then $\theta_0 - \frac{\epsilon}{2} \le \hat{\theta}_n(\mathbf{X}) \le \theta_0 + \frac{\epsilon}{2}$, thus $\mathbf{X} \in S_2$, hence, $S_1 \subseteq S_2$. We proved it $1 = \lim\limits_{n \rightarrow \infty} P[S1] \le P[S_2] \le 1$

$\blacksquare$

__What about the variance of MLE?__


$$
\begin{aligned}
1 & = \int_{-\infty}^{+\infty} f(x, \theta) dx \\
0 & = \int_{-\infty}^{+\infty} \frac{\partial{f(x, \theta)/\partial{\theta}}}{f(x, \theta)}f(x, \theta) dx \\
0 & = \int_{-\infty}^{+\infty} \frac{\partial{\log{f(x, \theta)}}}{ \partial{\theta}} f(x, \theta) dx
\end{aligned}
$$

$$
0 = E\bigg[\frac{\partial{\log{f(X, \theta)}}}{ \partial{\theta}}\bigg]  \tag{1}
$$

In multiparameter case:

$$
0 = E[\nabla\log f(X, \mathbf{\theta})] = \begin{bmatrix}
  E\bigg[ \frac{\partial\log f(X, \theta_1)}{\partial{\theta_1}} \bigg] \\
  E\bigg[ \frac{\partial\log f(X, \theta_2)}{\partial{\theta_2}} \bigg] \\
  ... \\
  E\bigg[ \frac{\partial\log f(X, \theta_p)}{\partial{\theta_p}} \bigg] \\
\end{bmatrix}
$$


By taking the second derivative of (1), we get


$$
I(\theta) = Var\bigg(\frac{\partial{\log{f(x, \theta)}}}{ \partial{\theta}}\bigg) = \int_{-\infty}^{+\infty} \bigg(\frac{\partial{\log{f(x, \theta)}}}{ \partial{\theta}}\bigg)^2 f(x, \theta) dx = - \int_{-\infty}^{+\infty} \frac{\partial^2{\log{f(x, \theta)}}}{ \partial{\theta^2}} f(x, \theta) dx \tag{2}
$$

In multiparameter case:

$$
I(\theta) = Cov(\nabla\log f(X, \mathbf{\theta})) = \bigg[ I_{jk} \bigg]
$$

where $I_{jk}$ is:

$$
I_{jk}(\theta) = -E\Big[ \frac{\partial^2\log f(X, \theta)}{\partial\theta_j \partial\theta_k} \Big]
$$


__(2) is called the Fisher Information__

If $\mathbf{X}$ is sample of size $n$ iid. Then its information is the sum of those independent fisher information 

$$
I_n(\mathbf{\theta}) = Var\bigg(\frac{\partial\log L(\theta, \mathbf{X})}{\theta}\bigg) = Var\bigg(\sum^n_{i=1}\frac{\partial\log{f(X_i,\theta)}}{\partial\theta}\bigg) = nI(\theta)
$$

In multiparameter case:

Let $Z_i = \sum^n_{i=1} \frac{\partial\log{f(X_i, \mathbf{\theta})}}{\partial\theta_i}$, we could derive that $E[Z_i] = 0$ and

$$
\begin{aligned}
  I_n(\theta) &= Cov\begin{bmatrix} 
  Z_1 \\
  Z_2 \\
  ... \\
  Z_p
  \end{bmatrix} \\
  & = \begin{bmatrix}
    E[Z_1Z_1], E[Z_1Z_2], ... E[Z_1Z_p] \\
    E[Z_1Z_2], E[Z_1Z_1], ... E[Z_2Z_p] \\
    ... \\
    E[Z_pZ_1], E[Z_pZ_2], ... E[Z_pZ_p]
  \end{bmatrix}
\end{aligned}
$$

It can be calculated that:

$$
\begin{aligned}
I_n^{jk}(\theta) &= E[Z_jZ_k] \\
                 &= -E\bigg[ \sum^n_{i=1} \frac{\partial\log{f(X_i, \theta)}}{\partial\theta_j\partial\theta_k} \bigg] \\
                 &= -nE\bigg[\frac{\partial\log{f(X_i, \theta)}}{\partial\theta_j\partial\theta_k} \bigg]
\end{aligned}

$$


## __Rao-Cramer Lower Bound Theorem__

$X_1, X_2,...,X_n$ iid with pdf $f(x,\theta)$
Let $Y = \mu(X_1, X_2,..., X_n)$ be a statistic with $E[Y]=\kappa(\theta)$

$$
Var(Y) \ge \frac{[\kappa'(\theta)]^2}{nI(\theta)}
$$

If $Y$ is unbiased estimator, $\kappa(\theta) = \theta$, then,

$$
Var(Y) \ge \frac{1}{nI(\theta)}
$$

**For multivariant case:**

Let $Y$ be a statistic:

$$
Y = \begin{bmatrix}
  \mu_1(\{X_i\}) \\
  \mu_2(\{X_i\}) \\
  ... \\
  \mu_p(\{X_i\})
\end{bmatrix}
$$ 

with $E[Y]$ be:

$$
E[Y]=\begin{bmatrix}
  \kappa_1(\theta) \\
  \kappa_2(\theta) \\
  ... \\
  \kappa_p(\theta) \\
\end{bmatrix}
$$

And the first derivative of $E[Y]$:

$$
\begin{aligned}
  D(E[Y]) = D(\kappa) &= \begin{bmatrix}
    \frac{\partial\kappa_1(\theta)}{\partial\theta_1} \frac{\partial\kappa_1(\theta)}{\partial\theta_2} ... \frac{\partial\kappa_1(\theta)}{\partial\theta_p} \\
    \frac{\partial\kappa_2(\theta)}{\partial\theta_1} \frac{\partial\kappa_2(\theta)}{\partial\theta_2} ... \frac{\partial\kappa_2(\theta)}{\partial\theta_p} \\
    ... \\
    \frac{\partial\kappa_p(\theta)}{\partial\theta_1} \frac{\partial\kappa_p(\theta)}{\partial\theta_2} ... \frac{\partial\kappa_p(\theta)}{\partial\theta_p} \\
  \end{bmatrix}
\end{aligned}
$$

Then:

$$
\begin{aligned}
  Cov_\theta(Y) &\ge \frac{1}{n} D(\kappa) I_n(\theta)^{-1} D(\kappa)^T
\end{aligned}
$$

If $Y$ is an unbaised estimate of $\{\theta_i\}$, then:

$$
\begin{aligned}
  Cov_\theta(Y) &\ge \frac{1}{n} I_n(\theta)^{-1}
\end{aligned}
$$


## Theorem

Assume $X_1, X_2,...X_n$ iid with pdf $f(x,\theta_0)$, $\hat{\theta}_n \xrightarrow{P} \theta_0$.

Given that $\log f(x, \theta)$ is three times differentiable and for all $\theta \in \Omega$, $\exists \epsilon \in \mathbf{R}$ and a function $M(x)$ such that $\bigg\lvert \frac{\partial^3 \log f(x, \theta)}{\partial\theta^3} \bigg\rvert \le M(x)$, with $E_{\theta_0}[M(X)] < \infty$ for all $\theta \in (\theta_0 - \epsilon, \theta_0 + \epsilon)$

Then

$$
\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{D} N\bigg(0, \frac{1}{I(\theta_0)}\bigg)
$$

