# Maximum Likelihood Theory

## Maximum Likelihood Estimation

### Likelihood for One Observation

Suppose we have a single data point, $x$, that is distributed according to a probability density function $f_\theta$. The likelihood function $f_\theta(x)$ is thought of as a function of the parameter $\theta$ for fixed $x$, rather than the other way around:

\begin{equation}
    L_x(\theta) = f_{\theta}(x)
\end{equation}

\begin{equation}
    l_x(\theta) = \log f_{\theta}(x)
\end{equation}

### Likelihood for iid Observations

Suppose we have a sequence of $iid$ random variables, $X_1, X_2, \dots, X_n$, that have a common probability density function, $f_{\theta}$:

\begin{equation}
    f_{n,\theta} (\mathbf{x}) = \prod_{i=1}^n f_{\theta}(x_i)
\end{equation}

\begin{equation}
    l_n(\theta) = \sum_{i=1}^{n} \log f_{\theta}(x_i)
\end{equation}

The value of the parameter $\theta$ that maximises the log-likelihood function is called the *maximum likelihood estimate*, $\hat{\theta}_n$, where the subscript $n$ denotes $iid$ data.

### Log Likelihood Derivatives

Consider the first two derivatives, $l'_x$ and $l''_x$. Differentiating the identity,

\begin{equation}
    \int f_{\theta} (x) \,dx = 1,
\end{equation}

under the integral to get the following results:

\begin{equation}
    E_{\theta} \{ l'_x (\theta) \} = 0
\end{equation}

\begin{equation}
    var_{\theta} \{ l'_x (\theta) \} = -E_{\theta} \{ l''_x (\theta) \}
\end{equation}

### Fisher Information

The Fisher Information, $I(\theta)$ is defined with either side of Equation (7):

\begin{equation}
    I(\theta) = var_{\theta} \{ l'_x (\theta) \}
\end{equation}

\begin{equation}
    I(\theta) = -E_{\theta} \{ l''_x (\theta) \}.
\end{equation}

This is a way of measuring the amount of *information* that an observable random variable $X$ carries about an unknown parameter $\theta$ of a distribution that models $X$. Formally, it is the *variance of the score*, or the *expected value of the observed information*.

The Fisher information matrix is used to calculate the covariance matrices associated with maximum-likelihood estimates.


\begin{align}
    I_n(\theta) &= -E_{\theta} \{ l''_n (\theta) \} \\
    &=  -E_{\theta} \left\{ \frac{d^2}{d\theta^2} \sum_{i=1}^n \log f_{\theta} (x_i) \right\} \\
    &= -\sum_{i=1}^n  E_{\theta} \left\{ \frac{d^2}{d\theta^2} \log f_{\theta} (x_i) \right\} \\
    &= -\sum_{i=1}^n  E_{\theta} \left\{ l''_1(\theta) \right\} \\
    &= n I_1 (\theta) \\
    \therefore I_n (\theta) &= n I_1(\theta).
\end{align}

### Asymptotics of Log Likelihood Derivatives

#### Law or Large Numbers

With $iid$ data, the *law of large numbers* applies to any average,

\begin{equation}
    \frac{1}{n} l'_n(\theta) = \frac{1}{n} \sum_{i=1}^n \frac{d}{d\theta} \log f_{\theta}(x_i),
\end{equation}

such that it will converge to its expectation, which as stated in Equation (6) should be equal to zero:

\begin{equation}
    \frac{1}{n} l'_n(\theta) \xrightarrow{P} 0
\end{equation}

Similarly, applying the law of large numbers to the average:

\begin{equation}
    -\frac{1}{n} l''_n(\theta) = -\frac{1}{n} \sum_{i=1}^n \frac{d^2}{d\theta^2} \log f_{\theta}(x_i),
\end{equation}

says this will converge to its expectation, which by Equation (15) is just $I_1 (\theta)$. Thus,

\begin{equation}
    -\frac{1}{n} l''_n(\theta) \xrightarrow{P} I_1 (\theta)
\end{equation}

#### Central Limit Theorem

If $X_1, X_2, \dots, X_n$ are random samples each of size $n$ taken from a population with overall mean $\mu$ and finite variance $\sigma^2$ and if $\bar{X}$ is the sample mean, the limiting form of the distribution:

\begin{equation}
    Z = \left( \frac{\bar{X}_n - \mu}{\sigma \mathbin{/} \sqrt{n}}\right) \quad \text{as } n \rightarrow \infty,
\end{equation}

is the *standard normal distribution*. If we apply this to $l'_n(\theta)$,

\begin{equation}
    \frac{1}{\sqrt{n}} l'_n(\theta) \xrightarrow{D} N(0, I_1(\theta))
\end{equation}

where we notice that there is nothing to subtract because the expectation is equal to zero, and $\sqrt{n} \cdot \left( \frac{1}{n} \right) = \frac{1}{\sqrt{n}}$

### Asymptotics of MLE

Assuming MLE is in the interior of the parameter space, the maximum log-likelihood occurs at:

\begin{equation}
    l'_n (\hat{\theta}_n) = 0.
\end{equation}

For large $n$, when $\hat{\theta}_n$ is close in value to $\theta$ and, most importantly, assuming that it is a *consistent* estimator, $l'_n$ can be approximated by a Taylor series around $\theta$:

\begin{equation}
    l'_n (\hat{\theta}_n) \approx l'_n(\theta) + l''_n(\theta)(\hat{\theta}_n - \theta).
\end{equation}

Since $l'_n (\hat{\theta}_n)$ is equal to zero, rearrange to find:

\begin{equation}
    \sqrt{n}(\hat{\theta}_n-\theta) \approx -\frac{\frac{1}{\sqrt{n}}l'_n(\theta)}{\frac{1}{n}l''_n(\theta)}.
\end{equation}

According to the Central Limit Theorem, 

\begin{equation}
    -\frac{\frac{1}{\sqrt{n}}l'_n(\theta)}{\frac{1}{n}l''_n(\theta)} \xrightarrow{D} \frac{Z}{I_1(\theta)}, \quad \text{where } Z \sim N\left(0, I_1(\theta) \right)
\end{equation}

Now using the fact that $E(Z/c) = E(Z)/c$ and $var(Z/c) = var(Z)/c^2$, we get

\begin{equation}
    \frac{Z}{I_1(\theta)} \sim N \left( 0, I_1(\theta)^{-1} \right),
\end{equation}

and crucially,

\begin{equation}
    \sqrt{n} \left( \hat{\theta}_n - \theta \right) \xrightarrow{D} N \left( 0, I_1(\theta)^{-1} \right).
\end{equation}

### Observed Fisher Information

The *observed Fisher information* is written as

\begin{equation}
    \hat{J}_n (\theta) = -l''_n(\theta).
\end{equation}

Therefore, Equation (27) can be rewritten as:

\begin{equation}
    \sqrt{I_n(\theta)} \cdot \left( \hat{\theta}_n - \theta \right) \xrightarrow{D} N \left( 0, 1 \right).
\end{equation}

\begin{equation}
    \sqrt{\hat{J}_n (\theta)} \cdot \left( \hat{\theta}_n - \theta \right) \xrightarrow{D} N \left( 0, 1 \right).
\end{equation}

## Misspecified Maximum Likelihood Estimation

### Modifying the Theory under Model Misspecification

The true distribution has no parameter $\theta$ because it is not in the model. We now write $E_g$ and $var_g$. In addition, under model misspecification, for $E_\theta$ and $var_\theta$ must be the same in order for the differentiation under the integral sign to work (Equations 5, 6, 7). 

Consider the expectation of the log likelihood

\begin{equation}
    \lambda_g (\theta) = E_g \{ l_X (\theta) \},
\end{equation}

and suppose that the function $\lambda_g$ achieves its maximum at some point $\theta^*$. Assuming differentiation under the integral sign is possible,

\begin{equation}
    E_g \{ l'_X(\theta^*) \} = 0
\end{equation}

Equation (7) is no longer valid under a misspecified model. The Fisher information is now defined with the two following equations:

\begin{equation}
    V_n(\theta) = var_g \{l'_n(\theta)\} \\
    J_n(\theta) = -E_g \{l''_n(\theta)\}
\end{equation}

When the model is not misspecified, both $V_n$ and $J_n$ are simply equal to $I_n(\theta)$. Similar to the result in Equation (15), we can now say that

\begin{equation}
    V_n(\theta) = n V_1(\theta) \\
    J_n(\theta) = n J_1(\theta)
\end{equation}

### Asymptotics under Model Misspecification

Following a similar process as in Section 4.1, we conclude that

\begin{equation}
    \sqrt{n} \left( \hat{\theta}_n - \theta^* \right) \xrightarrow{D} N \left(0,J_1(\theta^*)^{-1}V_1(\theta^*)J_1(\theta^*)^{-1})\right),
\end{equation}

\begin{equation}
    \hat{\theta}_n \approx N \left(\theta^*,\hat{J_n}(\hat{\theta}_n)^{-1}\hat{V_n}(\hat{\theta}_n)\hat{J_n}(\hat{\theta}_n)^{-1}\right),
\end{equation}

in which $V_n(\theta)$ is replaced by an empirical estimate

\begin{equation}
    \hat{V}_n(\theta) = \sum_{i=1}^{n} l'_n(\theta)^2.
\end{equation}

### The Sandwich Estimator

The asymptotic variance here

\begin{equation}
    \hat{J_n}(\hat{\theta}_n)^{-1}\hat{V_n}(\hat{\theta}_n)\hat{J_n}(\hat{\theta}_n)^{-1}
\end{equation}

is also called the *sandwich estimator*. Under model misspecification, the asymptotic variance is no longer simply the "inverse Fisher information".

## Code Implementation

`scikit-learn` has an implementation of robust covariance estimation via `sklearn.covariance`.

In [1]:
from sklearn.covariance import MinCovDet
import sys
sys.path.append("tools")
from armagarch import order_determination
import pandas as pd

In [4]:
# Load data
df = pd.read_csv('data/top10_logreturns.csv', index_col=0, parse_dates=True)['D05.SI']

# Fit ARMA-GARCH model
results = order_determination(df.values, max_p=6, max_q=6, gjr=False, verbose=True)

Running Order Determination...
Current best model: ARMA(1,1)-GARCH(1,1), AIC = -20341.721960227755
Current best model: ARMA(3,6)-GARCH(1,1), AIC = -21792.497033468313
Fitting model...
Model fitting complete.


In [3]:
results

     fun: -10176.860980113877
     jac: array([-1.10289683e+04, -2.34366699e+02, -2.31569946e+02, -4.77299828e+08,
       -4.41069554e+04, -2.02258904e+04])
 message: 'Inequality constraints incompatible'
    nfev: 7
     nit: 1
    njev: 1
  status: 4
 success: False
       x: array([2.53425045e-04, 2.53425045e-04, 2.53425045e-04, 1.35002893e-06,
       3.00000000e-02, 9.00000000e-01])

In [6]:
1-0.9-0.09/2 -0.03

0.02499999999999998