# Astronomy 8824 - Numerical and Statistical Methods in Astrophysics

## Statistical Methods Topic III. Corrrelated Errors, $\chi^2$, Maximum Likelihood, and MCMC

These notes are for the course Astronomy 8824: Numerical and Statistical Methods in Astrophysics. It is based on notes from David Weinberg with modifications and additions by Paul Martini.
David's original notes are available from his website: http://www.astronomy.ohio-state.edu/~dhw/A8824/index.html

#### Background reading in Statistics, Data Mining, and Machine Learning in Astronomy: 
- Bivariate and Multivariate Gaussians, see $\S\S 3.5.2-3.5.4$ 
- Parameter Errors in a Maximum Likelihood or Posterior Estimate, see $\S 4.2.5$

In [1]:
import math
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from scipy import optimize

# matplotlib settings 
SMALL_SIZE = 14
MEDIUM_SIZE = 16
BIGGER_SIZE = 18

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
plt.rc('lines', linewidth=2)
plt.rc('axes', linewidth=2)
plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)   # fontsize of the figure title

LaTex macros hidden here -- 
$\newcommand{\expect}[1]{{\left\langle #1 \right\rangle}}$
$\newcommand{\intinf}{\int_{-\infty}^{\infty}}$
$\newcommand{\xbar}{\overline{x}}$
$\newcommand{\ybar}{\overline{y}}$
$\newcommand{\like}{{\cal L}}$
$\newcommand{\llike}{{\rm ln}{\cal L}}$
$\newcommand{\xhat}{\hat{x}}$
$\newcommand{\yhat}{\hat{y}}$
$\newcommand{\xhati}{\hat{x}_i}$
$\newcommand{\yhati}{\hat{y}_i}$
$\newcommand{\sigxi}{\sigma_{x,i}}$
$\newcommand{\sigyi}{\sigma_{y,i}}$
$\newcommand{\cij}{C_{ij}}$
$\newcommand{\cinvij}{C^{-1}_{ij}}$
$\newcommand{\cinvkl}{C^{-1}_{kl}}$
$\newcommand{\valpha}{\vec \alpha}$
$\newcommand{\vth}{\vec \theta}$

### Bivariate and Multivariate Gaussians

Suppose we have two independent variables $x$ and $y$ drawn from Gaussian distributions of width $\sigma_x$ and $\sigma_y$. The joint distribution $p(x,y)=p(x)p(y)$ is a bivariate Gaussian, and the values of $x$ and $y$ are uncorrelated:
$$
\langle (x-\mu_x)(y-\mu_y)\rangle = 0.
$$

If we now consider
$$\eqalign{
x^\prime &= x\cos\alpha-y\sin\alpha \cr
y^\prime &= x\sin\alpha+y\cos\alpha 
}
$$
then we "rotate" the distribution by angle $\alpha$. The distribution $p(x^\prime,y^\prime)$ is still a bivariate
Gaussian, but now the values of $x^\prime$ and $y^\prime$ are correlated.

If we have a number of random variables $y_i$, $i=1...M$, which we combine into a vector ${\bf y}$, then the covariance matrix is
$$
C_{ij} = \langle (y_i-\langle{y_i\rangle}) (y_j-\langle{y_j}\rangle)\rangle ~.
$$

If the distribution $p({\bf y})$ is a multivariate Gaussian then
$$
p({\bf y}) =  {1 \over (2\pi)^{M/2} \sqrt{{\rm det}({\bf C})}}
  \exp\left(-{1\over 2} \Delta y_i \cinvij \Delta y_j\right) ~,
$$
where $\Delta y_i = y_i-\langle y_i \rangle$, $\cinvij$ is the inverse covariance matrix and I have used the Einstein summation convention: repeated indices ($i,j$ in this case) are automatically summed over.

This can also be written in vector/matrix notation.

### Correlated Errors: Observables and Parameters

Sometimes the errors on data points are correlated.

For example, there may be a calibration uncertainty that affects many data points in the same way.  For galaxy clustering statistics, measurement errors at different scales are usually correlated.

Even if the errors on data points are uncalibrated, the errors on _parameters_ derived from a multi-parameter fit to the data (e.g., the slope and amplitude of a line) are often correlated, unless one has deliberately constructed parameters that have uncorrelated errors.

It is also possible to have correlated errors on data and uncorrelated errors on parameters, though this is less generic than the reverse case.

### Gaussian Likelihoods and $\chi^2$

If we have uncorrelated, Gaussian errors on observables $y$ 
and a model that predicts $y_k=y_{\rm mod}(x_k)$ then the 
likelihood is $L \propto e^{-\chi^2/2}$ where
$$
\chi^2 = \sum_k {(\Delta y_k)^2 \over \sigma_k^2}
$$
with $\Delta y_k = y_k-y_{\rm mod}(x_k)$.

However, if the errors are correlated then we instead have
$$
\chi^2 = \Delta y_k \cinvkl \Delta y_l.
$$

The two definitions coincide for a diagonal covariance matrix
$C_{kl}=\sigma_k^2 \delta_{kl}$, in which case 
$\cinvkl = \delta_{kl}\sigma_k^{-2}$.

### Parameter Errors in a Maximum Likelihood (or MAP) Estimate


For a Gaussian probability distribution
$p(x)=(2\pi\sigma^2)^{-1/2}e^{-(x-\mu)^2/2\sigma^2}$,
$$
\ln p = -{1 \over 2} {(x-\mu)^2 \over \sigma^2} + {\rm const}.
$$

Suppose we have estimated a parameter $\theta$ by maximizing either the likelihood $L$ or the posterior probabiliity $L_p$. The first derivative vanishes at the maximum, so a Taylor expansion gives
$$
\ln L \approx \ln L_0 + 
  {1 \over 2}\left({\partial^2 \ln L\over \partial\theta^2}\right)
  (\theta-\theta_0)^2~,
$$
where $\theta_0$ is the location of the maximum.

Identifying the two equations, we infer that if $L(\theta)$ is adequately described by this Taylor expansion, the $1\sigma$ error on $\theta$ is 
$$
\sigma_\theta = 
  \left(-{\partial^2 \ln L\over \partial\theta^2}\right)^{-1/2},
$$
where the derivative is evaluated at the maximum.

For the more general case of a vector of parameters $\theta_i$, we can define the second-derivative matrix
$$
H_{jk} = - {\partial^2\ln L \over \partial\theta_j\partial\theta_k},
$$
which is sometimes called the Hessian matrix or curvature matrix (though terminology and notation are not standard).

One can approximate the log-likelihood as a multi-dimensional paraboloid near its maximum, to find that the likelihood itself is a multi-dimensional Gaussian with covariance matrix
$$
{\rm Cov}(\theta_j,\theta_k) = \sigma_{jk} = H_{jk}^{-1}
$$

Here $(\sigma_{ii})^{1/2}$ is the error on parameter $\theta_i$ marginalized over uncertainties in other parameters.

If $\sigma_{jk} \neq 0$ for some $j \neq k$ then the uncertainties on parameters $\theta_j$ and $\theta_k$ are
correlated.

I have phrased this discussion in terms of likelihood, but it could equally well be phrased in terms of posterior probability: the log of the posterior probability can also be approximated as a paraboloid about its maximum, and one would just substitute $P_{\rm posterior}$ for $L$ in the expressions.

**Notational caution:** Whenever I write $A^{-1}_{jk}$ I mean the $jk$ element of the inverse of matrix $A$, not the reciprocal of the $jk$ element of $A$, which I would write $(A_{jk})^{-1}$.

### Monte Carlo Markov Chains

A fairly common statistical problem is estimating the probability distribution of parameters in a high-dimensional parameter space.

If the 2nd-order expansion described above is adequate, then one "just" needs to find the maximum likelihood solution and compute the second-derivatives of the likelihood with respect to the parameters.

But sometimes this approximation isn't adequate -- a rule-of-thumb that doesn't always work is that the parabolic approximation is good when the fractional errors on _all_ of the parameters are small.


One option is to grid the parameter space finely and compute the posterior probability at all grid locations within it. Marginal distributions can be computed by summing over axes.

This approach is robust and therefore shouldn't be ignored, but it is often computationally impractical.

For example, we might be trying to determine the constraints from a CMB data set $D$ on the set of cosmological parameters $\vth=(\Omega_m,h,\Omega_b,A,n,\tau)$ that determines the CMB spectrum in the simplest current cosmological scenario.

There are tools for calculating $p(D|\vth I)$, but this calculation might take a few seconds, or minutes, for each model in the parameter space.  

Since the parameter space is six-dimensional, even a relatively coarse grid with 10 points along each parameter direction over the plausible range requires $10^6$ evaluations of $p(D|\vth I)$, and if we add another two parameters then $10^6$ becomes $10^8$.

Thus, a naive grid-based evaluation of the likelihood to find best-fit parameters and error bars may be prohibitively expensive.

Monte Carlo Markov Chains (MCMC) are a useful tool for this kind of problem, and this approach has taken rapid hold in the cosmology literature.

In effect, MCMC is doing the necessary integrals for marginalization by Monte Carlo integration.

For details, see the references listed below and the things that they in turn refer to, but in brief the idea is as follows.


The goal is to map the posterior probability distribution $p(\vth | DI) \propto p(\vth | I) p(D|\vth I),$ in the neighborhood of its maximum value.

If the prior $p(\vth|I)$ is flat, then we just have $p(\vth|DI) \propto L$.

Procedure:

1. Start from a randomly chosen point in the parameter space, $\vth = \valpha_1$.

2. Take a random step to a new position $\valpha_2$.

3. If $p(\valpha_2|DI) \geq p(\valpha_1|DI)$, "accept" the step: add $\valpha_2$ to the chain, and substitute $\valpha_2 \rightarrow \valpha_1$. Return to step 2.

4. If $p(\valpha_2|DI) < p(\valpha_1|DI)$, draw a random number $x$ from
a uniform distribution from 0 to 1.  If $x < p(\valpha_2|DI)/p(\valpha_1|DI)$,
"accept"  the step and proceed as in 3.  
If $x\geq p(\valpha_2|DI)/p(\valpha_1|DI)$, reject the step.
Save $\valpha_1$ as another (repeated) link on the chain, and go back to 2.


The chain takes some time to "burn in," i.e., to reach the neighborhood
of the most likely solutions.

However, once this happens, a "long enough" chain will have a 
density of points that is proportional to $p(\vth|DI)$.

To get, for example, the joint pdf of a pair of parameters, one
can just make contours of the density of points in the space of
those two parameters.  Other "nuisance parameters are marginalized
over automatically, because the points sample the full space.

If you want to calculate the posterior distribution of some 
_function_ of the parameters (e.g., the age of the Universe,
given parameter estimates from the CMB), you can just calculate
that function for all points in the chain, then plot the pdf
of the result.


There are numerous technical issues related to determining whether
a chain has "converged" (i.e., is fairly sampling the probability
distribution), and to choosing steps in a way that produces fast
convergence and good "mixing" (sampling the distribution fairly
with a relatively small number of points).  

There is an increasingly extensive literature on MCMC methods.
Some starting points are:
Sections 5.8.1 and 5.8.2 of Ivezic et al., and section 15.8
of the 3rd edition of Numerical Recipes, though this topic wasn't
in the 1st or 2nd edition.  

An exceedingly useful and enjoyably written reference is
Hogg \& Foreman-Mackey (2017, arXiv:1710.06068).
Another that goes a bit further in introducing more advanced
methods is Sharma (2017, ARAA, 55, 213).

Properly implemented, MCMC should sample tails or multiple modes
of a distribution that are not well described by the Gaussian approximation.

However, if the Gaussian approximation is adequate, then MCMC is
not a computationally efficient way to find the parameter PDF.