# Astronomy 8824 - Numerical and Statistical Methods in Astrophysics

## Statistical Methods Topic IV. Fisher Matrix Forecasts and Linear Models

These notes are for the course Astronomy 8824: Numerical and Statistical Methods in Astrophysics. It is based on notes from David Weinberg with modifications and additions by Paul Martini.
David's original notes are available from his website: http://www.astronomy.ohio-state.edu/~dhw/A8824/index.html

#### Background reading: 
- Statistics, Data Mining, and Machine Learning in Astronomy, $\S4.2$ 
- Numerical Recipes, Chapter 15
- Gould (2003), arXiv:astro-ph/0310577

In [1]:
import math
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
from scipy import optimize

# matplotlib settings 
SMALL_SIZE = 14
MEDIUM_SIZE = 16
BIGGER_SIZE = 18

plt.rc('font', size=SMALL_SIZE)          # controls default text sizes
plt.rc('axes', titlesize=SMALL_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=BIGGER_SIZE)    # fontsize of the x and y labels
plt.rc('lines', linewidth=2)
plt.rc('axes', linewidth=2)
plt.rc('xtick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=MEDIUM_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)   # fontsize of the figure title

LaTex macros hidden here -- 
$\newcommand{\expect}[1]{{\left\langle #1 \right\rangle}}$
$\newcommand{\intinf}{\int_{-\infty}^{\infty}}$
$\newcommand{\xbar}{\overline{x}}$
$\newcommand{\ybar}{\overline{y}}$
$\newcommand{\like}{{\cal L}}$
$\newcommand{\llike}{{\rm ln}{\cal L}}$
$\newcommand{\xhat}{\hat{x}}$
$\newcommand{\yhat}{\hat{y}}$
$\newcommand{\xhati}{\hat{x}_i}$
$\newcommand{\yhati}{\hat{y}_i}$
$\newcommand{\sigxi}{\sigma_{x,i}}$
$\newcommand{\sigyi}{\sigma_{y,i}}$
$\newcommand{\cij}{C_{ij}}$
$\newcommand{\cinvij}{C^{-1}_{ij}}$
$\newcommand{\cinvkl}{C^{-1}_{kl}}$
$\newcommand{\cinvmn}{C^{-1}_{mn}}$
$\newcommand{\valpha}{\vec \alpha}$
$\newcommand{\vth}{\vec \theta}$
$\newcommand{\ymod}{y_{\rm mod}}$
$\newcommand{\dy}{\Delta y}$

### Introduction

The Fisher information matrix, or simply the Fisher matrix, describes the amount of information that some observables $D$ contain about the parameters ${\mathbf \theta}$ of the model distribution for $D$, that is $p(D | \theta)$. 

In some cases, the probability density function $p(D | \theta)$ may depend strongly on $\theta$ such that one may need relatively little data derive $\theta$. One example is the use of parallax measurements to determine the distance to a nearby star cluster. In this case, $\theta$ is the distance of the cluster, and one would need relatively few measurements to determine $\theta$ reasonably well. In other words, there is a lot of information about the parameters in the observables. 

In other cases, the probability density function $p(D | \theta)$ may depend only weakly on the model parameters. An example is the galaxy luminosity function. The galaxy luminosity function is typically described by a power-law, an exponential cut-off at high luminosity, and a space density. A small number of measurements, especially if confined to a small luminosity range, would not provide much of a constraint. In this case there is relatively little information about parameters in the observables. 

### Single-Parameter Warmup

Suppose we have an observable $y_1$ measured at some $x_1$ that we can predict given some model parameter $\theta_1$, and that we measure $y_1$ with some observational error $\sigma(y_1)$.

Our best estimate of $\theta_1$ is the value that gives the observed value of $y_1$.

In the neighborhood of this best fit value $\hat{\theta}_1$, a linear Taylor expansion implies
$$
y_1(\theta_1) = y_1(\hat{\theta}_1)+
  \left({dy_1 \over d\theta_1}\right)(\theta_1-\hat{\theta}_1).
$$

Simple "chain rule" error propagation then tells us that the error on $\theta_1$ is
$$
\sigma(\theta_1) = \left(dy_1 \over d\theta_1\right)^{-1} \sigma(y_1).
$$
In this expression, $\sigma(\theta_1)$ is the width of the pdf of potential values of $\theta_1$, and $\sigma(y_1)$ the width of the distribution of observed values of $y_1$ relative to the model predictions.

Often we are interested in the fractional error
$$
{\sigma(\theta_1)\over\theta_1} = \sigma(\ln\theta_1) =
  \left({d\ln y_1 \over d\ln\theta_1}\right)^{-1}\sigma(\ln y_1).
$$

For example, if $y_1 \propto \theta_1^3$, $\ln y_1 = 3 \ln \theta_1$ and 
$$
\left( \frac{d\ln y_1}{d\ln \theta_1} \right) = 3
$$
so therefore
$$
\frac{\sigma(\theta_1)}{\theta_1} = \sigma(\ln \theta_1) = \frac{1}{3} \sigma(\ln y_1) = \frac{\sigma(y_1)}{3 y_1} 
$$
and the fractional error on $\theta_1$ is only 1/3 the fractional error on $y_1$.

These results break down if the linear Taylor expansion becomes inaccurate over the observationally allowed range of $y_1$.

In general, the error on a parameter depends on the error on the observable and on the sensitivity of that observable to that parameter. A more sensitive observable gives greater leverage on the parameter.

A good takeaway is that errors on parameters depend on 
1. errors on observables
2. sensitivity of the observables to the parameters

### Fisher Matrix Error Forecasting

Suppose we are considering some future experiment rather than data we have in hand.  If we can 
1. Predict what the measurement errors on the data will be and
2. We know how the data depend on model parameters
Then we can forecast how accurately we will be able to constrain parameters. 

Examples: 
1. Given $\sigma(mag)$, how many supernovae do we need to observe in order to obtain a 1\% error on $\Omega_\Lambda$?
2. If you want to measure the slope of the $M-\sigma$ relation with some uncertainty, but can only observe two galaxies, what $\sigma_*$ values should they have? 

If we have a parameter vector $\vth$, the Fisher information matrix is defined by
$$
F_{ij} = - \left\langle{\partial^2\ln L \over \partial\theta_i\partial\theta_j}
                \right\rangle.
$$
The Fisher matrix (or Fisher Information Matrix) is thus the expected value of the curvature/Hessian matrix. Note the expectation value of the first derivative of the likelihood with respect to each parameter is zero, as the parameters are at the values that maximize the likelihood. The likelihood the product of probabilities of observing some sample of random variables $\vec x$ given model parameters $\vth$. 

A Hessian matrix is a square matrix of second-order partial derivatives that describes the local curvature of a function of many variables. It also contains the coefficients of the quadratic terms in a local Taylor expansion of a function.

To the extent that the likelihood is well described by a quadratic Taylor expansion, the expected error on parameter $\theta_i$ is
$$
\sigma_i \equiv \sigma(\theta_{i}) = (F^{-1}_{ii})^{1/2}
$$
if all of the parameters are being estimated from the data set and
$$
\sigma_i \equiv \sigma(\theta_{i}) = (F_{ii})^{-1/2}
$$
if all parameters other than $\theta_i$ are known.

Under more general conditions, the error of any unbiased estimator must be greater than or equal to these values, a result known as the *Cram\'er-Rao Bound*. The Cram\'er-Rao Bound is the lower bound on the variance of estimators of a parameter, and states that the variance of any estimator must be at least as high as the inverse of the Fisher  matrix. This is a statement of the *best* one can do. It is always possible to do worse!

In _Fisher matrix forecasting_, we assume a fiducial model and properties of a data set to predict the Fisher matrix and thereby forecast the errors that will be obtained on model parameters.

There is a pretty good high-level discussion of this in section 2 of Tegmark, Taylor, & Heavens (1997, ApJ, 480, 22)
and a valuable but dense presentation in Gould (2003).

Note that a Fisher matrix forecast only gives you accurate error forecasts *if* the 2nd-order expansion of the likelihood is accurate.

If you're worried this might not be true, then you can use MCMC instead, with your anticipated measurement errors and setting the data equal to the values expected for your fiducial model.

### Parameter sensitivity and observational errors

We can decompose a Fisher matrix into a matrix product:

$$
F_{ij} = - \left\langle{\partial^2\ln L \over \partial\theta_i\partial\theta_j}
                \right\rangle
       = -\left\langle {\partial \dy_k \over \partial \theta_i}\cdot
                     {\partial^2 \ln L \over \partial \dy_k\partial \dy_l} \cdot
		       {\partial \dy_l \over \partial \theta_j}\right\rangle,
$$

where $\dy_k = \ymod(x_k)-y_k$ is the difference between the model prediction for data point $k$ and the observed value, and we are using the Einstein summation convention. Effectively the partial derivatives of $\Delta y$ with respect to the parameters describe the sensitivity of the observables to the parameters, and the inner second derivative of the likelihood with respect to $\Delta y$ describes the errors on the observables.

Because the data values do not depend on the model parameters (they are just observed),

$$
{\partial \dy_k \over \partial \theta_i} =
{\partial \ymod(x_k) \over \partial \theta_i}~.
$$

As we will show below, _if the errors on the observables are Gaussian and independent of the model parameters,_ then

$$
- {\partial^2\ln L \over \partial \dy_k\partial \dy_l}  = \cinvkl,
$$

the inverse covariance matrix. Note that minus sign here, and that in subsequent expressions the Fisher matrix is positive. 

Thus, the Fisher matrix has an "outer" piece $\partial \vec{y}_{\rm mod}/\partial{\vec{\theta}}$ that represents
the sensitivity of the observables to the parameters and an "inner" piece that represents the errors on the observables themselves.

If we consider the 1-parameter, 1-observable case, $C^{-1} = 1/\sigma^2_y$ we get
$$
F_{11} = {d\ymod \over d\theta} \cdot {1\over \sigma_y^2} \cdot
         {d\ymod \over d\theta},
$$
Recall that $\sigma_y$ are the uncertainties on the data $y$ and $\sigma(\theta)$ is on parameter $\theta$. This result implies
$$
\sigma(\theta_1) = (F_{11})^{-1/2}
$$
$$
\sigma^2(\theta) = 1/F_{11} = \sigma_y^2 \cdot
  \left({d\ymod\over d\theta}\right)^{-2},
$$
in agreement with our earlier chain rule result.

For a Fisher matrix forecast of parameter errors, we compute the
parameter sensitivity from our model, and we take the expected
values of the observable errors (and their covariances).

As far as I know, $\partial{\vec{y}_{\rm mod}}/\partial{\vec{\theta}}$ doesn't
have a special name, but we can think of it as an "influence matrix"
or "sensitivity matrix."

While computing the Fisher matrix requires assumptions about the
data set, the sensitivity matrix requires only knowledge of the
model, and it can be an interesting quantity to compute even if
one doesn't have a specific data set in mind.



### Fisher matrix for Gaussian Likelihoods 

Suppose we have a Gaussian likelihood function for $N$ data points
$$
L = \frac{1}{ (2 \pi)^{N/2} \sqrt{ {\rm det} {\bf C}} } \exp \left( -\frac{1}{2} \Delta y_m \cinvmn \Delta y_n \right)
$$

$$
-\ln L = {N \over 2}\ln(2\pi) + {1\over 2}\ln[{\rm det}({\bf C})]
  + {1 \over 2}\dy_m\cinvmn\dy_n.
$$
(My reason for changing $kl$ indices to $mn$ indices will become evident shortly.)

For the case of a diagonal covariance matrix (no covariance), $C_{mn}=\sigma_m^2\delta_{mn}$, this expression becomes
$$
-\ln L = {N \over 2}\ln(2\pi) + {1\over 2}\sum \ln\sigma_m^2 +
  {1 \over 2}\sum {\dy_m^2 \over \sigma_m^2},
$$
but we will consider the full covariant case.

Assume that we can ignore any dependence of the covariance matrix on the model parameters.  _This is a non-trivial assumption that will not always hold._  

For example, in cosmological applications we  sometimes have "cosmic variance" errors that depend on the amplitude
of matter or galaxy clustering, and the expected size of these errors depends on the cosmological parameters.  

However, if we have a data set that provides tight constraints on parameters, then the allowed model dependence of the covariance matrix usually cannot be large.

If we do make this assumption, then the derivative of ${\rm det}{\bf C}$ with respect to parameters vanishes, and $\cinvmn$ is independent of $\dy_k$ and $\dy_l$, allowing us to rearrange the
"inner" piece of the Fisher matrix:
$$\eqalign{
{1 \over 2} {\partial^2(\Delta y_m\cinvmn\dy_n) \over \partial\Delta y_k\partial\Delta y_l}
&=
{1 \over 2} \sum_{mn} \cinvmn {\partial^2(\Delta y_m\Delta y_n) \over 
                               \partial\Delta y_k \partial\Delta y_l} \cr
&=
{1 \over 2} \sum_{mn} \cinvmn {\partial \over \partial\Delta y_k}
   \left({\partial(\Delta y_m\dy_n) \over \Delta y_l}\right) \cr
&=
{1 \over 2} \sum_{mn} \cinvmn {\partial \over \partial\dy_k}
   \left(\dy_m {\partial \dy_n \over \partial \dy_l} + 
         \dy_n {\partial \dy_m \over \partial \dy_l}\right) \cr
&=
{1 \over 2} \sum_{mn} \cinvmn {\partial \over \partial\dy_k}
   \left(\dy_m \delta_{nl} + \dy_n \delta_{ml}\right) \cr
&=
{1 \over 2} \sum_{mn} \cinvmn (\delta_{km}\delta_{nl} + \delta_{kn}\delta_{ml})
  \cr
&= 
\cinvkl~.
}
$$

On the right-hand sides I have written out sums explicitly for clarity and interchanged sums and derivatives.  

This derivation is a bit mathematically loose, but I think it is correct.

Including the "outer" piece, the Fisher matrix is
$$
F_{ij} = 
 {\partial\dy_k\over \partial\theta_i}\cinvkl
 {\partial\dy_l\over\partial\theta_j} = 
 {\partial y_{\rm mod}(x_k)\over \partial\theta_i}\cinvkl
 {\partial y_{\rm mod}(x_l)\over\partial\theta_j}.
$$

Though notationally different, I think this is equivalent to equation (15) of Tegmark et al. (1997), except that the
term ${\bf A}_i{\bf A}_j$ in that equation has vanished because we have assumed that the dependence of $C_{ij}$ on the parameters can be neglected.

### Straight Line Model

Now consider a model $\ymod(x) = \theta_1 + \theta_2 x$ where $\Delta y_k = y_{mod}(x_k) - y_k$ and so $\Delta y_k = \theta_1 + \theta_2 x_k - y_k$. 

The derivatives are
$$
{\partial\dy_k \over \partial\theta_1} = 1, \qquad
{\partial\dy_k \over \partial\theta_2} = x_k,
$$
making the $2\times 2$ Fisher matrix
$$
F_{ij} = \pmatrix{ \sum\cinvkl & \sum \cinvkl x_k \cr
                   \sum\cinvkl x_l & \sum\cinvkl x_k x_l}.
$$
where the summation is over $k$ and $l$. 

This matrix can be inverted recalling that for 
$$
A = \pmatrix{ a & b \cr c & d}, \qquad
A^{-1} = {1\over ad-bc}\pmatrix{d & -b \cr -c & a}.
$$

The errors on the intercept and slope are, respectively, $(F_{11}^{-1})^{1/2}$ and $(F_{22}^{-1})^{1/2}$.

For a diagonal covariance matrix $C_{kl} = \delta_{kl}\sigma_k^2$,
$$
F_{ij} = \pmatrix{ \sum\sigma_k^{-2} & \sum x_k\sigma_k^{-2} \cr
                   \sum x_k\sigma_k^{-2} & \sum x_k^2\sigma_k^{-2}}
       = {N \over \sigma^2} \pmatrix{ 1 & \langle x \rangle \cr
                                      \langle x \rangle & \langle x^2 \rangle},
$$
where the second equality is for homoscedastic errors $\sigma_k = \sigma$. Note the summation over $ij$ is a summation over parameters, and the summation over $kl$ is over the data. 

Inverting the last case yields
$$
F_{ij}^{-1} = {\sigma^2 \over N}(\langle x^2\rangle - \langle x \rangle^2)^{-1}
  \pmatrix{ \langle x^2 \rangle & -\langle x \rangle \cr
            -\langle x \rangle & 1}.
$$

This is the forecast for the covariance matrix. 

This expression is the same as the top equation on p. 5 of Gould (2003). Here $F_{ij}^{-1}$ refers to the forecast covariance matrix of the parameter errors, and $\langle x\rangle$ and $\langle x^2 \rangle$ refer to expected properties of the data set. In Gould (2003), the $\langle ... \rangle$ averages are over the actual data points obtained, and the result is the actual covariance matrix of the parameter errors.

### $\chi^2$-minimization for a general linear model

My discussion here follows that of Gould (2003) but with different notation.

Our analysis of the straight-line model can be generalized to the fit of a model that is *linear* in the parameters $\theta_i$,
$$
\ymod(x) \equiv \sum_{i=1}^n \theta_i f_i(x),
$$
where the $f_i(x)$ are specified functions.

Note that the $f_i(x)$ do not need to be linear, e.g., we could have
$$
\ymod(x) = \theta_1 + \theta_2 x + \theta_3 x^2 +\theta_4 \sin(2\pi x).
$$

Again defining $\dy_k \equiv y_k-\ymod(x_k)$, we now have
$$
{\partial \dy_k \over \partial \theta_i} = f_i(x_k).
$$
as $y_{mod}$ is linear in the $\theta_i$. Therefore, for a Gaussian likelihood function,
$$
F_{ij} = f_i(x_k) \cinvkl f_j(x_l) = {\partial \dy_k \over \partial \theta_i} \cinvkl {\partial \dy_l \over \partial \theta_j}
$$

As before, 
$$
\sigma_{ij} = F_{ij}^{-1}
$$
is the expected covariance matrix of the parameter errors, with $(F_{ii}^{-1})^{1/2}$ the expected error on parameter $\theta_i$ if all parameters must be estimated from the data.

This is equivalent to the result on pp. 3 and 4 of Gould (2003), with the notational translations
$$
{\cal B}_{kl} = \cinvkl \qquad b_{ij} = F_{ij} \qquad c_{ij}=F_{ij}^{-1}.
$$


Importantly, Gould also derives the solution for the minimum $\chi^2$ (maximum likelihood) values of the parameters by requiring $\partial\chi^2/\partial\theta_i = 0$.

The result is
$$
\hat{\theta_i} = F_{ij}^{-1}\left[y_k\cinvkl f_j(x_l)\right] \quad 
  (= c_{ij} d_{j} \hbox{ in Gould's notation}),
$$
where there is an implicit sum over $k,l$ inside the $[..]$, and a sum over $j$. See his page 3, and note a_i = c_{ij} d_j$. 

_This is a general result for $\chi^2$ fitting of a model that is
linear in the parameters $\theta_i$,
with $F_{ij}$ defined as $f_i(x_k)\cinvkl f_j(x_l)$._


For a diagonal covariance matrix $\cinvkl = \sigma_{k}^{-2}\delta_{kl}$, the Fisher matrix is 
$$
F_{ij} = \sum_{k=1}^n {f_i(x_k) f_j(x_k) \over \sigma_k^2}
$$ 
and 
$$
y_k \cinvkl f_j(x_l) = \sum_{k=1}^n {y_k f_j(x_k) \over \sigma_k^2}.
$$

These are the two terms to compute the parameter values for a diagonal covariance matrix. 

As previously emphasized, a diagonal covariance matrix does _not_ imply a diagonal Fisher matrix.  One can have independent data points but still have correlated parameter errors, and vice versa. (Recall the example of the slope and intercept of a line.)

_For $\chi^2$-minimization of a general linear model, one can find best-fit parameter values and the covariance matrix of parameter errors "analytically" (numerical matrix inversions may be required)._

Gould (2003) also gives solutions for cases where one imposes
constraints on the parameters (e.g., $\theta_1 + 2\theta_2 = 0$).

### Illustration for a straight line

If we adopt a diagonal covariance matrix and further specify a straight-line model, $f_1(x)=1$, $f_2(x)=x$, we obtain our earlier result for the Fisher matrix, but I will now adopt notation from Numerical Recipes section 15.2:
$$
F_{ij} = \pmatrix{S & S_x \cr S_x & S_{xx}} 
$$
with the inverse-variance weighted sums
$$
S \equiv \sum \sigma_k^{-2}, \quad S_x \equiv \sum x_k\sigma_k^{-2},
\quad S_{xx} = \sum x_k^2\sigma_k^{-2}.
$$
The inverse Fisher matrix is
$$
F_{ij}^{-1} = {1\over S\,S_{xx} - S_x^2}\pmatrix{S_{xx} & -S_x \cr -S_x & S}.
$$
The vector $d_j \equiv y_k\cinvkl f_j(x_l)$ in Gould's notation is $(d_1,d_2)$ with
$$
d_1 = \sum y_k\sigma_k^{-2} \equiv S_y, \quad
d_2 = \sum x_k y_k\sigma_k^{-2} \equiv S_{xy}. 
$$

The minimum-$\chi^2$ solution is then
$$\eqalign{
\theta_1 &= F^{-1}_{11} d_1 + F^{-1}_{12} d_2 = 
  {S_{xx} S_y - S_x S_{xy} \over S S_{xx} - S_x^2} \cr
\theta_2 &= F^{-1}_{21} d_1 + F^{-1}_{22} d_2 = 
  {S S_{xy} - S_x S_y \over S S_{xx} - S_x^2},
}
$$
in agreement with equation 15.2.6 of NR. 

### Expanding a non-linear model

Suppose that our model is a _non-linear_ function of our
parameters, but we know that the correct parameters are 
small perturbations about a fiducial model with parameters
$\theta_{i,{\rm fid}}$.  

In this case, we can make a Taylor expansion
$$
\ymod (x_k) = y_{\rm mod,fid}(x_k) + \Delta\theta_i 
             {\partial \ymod (x_k) \over \partial\theta_i},
$$
where $\Delta\theta_i = \theta_i - \theta_{i,{\rm fid}}$
and the derivative is evaluated for the fiducial values
of all parameters.

This is now a linear model with parameters $\Delta\theta_i$ 
instead of $\theta_i$ and
$$
f_i(x_k) = {\partial \ymod (x_k)\over \partial\theta_i}.
$$

We can use this expansion to fit parameter values
or compute parameter errors or forecast errors
provided that the errors are small enough that the 
linear Taylor expansion remains accurate.

By definition, this expansion holds exactly for a true linear model.

Most Fisher matrix forecasts implicitly assume this kind of linear 
expansion around a fiducial model, 
so they will give accurate forecasts of parameter 
errors only to the extent that the linear expansion is accurate
over the range of parameters allowed by the data.

An MCMC forecast does not rely on this linear approximation.

### Adding Fisher matrices

Suppose we have two data sets that are statistically independent.

In this case, the joint likelihood (or posterior probability) is just the product of the individual likelihoods (or posterior probabilities), since $p(x,y)=p(x)p(y)$ for independent variables. For example, say we have likelihoods $L_1 = \Pi p_{i,1}$ and $L_2 = \Pi p_{k,2}$: 

$$
L_{T} = L_1 + L_2 = \Pi p_{i,1} + \Pi p_{k, 2}. 
$$

$$
\ln L_{T} = \Sigma p_{i,1} + \Sigma p_{k,2}
$$

Therefore, one obtains $\langle \ln L \rangle$ for the two data sets by adding the two individual values of $\langle \ln L \rangle$, and the Fisher matrix for the two data sets is just the sum of the Fisher matrices for the individual data sets.

$$
F_{ij,T} = - \left\langle{\partial^2\ln L_1 \over \partial\theta_i\partial\theta_j}
                \right\rangle - \left\langle{\partial^2\ln L_2 \over \partial\theta_i\partial\theta_j}
                \right\rangle
$$

$$
F_{ij,T} = F_{ij,1} + F_{ij, 2}
$$


This still holds even if the data sets are quite different in 
character provided they constrain the same underlying parameters.

For example, one can forecast cosmological parameter errors that
will be obtained by joint fits to CMB data, supernova data, and
a direct measurement of $H_0$ by adding the Fisher matrices
for the three data sets.

This is a powerful technique.

Note that Fisher information scales like an inverse variance,
$F \propto \sigma^{-2}$, and information from independent
data sets adds linearly.
