In [None]:
factor=0.6

from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'scroll': True,
        'width': 1024*factor,
        'height': 768*factor
})


# Classical Inference

K. Leighly 2017

This lecture was drawn from the following sources:
 - G. Richards "Inference" lecture
 - Ivezic Chapter 4
 - "Data Reduction and Error Analysis in the Physical Sciences", P. R. Bevington
 - "Numerical Recipes: The Art of Scientific Computation" 3rd Edition, W. H. Press, S. A. Teukolsky, W. T. Vetterling, B. P. Flannery

Statistical *inference* is about drawing conclusions from data, specifically determining the properties of a population by data sampling.  

This is a complicated topic, and it would be easy to build a full-semester course on it.  We will spend only one lecture period on it, covering a selection of the most basic results, and some useful applications.

In particular, we will discuss: 
 - Maximum likelihood estimation (MLE), a fundamental result that will lead us directly to a discussion of $\xi^2$ and its application and interpretation.
 - Bootstrap error analysis
 - Hypothesis Testing and Comparison of Distributions
 
Ivezic Chapter 4 covers other interesting and useful topics:
 - The Expectation Maximization Algorithm (we will discuss this when we talk about PCA).
 - Histograms (we will cover this later, when we talk about cluster analysis)




## Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation is a cornerstone of classical inference.  We have already touched upon some aspects of this in our discussion of Gaussian distributions.

If we know the distribution from which our data were drawn, then we can compute the **probability** or **likelihood** of our data.  

For example if you know that your data are drawn from a model with a Gaussian distribution, then we've already seen that the probablity of getting a specific value of $x$ is given by
$$p(x|\mu,\sigma) = \frac{1}{\sigma\sqrt{2\pi}} \exp\left(\frac{-(x-\mu)^2}{2\sigma^2}\right).$$

If we want to know the total likelihood of our *entire* data set (as opposed to one measurement) then we must compute the *product* of all the individual probabilities (assuming that all the data are drawn independently):
$$L \equiv p(\{x_i\}|M(\theta)) = \prod_{i=1}^n p(x_i|M(\theta)),$$
where $M$ refers to the *model* and $\theta$ refers collectively to the $k$ parameters of the model, which can be multi-dimensional.

In words, this is *the probability of the data given the model*.  However, note that while the components of $L$ may be normalized pdfs, their product is not.  

The product can be very small, so we often take the log of $L$.

We can write this out as
$$L = \prod_{i=1}^n \frac{1}{\sigma\sqrt{2\pi}} \exp\left(\frac{-(x_i-\mu)^2}{2\sigma^2}\right),$$
and simplify to
$$L = \prod_{i=1}^n \left( \frac{1}{\sigma\sqrt{2\pi}} \right) \exp\left( -\frac{1}{2} \sum \left[\frac{-(x_i-\mu)}{\sigma} \right]^2 \right),$$

where we have written the product of the exponentials as the exponential of the sum of the arguments, which will make things easier to deal with later.

That is, we have done this: $$\prod_{i=1}^n A_i \exp(-B_i) = (A_iA_{i+1}\ldots A_n) \exp[-(B_i+B_{i+1}+\ldots+B_n)]$$

If you have done $\chi^2$ analysis (e.g.,, doing a linear least-squares fit), then you might notice that the argument of the exponential is just 
$$\exp \left(-\frac{\chi^2}{2}\right).$$

That is, for our gaussian distribution
$$\chi^2 = \sum_{i=1}^n \left ( \frac{x_i-\mu}{\sigma}\right)^2.$$

So, maximizing the likelihood is the same as minimizing $\chi^2$.

We will discuss this in more detail below.

Note that we could have asked instead about the likelihood of the *model* given the *data*:
$$L \equiv p(M(\theta)|\{x_i\}) = \prod_{i=1}^n p(M(\theta)|x_i).$$

We'll come back to that in the next lecture; that is a fundamental aspect of Bayesian inference.

That's the **likelihood**, but what is the **maximum likelihood**?

Well, let's say that we know that some data were drawn from a Gaussian distribution, but we don't know the $\theta = (\mu,\sigma)$ values of that distribution (i.e., the parameters), then MLE is about varying the parameters until we find the maximal value of $L$.  Simple as that.

## MLE applied to a Homoscedastic Gaussian

Let's take a look at an example using a Gaussian model where all the measurements have the same error ($\sigma$).  This is known as having **homoscedastic** errors.  

For an experiment with data $D=\{x_i\}$ in 1D with Gaussian errors, we have
$$L \equiv p(\{x_i\}|\mu,\sigma) = \prod_{i=1}^N \frac{1}{\sigma\sqrt{2\pi}} \exp\left(\frac{-(x_i-\mu)^2}{2\sigma^2}\right).$$

Note that that is $p(\{x_i\})$ not $p(x_i)$, that is the probability _of the full data set_, not just one measurement.

If $\sigma$ is both uniform and *known*, then this is a one-parameter model with $k=1$ and $\theta_1=\mu$. 

As we found above, likelihoods can be really small, so let's define the *log-likelihood function* as ${\rm lnL} = \ln[L(\theta)]$.  The maximum of this function happens at the same place as the maximum of $L$.  Note that any constants in $L$ have the same effect for all model parameters, so constant terms can be ignored.  

In this case we then have $${\rm lnL} = {\rm constant} - \sum_{i=1}^N \frac{(x_i - \mu)^2}{2\sigma^2}.$$

Take a second and make sure that you understand how we got there.  It might help to remember that above, we wrote

$$L = \prod_{i=1}^n \left( \frac{1}{\sigma\sqrt{2\pi}} \right) \exp\left( -\frac{1}{2} \sum \left[\frac{-(x_i-\mu)}{\sigma} \right]^2 \right).$$

We then determine the maximum in the same way that we always do.  It is the parameter set for which the derivative of ${\rm lnL}$ is zero:
$$\frac{d\;{\rm lnL}(\mu)}{d\mu}\Biggr\rvert_{\mu_0} \equiv 0.$$

That gives $$ \sum_{i=1}^N \frac{(x_i - \mu_o)}{\sigma^2} = 0.$$

Since $\sigma = {\rm constant}$, that says 
$$\sum_{i=1}^N x_i = \sum_{i=1}^N \mu_0 = N \mu_0.$$

Thus we find that
$$\mu_0 = \frac{1}{N}\sum_{i=1}^N x_i,$$
which is just the arithmetic mean of all the measurements.

This is a very common result - the most likely value is the arithmetic mean.  Many people, if asked with is the most probable value, or the expected value from the data set, would posit the mean.  But take special note; we have implicitly assumed that the underlying distribution was Gaussian, and that all the errors are the same.  If the underlying distribution is not gaussian, then this may not be a good estimator of the most probable value.

(And yes, we already discussed this in the previous lecture, but it is such a fundamental result that it bears repeating.)

The uncertainty on our MLEs can be determined from the covariance matrix:
$$\sigma_{jk} = \left( - \frac{d^2}{d\theta_j} \frac{\ln L}{d\theta_k} \Biggr\rvert_{\theta=\theta_0}\right)^{-1/2}.$$

- The marginal error bars for each parameter, $\theta_i$ are given by the diagonal elements, $\sigma_{ii}$.  
- If $\sigma_{jk}=0$ for $j \ne k$ then the parameters are uncorrelated.  In this case, the $\sigma_{ii}$ are direct analogs of error bars in the one-dimensional problem.
- If $\sigma_{jk} \ne 0$ for $j \ne k$, then the errors for parameters $\theta_j$ and $\theta_k$ are correlated.  You can picure this, in the two dimensional space, as a bivariate Gaussian with principal axes that are not aligned with coordinate axes.  One of the things this tells us is that some combination of parameters are better determined than others.

(You have already worked with the bivariate Gaussian in the homework.)

Where does this come from?  It is derived by expanding $\ln L$ in a Taylor expansion.  

Recall the taylor expansion:

$$f(a)+\frac{f^\prime}{1!}(x-a)+\frac{f^{\prime\prime}}{2!}(x-a)^2+\frac{f^{\prime\prime\prime}}{3!}(x-a)^3+\dots.$$

We are evaluating at the maximum, so the odd terms are equal to zero by symmetry.

When we truncate to the second term, it means is that we are assuming that the shape of the $\ln L$ is a parabola.  This may not necessarily be the case, but is if the error distribution is Gaussian (as the higher order terms are zero).  

We will talk in more detail in the next lecture about what to do when the likelihood is not parabolic.

In our example, the uncertainly on the mean is 
$$\sigma_{\mu} = \left( - \frac{d^2\ln L(\mu)}{d\mu^2}\Biggr\rvert_{\mu_0}\right)^{-1/2}$$

We find
$$\frac{d^2\ln L(\mu)}{d\mu^2}\Biggr\rvert_{\mu_0} = - \sum_{i=1}^N\frac{1}{\sigma^2} = -\frac{N}{\sigma^2},$$
since, again, $\sigma = {\rm constant}$.  

Then $$\sigma_{\mu} = \frac{\sigma}{\sqrt{N}}.$$

So, our estimator of $\mu$ is $\overline{x}\pm\frac{\sigma}{\sqrt{N}}$, which is a result that you should be familiar with.

## MLE applied to a Heteroscedastic Gaussian

Now let's look a case where the errors are **heteroscedastic**.  For example if we are measuring the length of a rod and have $N$ measurements, $\{x_i\}$, where the error for each measurement, $\sigma_i$ is known.  Since $\sigma$ is not a constant, then following the above, we have

$$\ln L = {\rm constant} - \sum_{i=1}^N \frac{(x_i - \mu)^2}{2\sigma_i^2}.$$

Taking the derivative:
$$\frac{d\;{\rm lnL}(\mu)}{d\mu}\Biggr\rvert_{\mu_0} = \sum_{i=1}^N \frac{(x_i - \mu_o)}{\sigma_i^2} = 0,$$
then simplifying:

$$\sum_{i=1}^N \frac{x_i}{\sigma_i^2} = \sum_{i=1}^N \frac{\mu_o}{\sigma_i^2},$$

yields a MLE solution of 
$$\mu_0 = \frac{\sum_i^N (x_i/\sigma_i^2)}{\sum_i^N (1/\sigma_i^2)},$$

with uncertainty
$$\sigma_{\mu} = \left( \sum_{i=1}^N \frac{1}{\sigma_i^2}\right)^{-1/2}.$$

(Note that this is just the same as taking the weighted mean.)

## Truncated/Censored Data and Other Cost Functions

Note that dealing with missing data points ("censored data") adds complications.  There are ways to treat this from a classical statistics point of view, but we will instead look at this question when we talk about linear regression, and use a Bayesian point of view. 

## "Goodness" of Fit

The MLE approach tells us what the "best" model parameters are, but _not how good the fit actually is_.  If the model is wrong, "best" might not be particularly revealing!  For example, if you have $N$ points drawn from a linear distribution, you can always fit the data perfectly with an $N-1$ order polynomial.  But that won't necessarily perfectly predict future measurements.

We can describe the **goodness of fit** in words simply as _whether or not it is likely to have obtained the measured value of $\ln L_0$ by randomly drawing from the data_.  That means that we need to know the *distribution* of $\ln L$.  

For the Gaussian case we have just described, we can write
$$z_i = (x_i-\mu)/\sigma,$$ then
$$\ln L = {\rm constant} - \frac{1}{2}\sum_{i=1}^N z^2 = {\rm constant} - \frac{1}{2}\chi^2.$$

Here, $\chi^2$ is the same thing that you may already be familar with and whose distribution we discussed very briefly last week, and will be dicussed in more detail in the next slide.

So $\ln L$ is distributed as $\chi^2$ (with $N-k$ degrees of freedom).  

We define the $\chi^2$ per degree of freedom, $\chi^2_{dof}$, as
$$\chi^2_{dof} = \frac{1}{N-k}\sum_{i=1}^N z^2_i.$$

For a good fit, we would expect that $\chi^2_{dof}\approx 1$.  If $\chi^2_{dof}$ is significantly larger than 1, then it is likely that we are not using the correct model.


### $\chi^2$ Distribution

If we have a Gaussian distribution with values ${x_i}$ and we scale and normalize them according to

$$z_i = \frac{x_i-\mu}{\sigma},$$

then the sum of squares, $Q$ 

$$Q = \sum_{i=1}^N z_i^2,$$

will follow the $\chi^2$ distribution.  The *number of degrees of freedom*, $k$ is given by the number of data points, $N$ (minus any constraints, i.e., the number of fit parameters).  The pdf of $Q$ given $k$ defines $\chi^2$ and is given by

$$p(Q|k)\equiv \chi^2(Q|k) = \frac{1}{2^{k/2}\Gamma(k/2)}Q^{k/2-1}\exp(-Q/2),$$

where $Q>0$ and the $\Gamma$ function would just be the usual factorial function if we were dealing with integers, but here we have half integers.

Note that the shape of the distribution *only* depends on the sample size $N=k$ and not on $\mu$ or $\sigma$.  

Examples of the $\chi^2$ distribution are shown below.

![Ivezic, Figure 3.14](http://www.astroml.org/_images/fig_chi2_distribution_1.png)


## Understanding $\chi^2$

Let us examine the formula for $\chi^2$:

$$\chi^2 = \sum_{i=1}^N \left(\frac{y_i-y}{\sigma_i} \right)^2$$

 - The numerator is a measure of the spread of the observations.
 - The denominator is a measure of the expected spread due to the uncertainty on the data.
 - One might expect the ratio of the numerator and denominator to be $\sim 1$ for a single data point, if the model describes the data well.
 - Then if we sum over $N$ data points, we should obtain about $\chi^2 \sim N$.
 - So reduced $\chi^2_{dof}\sim 1$ for a good fit makes sense.


### Example of Goodness of Fit

Let's generate some simulated data, fit it, and examine the goodness of fit.  This follows a portion of the example in Bishop Chapter 1.

Specifically, we will generate one cycle of a sine wave, and fit it with various polynomials, and examine the quality of the fit.

In [None]:
## Initiate python stuff
%pylab inline
import scipy.stats

In [None]:
# Set up an x-vector
x=np.linspace(0,1,11)

# y is a sine wave
y=np.sin(2*np.pi*x)

#give y some uncertainty, i.e., apply a Gaussian scatter around 
#the theoretically determined values.
ynew=y+np.random.normal(0,0.15,11)

#assign some errors.  These errors don't have any physical interpretation, 
#but are reasonably appropriate, as they are scattered around the 
#sigma of the uncertainty gaussian (=0.15)
err=np.random.uniform(0.1,0.2,11)

#plot ideal y, and data.
plt.plot(x,y)
plt.errorbar(x, ynew, yerr=err, fmt='o')

Now fit with polynomials of different orders.  

In [None]:
#fit the data with a zeroth order polynomial using polyfit
result0=np.polyfit(x,ynew,0,w=1/err)
print 'The best fit coefficients ',result0

#poly1d will generate the model using the best fit coefficients.  
y0=np.poly1d(result0)
y0out=y0(x)
plt.errorbar(x, ynew, yerr=err, fmt='o')
plt.plot(x,y0out)

The zeroth order polynomial is a very poor fit to the data.  How poor is it?  We can answer that question by evaluating the $\chi^2$, i.e., 

$$\chi^2 = \sum_{i=1}^N \left(\frac{y_i-y}{\sigma_i} \right)^2$$

where $y_i$ are the observed points, $\sigma_i$ are the uncertainties on those points, and $y$ is the model.

So $\chi^2$ is the sum of the weighted squared distance between the data and the model.  

We know from above that minimizing $\chi^2$ is equivalent to maximizing the likelihood, for the case of the Gaussian likelihood.  That is what polyfit is doing.  We will discuss the process of maximizing the likelihood later.

The output of the polyfit are the fit coefficients (in result).  Those can be used to evaluate the model at {$x_i$}.  Then the model can be compared with the data using the formula for $\chi^2$ above.

But now we want to understand what the computed $\chi^2$ tells us. Section 3.3.7 in Ivezic, "If {$x_i$) are drawn from a Gaussian distriubtion, and we define $z_i=(x_i-\mu)/\sigma$, then the sum of its squares, $Q=\sum z_i^2$ follows a $\chi^2$ distribution with $k=N$ degrees of freedom$\dots$" (and then an analytical form for the distribution is given; in practice, we can evaluate using the built-in functions from scipy.stats.). 

Here, the idea is that the model is the model that produced the data, and the data, at each point, are drawn from a Gaussian distribution around the model at that point, characterized by a mean (which is the model) and the standard deviation, which expresses the noise on the data.  

Bishop Figure 1.16 illustrates this idea.  Keep in mind that we are not talking about modeling data with a Gaussian, but that the noise on the data is distributed in a Gaussian way around the true value. 


The expectation value of the $\chi^2$ distribution is the number of degrees of freedom $N-k$, where $N$ is the number of data points, and $k$ is the number of model parameters.  What this means in practice is that you cannot evaluate the probability of obtaining a certain value alone, you must compare it with that expected for the number of degrees of freedom.  As noted above, the $\chi^2$ distribution is a function of a single parameter, the number of degrees of freedom (above denoted as $k=N$ but here corrected for the complexity of the model as $N-k$).

Moreover, the rule of thumb for $\chi^2$ is that the reduced $\chi^2$, also known as $\chi^2_\nu$ or $\chi^2_{dof}$, should be approximately 1 for a good fit. 

$$\chi^2_{dof}=\frac{1}{N-k} \sum_{i=1}^N z_i^2 \approx 1.$$

Back to our example.  We can compute the $\chi^2$ for the fit, and estimate the probability that the model provides a good fit to the data.

In [None]:
#Compute chi2
chi2=(((ynew-y0out)/err)**2).sum()
print 'The value of chi2 is ', chi2

from scipy import stats

#number of degrees of freedom equals the number of points minus the number of fit parameters.  
#We drew 11 points, and have a single fit parameter.

dist=stats.chi2(len(x)-1)
print 'The probability obtainning such a high value of chi2 \
if the model were a correct representation of the data ',dist.pdf(chi2)

The probability of obtaining the data given that the model is a zeroth order polynomial is very small.

It is not surprising that the zeroth degree polynomial does not provide a good fit to the data.  Let's try a first degree polynomial, i.e., a straight line.

In [None]:
result1=np.polyfit(x,ynew,1,w=1/err)
print 'The fit coefficients ',result1
y1=np.poly1d(result1)
y1out=y1(x)
plt.errorbar(x, ynew, yerr=err, fmt='o')
plt.plot(x,y1out)

chi2=(((ynew-y1out)/err)**2).sum()
print 'The chi2 is ', chi2

from scipy import stats
dist=stats.chi2(len(x)-len(result1))
print 'The probability that the straight line model could produce the observed data ',dist.pdf(chi2)


The probability of obtaining our data set using a linear model is larger than the probability of obtaining our data set using a constant model, but it is still not large.  So let's try increasing the order of the polynomial.



Better to write a function rather than repeating the same snippets of code.

In [None]:
def get_polyfit(x,ynew,err,degree):
    result=np.polyfit(x,ynew,degree,w=1/err)
    print 'The fit coefficients ',result
    ymodel=np.poly1d(result)
    ymodelout=ymodel(x)
    plt.errorbar(x,ynew,yerr=err,fmt='o')
    plt.plot(x,ymodelout)
    chi2=(((ynew-ymodelout)/err)**2).sum()
    print 'The chi2 ',chi2
    numpnts=len(x)
    print 'The reduced chi2',chi2/(numpnts-degree)
    dist=stats.chi2(numpnts-degree)
    dof=numpnts-degree
    print 'The probability that the model could produce the observed data ',dist.pdf(chi2)
    return chi2,dof
    

In [None]:
degree=2
chi2_2,dof_2=get_polyfit(x,ynew,err,degree)

In [None]:
degree=3
chi2_3,dof_3=get_polyfit(x,ynew,err,degree)

Finally, we obtain a decent probability for a third-order polynomial.

But we can continue:

In [None]:
degree=4
chi2_4,dof_4=get_polyfit(x,ynew,err,degree)

In [None]:
degree=5
chi2_5,dof_5=get_polyfit(x,ynew,err,degree)

## F test

We observe that increasing the order beyond 3 does not improve the quality of the fit dramatically.  Is there a way to test that quantitatively?  Yes, we can use the F test.

If two statistics, $\chi_1^2$ and $\chi^2_2$, which follow the $\chi^2$ distribution have been determined, the ratio of the reduced $\chi^2$ is distributed according to the F distribution.

I.e., 

$$f=\frac{\chi_1^2/\nu_1}{\chi^2_2/\nu_2}$$

where $\nu_1$ and $\nu_2$ are the respective degrees of freedom.  

Note there is some ambiguity about which $\chi^2$ should appear in the numerator.  

So let's experiment with this, first by looking at 3 versus 2 and then 4 versus 3.

In [None]:
f_2_vs_3=(chi2_2/dof_2)/(chi2_3/dof_3)

print 'chi2_3, dof_3, chi2_2, dof_2: ',chi2_3,dof_3,chi2_2,dof_2

print 'The value of f: ', f_2_vs_3

dist_f_2_vs_3=stats.f(8,9)
print 'The probability of exceeding this value of f with chi2_2 dof versus chi2_3 dof: ',1.0-dist_f_2_vs_3.cdf(f_2_vs_3)

In [None]:
f_4_vs_3=(chi2_4/dof_4)/(chi2_3/dof_3)

print 'chi2_4, dof_4, chi2_3, dof_3: ', chi2_4, dof_4, chi2_3, dof_3
print 'The value of f: ', f_4_vs_3

dist_f_4_vs_3=stats.f(7,8)
print 'The probability of exceeding this value of f with chi2_3 dof versus chi2_4 dof: ',1.0-dist_f_4_vs_3.cdf(f_4_vs_3)

The F test is used to determine whether adding another parameter and therefore an extra degree of freedom is necessary.  

The probability that such a large F could have been obtained by accident when we went from two fit parameters to three fit parameters is very small. This means that there is a dramatic improvement in fit (which can be seen in the plots, anyway).

The probability that such a large F could have been obtained by accident when we went from three fit parameters to four fit parameters is rather large, approaching 1. This means that there is no significant improvement in fit.

We will discuss these concepts more later.


## $\chi^2$ minimum


Now, the claim has been made that the best fit represents the lowest chi2 (or highest likelihood).  Just for fun, let's investigate this claim.  Specifically, let us return the variances in the fit coefficients, and use one of them to show that it is in fact a minimum in chi2.

The result[1] gives the variance of the parameters along the diagonal.  

In [None]:
degree=3

#Turn on the covariance.
result=np.polyfit(x,ynew,degree,w=1/err,cov='True')

#Arbitrarily grab the coefficient for the cubic
std_coeff0=sqrt(result[1][0][0])

#Create a vector with some variance around the value of the cubic coefficient.
temp1=np.linspace(-2.0*std_coeff0,2*std_coeff0,101)
print temp1.min(),temp1.max()

print result[0]
print result[1]
print result[1].shape

In [None]:
#write a function that will compute some chi2s after varying one of the fit coefficients.  
#This below varies the linear coefficient.

def get_chi2s(x,ynew,err,result):
    std_coeff=sqrt(result[1][3][3])
    temp1=np.linspace(-std_coeff,std_coeff,101)
    chi2_out=np.zeros(101)
    plt.errorbar(x,ynew,yerr=err,fmt='o')
    for i in arange(101):
        ymodel=np.poly1d([result[0][0],result[0][1],result[0][2]+temp1[i],result[0][3]])
        ymodelout=ymodel(x)
        plt.plot(x,ymodelout)
        chi2_out[i]=(((ynew-ymodelout)/err)**2).sum()
    return temp1,chi2_out


In [None]:
temp1,chi2_out=get_chi2s(x,ynew,err,result)

In [None]:
plt.plot(temp1,chi2_out,'.')
print chi2_out.min()
test_value=chi2_out.min()+1
print test_value
result=np.interp(test_value,chi2_out[temp1 > 0],temp1[temp1 > 0])
print result

This shows indeed that the $\chi^2$ is minimum at the best fit.  

You might be tempted to estimate the errors on this fit parameter by determining the values at which $\chi^2$ increases by some value.  **Look before you leap!**  Such errors will be incorrect due to covariance, and we will explore that next.


## Errors on fit parameters

We now have the parameter values for the best fit cubic equation.  These might be something we want to report in a paper.  However, numbers are useless without uncertainties.  So what are the uncertainties on the parameters of the cubic?

This question can also be tackled using $\chi^2$ (within limits to be discussed below).  

Consider that the parameter values at the $\chi^2$ minimum are given by $\mathbf{a}_{(0)}$, and the value of $\chi^2$ for those parameters is $\chi^2_{min}$.  If the parameters $\mathbf{a}$ are perturbed away from  $\mathbf{a}_{(0)}$, then $\chi^2$ increases.  

(It may help to think of the $\chi^2$ surface as a multi-dimensional bowl.)

We can define a confidence region by a certain amount $\Delta\chi^2$.  Generally speaking, that will be an $M$ dimensional region for $M$ fit parameters.  However, we can also do this for a more limited number of parameters ("parameters of interest").  



The information about how to do this is in the covariance matrix, but before we go into those details, we need to consider when this applies.   Generally speaking, these are most applicable when the errors on the data are normally distributed. If they are not normally distributed, $\chi^2$ can still be used, but the interpretation (e.g., precise values of $\Delta\chi^2$ for different levels of confidence) may not hold.

In that case, some precepts apply.
  - $\chi^2_{min}$ is distributed as a chi-square distribution with $N-M$ degress of freedom, where $N$ is the number of data points, and $M$ is the number of fit parameters.  (We have already covered this in the $\chi^2$ minimization discussion above.  
  
  - Consider $\mathbf{a}^S_{(j)}$ to be drawn from the universe of data sets with actual (i.e., physical) parameters $\mathbf{a}_{(0)}$ then the probability distribution of $\delta\mathbf{a} = \mathbf{a}^S_{(j)}- \mathbf{a}_{(0)}$ is the multivariate normal distribution.
  
  $$P(\delta\mathbf{a}) da_0\dots d a_{M-1} = const \times \exp(-\frac{1}{2} \delta\mathbf{a}\centerdot \mathbf{\alpha} \centerdot \delta\mathbf{a}) da_0\dots d a_{M-1}$$
  where 
  $$\mathbf{\alpha} = \frac{1}{2} \frac{\partial^2\chi^2}{\partial a_k \partial a_l}$$
  
  shows the curvature of the $\chi^2$ surfaces.  (This is related to the second partial derivative of the maximum likelihood, as you know, comes from the Taylor expansion of the maximum likelihood that is used to extract the covariance matrix.)
  

  
  - If $\mathbf{a}^S_{(j)}$ to be drawn from the universe of data sets with actual (i.e., physical) parameters $\mathbf{a}_{(0)}$, then quantity $\Delta \chi^2 = \chi^2(\mathbf{a}_{(j)}) - \chi^2(\mathbf(a)_{(0)}$ is also distributed as a $\chi^2$ distribution with $M$ degrees of freedom.  
  
It is the last precept that makes the connection betweeen particular values of $\Delta \chi^2$ and the fraction of the probability distriubtion enclosed, i.e., the confidence levels.

  - Suppose that $\mathbf{a}^S_{(j)}$ is drawn from the universe of simulated data, but this time, the first $\nu$ components are held fixed, and the remaining $M-\nu$ components are varied to minimize $\chi^2$.  Call this new minimum $\chi^2_\nu$. Then $\delta (\chi^2_\nu - \chi^2_{min})$ is distributed as $\chi^2$ with $\nu$ degrees of freedom.  
  
What this gives you is the projected $\Delta \chi^2$ on to the $M-\nu$ dimensional plane.  

Example: Consider $\nu=1$, where we want to find the confidence of a single parameter. (We call $\nu$ the number of parameters of interest.)  Then this is distributed as $\chi^2$ with one degree of freedom.  

In [None]:
degree=1

dist_chi_1=stats.chi2(degree)

#print out the delta chi^2 for various confidence intervals

conf_intervals=[0.6827, 0.90, 0.9545, 0.99, 0.9973, 0.9999]
print 'For these confidence levels, and one parameter of interest ', conf_intervals
print 'the delta chi2 should be ', dist_chi_1.ppf(conf_intervals)


Let us use this to determine the uncertainty on one of the parameters from our cubic model to the data.  We will use the covariance matrix, which is the inverse of the partial derivatives matrix (the Hessian matrix). For the polyfit, the covariance matrix is in result[1].


In [None]:
degree=3
result=np.polyfit(x,ynew,degree,w=1/err,cov='True')
print result[1]

We will use $$\Delta \chi^2 = \delta\mathbf{a}\centerdot \mathbf{\alpha} \centerdot \delta\mathbf{a}$$ 
and solve for $\delta\mathbf{a}$.  

We can write $\delta\mathbf{a}$ as:

$$\delta\mathbf{a}=\mathbf{\alpha}^{-1} \centerdot \left(\begin{align}c\\ 0\\ \vdots \\ 0 \end{align} \right) $$
$$= \mathbf{C} \centerdot   \left( \begin{align}c\\ 0\\ \vdots \\ 0 \end{align} \right)$$

where $c$ is the constant that will be adjusted to make the $\Delta\chi^2$ equal the desired value, and $C$ is the covariance matrix.  

Combining these two equations gives 
$$ c=\frac{\partial a_0}{C_{00}}$$ 
and therefore 
$$ \Delta \chi^2_\nu = \frac{(\partial a_0)^2}{ C_{00}}$$
so 
$$ \delta a_o = \pm \sqrt{\Delta \chi^2_{\nu}} \sqrt{C_{00}}.$$

So, what this means that, given the uncertainty level desired (e.g., $\Delta \chi^2 = 1$ for 68.7% for one parameter of interest), the uncertainty on the parameter is just the square root of the corresponding term of the covariance matrix times the square root of the corresponding $\Delta \chi^2_\nu$. 

What if you are interested in more than one parameter?  Then you must compute the $\Delta \chi^2$ for the number of parameters of interest, e.g., for two parameters of interest:


In [None]:
degree=2

dist_chi_2=stats.chi2(degree)

#print out the delta chi^2 for various confidence intervals

conf_intervals=[0.6827, 0.90, 0.9545, 0.99, 0.9973, 0.9999]
print 'For these confidence levels, and one parameter of interest ', conf_intervals
print 'the delta chi2 should be ', dist_chi_2.ppf(conf_intervals)


In addition, the $\Delta \chi^2 = $ some value region will be a $\nu$-dimensional ellipsoid, since if you have multiple parameters of interest, it means you want to determine the joint confidence region between said parameters.  So for $\nu=2$, the $\chi^2$ boundary will be an ellipse.  This makes sense due to the fact that we are assuming (by using the second partials to determine the covariance matrix) and therefore the $\chi^2$ surface is a paraboloid.  

So the general form of the equation that you will want to solve is:

$$\Delta = \partial \mathbf{a}^{\prime} \centerdot {\mathbf{C}^{-1}_{proj}} \centerdot \partial \mathbf{a} ^\prime$$

where $\mathbf{a}^\prime$ is the subset of the whole $\mathbf{a}$ that form the parameters of interest (e.g., the constant plus linear coefficient in the example above), and $\mathbf{C}_{proj}$ is the  $\nu \times \nu$ relevant portion of the covariance matrix (e.g., the $[[C_{22}, C_{23}],[C_{32}, C_{33}]]$ portion of the full covariance matrix above.


## Homework - spectral fitting and $\chi^2$

The first problem of HW4 will involve fitting a couple of models to some simulated data, evaluating and comparing the $\chi^2$, then constructing a couple of $\chi^2$ contours for a couple of pairs of parameters of interest.

For further reference for this part, see Numerical Recipes

## Bootstrap Resampling

The use of $\chi^2$ makes some significant demands on the nature of the model, the data, and the uncertainties.  Specifically, as noted above, it assumes that at any point, data is distributed around the model as a Gaussian with width $\sigma$ which is basically understood to be the data uncertainty.  

These assumptions may not be valid, for example when the number of counts is small, then the distribution will be Poisson.  There may be other reasons why the uncertainties are not reliable.  In this case, a parameter may be estimated from the data, but what is the uncertainty on that parameter?

Here's how the bootstrap works in practice:
 - Measure the parameter of interest from the data.
 - Draw a "random sample with replacement" from the sample, and measure the parameter of interest.
 - Do this a large number of times.  How large?  Large enough that there is no difference if you do this process more than once.
 - The distribution of the measured parameter of interest from the bootstrapped samples yields the uncertainty on the parameter of interest.

This procedure seems a little sketchy, but it should work <i> if the boot strapped sample has the same distribution as the original sample</i>.  

According to wikipedia, bootstrap is recommended under the following conditions:
 - "When the theoretical distribution of a statistic of interest is complicated or unknown. Since the bootstrapping procedure is distribution-independent it provides an indirect method to assess the properties of the distribution underlying the sample and the parameters of interest that are derived from this distribution.
 - When the sample size is insufficient for straightforward statistical inference. If the underlying distribution is well-known, bootstrapping provides a way to account for the distortions caused by the specific sample that may not be fully representative of the population."
 
Wikipedia also notes:
  - "if one performs a naive bootstrap on the sample mean when the underlying population lacks a finite variance (for example, a power law distribution), then the bootstrap distribution will not converge to the same limit as the sample mean. As a result, confidence intervals on the basis of a Monte Carlo simulation of the bootstrap could be misleading. Athreya states that "Unless one is reasonably sure that the underlying distribution is not heavy tailed, one should hesitate to use the naive bootstrap"."
  
Let's look at some examples of the bootstrap usage.

Some simple examples follow. First, let's draw 1000 samples from a gamma function, chosen simply because it is asymmetric and non-Gaussian for low values if $k$.  Here, stats.gamma is a function of the shape parameter k, the location parameter, and the scale parameter $\theta$. 

Physically, the idea is similar to measuring the length of a rod multiple times, but for whatever reason, the lengths are biased from a Gaussian distribution and are actually distributed according to a gamma distribution (for whatever reason).  Not very realistic, but that is the idea.


In [None]:
from scipy import stats
dist=stats.gamma(2,0,1)
r=dist.rvs(1000)
plt.hist(r)

We can compute the mean, and the uncertainty on the mean as though it were Gaussian, as shown above, as follows.

In [None]:
print 'The mean is ',r.mean()
print 'The uncertainty on the mean, if r were drawn from a Gaussian distribution ',r.std()/sqrt(len(r))


Let's compute the distribution of the means using the bootstrap.  This is implemented in np.random.choice.  If we 

In [None]:
n = len(r)
reps = 10000
xb = np.random.choice(r, (n, reps))
print xb.shape

mb=np.mean(xb,axis=0)
print mb.shape
plt.hist(mb)
mb.sort()

print 'The 68% bounds on the bootstrapped means is ',0.5*(np.percentile(mb, [68.0])-np.percentile(mb,[32.0]))



The bootstrap gives a smaller estimate of the uncertainty in the mean than does the Gaussian approximation.  Does this make sense?

I would contend it does make sense.  It turns out that the mean estimator for the Gamma function is the shape times the scale (in this case, $2 \times 1$).  So it makes sense that it should be tightly clustered around that value.

## Example

One example of the application of the bootstrap is given here:
https://ned.ipac.caltech.edu/level5/Sept01/Bhavsar/frames.html

The goal of this paper is to derive slope of the angular two-point correlation function.  
 - They have a sample of galaxies whose position on the sky is known.
 - They measure the angular distance between each galaxy.
 - They bin these in angular distance, and fit the result with a power law (after correcting for the uniform distribution).  They find that that the slope of the power law is $0.77$.
 - The uncertainty on the number of galaxies in a bin is Poisson, and there may not be enough galaxies per bin for those to be approximately Gaussian.  So this falls into the category of errors that are not well understood.  Moreover, as noted in the article, what they really want is not the uncertainty on the data that they have, but on other putative data sets; if they observed in another region of the sky.
 - So, assuming that the observed sample is a fair sample, they can draw a large number of bootstrap samples (i.e., random samples with replacement) and perform the same power law fit.  The slopes from all these are then analyzed, to prodice a standard deviation on the slope of 0.13.
 
 



## Homework for bootstrap

A 1-dimensional analog of the galaxy angular distribution is photon arrival times. 

You will be given two sets of photon arrival time data, created from lightcurves generated from two different power spectra.  

You can compute the difference in time between all the points in the data set.

You can then create a vector that spans the range of time differences obtained, and count the number of differences in each bin.  

Finally, you fit the results (time differences vs number of differences in each bin, assuming Poission errors) with a power law to obtain a slope.

What is the error on the slope?  That can be determined by doing bootstrap resampling on the photon arrival times (note that since you will count differences between all points, it won't matter if the bootstrapped data become out of order).


## Hypothesis Testing

Often we want to know whether a given sample is consistent with some hypothesis.   

For example, you measure a set of data, and it looks interesting.  It will be interesting,  you think, if it is not consistent with a Gaussian distribution.  So you ask the statistical question: is a set of data $\{x_i\}$ consistent with a Gaussian distribution $\mathcal{N}(\mu,\sigma)$. In this example, $\mathcal{N}(\mu,\sigma)$ is the _null hypothesis_, and we are trying to reject it.  

Astronomical example in Ivezic: consider you are trying to detect a certain object in an image with substantial background.  Because the background fluctuates from pixel to pixel, the contribution of the object in a certain pixel must be larger than the background fluctuation in order to claim a detection.  

It is assumed that you can compute the probabilty of a given outcome from the null hypothesis: for example, for a cumulative distribution function $H_0(x)$, the probability that you would get a value at least as large as $x_i$ is $p(x> x_i) = 1-H(x_i)$.  This is called the _p value_.

Often a significance level $\alpha$ is adopted (ahead of time) and the null hypothesis is rejected when $p \le \alpha$.  For example, if $\alpha=0.05$ (i.e., 5%), and $p< 0.05$, the null hypothesis is rejected at a 0.05 significance level.

Note that failing to reject a hypothesis is not the same as proving its correctness. 

Example: if we flip a coin 10 times and get 8 tails, should we reject the hypothesis that the coin is fair?  No, because the binomial distribution predicts that the probability of 8 or more tails is 0.054, and so we cannot reject the hypothesis at the 0.05 significance level.

**Use hypothesis testing with care.**

There are many pitfalls that fall under the term of _p-hacking_, i.e., manipulating data until you get $p< 0.05$, by, e.g., leaving data out, or trying a variety of statistical tests until you get one where $p<0.05$.  

Maybe this happens in soft sciences more than in physics and astronomy?  

[This article](https://fivethirtyeight.com/features/science-isnt-broken/#part1) gives a nice discussion.

More jargon associated with this:

 - Type I error (false positives): cases where the null hypothesis is true but incorrectly rejected
 - Type II error (false negatives): cases where the null hypthesis is false, but it is not rejected.  This is related to the power of the statistical test.
 
There is a lot more to this than I have mentioned; see Ivezic and other sources.
    

## Comparison of Distributions

Sometimes we would like to answer one of the following questions:

- Are two samples drawn from the same distribution?  
- Is a sample consistent with being drawn from a known distribution?
- Do two sets of measurements imply a difference in the measured quantity?

Ivezic notes that these are not necessarily easy questions to formulate. For example, a Gaussian distribution depends on both $\mu$ and $\sigma$ - it could be that we are interested in only whether the $\mu$ are different, and we don't care about the $\sigma$.

### Nonparametric Methods for Comparing Distributions

A nonparametric method is used when the distribution is not known.  For example, the regular Pearson $r$ correlation coefficient implicitly assumes that both variables should be normally distributed.  If they are not, $r$ can still be computed, but the interpretation in terms of $p-$values is not valid.  In contrast, the Spearman rank correlates the ranks rather than the values, and is more robust in the face of non-normal distributions.

One of the most popular nonparametric methos for comparing distributions is the Kolmorgorov-Smirnov (K-S) test, which compares the cumulative distributions of two samples.  

It is based on the maximum distance of two cumulative distributions, as seen in the [Wikipedia article](https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test#/media/File:KS_Example.png).  Interestingly, this distance has well-understood statistical properties.

It is implemented in python using kstest, ks_2sample, and ksone.



#### Example

Let's return to the question of whether a poission distribution is distinguished from a normal distribution for various values of count rates as in HW3.

Remember the idea is to compare the Poisson distribution for input value of $\lambda$ with the inferred Gaussian distribution.


In [None]:
def get_poisson_and_normal_samples(rate,numpnts):
    s = np.random.poisson(rate, numpnts)
    avg = np.average(s)
    std = np.std(s)
    t=np.random.normal(avg,std,numpnts)
    return s,t


In [None]:
rate=1.0
numpnts=100

pylab.rcParams['figure.figsize'] = (15, 6)

s,t=get_poisson_and_normal_samples(rate,numpnts)

bins=-4+np.arange(12)

plt.hist(s,bins,histtype='step')
plt.hist(t,bins,histtype='step')


In [None]:
plt.hist(s,bins,normed=1,histtype='step',cumulative=True)
plt.hist(t,bins,normed=1,histtype='step',cumulative=True)


Next, compute the ks_two for these two data sets.

In [None]:
stats.ks_2samp(s,t)

We see that the probability that these two are drawn from the same distribution is low.  Let's try with rate=20.

In [None]:
rate=20.0

pylab.rcParams['figure.figsize'] = (15, 6)

s,t=get_poisson_and_normal_samples(rate,numpnts)

bins=5+np.arange(30)

plt.hist(s,bins,histtype='step')
plt.hist(t,bins,histtype='step')

stats.ks_2samp(s,t)

In [None]:
plt.hist(s,bins,normed=1,histtype='step',cumulative=True)
plt.hist(t,bins,normed=1,histtype='step',cumulative=True)


The large p-value shows that the two distributions are indistinguishable.

Note the dependence on number of draws; if you have a lot of data, you can distinguish between different distributions better than if you have only few data points.

###  More on the KS Test

Here we have compared two distributions (note that they don't have to have the same number of points, since it is the normalized cumulative distribution that is compared).  

But you can also compare a set of data with a known distribution using kstest, which takes as arguments a distribution from scipy.stats, or can also a function.

Regardess, because it is looking for the maximum distance in the cumulative distributions, it doesn't perform well when the differences are out in the tails.

Note that KS test will be sensitive to location and scale differences, as well as shape differences, but not to reparameterization (e.g., the same result would be obtained for $\ln x$ as for $x$.

There are other nonparameteric tests including 
 - The Cramer-von Mises criterion
 - The Watson test
 - The Anderson-Darling test
 - The U test
 - The Wilcoxon test
 - The Shapiro-Wilk test
 
There are also parametric tests, if the sample is consistent with a normal distribution
 - The t test
 - The F test (which we have already come across)
 
See Ivezic and other sources for details.