In [None]:
factor=0.6

from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'scroll': True,
        'width': 1024*factor,
        'height': 768*factor
})



# Probability Distributions and Estimators

Karen Leighly, 2017

Resources for this material include:
 - Portions of Bishop 1.2
 - Ivezic Chapter 3
 
 

Let's motivate probability distributions through the use of an example.

Consider a fair coin.  If I flip that coin 20 times, how many times will I obtain "heads"?

The naive view would say: 10 times.  Let's see.

In [None]:
## Initiate python stuff
%pylab inline
import scipy.stats

In [None]:
h=np.random.uniform(0.0,1.0,20)
test=np.where(h >= 0.5)
print len(test[0])

Repeated runs of this cell show that "10" is observed only some of the time.

Let's look at this in more detail by running the cell multiple times, and counting the number of times the uniform distribution yields greater than equal to 0.5, i.e., "heads".

In [None]:
#Make a function that will count the number of times the draw from the uniform distributions is greater or equal to 0.5

def one_trial(num_points):
    h=np.random.uniform(0.0,1.0,20)
    test=np.where(h >= 0.5)
    num_heads=len(test[0])
    return num_heads

In [None]:
#Call this function 1000 times, storing the results in a vector

num_points=1000
out=np.zeros(1000)

for j in range(1000):
    out[j]=one_trial(num_points)



In [None]:
# and now plot

pylab.rcParams['figure.figsize'] = (15, 6)
pylab.rcParams['axes.labelsize'] = 24

pylab.rcParams['axes.titlesize'] = 24
pylab.rcParams['lines.linewidth'] = 3
pylab.rcParams['lines.markersize'] = 10
pylab.rcParams['xtick.labelsize'] = 16
pylab.rcParams['ytick.labelsize'] = 16

plt.xlabel('Run',fontsize=20)
plt.ylabel('Number of Heads',fontsize=20)


plt.plot(out)

As you can see, the result is not always 10 - there is a <i>distribution</i> around that value.

Let us count how many points are in each bin.

In [None]:
x_coin=np.arange(20)
y_coin=np.zeros(20)
for i in range(20):
    temp=where(out == i)
    y_coin[i]=len(temp[0])
    
plt.xlabel('Number of Heads',fontsize=20)
plt.ylabel('Number of Times #Heads was drawn',fontsize=20)

    
plt.plot(x_coin+0.5,y_coin,drawstyle='steps')

This example yields some very useful concepts.

- Even though we <i>expect</i> that there should be 10 "heads" per every 20 draws, there is actually a <i> distribution</i> of numbers of heads.  There can, in principle, be any value between zero and 20.

- But the peak of the distribution is near 10, so it is consistent with the number that we expect. 

- At the same time, the distribution has some width, which gives us some information about how often we should expect very low or very high values.

### A probability distribution should be normalized

Our histogram above is informative, but the $y$-axis values are related to the number of times the experiment has been performed, and that, of course, may change.

So, if we want to talk about the probability of observing any particular value, we need to normalize the distribution by dividing by the total number of draws.  Let's do this.

Note that the result should be referred to as a _probability density_.


In [None]:
y_coin_normalized=y_coin/y_coin.sum()
plt.xlabel('Number of Heads',fontsize=20)
plt.ylabel('Probability Density',fontsize=20)


plt.axis([0,20,0,y_coin_normalized.max()*1.1],fontsize=20,linewidth=4)


plt.axis(fontsize=20,linewidth=4)

plt.plot(x_coin+0.5,y_coin_normalized,drawstyle='steps')

In [None]:
# And here are the values for y_coin and y_coin_normalized.  We will come back to this later.

for i in np.arange(20):
    print x_coin[i],y_coin[i],y_coin_normalized[i]

## Binomial Distribution

This is known as the binomial distribution.  

The binomial distribution has two parameters: 
 - $N$ is the number of trials (in our example, $N=20$), and 
 - $b$, which is the probability of "success" for each draw (in our example, $b=0.5$).   
 

If the probability of a "success" is $b$, then what is the probability of a failure?

The probability of a "failure" is $1-b$.  

In our case, $b=0.5$ (a fair coin) but it can be a different number between 0 and 1 (a weighted coin).

Then, then, the probability that $k$ successes occurred on any one set of trials is

$$p(k|b,N)=\frac{N!}{k!(N-k)!} b^k (1-b)^{N-k}$$

where the "!" stands for factorial.

The special case of $N=1$ is known as the _Bernoulli_ distribution.

From our discussions on Monday, this can be understood as the probability of $k$ consecutive successes followed by $(N-k)$ consecutive failures multiplied by the number of different permultations of such a draw.  

scipy.stats has an implementation of the biniomial distribution, and we can use the "pmf" method to examine the probability distribution from it.

In [None]:
dist=scipy.stats.binom(20,0.5)
ypdf=np.zeros(20)
print dist.pmf(8)
for i in range(20):
    ypdf[i]=dist.pmf(x_coin[i])

plt.xlabel('Number of Heads',fontsize=20)
plt.ylabel('Probability Density',fontsize=20)
    
plt.plot(x_coin+0.5,y_coin_normalized,drawstyle='steps')
plt.plot(x_coin,ypdf)

We can see that the pmf method on the binom yields a good representation of the generated distribution.

### Aside about methods

_Methods_, in my experience, are a phenomenon unique to python that require a shift in thinking to use effectively.  However, the binom provides a good example of how methods work.

First, we compute an "instance" of binom for the specific values of 20 and 0.5, as follows:

dist=scipy.stats.binom(20,0.5)

That instance, now called "dist", lives in the memory.

But there are a lot of things you might want to obtain from the specific example.  For the binomial distribution, those are available as [scipy.stats.binom](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html)

| Method | Notes |
| ------------- |-------------|
| rvs(n, p, loc=0, size=1, random_state=None) |	Random variates. |
| pmf(k, n, p, loc=0)	| Probability mass function.| 
| logpmf(k, n, p, loc=0)	| Log of the probability mass function.| 
| cdf(k, n, p, loc=0)	| Cumulative distribution function.| 
| logcdf(k, n, p, loc=0)	| Log of the cumulative distribution function.| 
| sf(k, n, p, loc=0)| 	Survival function (also defined as 1 - cdf, but sf is sometimes more accurate).| 
| logsf(k, n, p, loc=0)	| Log of the survival function.| 
| ppf(q, n, p, loc=0)| 	Percent point function (inverse of cdf — percentiles).| 
| isf(q, n, p, loc=0)	| Inverse survival function (inverse of sf).| 
| stats(n, p, loc=0, moments='mv')| 	Mean(‘m’), variance(‘v’), skew(‘s’), and/or kurtosis(‘k’).| 
| entropy(n, p, loc=0)	| (Differential) entropy of the RV.| 
| expect(func, args=(n, p), loc=0, lb=None, ub=None, conditional=False)	| Expected value of a function (of one argument) with respect to the distribution.| 
| median(n, p, loc=0)| 	Median of the distribution.| 
| mean(n, p, loc=0)| 	Mean of the distribution.| 
| var(n, p, loc=0)| 	Variance of the distribution.| 
| std(n, p, loc=0)	| Standard deviation of the distribution.| 
| interval(alpha, n, p, loc=0)	| Endpoints of the range that contains alpha percent of the distribution| 

These now do not have to be computed separately; they are already part of the dist instance.

## Descriptive Statistics

We will want to describe our statistical distributions with convenient measures of the "location" and "shape" of them. 

We will differentiate betweeen:
 - population statistics - parameters based on the distributions themselves.
 - sample statistics - values computed from data.

Some of these are related to expectation values:

Specifically, the expectation of f(x) is

$$\mathbb{E}[f]=\sum_x p(x)f(x)$$
for the discrete case, and 
$$\mathbb{E}[f]=\int p(x)f(x)dx$$
for the continuous case.


So, the expectation of $x$ yields the mean of the probability distribution $h(x)$.
$$\mu=\int_{-\infty}^{\infty} x h(x)dx.$$

The variance of $f(x)$ provides a measurement of how much variability there is in $f(x)$ around its mean value $\mathbb{E}[f(x)]$.  It is defined as:

$$var[f]=\mathbb{E}[(h(x)-\mathbb{E}[h(x)])^2]$$

which can also be written as:

$$\sigma^2 = V = \int_{-\infty}^{\infty}  (x-\mu)^2 h(x) dx,$$

i.e., the expectation of the square distance of points from the mean.

Other estimators can be defined that are useful in different circumstances.  These include
 - skewness - useful to characterize the asymmetry of a distribution
 
 $$\Sigma = \int_{-\infty}^\infty \left(\frac{x-\mu}{\sigma}\right)^3 h(x) dx$$
 
 
 - kurtosis - useful to characterize the shape of a distribution (peaked or flat)
 
 $$K=\int_{-\infty}^\infty \left(\frac{x-\mu}{\sigma}\right)^4 h(x) dx-3$$ 
 
Example of skew and kurtosis.  You can see that in a sense, they are defined relative to the normal distribution (which we have not talked about yet).
![Figure 3.6](http://www.astroml.org/_images/fig_kurtosis_skew_1.png)

 
 - absolute deviation - like the variance, but without the square (also called an L1 estimator).  Useful (in my experience) when outliers are present.
 $$\delta=\int_{-\infty}^{\infty} \lvert x-d \rvert h(x)dx$$
 - Mode (or the most probable value in the case of unimodal functions), $x_m$
 
Mode is implemented as: mode = scipy.stats.mode(data)
 



- p% percentiles - e.g., quartiles ($q_{25}$, median=$q_{50}$, $q_{75}$)

The difference between the third  and forst quartiles is called the interquartile range.  

It is argued that these are more robust estimators than the mean and standard deviation, by which I mean that they are less affected by outliers.

Back to the binomial distribution. Rcall that $b$ is the probability of success, and $N$ is the number of draws. 

For the binomial distribution:
 - the mean (expected number of successes) is $\bar{k}=bN$
 - the standard deviation is $\sigma_k=[Nb(1-b)]^{1/2}$
 
Note that these are the _population_ statistics, i.e., expectation values given the distribution, $b$, and $N$.

Examples of the binomial distribution are seen below.

![Figure 3.6](http://www.astroml.org/_images/fig_binomial_distribution_1.png)

## The Multinomial Distribution

The multinomial distribution is a generalization of the binomial distribution.  It describes the distribution of a variable that can have more than two discrete values (true or false, as above, which we dummied up by using $>0.5$ and $<0.5$ in our example), but rather $M$ discrete value, with the probability of each value being different.   

Example: I once refereed a paper in which something was classified as either increasing, decreasing, or staying the same.  The authors wanted to make some claims based on the number that were increasing, if I remember correctly, so they wanted to determine what the probability that the number they observed could have been obtained by chance. This  was a job for a multinomial distribution, which could have been dummied up from a uniform distribution by looking at $<0.333$, between 0.3333 and 0.6666, and $>0.6666$.  

The python distribution of multinomial is called [numpy.random.multinomial](https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.multinomial.html).



## Uncertainty in the bin counts

Back to our binomial histogram $y_{coin}$, plotted below.  Does anybody have an idea how to compute the uncertainty on the bins?



In [None]:
    
plt.axis([0,20,0,y_coin.max()*1.1],fontsize=20,linewidth=4)


plt.axis(fontsize=20,linewidth=4)

plt.plot(x_coin+0.5,y_coin,drawstyle='steps')
plt.plot(x_coin,y_coin,'o')

## Poisson Distribution

The Poission distribution is a distribution that is near and dear to the heart of X-ray astronomers.  It is a discrete probability distribution that expresses the probability that a certain number of events has occurred in a fixed interval of time (or space) given that the average rate of events is known, and the events are independent.

So, it is related to the binomial distribution if the number of trials goes to infinity so that the probability of success, $p=k/N$ stays fixed, and thus the distribution of the number of successes, $k$, is controlled by $/mu=pN$

$$p(k|\mu)=\frac{\mu^k \exp(-\mu)}{k!}.$$




The estimators for the poission distribution are:
 - mean: $\mu$
 - standard deviation: $\sqrt{\mu}$
 - skewness: $1/\sqrt{\mu}$
 - kurtosis: $1/\mu$



So we can estimate the uncertainty in our histogram above by taking the square root of the number of counts in each bin.



In [None]:
y_coin_err=sqrt(y_coin)
plt.xlabel('Number of Heads',fontsize=20)
plt.ylabel('Number of Times #Heads was drawn',fontsize=20)
plt.errorbar(x_coin,y_coin,yerr=y_coin_err,linestyle='None')
plt.plot(x_coin,y_coin,'o')

We will talk about histograms in much more detail when we discuss cluster analysis.

## Probability Densities

Let us think more generally about what a probability density is and how it can be defined.  In the same time, let's also generalize to cases where the variable can be continuous rather than discrete (i.e., the height of a person, versus the number of heads).

If the probability of a variable $x$ falling in an interval between $x$ and $x+\delta x$ is given by $p(x)\delta x$ as $\delta x$ approaches $0$, then $p(x)$ is called the probability density over $x$.  

For illustration, let's make up a probability density comprised of two Gaussians.

In [None]:
dist1=scipy.stats.norm(0,1)
dist2=scipy.stats.norm(2,0.5)
x=-4.0+0.01*np.arange(800)
y1=np.zeros(800)
y2=np.zeros(800)

for i in range(800):
    y1[i]=dist1.pdf(x[i])
    y2[i]=dist2.pdf(x[i])
yout=y1*2+2*y2

#Don't forget to normalize
prob_dist=yout/scipy.integrate.trapz(yout,x)
plt.ylabel('Probability Density',fontsize=20)
plt.xlabel('Variable',fontsize=20)

plt.plot(x,prob_dist)

## Properties of a Probability Density Function

The probability density $p(x)$ must satisfy two conditions:
$$p(x) \ge 0$$

and 

$$\int_{-\infty}^{\infty} p(x) dx = 1.$$

We can see that the first is true from the plot, and we normalized the function (using `scipy.integrate.trapz`), so we the second is true.

Also useful is the cumulative distribution function 
$$P(z)=\int_{-\infty}^z p(x) dx$$
which we compute next.  

Naturally, given the definition, the minimum value of the cumulative distribution function is 0, and the maximum value of the cumulative distribution function is 1.

In [None]:
cumfunc=np.zeros(800)
for i in range(800):
    cumfunc[i]=scipy.integrate.trapz(prob_dist[0:i],x[0:i])

plt.ylabel('Cumulative Probability Density',fontsize=20)
plt.xlabel('Variable',fontsize=20)


plt.plot(x,cumfunc)



Let's compute mean, variance, and standard deviation for the distribution above, using the expectation value formalism, i.e., 

$$\mu=\int_{-\infty}^{\infty} x h(x)dx.$$

and

$$\sigma^2 = V = \int_{-\infty}^{\infty}  (x-\mu)^2 h(x) dx,$$



In [None]:
mu=scipy.integrate.trapz(prob_dist*x,x)
print mu

var=scipy.integrate.trapz(((x-mu)**2)*prob_dist,x)
print var

sigma=np.sqrt(var)
print sigma

## The Gaussian (normal) Distribution

The normal distribution is commonly used, and even more commonly assumed, so let's discuss its properties in some detail.  

We write the normal distribution of $x$ as: $\mathcal{N}(x|\mu,\sigma^2)$.  

This nomenclature refers to the fact that the distribution is characterized by the <i>mean</i> $\mu$ and the <i>variance</i> $\sigma^2$.  
 - The square root of the variance is the <i>standard deviation</i>, 
 - while the inverse of the variance, $1/\sigma^2$ is known as the <i>precision</i>.  (The latter is more useful than you might think, given that $\sigma^2$ always appears in the denominator.)


The normal (Gaussian) probability distribution is given by: 

$$p(x)=\mathcal{N}(x|\mu,\sigma^2)=\frac{1}{(2\pi\sigma^2)^{1/2}}\exp\left[-\frac{1}{2\sigma^2}(x-\mu)^2\right]$$

where the $\mathcal{N}$ generically refers to the Gaussian distribution, and $\mu$ and $\sigma$ are the mean and standard deviation, which we will discuss shortly.

The Gaussian distribution for a few different values of $\mu$ and $\sigma$ can be seen below:


In [None]:
dist1=scipy.stats.norm(0,1.5)
dist2=scipy.stats.norm(1,1)
dist3=scipy.stats.norm(-1,0.5)

y1=np.zeros(800)
y2=np.zeros(800)
y3=np.zeros(800)

for i in range(800):
    y1[i]=dist1.pdf(x[i])
    y2[i]=dist2.pdf(x[i])
    y3[i]=dist3.pdf(x[i])

plt.plot(x,y1)
plt.plot(x,y2)
plt.plot(x,y3)


Computation of the mean and standard deviations will be left as an exercise.

The cumulative distribution for a Gaussian 

$$P(x|\mu,\sigma) =\frac{1}{(2\pi\sigma^2)^{1/2}}\int_{-\infty}^x exp\left[-\frac{1}{2\sigma^2}(x^\prime-\mu)^2\right]dx^\prime$$

cannot be expressed in closed form, and it is expressed in terms of the Gauss error function:

$$erf(z)=\frac{2}{\sqrt{\pi}} \int_0^z \exp (-t^2)dt$$

so that 

$$P(x|\mu,\sigma)=\frac{1}{2} (1\pm erf(\frac{|x-\mu|}{\sigma \sqrt{2}}))$$

Computation of the cumulative function will be left as an exercise.  Note that erf is is available in python as [scipy.special.erf](https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.special.erf.html).


### Useful properties of the Gaussian

The Fourier transform of a Gaussian is a gaussian.  This means that the convolution of a Gaussian distribution with another Gaussian distribution also yields a Gaussian distriubtion.  This is useful for Bayesian statistics, if, for example, your likelihood is based on $\chi^2$, which has similar structure as a Gaussian, and your prior is also a Gaussian.

So for example, if you have 
 - $\mathcal{N}(\mu_0,\sigma_0)$ (for example, the intrinsic distribution that you are trying to measure)
and it is convolved with 
 - $\mathcal{N}(b,\sigma_e)$ (for example, a error distribution with bias $b$ and random error $\sigma_e$
The result will have $\mathcal{N}(\mu_c,\sigma_c)$, where
$$\mu_c=\mu_0+b$$
and
$$\sigma_c=(\sigma_0^2+\sigma_e^2)^{1/2}.$$

There are other useful properties that we will touch on later.

## Sample Statistics from a Gaussian


Consider we have a sample of points $\mathbf{X}=(x_1,x_2,\dotsc,x_N)$ that we know to have been drawn from a Gaussian distribution whose mean and variance are unknown.  We would like to determine the mean and variance from the data set.  How do we do this?

We can assume that these are _independently_ drawn.  Then the probability product rule tells us that the joint probability of two independent events is the product of the marginal probabilties for each event separately.  Thus, the probability of the data set is

$$p(\mathbf{X}|\mu,\sigma^2) = \prod_{n=1}^{N} \mathcal{N}(x_n|\mu,\sigma^2)$$

This is the <i>likelihood function</i> for the Gaussian as a function of $\mu$ and $\sigma$.  

A common criterion for determining the parameters in a probability distribution using an observed data set is to find the parameter values that maximize the likelihood function.  It is convenient to take the natural log of the likelihood function, however, before finding the parameters that maximmize it.  Since the natural log is a monotonic function, the parameters that maximize a function also maximize its natural log.  

Then the log likelihood function is written as:

$$\ln p(\mathbf{X} | \mu,\sigma^2) = -\frac{1}{2\sigma^2} \sum_{n=1}^{N}(x_n-\mu)^2 - \frac{N}{2} \ln \sigma^2 - \frac{N}{2} \ln (2\pi).$$

We can maximize this function with respect to $\mu$ and $\sigma$ by taking the derivative with respect to these two variables and setting the result equal to zero.

$$0=\frac{\partial}{\partial\mu} \sum_{n=1}^{N} (x_n-\mu)^2$$
$$=\sum_{n=1}^{N} \frac{\partial}{\partial\mu} (x_n-\mu)^2$$
$$=\sum_{n=1}^{N} 2(x_n-\mu)(-1)$$
$$\mu_{ML}=\frac{1}{N} \sum_{n-1}^N x_n$$

This is the sample mean, i.e,. the mean of the observed values ${x_n}$.  This value for $\mu$ maximizes the log-likelihood function, and is a very familiar function.

It is similarly easy to show determine the maximum likelihood estimate of $\sigma^2$ by differentiating the above with respect to $\sigma^2$ and setting equal to zero:

$$0=\frac{\partial}{\sigma^2}[-\frac{1}{2\sigma^2} \sum_{n=1}^{N}(x_n-\mu)^2 - \frac{N}{2} \ln \sigma^2]$$

$$=-\frac{1}{2}\sum_{n=1}^N(x_n-\mu)^2 \frac{\partial}{\sigma^2} \big( \frac{1}{\sigma^2} \big)- \frac{N}{2} \frac{\partial}{\sigma^2} \ln \sigma^2$$

$$=-\frac{1}{2}\sum_{n=1}^N(x_n-\mu)^2 \big(\frac{-1}{\sigma^4}\big) - \frac{N}{2} \frac{1}{\sigma^2}$$

$$\sigma^2_{ML}=\frac{1}{N} \sum_{n=1}^N(x_n-\mu_{ML})^2.$$

So now you know where these familiar formulae come from.

### Gaussian confidence levels

The probability of a measurement drawn from a Gaussian distribution that is between $\mu-a$ and $\mu+b$ is
$$\int_{\mu-a}^{\mu+b} p(x|\mu,\sigma) dx.$$
For $a=b=1\sigma$, we get the familar result of 68.3%.  For $a=b=2\sigma$ it is 95.4%.  So we refer to the range $\mu \pm 1\sigma$ and $\mu \pm 2\sigma$ as the 68% and 95% **confidence limits**, respectively.

## Generalized Normal Distribution

The _generalized normal distribution_ can sometimes come in handy. For a location $\mu$, a scale parameter $\alpha$, and a shape parameter $\beta$, the probability distribution function is:

$$h(x)=\frac{\beta}{2\alpha \Gamma(1/\beta)} e^{-(\lvert x-\mu \rvert / \alpha)^\beta}$$

where $\gamma$ is the gamma function.

- The case where $\beta=1$ is known as the Laplace distribution.
- The case where $\beta=2$ is the Gaussian distribution.

Plots of the generalized normal distribution are seen here.:![generalized normal distribution](https://upload.wikimedia.org/wikipedia/commons/1/10/Generalized_normal_densities.svg)

### The Central Limit Theorem

Why is the Gaussian distribution so popular and commonly used (and, importantly, assumed)?  First, it has a number of convenient properties.
 - the convolution of two Gaussian functions is also a Gaussian.
 - the Fourier transform of a Gaussian is also a Gaussian.
 
But the most important property is the Central Limit Theorem, which states "Given an arbitrary distribution $h(x)$, characterized by a mean $\mu$ and standard deviation $\sigma$, the mean of $N$ values $x_i$ drawn from that distribution will approximately follow a Gaussian distribution $\mathcal{N}(\mu,\sigma/\sqrt{N}$), with the approximation accuracy improving with $N$."

This theorem is the foudation for the performing repeat measurements in order to improve the accuracy of one's experiment.  It is telling us something about the *shape* of the distribution that we get when averaging.  The **Law of Large Numbers** further says that the sample mean will converge to the distribution mean as $N$ increases.

Let us explore this concept by experiment.  Consider 10000 numbers drawn from a uniform distribution, and create a histogram of those values between zero and 1.

In [None]:
x=np.random.uniform(0,1,10000)
print x.shape
plt.hist(x,bins=20)


It doesn't look much like a normal distribution (in fact, it looks like a uniform distribution), but we have only one draw per trial.

Now, what if at every trial, you picked two, and recorded the average value?

In [None]:
x=np.random.uniform(0,1,20000)
out=np.zeros(10000)
for i in range(10000):
    out[i]=(x[2*i]+x[2*i+1])/2

plt.hist(out,bins=20)



Now the distribution is peaked in the middle.  Does this make sense?


How about 10 draws per data point?

In [None]:
x=np.random.uniform(0,1,100000)
out=np.zeros(10000)
temp=np.zeros(10)
for i in range(10000):
    temp[0:10]=x[10*i:10*(i+1)]
    out[i]=temp.mean()
    
plt.hist(out,bins=20)


As you can see, we have generated a reasonable bell curve from the uniform distribution by simply drawing a large number of trials and averaging.

## Homework: 
You will have the opportunity to explore the central limit theorem for the Poission distribution in the HW.

## The Cauchy (Lorenzian) Distribution

The Cauchy distribution is defined in terms of location parameter $\mu$ and shape parameter $\gamma$ as

$$p(x|/mu,\gamma)= \frac{1}{\pi \gamma} \left(\frac{\gamma^2}{\gamma^2+(x-\mu)^2}\right).$$

Examples of the Cauchy distribution are seen in Fig. 3.11 of Ivezic:

![Figure 3.11](http://www.astroml.org/_images/fig_cauchy_distribution_1.png)

A notable feature of this distribution is the long tails.  It turns out that for the Cauchy distribution, the median and mode are both equal to $\mu$, but the _mean, variance, standard deviation, and higher moments do not exist_.  

Basically, the long tails mean that there is finite probability that large values will be drawn from the distribution that would skew any estimates of the mean, etc.  In other words, one can compute a mean $\mu$ from data drawn from the Cauchy distribution, but it would have large scatter around the value, and this scatter _will not decrease as the sample size increases_.  

What this means is that the _Central Limit theorem will not hold for data drawn from a Cauchy distribution_, because the Central Limit theorem only holds for distributions that have actual means and standard deviations.



Let's investigate this assertion by setting setting up some instances of gaussian and normal distributions, and computing the mean and standard deviation for increasing numbers of draws.

In [None]:
normal_dist=scipy.stats.norm(0,1)
cauchy_dist=scipy.stats.cauchy(0,1)

num_draws=2+arange(1000)
norm_dist_mean=np.zeros(1000)
norm_dist_sigma=np.zeros(1000)
cauchy_dist_mean=np.zeros(1000)
cauchy_dist_sigma=np.zeros(1000)
for i in range(1000):
    temp_normal=normal_dist.rvs(num_draws[i])
    temp_cauchy=cauchy_dist.rvs(num_draws[i])
    norm_dist_mean[i]=np.mean(temp_normal)
    norm_dist_sigma[i]=np.std(temp_normal)
    cauchy_dist_mean[i]=np.mean(temp_cauchy)
    cauchy_dist_sigma[i]=np.std(temp_cauchy)
    
plt.figure(1)
plt.subplot(121)
plt.axis([0,1002,-1,1],fontsize=20,linewidth=4)
plt.ylabel('Normal Distribution Mean',fontsize=20)
plt.xlabel('Number of Draws',fontsize=20)

plt.plot(num_draws,norm_dist_mean)

plt.subplot(122)

plt.axis([0,1002,-200,200],fontsize=20,linewidth=4)
plt.ylabel('Cauchy Distribution Mean',fontsize=20)
plt.xlabel('Number of Draws',fontsize=20)
plt.plot(num_draws,cauchy_dist_mean)



In [None]:
plt.figure(1)
plt.subplot(121)
plt.axis([0,1002,0,2],fontsize=20,linewidth=4)
plt.ylabel('Normal Distribution Sigma',fontsize=20)
plt.xlabel('Number of Draws',fontsize=20)

plt.plot(num_draws,norm_dist_sigma)

plt.subplot(122)

plt.axis([0,1002,0,200],fontsize=20,linewidth=4)
plt.ylabel('Cauchy Distribution Sigma',fontsize=20)
plt.xlabel('Number of Draws',fontsize=20)
plt.plot(num_draws,cauchy_dist_sigma)


It turns out that the ratio of two independent standard normal variables ($z=(x-\mu)/\sigma)$, with $z$ drawn from $\mathcal{N}(0,1)$ follows a Cauchy distribution with $\mu=0$ and $\gamma=1$.

What this means is, to quote Ivezic "Therefore, in cases where the quantity of interest is obtained as a ratio of two other measured quantities, assuming that it is distributed as a Gaussian is a really bad idea if the quantity in the deonomianor has a finite chance of taking on a zero value.  Futhermore, using the mean value to determine its location parameter (i.e., to get a "best" value implied by the measurements) will not achieve the 1/\sqrt{N}$ error reduction.

They also note that the ratio of two general Gaussians will follow a Hinkley distribution.

**The bottom line is that it is good to be cautious when looking at ratios.**

### The Bivariate and Multivariate Gaussian Distributions

The multivariate Gaussian distribution illustrates some properties that will be generally useful, in particular the concept of **covarience**.

The $D$ dimensional vector of variable $\mathbf{x}$, the normal distribution looks like this:

$$\mathcal{N}(\mathbf{x} | \mathbf{\mu}, \mathbf{\Sigma})=\frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}\exp\big[{-\frac{1}{2}(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu}})\big].$$

Here, note the boldface: $\mathbf{\mu}$ is the D-dimensional mean vector, $\Sigma$ is the $D\times D$ covariance matrix, and $|\mathbf{\Sigma}|$ is the determinant of $\mathbf{\Sigma}$.  

Note that bold-face indicates a vector.

Let us consider some important properties of this function.  The value $\Delta$ is a type of distance from $\mathbf{x}$ to $\mathbf{\mu}$.    

$$\Delta^2={(\mathbf{x}-\mathbf{\mu})^T\mathbf{\Sigma}^{-1}(\mathbf{x}-\mathbf{\mu}})$$

Note that this reduces to the Euclidean distance when $\mathbf{\Sigma}$ is the identity matrix.


Let's explore some properties of $\mathbf{\Sigma}$.  
 - It must be symmetric, because any antisymmetric component would disappear from the exponent.
 - It is quadratic, and therefore lines of constant probability look like an ellipse (or generalization of an ellipse).  
 

Let us consider an example to illustrate some other properties.  Let $\mathbf{\mu}=[3,2]$, and $\mathbf{\Sigma}$ equal to

\begin{pmatrix}
1.5 & 1 \\
1 & 2 \\
\end{pmatrix}

Let us sample this bivariate gaussian and plot it several different ways:

In [None]:
pylab.rcParams['figure.figsize'] = (6, 6)

mean = [3,2]
cov = [[1.5,1], [1,2]]  
x, y = np.random.multivariate_normal(mean, cov, 50000).T
plt.plot(x, y, '.')


So $x$ and $y$ together are distributed as a bivariate normal distribution.

Let us first reproduce the mean and covariance from the data.

In [None]:
print 'The mean of x is ',x.mean()
print 'The mean of y is ',y.mean()
mu_x=x.mean()
mu_y=y.mean()
x_center=x-mu_x
y_center=y-mu_y
print 'sigma_xx is ',(x_center*x_center).mean()
print 'sigma_yy is ',(y_center*y_center).mean()
print 'sigma_xy is ',(x_center*y_center).mean()

print 'Compare with the covariance matrix ',cov



We can also plot as a 2-d histogram:

In [None]:
plt.hist2d(x,y,bins=50,cmap='Greys')


An important aspect of this result is that the two variables are correlated with one another and are therefore not independent.  This means that the distribution and associated estimators of each variable <i>separately</i> does not capture the probability density.

Let's illustrate with an example.

Imagine that you are presented with $\mathbf{x}$ and $\mathbf{y}$, two properties measured from some sample (e.g., redshift and luminosity of some quasars).

Then imagine that you wanted to determine the probability that you would observe a quasar with $[x_1,y_1]$.  How might you do that?  

Naively, you might be tempted to model each of  $\mathbf{x}$ and $\mathbf{y}$ separately as a Gaussian distribution, and <it> assuming that they are independent</it> multiply the probabilities of the two together to get the joint probability, as follows.  To illustrate, consider computing the probability density at $(3.5,2.5)$ and $(3.5,1.5)$.  This pair of points is symmetric around the $y$ distribution, but not symmetric on the joint distribution.

In [None]:
plt.hist2d(x,y,bins=50,cmap='Greys')

xtemp=[3.5,3.5]
ytemp=[2.5,1.5]

plt.plot(xtemp,ytemp,'ro')

In [None]:
plt.hist(x,bins=50,histtype=u'step')
plt.hist(y,bins=50,histtype=u'step')
print x.mean(),x.std()
print y.mean(),y.std()
from scipy.stats import norm
xdist=norm(x.mean(),x.std())
ydist=norm(y.mean(),y.std())
print 'The probability of x=3.5:', xdist.pdf(3.5)
print 'The probability of y=2.5:', ydist.pdf(2.5)
print 'The joint probability of x=3.5 and y=2.5, assuming that they are independent:',xdist.pdf(3.5)*ydist.pdf(2.5)
print ''
print 'The probability of x=3.5:', xdist.pdf(3.5)
print 'The probability of y=1.5:', ydist.pdf(1.5)
print 'The joint probability of x=3.5 and y=1.5, assuming that they are independent:',xdist.pdf(3.5)*ydist.pdf(1.5)



In fact, you will need to consider the multivariate normal distribution, which accounts for the coupling between the two parameters, as follows.

 - First I compute the instance of the multivariate normal distribution with the mean and covariance given above.
 - Then I compute the probability density at our two example positions.

In [None]:
dist=scipy.stats.multivariate_normal(mean, cov)
print dist.pdf([3.5,2.5])
print dist.pdf([3.5,1.5])

These are not equal to the naive estimated joint probability above.  This is probably not surprising since our computation above would predict that the probability should be the same at $(x-\mu_x,y-\mu_y)$ as at $(x+\mu_x,y-\mu_y)$, by the symmetry of the individual gaussians - but we can see from the contour plots that it is not.


Are there any circumstances when the joint probability distribution (i.e., the product of the probabilities for the two distributions separately) is equal to the actual probability distribution?  

The conditions under which the probability of the 2-dimensional Gaussian equals the joint probability of the Gaussian distributions of their projects is _when the covariance matrix is diagonal_.  

How can we get a diagonal covariance matrix?  

If we rotate the coordinate system along the axes of symmetry, we will obtain a diagonal covariance matrix.

How do we rotate the coordinate system? 

If we solve the eigenvector problem and use the matrix formed by the eigvenvectors to rotate the data.

We can show that this procedure will result in a joint probability density equal to the product of the individual probability densities, as follows:

The general eigenvector equation for the D-dimensional covariance matrix:
   $$\mathbf{\Sigma}\mathbf{u}_i = \lambda_i\mathbf{u}_i$$
   
where $i=1,\dots,D$.  

Linear algebra tells us that since $\mathbf{\Sigma}$ is a real, symmetric matrix, its eigenvalues are real and the eigenvectors will form an orthonormal set, i.e., 

$$\mathbf{u}_i^T \mathbf{u}_j = I_{ij}$$
where $\mathbf{I}$ is the D-dimensional identity matrix.

More results from linear algebra (see Bishop, section 2.3) yield

$$\mathbf{\Sigma}^{-1} = \sum_{i=1}^D \frac{1}{\lambda_i} \mathbf{u}_i \mathbf{u}_i^T, $$

with the quadratic form in the Gaussian function becoming:

$$\Delta^2 = \sum_{i=1}^D \frac{y_i^2}{\lambda_i}$$

where $y_i$ is defined as:

$$y_i=\mathbf{u}_i^T (\mathbf{x}-\mathbf{\mu}).$$

This is written more generally as:

$$\mathbf{y}=\mathbf{U}(\mathbf{x}-\mathbf{\mu})$$
where $\mathbf{U}$ is a matrix whose rows are given by $u_i^T$, and {$y_i$} is a shifted, rotated, and scaled coordinate system. 

What has been accomplished by this transformation to the $\mathbf{y}$ coordinate system?  


Assuming all eigenvectors are non-zero, in this coordinate system, the general equation for the 2-dimensional Gaussian simplifies dramatically, chiefly because the analog of the covariance matrix is now a diagonal matrix of eigenvalues.

In addition, the determinant of the diagonal matrix is a product of the square root of the eigenvalues:
$$|\mathbf{\Sigma}|^{1/2} = \prod_{j=1}^D \lambda_j^{1/2}.$$

So, in the $y_i$ coordinate system, the Gaussian distribution takes the form:

$$p(\mathbf{y}) = \prod_{j=1}^D \frac{1}{(2 \pi \lambda_{j})^{1/2}} \exp \big[-\frac{y_j^2}{2 \lambda_j}\big],$$
i.e., a product of $D$ <i>independent</i> one dimensional Gaussian distributions.  So it has the property that we were looking for: the joint probability distribution factorizes into a product of independent distributions.

### Homework:

Your homework has a problem that allows you to explore this.

### Sample vs. Population Statistics 

Statistics estimated from the *data* are called _sample statistics_ as compared to _population statistics_ which come from knowing the functional form of the pdf.  Up to now we have been computing population statistics.

Specifically, $\mu$ is the *population average*, i.e., it is the expecation value of $x$ for $h(x)$.  But we don't *know* $h(x)$.  So the **sample mean**, $\overline{x}$, is an *estimator* of $\mu$, defined as
$$\overline{x} \equiv \frac{1}{N}\sum_{i=1}^N x_i,$$
which we determine from the data itself.

Then instead of $\sigma^2$, which is the population variance, we have the **sample variance**, $s^2$, where

$$s^2 = \frac{1}{N-1}\sum_{i=1}^N(x_i-\overline{x})^2$$

Where it is $N-1$ instead of $N$ since we had to determine $\overline{x}$ from the data instead of using a known $\mu$.  Ideally one tries to work in a regime where $N$ is large enough that we can be lazy and ignore this. 

So the mean and variance of a distribution are $\mu$ and $\sigma^2$.  The *estimators* of the distribution are $\overline{x}$ (or $\hat{x}$) and $s^2$.

## Bias

If there is a difference between the *estimator* and the *population* values, we say that the estimator is **biased** (perhaps not quite the usage of the word that you are used to).  Again, more on this later.

### Uncertainty

We would also like to know the uncertainty of our estimates $\overline{x}$ and $s$.  Note that $s$ is **NOT** the uncertainty of $\overline{x}$.  Rather the uncertainty of $\overline{x}$, $\sigma_{\overline{x}}$ is 
$$ \sigma_{\overline{x}} = \frac{s}{\sqrt{N}},$$
which we call the *standard error of the mean*.

The uncertainty of $s$ itself is
$$\sigma_s = \frac{s}{\sqrt{2(N-1)}} = \frac{1}{\sqrt{2}}\sqrt{\frac{N}{N-1}}\sigma_{\overline{x}}.$$

Note that for large $N$, $\sigma_{\overline{x}} \sim \sqrt{2}\sigma_s$ and for small $N$, $\sigma_s$ is not much smaller than $s$.

## A Few More Distributions



### Uniform Distribution

The uniform distribution is perhaps more commonly called a "top-hat" or a "box" distribution.  It is specified by a mean, $\mu$, and a width, $W$, where

$$p(x|\mu,W) = \frac{1}{W}$$

over the range $|x-\mu|\le \frac{W}{2}$ and $0$ otherwise.  That says that "given $\mu$ AND $W$, the probability of $x$ is $\frac{1}{W}$" (as long as we are within a certain range).

Since we are used to thinking of a Gaussian as the *only* type of distribution the concept of $\sigma$ (aside from the width) may seem strange.  But $\sigma$ as mathematically defined above applies here and
$$\sigma = \frac{W}{\sqrt{12}}.$$


### Log Normal

Note that if $x$ is Gaussian distributed with $\mathscr{N}(\mu,\sigma)$, then $y=\exp(x)$ will have a **log-normal** distribution, where the mean of y is $\exp(\mu + \sigma^2/2)$.  

### $\chi^2$ Distribution

We'll run into the $\chi^2$ distribution when we talk about Maximum Likelihood in the next lecture.

If we have a Gaussian distribution with values ${x_i}$ and we scale and normalize them according to
$$z_i = \frac{x_i-\mu}{\sigma},$$
then the sum of squares, $Q$ 
$$Q = \sum_{i=1}^N z_i^2,$$
will follow the $\chi^2$ distribution.  The *number of degrees of freedom*, $k$ is given by the number of data points, $N$ (minus any constraints).  The pdf of $Q$ given $k$ defines $\chi^2$ and is given by
$$p(Q|k)\equiv \chi^2(Q|k) = \frac{1}{2^{k/2}\Gamma(k/2)}Q^{k/2-1}\exp(-Q/2),$$
where $Q>0$ and the $\Gamma$ function would just be the usual factorial function if we were dealing with integers, but here we have half integers.

This is ugly, but it is really just a formula like anything else.  Note that the shape of the distribution *only* depends on the sample size $N=k$ and not on $\mu$ or $\sigma$.  

### Student's $t$ Distribution

Another distribution that we'll see later is the Student's $t$ Distribution.

If you have a sample of $N$ measurements, $\{x_i\}$, drawn from a Gaussian distribution, $\mathscr{N}(\mu,\sigma)$, and you apply the transform
$$t = \frac{\overline{x}-\mu}{s/\sqrt{N}},$$
then $t$ will be distributed according to Student's $t$ with the following pdf (for $k$ degrees of freedom): 
$$p(x|k) = \frac{\Gamma(\frac{k+1}{2})}{\sqrt{\pi k} \Gamma(\frac{k}{2})} \left(1+\frac{x^2}{k}\right)^{-\frac{k+1}{2}}$$

As with a Gaussian, Student's $t$ is bell shaped, but has "heavier" tails.

### What's the point?

The point is that we are going to make some measurement.  And we will want to know how likely it is that we would get that measurement in our experiment as compared to random chance.  To determine that we need to know the shape of the distribution.  Let's say that we find that $x=6$.  If our data is $\chi^2$ distributed with 2 degrees of freedom, then we would integrate the $k=2$ curve above from 6 to $\infty$ to determine how likely it is that we would have gotten 6 or larger by chance.  If our distribution was instead $t$ distributed, we would get a *very* different answer.  Note that it is important that you decide *ahead of time* what the metric will be for deciding whether this result is significant or not.  More on this later, but see [this article](http://fivethirtyeight.com/features/science-isnt-broken/).