# Effect of Priors

This example is taken from [Eadie, et al. (2023)](https://arxiv.org/abs/2302.04703).

The true but unknown distance $d$ in kiloparsecs (kpc) is related to the true but unknown parallax angel $\pi$ through the equation:
$$ d \ [kpc] = \frac{1}{\pi \ [mas]}$$.

Through Bayesian inference we wish to infer the parameter $d$ given the data. In this example we will first define the likelihood function, then explore the effect of possible priors, and finally apply this method to a star cluster.

Setting up some basic commands and plotting environment style.

In [None]:
import sys
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
from astropy.io import fits
from sklearn.neighbors import KernelDensity
from scipy.stats import norm

In [None]:
# plot in-line within the notebook
%matplotlib inline

# re-defining plotting defaults
from matplotlib import rcParams

In [None]:
# fix random seed to ensure exact reproducibility
np.random.seed(42)

# Data

Download and Load in the data. These parallaxes are measured by the *Gaia* spacecraft.

In [None]:
import urllib.request
url = "https://raw.githubusercontent.com/profhewitt/ast3414/refs/heads/main/NGC_2682.fits"
urllib.request.urlretrieve(url, "NGC_2682.fits")

In [None]:
# load data for NGC 2682 (M67) star cluster
hdu = fits.open('NGC_2682.fits')
data = hdu[1].data

# extract membership probabilities from kinematics (Gaia DR2)
pmem = data['HDBscan_MemProb']

# extract parallaxes (Gaia DR2)
# note: no corrections have been applied for any systematics
p, pe = data['Parallax'], data['Parallax_Err']

In [None]:
# get lowest signal-to-noise ratio parallax measurement
idx_min = np.argmin(p/pe)

# Prior distributions

The prior captures the initial knowledge about the model parameters before we have seen the data. Priors may assign higher probability to some values of the model parameters over others. Priors are often categroized into two classes: *informative* and *non-informative*.

Informative priors can be scientifically motivated, based on previous data, or even based on the data at hand. Non-informative priors can be "flat," weakly-informative, or some other refrerence distribution.

One popular choice for a non-informative prior is a flat prior on an unbounded range. When using such a prior it is important to realize what information your prior is providing, despite being "flat" and "non-informative". For example, we can choose a flat prior for parallax uniformly distributed between some physically motivated cutoffs $\pi = (\pi_{min}, \pi_{max})$, so that:
$$
\begin{align}
\end{align}
$$
$$
\begin{equation}
p(\pi) \propto \left\{
\begin{aligned}
{\rm constant} \ \ \ \ \ \pi_{min} < \pi < \pi_{max} \\
0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ {\rm otherwise}
\end{aligned}
\right.
\end{equation}
$$

Note that we can also define a prior that is uniform in distance:
$$
\begin{equation}
p(d) \propto \left\{
\begin{aligned}
{\rm constant} \ \ \ \ \ \ d_{min} < d < d_{max} \\
0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ {\rm otherwise}
\end{aligned}
\right.
\end{equation}
$$
where $d_{min} = 1//pi_{max}$ and $d_{max} = 1//pi_{min}$.
Both are weakly informative, but they display drastically different behavior as a function of $d$.

There is a better choice of prior for this particular parallax problem introducted by [Bailer-Jones, et al. (2018)](https://ui.adsabs.harvard.edu/abs/2018AJ....156...58B/). The physical volume $dV$ probed by an infinitesimal solid angle on the sky $d\Omega$ at a given distance $d$ scales as the size of a shell, or $d\Omega \propto d^2$. Assuming a constant stellar number density $\rho$ everywhere means a prior behaving as $p(d) \propto d^2$ is more appropriate. However, we also konw that we sit in the disk of our Galaxy in which the stellar density drops off as we go radially outward in the disk. Assuming we are looking inward, and that the stellar density decreases exponentially with a length scale $L$, we have as a prior:
$$
\begin{equation}
p(\rho | d)\propto \left\{
\begin{aligned}
d^2 e^{-d/L} \ \ \ \ \ \ d_{min} < d < d_{max} \\
0 \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ {\rm otherwise}
\end{aligned}
\right.
\end{equation}
$$

In [None]:
# deliberately shift mean and inflate errors for illustrative purposes
pdraws = np.random.normal(p[idx_min] - 0.5, pe[idx_min] * 3, size=int(1e8))

In [None]:
# prior 1: uniform in parallax
prior_unif_parallax = np.random.uniform(0, 5, size=int(1e8))

# prior 2: uniform in distance
prior_unif_dist = np.random.uniform(0, 10, size=int(1e8))

# make histogram of data in both parallax and distance
y, bins_p = np.histogram(pdraws, bins=np.linspace(-4, 5, 2501))
cents_p = (bins_p[1:] + bins_p[:-1]) / 2.
yp_par, _ = np.histogram(prior_unif_parallax, bins=bins_p)
yp_dist, _ = np.histogram(1. / prior_unif_dist, bins=bins_p)

y2, bins_d = np.histogram(1./pdraws, bins=np.linspace(0, 10, 2501))
cents_d = (bins_d[1:] + bins_d[:-1]) / 2.
y2p_par, _ = np.histogram(1. / prior_unif_parallax, bins=bins_d)
y2p_dist, _ = np.histogram(prior_unif_dist, bins=bins_d)

In [None]:
# prior 3: Bailer-Jones
# compute function over corresponding bins
y3 = cents_d**2 * np.exp(-cents_d / 1.)
y3 /= y3.sum() # normalize function

# use inverse-CDF sampling to draw from prior
y3p_draws = np.interp(np.random.uniform(0, 1, size=int(1e8)), y3.cumsum(), cents_d)

# histogram data
y3p_par, _ = np.histogram(1. / y3p_draws, bins=bins_p)
y3p_dist, _ = np.histogram(y3p_draws, bins=bins_d)

In [None]:
# plot priors (distance)
plt.plot(cents_d, y2p_par / 5., color='navy', lw=3)
plt.plot(cents_d, y2p_dist, color='firebrick', lw=3)
plt.plot(cents_d, y3p_dist, color='darkorange', lw=3)
plt.fill_between(cents_d, y2p_par / 5., color='dodgerblue', alpha=0.5)
plt.fill_between(cents_d, y2p_dist, color='indianred', alpha=0.5)
plt.fill_between(cents_d, y3p_dist, color='sandybrown', alpha=0.5)
plt.xlim([0, 6])
plt.xlabel('Distance [kpc]')
plt.ylim([0, None])
plt.yticks([])
plt.ylabel('Probability [normalized]')
plt.tight_layout()
plt.legend(['Uniform in Parallax', 'Uniform in Distance', 'Bailer-Jones'])

# Likelihood

We assume that the likelihood function is a product of independent and identically distributed Gaussian random variables with known variance. This is appropriate if the errors on the data are randomly distributed, and there are many measurements. The likelihood function is given by the normal distribution with mean equal to the true underlying parallax $\pi$:
$$ p(y|\pi) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp \big[-\frac{(y-\pi)^2}{2\sigma^2}\big],$$
where $y$ is the measured parallax and $\sigma$ is the associated measurement uncertainty. The parameter of interest is the distance $d$, so the likelihood function can be re-written in terms of the distance as,
$$ p(y|d)=\frac{1}{\sqrt{2\pi\sigma^2}} \exp \big[-\frac{(y-1/d)^2} {2\sigma^2}\big].$$

Below we plot the two likelihood functions to see what each looks like. Note how the Gaussian distribution in parallax leads to a right-skewed distribution in distance because of the inverse relation between parallax angle and distance.

In [None]:
# plot likelihood (parallax)
plt.fill_between(cents_p, y, color='grey', alpha=0.5)
plt.plot(cents_p, y, color='grey', lw=3)
plt.xlim([-4, 5])
plt.ylim([0, None])
plt.xlabel('Parallax [mas]')
plt.yticks([])
plt.ylabel('Probability [normalized]')
plt.tight_layout()

In [None]:
# plot likelihood (distance)
plt.fill_between(cents_d, y2, color='grey', alpha=0.5)
plt.plot(cents_d, y2, color='grey', lw=3)
plt.xlim([0, 6])
plt.ylim([0, None])
plt.xlabel('Distance [kpc]')
plt.yticks([])
plt.ylabel('Probability [normalized]')
plt.tight_layout()

Now let's compare how these prior distributions affect the posterior distributions by multiplying the Gaussian likelihood by the various priors.

Posterior for uniform in parallax prior:
$$ p(y|d) \propto \frac{1}{d^2} \exp \big[-\frac{(y-1/d)^2} {2\sigma^2}\big]$$.

Posterior for uniform in distance prior:
$$ p(y|d) \propto \exp \big[-\frac{(y-1/d)^2} {2\sigma^2}\big]$$.

Posterior for Bailer-Jones prior:
$$ p(y|d) \propto d^2 \exp \big[-\frac{d}{L}-\frac{(y-1/d)^2} {2\sigma^2}\big]$$.

Using our posteriors, we can calculate a few summary statistics: the median distance and 95% credible interval, as well as the modes for each distribution. Clearly, very different median distances and 95% credible intervals from the posteriors are found from these different prior distributions.

These posterior probability distributions are plotted below. You can clearly see that the shapae is skewed to the right (larger distances) and the shapes are noticeably different.

In [None]:
# compute median and 95% CIs
y2p_par_med = cents_d[np.argmin(np.abs(np.cumsum(y2p_par * y2 / sum(y2p_par * y2)) - 0.5))]
y2p_par_ci = [cents_d[np.argmin(np.abs(np.cumsum(y2p_par * y2 / sum(y2p_par * y2)) - ci))]
              for ci in [0.05, 0.95]]
y2p_dist_med = cents_d[np.argmin(np.abs(np.cumsum(y2p_dist * y2 / sum(y2p_dist * y2)) - 0.5))]
y2p_dist_ci = [cents_d[np.argmin(np.abs(np.cumsum(y2p_dist * y2 / sum(y2p_dist * y2)) - ci))]
               for ci in [0.05, 0.95]]
y3p_dist_med = cents_d[np.argmin(np.abs(np.cumsum(y3p_dist * y2 / sum(y3p_dist * y2)) - 0.5))]
y3p_dist_ci = [cents_d[np.argmin(np.abs(np.cumsum(y3p_dist * y2 / sum(y3p_dist * y2)) - ci))]
               for ci in [0.05, 0.95]]
print("Medians and 95% Credible Interval:")
print(f"Posterior 1: {y2p_par_med:.2f} from [{y2p_par_ci[0]:.2f},{y2p_par_ci[1]:.2f}] kpc")
print(f"Posterior 2: {y2p_dist_med:.2f} from [{y2p_dist_ci[0]:.2f},{y2p_dist_ci[1]:.2f}] kpc")
print(f"Posterior 3: {y3p_dist_med:.2f} from [{y3p_dist_ci[0]:.2f},{y3p_dist_ci[1]:.2f}] kpc")

# compute modes
y2p_par_mode = cents_d[np.argmax(y2p_par * y2)]
y2p_dist_mode = cents_d[np.argmax(y2p_dist * y2)]
y3p_dist_mode = cents_d[np.argmax(y3p_dist * y2)]
print(f"Modes: {y2p_par_mode:.2f}, {y2p_dist_mode:.2f}, {y3p_dist_mode:.2f} kpc, respectively")

In [None]:
# plot posteriors (distance)
plt.plot(cents_d, y2p_par * y2 / sum(y2p_par * y2), color='navy', lw=3)
plt.plot(cents_d, y2p_dist * y2 / sum(y2p_dist * y2), color='firebrick', lw=3)
plt.plot(cents_d, y3p_dist * y2 / sum(y3p_dist * y2), color='darkorange', lw=3)
plt.fill_between(cents_d, y2p_par * y2 / sum(y2p_par * y2), color='dodgerblue', alpha=0.5)
plt.fill_between(cents_d, y2p_dist * y2 / sum(y2p_dist * y2), color='indianred', alpha=0.5)
plt.fill_between(cents_d, y3p_dist * y2 / sum(y3p_dist * y2), color='sandybrown', alpha=0.5)

plt.fill_between(y2p_par_ci, 0, 1, color='navy', alpha=0.1)
plt.plot(y2p_par_med, np.interp(y2p_par_med, cents_d, y2p_par * y2 / sum(y2p_par * y2)), color='navy', marker='o', markersize=11)

plt.fill_between(y3p_dist_ci, 0, 1, color='sandybrown', alpha=0.1)
plt.plot(y3p_dist_med, np.interp(y2p_dist_med, cents_d, y2p_dist * y2 / sum(y2p_dist * y2)), color='sandybrown', marker='o', markersize=11)

plt.xlim([0, 6])
plt.xlabel('Distance [kpc]')
plt.ylim([0, 1e-2])
plt.ylabel('Probability [normalized]')
plt.yticks([])
plt.tight_layout()
plt.legend(['Uniform in Parallax', 'Uniform in Distance', 'Bailer-Jones'])

# Fitting the distance to a star cluster

This example of parallax fitting can be extended to infer the distance to a cluster of stars, based on the collection of parallax measurements of each individual star. Our data is for the open cluster M67. Assuming there are $n$ stars located at approximately the same distance $d_{cluster}$ and that the measured parallaxes $y = {y_1, y_2, ..., y_n}$ to each star are independent given $d_{cluster}$, the combined likelihood is just the product of the individual likelihoods:
$$
p(y_1, y_2,..., y_n| d_{cluster}^{-1}) = \prod^n_{i=1} p(y_i|d_{cluster}^{-1}).
$$
We assumed that the measurment uncertainties $\sigma_i$ are known constants. The posterior distribution is given by:
$$
p(d_{cluster}|y) \propto p(d_{cluster}) \prod^n_{i=1} p(y_i|d_{cluster}^{-1}).
$$
Note that the product of $n$ independent Guassian densities with known variances is also a Gaussian density.

For this "fitting" we will use a different prior, called a "data-driven Gaussian" prior, which has a mean and covariance learned from a subset of the observed data. This is an example of an *informative prior* which can improve computational efficiency and accuracy. Some downsides of using this prior include that it is considered subjective, is a poor model for non-Gaussian data, cannot handle outliers well, and may bias or cause you to overfit you data.

The reason we use a Gaussian prior is to avoid using ``emcee``. The product of two Gaussian functions is also a Gaussian, with a new variance $\sigma_3^{-2} = \sigma_1^{-2} + \sigma_2^{-2}$, and a new mean $\mu_3 = \sigma_3^{2}\big(\frac{\mu_1}{\sigma_1^{-2}} + \frac{\mu_2}{\sigma_2^{-2}} \big)$. To compute the mean and variance of the posterior Gaussian function we just need to cumulatively sum the terms in those relations for all the data points and our prior.


In [None]:
# sort data in ascending signal-to-noise ratio
idx_asc = np.argsort(p/pe)

# shuffle order (unordered results)
#np.random.shuffle(idx_asc)

# apply membership probability cutoff
sel = idx_asc[pmem[idx_asc] > 0.3]

# define Gaussian prior from first 100 data points
# make array for all data points with that prior value for cumsum
num=100
prior_mean_value = p[sel[:num]].mean()
prior_std_value = p[sel[:num]].var()**0.5
prior_mean = np.full_like(p[sel], prior_mean_value)
prior_std = np.full_like(p[sel], p[sel[:num]].var()**0.5)

# compute joint likelihood as we sequentially add more data
# note that p_inverse_var is the inverse of the variance of the parallax
p_inverse_var = np.cumsum(1. / pe[sel]**2)
pmean = np.cumsum(p[sel] / pe[sel]**2) / wivar

# compute joint posterior as we add more data
p_inverse_var_post = p_inverse_var + 1. / prior_std**2
pmean_post = (pmean * p_inverse_var + prior_mean / prior_std**2) / p_inverse_var_post

The first plot shows the individual stellar parallax measurements for all stars in the cluster. Note that they are ordered sequentially with ascending signal-to-noise ratio, meaning the least informative data point is considered first.

The second plot, shows how the inferred cluster's parallax changes as more data points are added. The mean values of the joint likelihood and posterior distribuitons of the clusters' parallax are shown with the standard deviation as an error. To the left, the mean value and standard deviation of the prior is also shown. With increasing data points the mean value quickly converges to a value nearly one standard deviation away from

In [None]:
# plot prior and measurements
plt.figure(figsize=(20, 8))
plt.errorbar(np.arange(1, len(sel) + 1), p[sel], yerr=pe[sel],
             linestyle='none', ecolor='gray', alpha=0.6, marker='.', color='black')
plt.xlim([-10, len(sel) + 10])
plt.xlabel('Number of objects')
plt.ylim([0.93, 1.38])
plt.ylabel('Measured parallax [mas]')
plt.tight_layout()

In [None]:
# plot prior, joint likelihood, and joint posterior
plt.figure(figsize=(20, 8))
plt.errorbar(-35, prior_mean[0], yerr=prior_std[0], linestyle='none', ecolor='dodgerblue', alpha=0.6, marker='o', markersize=10, lw=5, color='navy')
plt.errorbar(np.arange(1, len(sel) + 1), pmean, yerr=p_inverse_var**-0.5, ecolor='gray', alpha=0.6, color='black')
plt.errorbar(np.arange(1, len(sel) + 1), pmean_post, yerr=p_inverse_var_post**-0.5, ecolor='dodgerblue', alpha=0.6, color='navy')
plt.xlim([-70, len(sel) + 10])
plt.xlabel('Number of objects')
plt.ylim([0.93, 1.38])
plt.ylabel('Inferred Parallax [mas]')
plt.tight_layout()

# Questions

Please answer the following three questions in this week's Notebook.

1. Find the maximum a posteriori (MAP) estimate for the parallax towards star cluster M67. What is the associated 68% (1$\sigma$) credible interval for this value? How do these values translate to a distance in kpc to the cluster?

2. The *informative* Gaussian prior uses a subset of the data to infer the best-fit parameters of our data. Test whether the final parameter values are robust against the choice of data points for your Gaussian prior. Show your code and state your conclusions below.

3. The Bailer-Jones prior contained assumptions about the structure of our Galaxy. Explain why including this information results in large parallax distances being more likely?