Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result by a probability distribution on the parameters of the model. 

\section{Probability and inference}
Bayesian methods: explicit use of probability

Bayesian data analysis three steps
\begin{enumerate}
\item Full probability model: joint probability for all observable and unobservable quantities
\item Calculate posterior distribution
\item Evaluate fit of the model
\end{enumerate}

Second step aided by computational methodology

Pragmatic advantages of the Bayesian framework, flexibility and generality

\subsection{General notation for statistical inference}
Statistical inference: drawing conclusions from numerical data about quantities that are not observed. 

Population, sample

Two kinds of estimands: unobserved and observed

$\theta$ unobservable vector quantities or population parameters, $y$ observed data and $\tilde{y}$ unknown but potentially observable quantities. In general multivariate quantities

Vectors are columns: $u^T u$ and $u u^T$

Data set: $y = (y_1, ..., y_n)$. Exchangeable! We model data from an exchangeable distribution as independent and identically distributed 

Explanatory variables or covariates $x$: $n$ units, $k$ explanatory variabels, $X (n \times k)$. 

Hierarchical modeling: also called multilevel models, information is available on several different levels of observational units. For example, suppose two medical treatments are applied, in separate randomized experiments, to patients in several different cities. Then, if no other information were available, it would be reasonable to treat the patients within each city as exchangeable and also treat the results from different cities as themselves exchangeable. In practice it would make sense to include, as explanatory variables at the city level, whatever relevant information we have on each city, as well as the explanatory variables mentioned before at the individual level, and then the conditional distributions given these explanatory variables would be exchangeable.

\subsection{Bayesian inference}
Bayesian statistical conclusions about a parameter $\theta$ or unobserved data $\tilde{y}$ are made in terms of probability statements: $p(\theta|y)$ and $p(\tilde{y} | y)$

Different notations: 
$\theta \sim N(\mu, \sigma^2)$, $p(\theta) = N(\theta | \mu, \sigma^2)$

\subsubsection{Bayes' rule}:
We begin by providing a joint probability distribution for $\theta$ and $y$. The joint probability density function is the product of the sampling distribution and the prior distribution ($p(\theta, y) = p(\theta)p(y | \theta)$). Then the posterior distribution is:
\[
p(\theta | y) = \frac{p(\theta, y)}{p(y)} = \frac{p(\theta)p(y | \theta)}{p(y)}
\]
where $p(y) = \sum_{\theta}p(\theta)p(y |\theta)$ or $p(y) = \int p(\theta)p(y |\theta) d\theta$

Unnormalized posterior density:
\[
p(\theta | y) \propto p(\theta)p(y|\theta)
\]
\subsubsection{Prediction}
Before the data $y$ are considered, the distribution of the unknown but observable $y$ is
\[
p(y) = \int p(y, \theta) d\theta = \int p(\theta) p(y | \theta) d \theta
\]
This ($p(y)$) is called the marginal distribution of $y$ or prior predictive distribution. 

After the data $y$ has been observed,  we can predict an unknown observable, $\tilde{y}$, from the same process.

Posterior predictive distribution: the distribution of $\tilde{y}$
\[
p(\tilde{y} | y) = \int p(\tilde{y} | \theta) p(\theta | y) d\theta
\]
\subsubsection{Likelihood}
The data $y$ affect the posterior inference only through $p(y | \theta)$, which is called the likelihood function. 

\subsubsection{Likelihood and odds ratios}
The ratio of the posterior density $p(\theta |y)$ evaluated at the points $\theta_1$ and $\theta_2$ under a given model is called the posterior odds for $\theta_1$ compared to $\theta_2$. 
\[
\frac{p(\theta_1|y)}{p(\theta_2|y)}
\]
\subsection{Discrete example}
The woman is either a carrier of the gene $(\theta = 1)$ or not $(\theta = 0)$. The prior distribution for the unknown $\theta$ can be expressed simply as $\textrm{Pr}(\theta = 1) = \textrm{Pr}(\theta = 0) = 1/2$. 

The woman has two sons, neither of whom is affected. We generate the likelihood functions:
\[
\textrm{Pr}(y_1 = 0, y_2 = 0 | \theta = 1) = 0.5 \cdot 0.5 = 0.25
\]
\[
\textrm{Pr}(y_1 = 0, y_2 = 0 | \theta = 0) = 1 \cdot 1 = 1
\]
The posterior distribution:
\[
\textrm{Pr}(\theta = 1|y) = \frac{p(y|\theta = 1)\textrm{Pr}(\theta = 1)}{p(y|\theta = 1)\textrm{Pr}(\theta = 1) + p(y | \theta = 0)\textrm{Pr}(\theta = 0)}
\]
\subsection{Probability as a measure of uncertainty}
Within the Bayesian paradigm, it is equally legitimate to discuss the probability of ‘rain tomorrow’ or of a Brazilian victory in the soccer World Cup as it is to discuss the probability that a coin toss will land heads.
 
Two justifications for numerical measure of uncertainty: symmetry or exchangeability argument, frequency argument. 
 
 \subsection{Some useful results from probability theory}
Probability background: manipulation of joint densities, the definition of simple moments, the transformation of variables, and methods of simulation

We generally represent joint distributions by their joint probability density functions. It is also often useful to factor a joint density as a product of marginal and conditional densities; for example, 
\[
 p(u, v, w) = p(u|v, w) p(v|w)p(w)
 \]

The expextation and variance are defined as follows:

\[
E(u) = \int up(u)du
\]

\[
var(u) = \int (u-E(u))^2p(u)du
\]
If the parameter is a vector, the covariance matrix is defined as 
\[
var(u) = \int (u-E(u))(u-E(u))^Tp(u)du
\]
where $u$ is considered a column vector. 

\subsubsection{Means and variances of conditional distributions}
It is often useful to express the mean and variance of a random variable $u$ in terms of the conditional mean and variance given some related quantity $v$. The mean of $u$ can be obtained by averaging the conditional mean over the marginal distribution of $v$, 
\[
E(u) = E(E(u|v))
\]
where the inner expectation averages over $u$, conditional on $v$, and the outer expectation averages over $v$. 
\[
E(u) = \int \int up(u, v)dudv = \int \int u p(u | v) du p(v) dv = \int E(u|v) p(v) dv
\]The corresponding result for the variance includes two terms, the mean of the conditional variance and the variance of the conditional mean:
\[
var(u) = E(var(u|v)) + var(E(u|v))
\]\subsubsection{Transformation of variables}
Suppose $p_u(u)$ is the density of the vector $u$, and we transform to $v=f(u)$, where $v$ has the same number of components as $u$. 

If $p_u$ is a discrete distribution, and $f$ is a one-to-one function, then the density of $v$ is given by
\[
p_v(v) = p_u(f^{-1}(v))
\]
If $p_u$ is a continuous distribution, and $v=f(u)$ is a one-to-one transformation, then the joint density of the transformed vector is 
\[
p_v(v) = |J|p_u(f^{-1}(v))
\]\subsection{Bayesian inference in applied statistics}
Rationale for the use of Bayesian methods: incorporation of multiple levels of randomness and the resultant ability to combine information from different sources, while incorporating all reasonable sources of uncertainty in inferential summaries. 

\subsection{Code}
\subsubsection{Sample prior distribution and visualize density plot}
\begin{minted}[breaklines]{R}
# Sample 10000 draws from Beta(45,55) prior
prior_A <- rbeta(n = 10000, shape1 = 45, shape2 = 55)

# Store the results in a data frame
prior_sim <- data.frame(prior_A)

# Construct a density plot of the prior sample
ggplot(prior_sim, aes(x = prior_A)) + 
    geom_density()
\end{minted}

\subsubsection{Sample three different prior distributions and visualize as separate  density functions in the same plot}
\begin{minted}[breaklines]{R}
# Sample 10000 draws from the Beta(1,1) prior
prior_B <- rbeta(n = 10000, shape1 = 1, shape2 = 1)    

# Sample 10000 draws from the Beta(100,100) prior
prior_C <- rbeta(n = 10000, shape1 = 100, shape2 = 100)

# Combine the results in a single data frame
prior_sim <- data.frame(samples = c(prior_A, prior_B, prior_C),
        priors = rep(c("A","B","C"), each = 10000))

# Plot the 3 priors
ggplot(prior_sim, aes(x = samples, fill = priors)) + 
    geom_density(alpha = 0.5)
\end{minted}

\subsubsection{Likelihood simulation}
Here we define a discrete array of parameter values and simulate a sample from the likelihood function using each parameter value. 

\begin{minted}{R}
# Define a vector of 1000 p values    
p_grid <- seq(from = 0, to = 1, length.out = 1000)

# Simulate 1 poll result for each p in p_grid   
poll_result <- rbinom(n=1000, size=10, prob=p_grid)
poll_result
# Create likelihood_sim data frame
likelihood_sim <- data.frame(p_grid, poll_result)    

# Density plots of p_grid grouped by poll_result
ggplot(likelihood_sim, aes(x = p_grid, y = poll_result, group = poll_result)) + 
    geom_density_ridges()
\end{minted}

\subsubsection{Approximating likelihood}
 \begin{minted}{R}
 # Density plots of p_grid grouped by poll_result
ggplot(likelihood_sim, aes(x = p_grid, y = poll_result, group = poll_result, fill = poll_result == 6)) + 
    geom_density_ridges()
 \end{minted}

\section{Single-parameter models}
Statistical models where only a single scalar parameter is to be estimated. 

\subsection{Estimating a probability from binomial data}
Results of Bernoulli trials: that is, data $y_1, \dots, y_2$, each of which is either 0 or 1. 

The binomial distribution provides a natural model for data that arise from a sequence of $n$ exchangeable trials or draws from a large population where each trial gives rise to one of two possible outcomes, conventionally labeled "success" and "failure". Total number of successes $y$. The binomial sampling model is
\[
p(y|\theta) = \textrm{Bin}(y|n, \theta) = {n \choose y} \theta^y (1-\theta)^{n-y}
\]To perform Bayesian inference in the binomial model, we must specify a prior distribution for $\theta$. For simplicity, we assume that the prior distribution for $\theta$ is uniform on the interval $[0,1]$. 

Application of Bayes' rule gives the posterior density for $\theta$ as
\[
p(\theta | y) \propto \theta^y(1-\theta)^{n-y}
\]
The factor ${n \choose y}$ can be treated as a constant and left out. We can recognize the posterior as the unnormalized form of the beta distribution
\[
\theta | y \sim \textrm{Beta}(y + 1, n - y + 1)
\]\subsection{Posterior as compromise between data and prior information}
We might expect that, because the posterior distribution incorporates the information from the data, it will be less variable than the prior distribution. The variance formula is more interesting because it says that the posterior variance is on average smaller than the prior variance, by an amount that depends on the variation in posterior means over the distribution of possible data. 

\subsection{Summarizing posterior inference}
Ideally one might report the entire posterior distribution $p(\theta|y)$. For multiparameter problems we use contour plots and scatterplots. 

Mean, median, mode, standard deviation, the interquartile range, and other quantiles. 

\subsection{Posterior quantiles and intervals}
Posterior intervals: the range of values above and below which lies exactly $100(\alpha/2)\%$ of the posterior probability.   

\subsection{Informative prior distribution}
In the binomial example, we have so far considered only the uniform prior distribution for $\theta$. 

We consider two basic interpreations that can be given to prior distributions. In the population interpreation, the prior distribution represents a population of possible parameter values, from which the $\theta$ of current interest has been drawn. In the more subjective state of knowledge interpretation, we express our knowledge and uncertainty about $\theta$. 

\subsubsection{Binomial example with different prior distributions}
If the prior density is of the same form, with its own values $a$ and $b$, then the posterior density will also be of this form. We will parameterize such a prior density as 
\[
p(\theta) \propto \theta^{\alpha - 1}(1 - \theta)^{\beta - 1}
\]
Hyperparameters: the parameters of the prior distribution

The posterior density for $\theta$ is 
\[
p(\theta | y) \propto \theta^y (1 - \theta)^{n-y}\theta^{\alpha - 1}(1 - \theta)^{\beta - 1}
\]
\[
= \theta^{y + \alpha - 1}(1 - \theta)^{n - y + \beta -1}
\]
Conjugacy: the posterior distribution follows the same parametric form as the prior distribution

In the limit, the parameters of the prior distribution have no influence on the posterior distribution. The central limit theorem of probability theory can be put in a Bayesian context to show:
\[
\Bigg( \frac{\theta - E(\theta | y)}{\sqrt{\textrm{var}(\theta | y)}} \Bigg| y \Bigg) \to N(0,1)
\]
This result is often used to justify approximating the posterior distribution with a normal distribution. 

\subsubsection{Conjugate prior distributions}
Formal definition of conjugacy: if $F$ is a class of sampling distributions $p(y|\theta)$, and $P$ is a class of prior distributions for $\theta$, then the class $P$ is conjugate for $F$ if
\[
p(\theta | y) \in P \textrm{ for all } p(\cdot|\theta) \in F \textrm{ and } p(\cdot) \in P
\]
\subsubsection{Nonconjugate prior distributions}
The basic justification for the use of conjugate prior distributions is similar to that for using standard models: it is easy to understand the results. 

\subsubsection{Conjugate prior distributions, exponential families and sufficient statistics}
Probability distributions that belong to an exponential family have natural conjugate prior distributions. 

The class $F$ is an exponential family if all its members have the form
\[
p(y_i | \theta) = f(y_i)g(\theta)e^{\phi(\theta)^Tu(y_i)}
\]\subsection{Normal distribution with known variance}
The normal distribution is fundamental to most statistical modelling. We derive results first for a single data point and then for the general case of many data points.

\subsubsection{Likelihood of one data point}
The sampling distribution is
\[
p(y|\theta) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2\sigma^2}(y-\theta)^2}
\]
\subsubsection{Conjugate prior and posterior distributions}
Considered as a function of $\theta$, the likelihood is an exponential of a quadratic form in $\theta$, so the family of conjugate prior densities looks like
\[
p(\theta) = e^{A\theta^2 + B\theta + C}
\]
We parameterize this family as 
\[
p(\theta) \propto \exp\Bigg( -\frac{1}{2\tau_0^2}(\theta - \mu_0)^2\Bigg)
\]that is $\theta \sim N(\mu_0, \tau_0^2)$ with hyperparameters $\mu_0$ and $\tau_0^2$. 

The posterior is then:
\[
p(\theta | y) \propto \exp\Bigg( - \frac{1}{2}\Bigg(\frac{(y - \theta)^2}{\sigma^2} + \frac{(\theta - \mu_0)^2}{\tau_0^2}\Bigg) \Bigg)
\]which can be simplified into 
\[
p(\theta | y) \propto \exp\Bigg( -\frac{1}{2\tau_1^2}(\theta - \mu_1)^2\Bigg)
\]that is, $\theta | y \sim N(\mu_1, \tau_1^2)$, where 
\[
\mu_1 = \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{1}{\sigma^2}y}{\frac{1}{\tau_0^2} + \frac{1}{\sigma^2}} \textrm{ and } \frac{1}{\tau_1^2} = \frac{1}{\tau_0^2} + \frac{1}{\sigma^2}
\]\subsubsection{Precision of the prior and posterior distributions}
Precision: the inverse of the variance

The posterior precision equals the prior precision plus the data precision

There are several different ways of interpreting the form of the posterior mean, $\mu_1$. The posterior mean is expresed as a weighted average of the prior mean and the observed value $y$, with weights proportional to the precisions. 

\subsubsection{Posterior predictive distribution}
The posterior predictive distribution of a future observation, $\tilde{y}, p(\tilde{y}|y)$, can be calculated directly by integration
\[
p(\tilde{y}|y) = \int p(\tilde{y}|\theta) p(\theta |y) d \theta
\]
\[
\propto \int \exp \Bigg( -\frac{1}{2\sigma^2} (\tilde{y} - \theta)^2 \Bigg) \exp \Bigg(-\frac{1}{2\tau_1^2}(\theta - \mu_1)^2 \Bigg) d\theta
\]\subsubsection{Normal model with multiple observations}
The normal model with a single observation can be easily extended to the more realistic situation where a sample of independent and identically distributed observations $y = (y_1, \dots, y_n)$ is available. The posterior density is then
\[
p(\theta | y) \propto p(\theta) p(y | \theta)
\]
\[
= p(\theta) \prod_{i=1}^n p(y_i | \theta) 
\]
\[
\propto \exp\Bigg(-\frac{1}{2\tau_0^2} (\theta - \mu_0)^2 \Bigg) \prod_{i=1}^n \exp \Bigg(- \frac{1}{2\sigma^2}(y_i - \theta)^2 \Bigg)
\]
\[
\exp \Bigg( -\frac{1}{2} \Bigg(\frac{1}{\tau_0^2}(\theta - \mu_0)^2 + \frac{1}{\sigma^2} \sum_{i=1}^n (y_i - \theta)^2 \Bigg) \Bigg)
\]Simplification of the expression reveals that the posterior distribution depends on $y$ only through the sample mean, i.e. $\overline{y}$ is a sufficient statistic in this model. Furthermore, we get that
\[
p(\theta | y_1, \dots, y_n) = p(\theta | \overline{y}) = N(\theta | \mu_n, \tau_n^2)
\]
where 
\[
\mu_1 = \frac{\frac{1}{\tau_0^2}\mu_0 + \frac{n}{\sigma^2}\overline{y}}{\frac{1}{\tau_0^2} + \frac{n}{\sigma^2}} \textrm{ and } \frac{1}{\tau_n^2} = \frac{1}{\tau_0^2} + \frac{n}{\sigma^2}
\]
\subsection{Other standard single-parameter models}
In general, the posterior density has no closed-form expression; the normalizing constant is often especially difficult to compute due to the integral. Much formal Bayesian analysis concentrates on situations where closed forms are available; such models are sometimes unrealistic, but they provide a useful starting point. 

Standard distributions: binomial, normal, Poisson, exponential

Each of these standard models has an associated family of conjugate prior distributions, which we discuss in turn.

\subsubsection{Normal distribution with known mean but unknown variance}
Useful building block for more complicated models. The likelihood for a vector $y$ of $n$ independent and identically distributed observations is
\[
p(y | \sigma^2) \propto \sigma^{-n} \exp \Bigg( - \frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - \theta)^2 \Bigg)
\]
\[
= (\sigma^2)^{-n/2} \exp\Bigg( -\frac{n}{2\sigma^2}v \Bigg)
\]
The sufficient statistic is 
\[
v = \frac{1}{n}\sum_{i=1}^n (y_i - \theta)^2
\]The corresponding conjugate prior density is the inverse-gamma
\[
p(\sigma^2) \propto (\sigma^2)^{-(\alpha + 1)}e^{-\beta/\sigma^2}
\]
which has hyperparameters $(\alpha, \beta)$. 

The resulting posterior density for $\sigma^2$ is
\[
p(\sigma^2 | y) \propto p(\sigma^2)p(y | \sigma^2)
\]
\[
\propto \Bigg( \frac{\sigma_0^2}{\sigma^2} \Bigg)^{v_0/2+1} \exp \Bigg( -\frac{v_0\sigma_0^2}{2\sigma^2} \Bigg) (\sigma^2)^{-n/2} \exp \Bigg( - \frac{n}{2}\frac{v}{\sigma^2} \Bigg)
\]\[
\propto (\sigma^2)^{-((n+v_0)/2+1)} \exp \Bigg( -\frac{1}{2\sigma^2} (v_0 \sigma_0^2 + nv) \Bigg)
\]Thus 
\[
\sigma^2 | y \sim \textrm{Inv-}\chi^2 \Bigg(v_0 + n, \frac{v_0\sigma_0^2 + nv}{v_0 + n} \Bigg)
\]\subsubsection{Poisson model}
The Poisson distribution arises naturally in the study of data taking the form of counts:

If a data point $y$ follows the Poisson distribution with rate $\theta$, then the probability distribution of a single observation $y$ is 
\[
p(y|\theta) = \frac{\theta^y e^{-\theta}}{y!}\textrm{ for } y = 0, 1, 2, \dots
\]
and for a vector $y = (y_1, \dots, y_n)$ of independent and identically distributed observations, the likelihood is 
\[
p(y|\theta) = \prod_{i=1}^n \frac{1}{y_i!}\theta^{y_i}e^{-\theta}
\]
\[
\propto \theta^{t(y)}e^{-n\theta}
\]
where $t(y) = \sum_{i=1}^n$ is the sufficient statistic. We can rewrite the likelihood in exponential family form as 
\[
p(y|\theta) \propto e^{-n\theta}e^{t(y)\log\theta}
\]
Then the natural conjugate prior distribution  is
\[
p(\theta) \propto (e^{-\theta})^{\eta}e^{v\log{\theta}}
\]
With the conjugate prior distribution, the posterior distribution is
\[
\theta | y \sim \textrm{Gamma}(\alpha + n\overline{y}, \beta + n)
\]
\subsubsection{The negative binomial distribution}
The binomial distribution is a mixture of Poisson distributions with rates, $\theta$, that follow the gamma distribution:
\[
\textrm{Neg-bin}(y|\alpha, \beta) = \int \textrm{Poisson}(y|\theta) \textrm{Gamma}(\theta|\alpha, \beta) d\theta
\]
\subsubsection{Exponential model}
The sampling distribution of an outcome $y$, given parameter $\theta$, is
\[
p(y|\theta) = \theta \exp(-y\theta)\textrm{ for }y > 0
\]
The sampling distribution of $n$ independent exponential observations $y=(y_1, \dots, y_n)$ with constant rate $\theta$ is
\[
p(y|\theta) = \theta^n \exp(-n\overline{y}\theta)
\]
\subsection{Non-informative prior distributions}
When prior distributions have no population basis, they can be difficult to construct, and there has long been a desire for prior distributions that can be guaranteed to play a minimal role in the posterior distribution. 

\subsubsection{Proper and improper prior distributions}
Proper prior: the prior density $p(\theta)$ does not depend on data and integrates to 1. 

Posterior distributions obtained from improper prior distributions must be interpreted with great care - one must always check that the posterior distribution has a finite integral and a sensible form

\subsubsection{Jeffreys' invariance principle}
Defining noninformative prior distributions, based transformations of the parameter $\phi = h(\theta)$. Jeffreys' general principe is that any rule for determining the prior density $p(\theta)$ should yield an equivalent result if applied to the transformed parameter. 

Jeffreys' principle leads to defining the noninformative prior density as $p(\theta) \propto [J(\theta)]^{1/2}$, where $J(\theta)$ is the Fisher information for $\theta$: 
\[
J(\theta) = E\Bigg( \Bigg( \frac{d\log p(y|\theta)}{d\theta} \Bigg)^2 \Bigg| \theta \Bigg) = -E\Bigg(\frac{d^2 \log p(y|\theta)}{d\theta^2} \Bigg| \theta \Bigg)
\]
\subsubsection{Pivotal quantities}
If the density $y$ is such that $p(y-\theta | \theta)$ is a function that is free of $\theta$ and $y$, say, $f(u)$, where $u = y - \theta$, then $y - \theta$ is a pivotal quantity, and $\theta$ is called a pure location parameter. 

If the density of $y$ is such that $p(\frac{y}{\theta} | \theta)$ is a function that is free of $\theta$ and $y$ - say, $g(u)$, where $u = \frac{y}{\theta}$, then $u = \frac{y}{\theta}$ is a pivotal quantity and $\theta$ is called a pure scale parameter. 

\subsubsection{Difficulties with noninformative prior distributions}
The search for noninformative priors has several problems, including:

1. Establishing a particular specification as the reference prior distribution seems to encourage its automatic, and possibly inappropriate, use. 

2. For many problems, there is no clear choice for a vague prior distribution, since a density that is flat or uniform in one parameterization will not be in another. 

3. Further difficulties arise when averaging over a set of computing models that have improper prior distributions

\subsection{Weakly informative prior distributions}
Weakly informative prior: proper prior but it is set up so that the information it does provide is intentionally weaker than whatever actual prior knowledge is available. 

Going at the problem from two different directions:
Start with some version of a noninformative prior distribution and then add enough information so that inferences are constrained to be reasonable

Start with a strong, highly informative prior and broaden it to account for uncertainty in one's prior beliefs and in the applicability of any historically based prior distribution to new data. 

\section{Multiparameter models}
Virtually every practical problem in statistics involves more than one unknown or unobservable quantity. 

Conclusions will often be drawn about one, or only a few, parameters at a time. In this case, the ultimate aim of a Bayesian analysis is to obtain the marginal posterior distribution of the parameters of interest. We first require the joint posterior distribution of all unknowns, and then we integrate the distribution over the unknowns that are not of immediate interest to obtain the desired marginal distribution. 

Nuisance parameters: no interest in making inferences

\subsection{Averaging over 'nuisance parameters'}
To express the ideas of joint and marginal posterior distributions mathematically, suppose $\theta$ has two parts, each of which can be a vector, $\theta = (\theta_1, \theta_2)$, and further suppose that we are only interested in inference for $\theta_1$, so $\theta_2$ may be considered a 'nuisance' parameter. For example
\[
y | \mu, \sigma^2 \sim N(\mu, \sigma^2) 
\]
We seek the conditional distribution of the parameter of interest given the observed data; in this case $p(\theta_1|y)$. This is derived from the joint posterior density
\[
p(\theta_1, \theta_2|y) \propto p(y|\theta_1, \theta_2)p(\theta_1, \theta_2)
\]
by averaging over $\theta_2$: 
\[
p(\theta_1 | y) = \int p(\theta_1, \theta_2 |y) d\theta_2
\]
which can be factored to yield
\[
p(\theta_1 | y) = \int p(\theta_1|\theta_2, y) p(\theta_2|y) d\theta_2
\]
Posterior distributions can be computed by marginal and conditional simulation, first drawing $\theta_2$ from its marginal posterior distribution and then $\theta_1$ from its conditional posterior distribution, given the drawn value $\theta_2$. 

\subsection{Normal data with a noninformative prior distribution}
A vector $y$ of $n$ independent observations from a univariate normal distribution, $N(\mu, \sigma^2)$

\subsubsection{A noninformative prior distribution}
A vague prior density for $\mu$ and $\sigma$, is uniform on $(\mu, \log\sigma)$, or equivalently,
\[
p(\mu, \sigma^2) \propto (\sigma^2)^{-1}
\]
\subsubsection{The joint posterior distribution}
Under this prior density, the joint posterior distributions is
\[
p(\mu, \sigma^2 | y) \propto \sigma^{-n-2}\exp \Bigg(-\frac{1}{2\sigma^2}\sum_{i=1}^n (y_i - \mu)^2 \Bigg)
\]
\[
= \sigma^{-n-2}\exp\Bigg(-\frac{1}{2\sigma^2}\Bigg[\sum_{i=1}^n(y_i - \overline{y})^2 + n(\overline{y} - \mu)^2 \Bigg] \Bigg)
\]\[
= \sigma^{-n-2} \exp \Bigg(-\frac{1}{2\sigma^2} [(n-1)s^2 + n(\overline{y}-\mu)^2]\Bigg)
\]
where 
\[
s^2 = \frac{1}{n-1}\sum_{i=1}^n (y_i - \overline{y})^2
\]
\subsubsection{The conditional posterior distribution, $p(\mu|\sigma^2, y)$ }
To determine the posterior distribution of $\mu$, given $\sigma^2$, we simply use the result derived in Section 2.5:
\[
\mu | \sigma^2, y \sim N(\overline{y}, \sigma^2/n)
\]
\subsubsection{The marginal posterior distribution, $p(\sigma^2|y)$}
We average the joint distribution over $\mu$: 
\[
p(\sigma^2 | y) \propto \int \sigma^{-n-2}\exp \Bigg(-\frac{1}{2\sigma^2}[(n-1)s^2 + n(\overline{y} - \mu)^2] \Bigg) d\mu
\]
\[
p(\sigma^2 | y) \propto \sigma^{-n-2} \exp \Bigg( - \frac{1}{2\sigma^2}(n-1)s^2 \Bigg) \sqrt{2 \pi \sigma^2/n}
\]
\[
\propto (\sigma^2)^{-(n+1)/2} \exp \Bigg( -\frac{(n-1)s^2}{2\sigma^2} \Bigg)
\]
which is a scaled inverse-$\chi^2$ density:
\[
\sigma^2|y \sim \textrm{Inv}-\chi^2(n - 1, s^2)
\]
\subsubsection{Sampling from the joint posterior distribution}
First draw $\sigma^2$, then draw $\mu$ 

\subsubsection{Analytic form of the marginal posterior distribution of $\mu$ }
Few cases where possible: We can derive the marginal posterior density for $\mu$ by integrating the joint posterior density over $\sigma^2$: 
\[
p(\mu | y) = \int_{0}^{\infty}p(\mu, \sigma^2 | y)d\sigma^2
\]
\[
p(\mu | y) \propto A^{-n/2}\int_{0}^{\infty}z^{(n-2)/2}\exp(-z)dz
\]
\[
\propto [(n-1)s^2 + n(\mu - \overline{y})^2]^{-n/2}
\]
\[
\propto \Bigg[1 + \frac{n(\mu - \overline{y})^2}{(n-1)s^2} \Bigg]^{-n/2}
\]
\subsubsection{Posterior predictive distribution for a future observation}
The posterior predictive distribution for a future observation, $\tilde{y}$, can be written as a mixture
\[
p(\tilde{y}|y) = \iint p(\tilde{y}|\mu, \sigma^2, y)p(\mu, \sigma^2 |y) d\mu d\sigma^2
\]

To draw from the posterior predicitve distribution, first draw $\mu, \sigma^2$ from their joint posterior distribution and then simulate $\tilde{y} \sim N(\mu, \sigma^2)$ 

\subsection{Normal data with a conjugate prior distribution}
\subsubsection{A family of conjugate prior distributions}
A convenient parameterization is given by the following specification:
\[
\mu | \sigma^2 \sim N(\mu_0, \sigma^2/\kappa_0)
\]
\[
\sigma^2 \sim \textrm{Inv-}\chi^2(v_0, \sigma_0^2)
\]which corresponds to the joint prior density
\[
p(\mu, \sigma^2) \propto \sigma^{-1}(\sigma^2)^{-(v_0/2+1)}\exp\Bigg( -\frac{1}{2\sigma^2}[v_0\sigma_0^2 + \kappa_0(\mu_0 - \mu)^2]\Bigg)
\]
\subsubsection{The joint posterior distribution, $p(\mu, \sigma^2|y)$}
Multiplying the prior density by the normal likelihood yields the posterior density
\[
p(\mu, \sigma^2|y) \propto \sigma^{-1}(\sigma^2)^{-(v_0/2+1)}\exp \Bigg( -\frac{1}{2\sigma^2}[v_0\sigma_0^2 + \kappa_0(\mu - \mu_0)^2] \Bigg) \cdot
\]
\[
\cdot (\sigma^2)^{-n/2} \exp \Bigg( -\frac{1}{2\sigma^2}[(n-1)s^2 + n(\overline{y} - \mu)^2] \Bigg)
\]
\[
= \textrm{N-Inv-}\chi^2(\mu, \sigma^2 | \mu_n, \sigma_n^2/\kappa_n; v_n, \sigma_n^2)
\]
where after some algebra it can be shown that
\[
\mu_n = \frac{\kappa_0}{\kappa_0 + n}\mu_0 + \frac{n}{\kappa_0 + n}\overline{y}
\]
\[
\kappa_n = \kappa_0 + n
\]
\[
v_n = v_0 + n
\]
\[
v_n\sigma_n^2 = v_0\sigma_n^2 + (n-1)s^2 + \frac{\kappa_0n}{\kappa_0 + n}(\overline{y} - \mu_0)^2
\]\subsubsection{The conditional posterior distribution, $p(\mu|\sigma^2, y)$}
The conditional posterior density is proportional to the joint posterior density with $\sigma^2$ held constant
\[
\mu | \sigma^2, y \sim N(\mu_n, \sigma^2/\kappa_n)
\]
\[
N\Bigg( \frac{\frac{\kappa_0}{\sigma^2}\mu_0 + \frac{n}{\sigma^2}\overline{y}}{\frac{\kappa_0}{\sigma^2} + \frac{n}{\sigma^2}} , \frac{1}{\frac{\kappa_0}{\sigma^2} + \frac{n}{\sigma^2}} \Bigg)
\]
\subsubsection{The marginal posterior distribution, $p(\sigma^2|y)$}
The marginal posterior density is scaled inverse-$\chi^2$: 
\[
\sigma^2 | y \sim \textrm{Inv-}\chi^2(v_n, \sigma_n^2)
\]
\subsubsection{Sampling from the joint posterior distribution}
We first draw $\sigma^2$ from its marginal posterior distribution, then draw $\mu$ from its normal conditional posterior distribution

\subsubsection{Analytic form of the marginal posterior distribution $\mu$ }
\[
p(\mu | y) \propto \Bigg(1 + \frac{\kappa_n(\mu - \mu_n)^2}{v_n\sigma_n^2} \Bigg)^{-(v_n+1)/2}
\]
\[
= t_{v_n}(\mu | \mu_n, \sigma_n^2/\kappa_n)
\]
\subsection{Multinomial model for categorical data}
The multinomial sampling distribution is used to describe data for which each observation is one of $k$ possible outcomes. If $y$ is the vector of counts of the number of of observations of each outcome, then
\[
p(y | \theta) \propto \prod_{j=1}^k \theta_j^{y_j}
\]where the sum of the probabilities $\sum_{j=1}^k\theta_j$ is 1. The conjugate prior distribution is a multivariate generalization of the beta distribution known as the Dirichlet,
\[
p(\theta | \alpha) \propto \prod_{j=1}^k \theta_j^{\alpha_j - 1}
\]
The resulting posterior distribution for the $\theta_j$'s is Dirichlet with parameters $\alpha_j + y_j$

\subsection{Multivariate normal model with known variance}
\subsubsection{Multivariate normal likelihood}
An observable vector $y$ of $d$ components, with the multivariate normal distribution
\[
y | \mu, \Sigma \sim N(\mu, \Sigma)
\]
where $\mu$ is a column vector of length $d$ and $\Sigma$ is a $d\times d$ variance matrix. The likelihood function for a single observation is
\[
p(y | \mu, \Sigma) \propto |\Sigma|^{-1/2} \exp \Bigg( -\frac{1}{2}(y - \mu)^T \Sigma^{-1} (y - \mu) \Bigg)
\]\[
= |\Sigma|^{-n/2} \exp\Bigg( -\frac{1}{2}\textrm{tr}(\Sigma^{-1}S_0) \Bigg)
\]
where $S_0$ is the matrix of sums of squares relative to $\mu$,
\[
S_0 = \sum_{i=1}^n(y_i - \mu)(y_i - \mu)^T
\]\subsubsection{Conjugate prior distribution for $\mu$ with known $\Sigma$}
The conjugate prior distribution for $\mu$ is the multivariate normal distribution, which we parameterize
\[
\mu \sim N(\mu_0, \Lambda_0)
\]
\subsubsection{Posterior distribution for $\mu$ with known $\Sigma$ }
The posterior distribution of $\mu$ is
\[
p(\mu|y, \Sigma) \propto \exp \Bigg( -\frac{1}{2} \Bigg((\mu - \mu_0)^T \Lambda_0^{-1}(\mu - \mu_0) + \sum_{i=1}^n (y_i - \lambda)^T \Sigma^{-1}(y_i - \mu) \Bigg) \Bigg)
\]
\[
p(\mu | y, \Sigma) \propto \exp\Bigg(-\frac{1}{2}(\mu - \mu_n)^T \Lambda_n^{-1}(\mu - \mu_n) \Bigg)
\]\[
= N(\mu | \mu_n, \Lambda_n)
\]
where 
\[
\mu_n = (\Lambda_0^{-1} + n\Sigma^{-1})^{-1}(\Lambda_0^{-1}\mu_0 + n\Sigma^{-1}\overline{y})
\]\[
\Lambda_n^{-1} = \Lambda_0^{-1} + n\Sigma^{-1}
\]
\subsubsection{Posterior conditional and marginal distributions of subvectors of $\mu$ with known $\Sigma$}
The conditional posterior distribution of a subset $\mu^{(1)}$ given the values of a second subset $\mu^{(2)}$ is multivariate normal. If we write superscripts in parentheses to indicate appropriate subvectors and submatrices, then
\[
\mu^{(1)} | \mu^{(2)}, y \sim N \Bigg( \mu_n^{(1)} + \beta^{1|2}(\mu^{(2)} - \mu_n^{(2)}), \Lambda^{1|2} \Bigg)
\]
where the regression coefficients $\beta^{1|2}$ and conditional variance matrix $\Lambda^{1|2}$ are defined by
\[
\beta^{1|2} = \Lambda_n^{(12)} \Bigg( \Lambda_n^{(22)} \Bigg)^{-1}
\]
\[
\Lambda^{1|2} = \Lambda_n^{(11)} - \Lambda_n^{(12)} \Bigg(\Lambda_n^{(22)} \Bigg)^{-1} \Lambda_n^{(21)}
\]

\subsubsection{Posterior predictive distribution for new data}
The marginal posterior distribution of $\tilde{y}$ is multivariate normal. in this case,
\[
\textrm{E}(\tilde{y}|y) = \textrm{E}(\textrm{E}(\tilde{y}|\mu, y)|y) = \textrm{E}(\mu | y) = \mu_n
\]
and 
\[
\textrm{Var}(\tilde{y}|y) = \textrm{E}(\textrm{Var}\tilde{y}|\mu, y) | y) + \textrm{Var}(\textrm{E}(\tilde{y}|\mu, y)|y)
\]\[
= \textrm{E}(\Sigma | y) + \textrm{Var}(\mu | y) = \Sigma + \Lambda_n
\]
\subsection{Multivariate normal with unknown mean and variance}
\subsubsection{Conjugate inverse-Wishart family of prior distributions}
The conjugate distribution for the univariate normal with unknown mean and variance is the normal-inverse-$chi^2$ distribution. We can use the inverse-Wishart distribution, a multivariate generalization of the scaled inverse-$\chi^2$, to describe the prior distribution of the matrix $\Sigma$. 
\[
\Sigma \sim \textrm{Inv-Wishart}_{v_0}(\Lambda_0^{-1})
\]
\[
\mu | \Sigma \sim N(\mu_0, \Sigma/\kappa_0)
\]
which corresponds to the joint prior density
\[
p(\mu, \Sigma) \propto |\Sigma|^{-((v_0+d)/2+1)}\exp \Bigg( -\frac{1}{2}\textrm{tr}(\Lambda_0 \Sigma^{-1}) - \frac{\kappa_0}{2}(\mu - \mu_0)^T \Sigma^{-1} (\mu - \mu_0) \Bigg)
\]
Multiplying the prior density by the normal likelihood results in a posterior density of the same family with parameters
\[
\mu_n = \frac{\kappa_0}{\kappa_0 + n} + \frac{n}{\kappa_0 + n}\overline{y}
\]\[
\kappa_n = \kappa_0 + n
\]
\[
v_n = v_0 + n
\]
\[
\Lambda_n = \Lambda_0 + S + \frac{\kappa_0n}{\kappa_0 + n}(\overline{y} - \mu_0)(\overline{y} - \mu_0)^T
\]
where $S$ is the sum of squares matrix about the sample mean
\[
S = \sum_{i=1}^n (y_i - \overline{y})(y_i - \overline{y})^T
\]
\subsection{Example: analysis of a bioassay experiment}
Few multiparameter sampling models allow simple explicit calculation of posterior distributions. Data analysis for such models is possible using the computational methods. 

Various dose levels of the compound given to animals. The animals' responses are characterized by a dichotomous outcome: alive or dead, tumor or no tumor. 
\[
(x_i, n_i, y_i); i = 1, \dots, k
\]
where $x_i$ represents the $i$th of $k$ dose levels given to $n_i$ animals, of which $y_i$ subsequently respond with positive outcome. 

The data points $y_i$ are binomially distributed:
\[
y_i | \theta_i \sim \textrm{Bin}(n_i, \theta_i)
\]
where $\theta_i$ is the probability of death for animals given dose $x_i$. 

Dose-response relation: $\textrm{logit}(\theta_i) = \alpha + \beta x_i$

The likelihood for each group $i$ :
\[
p(y_i | \alpha, \beta, n_i, x_i) \propto [\textrm{logit}^{-1}(\alpha + \beta x_i)]^{y_i}[1 - \textrm{logit}^{-1}(\alpha + \beta x_i)]^{n_i - y_i}
\]
The joint posterior distribution is
\[
p(\alpha, \beta | y, n, x) \propto p(\alpha, \beta | n, x) p(y | \alpha, \beta, n, x)
\]
\[
\propto p(\alpha, \beta) \prod_{i=1}^k p(y_i | \alpha, \beta, n_i, x_i)
\]
\subsection{Summary of elementary modeling and computation}
Lack of multiparameter models permitting easy calculation of posterior distributions is not a problem: 1. Simple simulation methods 2. Soophisticated models can be represented in a hierarchical or conditional manner 3. We can often apply a normal approximation to the posterior distribution. 

\begin{enumerate}
\item Write the likelihood $p(y|\theta)$
\item Posterior density $p(\theta | y) \propto p(\theta)p(y | \theta)$ 
\item Create a crude estimation of the parameters
\item Draw simulations $\theta^1, \dots, \theta^S$ from the posterior distribution.
\item If any predictive quantities $\tilde{y}$ are of interest, simulate $\tilde{y}^1, \dots, \tilde{y}^S$ by drawing each $\tilde{y}^S$ from the sampling distribution conditional on the drawn value $\theta^S$, $p(\tilde{y}|\theta^S)$ 
\end{enumerate}

\section{Asymptotics and connections to non-Bayesian approaches}
Many simple Bayesian analyses based on noninformative prior distributions give similar results to standard non-Bayesian approaches. 

\subsection{Normal approximation to the posterior distribution}
\subsubsection{Normal approximation to the joint posterior distribution}
If the posterior distribution $p(\theta | y)$ is unimodal and roughly symmetric, it can be convenient to approximate it by a normal distribution. 

Posterior mode $\hat{\theta}$
\[
p(\theta | y) \approx N(\hat{\theta}, [I(\hat{\theta}]^{-1})
\]
where $I(\theta)$ is the observed information
\[
I(\theta) = - \frac{d^2}{d\theta^2}\log p(\theta | y)
\]
\subsubsection{Interpretation of the posterior density function relative to its maximum}
The multivariate normal distribution provides a benchmark for interpreting the posterior density function and contour plots

\subsubsection{Data reduction and summary statistics}
Under the normal approximation, the posterior distribution is summarized by its mode $\hat{\theta}$ and the curvature of the posterior density $I(\hat{\theta})$; that is, asymptotically these are sufficient statistics. 

\subsubsection{Lower-dimensional normal approximations}
For a finite sample size $n$, the normal approximation is typically more accurate for conditional and marginal distributions of components of $\theta$ than for the full joint distribution. 

\subsection{Large-sample theory}
We review some theory of how the posterior distribution behaves as the amount of data, from some fixed sampling distribution, increases. 

\subsubsection{Notation and mathematical setup}
Asymptotic normality of the posterior distribution

Kullback-Leibler divergence

\subsubsection{Asymptotic normality and consistency}
As $n \to \infty$, the posterior distribution of $\theta$ approaches normality with mean $\theta_0$ and variance $(nJ(\theta_0))^{-1}$

In the limit of large $n$, in the context of a specified family of models, the posterior mode, $\hat{\theta}$, approaches $\theta_0$, and the curvature approaches $nJ(\hat{\theta})$ or $nJ(\theta_0)$. 

\subsubsection{Likelihood dominating the prior distribution}
The asymptotic results formalize the notion that the importance of the prior distribution diminshes as the sample size increases

\subsection{Counterexamples to the theorems}
A good way to understand the limitations of the large-sample results is to consider cases in which the theorems fail

\subsubsection{Underidentified models and nonidentified parameters}
Underidentified: if the likelihood $p(\theta|y)$ is equal for a range of values of $\theta$. 

Nonidentified: the data supply no information about $\rho$, so the posterior distribution of $\rho$ is the same as its prior distribution, no matter how large the dataset is

\subsubsection{Number of parameters increasing with sample size}
Large number of parameters, and then we need to distinguish between different types of asymptotics. For example, sometimes a parameter is assigned for each sampling unit in a study; for example, $y_i \sim N(\theta_i, \sigma^2)$. The parameters $\theta_i$ generally cannot be estimated consistently unless the amount of data collected from each sampling unit increases along with the number of units. 

\subsubsection{Aliasing}
Aliasing is a special case of underidentified parameters in which the same likelihood function repeats at a discrete set of points. The problem of aliasing is eliminated by restricting the parameter space so that no duplication appears. 

\subsubsection{Unbounded likelihoods}
If the likelihood function is unbounded, then there might be no posterior mode within the parameter space, invalidating both the consistency results and the normal approximation

\subsubsection{Improper posterior distributions}
An improper posterior distribution cannot occur except with an improper prior distribution

\subsubsection{Prior distributions that exclude the point of convergence}
If $p(\theta_0) = 0$ for a discrete parameter space, or if $p(\theta) = 0$ in a neighborhood about $\theta_0$ for a continuous parameter space, then the convergence results, which are based on the likelihood dominating the prior distribution, do not hold. 

\subsubsection{Convergence to the edge of parameter space}
If $\theta_0$ is on the boundary of the parameter space, then the Taylor series expansion must be truncated in some directions, and the normal distribution will not necessarily be approprate, even in the limit. 

\subsubsection{Tails of the distribution}
The normal approximation can hold for essentially all the mass of the posterior distribution but still not be accurate in the tails

\subsection{Frequency evaluations of Bayesian inference}
The methods of frequentist statistics provide a useful approach for evaluating the properties of Bayesian inferences

\subsubsection{Large-sample correspondence}
The posterior distribution derived from Bayesian theory is asymptotically the same as the repeated sampling distribution. 

\subsubsection{Point estimation, consistency and efficiency}
A point estimate and its estimated standard error are adequate to summarize a posterior inference, but we interpret the estimate as an inferential summary, not as the solution to a decision problem. 

When the truth is included in the family of models being fitted, the posterior mode $\hat{\theta}$, and also the posterior mean and median, are consistent and asymptotically unbiased under mild regularity conditions.

Under  mild regularity conditions, the center of the posterior distribution (defined, for example, by the posterior mean, median, or mode) is asymptotically efficient. 

\subsubsection{Confidence coverage}
Asymptotically a $100(1 - \alpha)\%$ central posterior interval for $\theta$ has the property that, in repeated samples of $y$, $100(1 - \alpha)\%$ of the intervals include the value $\theta_0$. 

\subsection{Bayesian interpretations of other statistical methods}
We briefly consider several statistical concepts - point and interval estimation, likelihood inference, unbiased estimation, frequency coverage of confidence intervals, hypothesis testing, multiple comparisons, nonparametric methods, and the jackknife and bootstrap - and discuss their relation to Bayesian methods.

\subsubsection{Maximum likelihood and other point estimates}
We can often interpret classical point estimates as exact or approximate posterior summaries based in some implicit full probability model. 

In the limit (assuming regularity conditions), the maximum likelihood estimate, $\hat{\theta}$, is a sufficient statistic - and so is the posterior mode, mean, or median. That is, for large enough $n$, the maximum likelihood estimate (or any other summaries) supplies essentially all the information about $\theta$ available from the data. 

\subsubsection{Unbiased estimates}
From a Bayesian perspective, the principle of unbiasedness is reasonable in the limit of large samples, but otherwse is potentially misleading.

\subsubsection{Hypothesis testing}
In order for a Bayesian analysis to yield a nonzero probability for a point null hypothesis, it must begin with a nonzero prior probability for that hypothesis. For a continuous parameter $\theta$, the question "Does $\theta$ equal 0?" can generally be rephrased more usefully as "What is the posterior distribution for $\theta$?"

We do find a form of hypothesis test to be useful when assessing the goodness of fit of a probability model.

\subsubsection{Multiple comparisons and multilevel modeling}
Several competing multiple comparisons procedures have been derived in classical statistics, with rules about when various $\theta_j$'s can be declared significantly different. 

We prefer to handle multiple comparisons problems using hierarchical models. As a result, this Bayesian procedure automatically addresses the key concern of classical multiple comparisons analysis. 

\subsubsection{Nonparametric methods, permutation tests, jackknife, bootstrap}
Hypothesis tests for comparing medians based on ranks do not have direct counterparts in Bayesian inference; therefore it is hard to interpret the resulting estimates and $p$-values from a Bayesian point of view. 

\section{Hierarchical models}
It is natural to model such a problem hierarchically, with observable outcomes modeled conditionally on certain parameters, which themselves are given a probabilistic specification in terms of further parameters, known as hyperparameters. 

Simple nonhierarchical models are usually inapproprate for hierarchical data

\subsection{Constructing a parameterized prior distribution}
\subsubsection{Analyzing a single experiment in the context of historical data}
We consider the problem of estimating a parameter $\theta$ using data from a small experiment and a prior distribution constructed from similar previous (or historical) experiments. 

\subsubsection{Analysis with a fixed prior distribution }
Assuming a $\textrm{Beta}(\alpha, \beta)$ prior distribution for $\theta$ yields a $\textrm{Beta}(\alpha + 4, \beta + 10)$ posterior distribution for $\theta$. 

\subsubsection{Approximate estimate of the population distribution using the historical data}
Several logical and practical problems with the approach of directly estimating a prior distribution from existing data:

If we wanted to use the estimated prior distribution for inference about the first 70 experiments, then the data would be used twice: 

The point estimate for $\alpha$ and $\beta$ seems arbitrary, and using any point estimate for $\alpha$ and $\beta$ necessarily ignores some posterior uncertainty.

\subsubsection{Logic of combining information}
It clearly makes more sense to try to estimate the population distribution from all the data, and thereby to help estimate each $\theta_j$, than to estimate all 71 values $\theta_j$ separately. 

\subsection{Exchangeability and hierarcical models}
In order to create a joint probability model for all the parameters $\theta$, we use the crucial idea of exchangability

\subsubsection{Exchangeability}
If no information - other than the data $y$ - is avaiable to distinguish any of the $\theta_j$'s from any of the others, and no ordering or grouping of the parameters can be made, one must assume symmetry among the parameters in their prior distribution. 

The simplest form of an exchangeable distribution has each of the parameters $\theta_j$ as an independent sample from a prior (or population) distribution governed by some unknown parameter vector $\phi$; thus
\[
p(\theta | \phi) = \prod_{j=1}^J p(\theta_j | \phi)
\]
In general, $\phi$ is unknown, so our distribution for $\theta$ must average over our uncertainty in $\phi$: 
\[
p(\theta) = \int \Bigg( \prod_{j=1}^J p(\theta_j | \phi) \Bigg) p(\phi) d\phi
\]
\subsubsection{Exchangeability when additional information is available on the units}
If the observations can be grouped, we may make a hierarchical model, where each group has its own submodel, but the group properties are unknwon. If we assume that group properties are exchangeable, we can use a common prior distribution for the group properties.

If $y_i$ has additional information $x_i$ so that $y_i$ are not exchangeable but $(y_i, x_i)$ still are exchangeable, then we can make a joint model for $(y_i, x_i)$ or a conditional model for $y_i | x_i$ 

In general, the usual way to model exchangeability with covariates is through conditional independence: $p(\theta_1, \dots, \theta_j | x_1, \dots, x_j) = \int[\prod_{j=1}^J p(\theta_j|\phi, x_j)]p(\phi|x)d\phi$ with $x = (x_1, \dots, x_J)$. 

\subsubsection{Objections to exchangeable models}
With no information available to distinguish them, we have no logical choice but to model the $\theta_j$'s exchangeability. 

\subsubsection{The full Bayesian treatment of the hierarchical model}
The appropriate Bayesian posterior distribution is of the vector $(\phi, \theta)$. The joint prior distribution is
\[
p(\phi, \theta) = p(\phi)p(\theta|\phi)
\]
and the joint posterior distribution is
\[
p(\phi, \theta | y) \propto p(\phi, \theta)p(y | \phi, \theta)
\]
\[
= p(\phi, \theta)p(y|\theta)
\]
\subsubsection{The hyperprior distribution}
In order to create a joint probability distribution for $(\phi, \theta)$, we must assign a prior distribution to $\phi$. 

\subsubsection{Posterior predictive distributions}
Two posterior predictive distributions: 1. the distribution of future observations $\tilde{y}$ correspodning to an existing $\theta_j$ 2. The distribution of observations $\tilde{y}$ corresponding to future $\theta_j$'s drawn from the same superpopulation

\subsection{Bayesian analysis of conjugate hierarchical models}
We present an approach that combines analytical and numerical methods to obtain simulations from the joint posterior distribution $p(\theta, \phi|y)$ for the beta-binomial model, for which the population distribution $p(\theta|\phi)$ is conjugate to the likelihood $p(y|\theta)$. 

\subsubsection{Analytic derivation of conditional and marginal distributions}
Three steps analytically
\begin{enumerate}
\item Write the joint posterior density $p(\theta, \phi |y)$ in unnormalized form as the product of $p(\phi)$, $p(\theta|\phi)$ and $p(y|\theta)$
\item Determine $p(\theta | \phi, y)$
\item Obtain the marginal posterior distribution $p(\phi | y)$ 
\end{enumerate}

For many standard models, the marginal posterior distribution of $\phi$ can be computed algebraically

\[
p(\phi | y) = \frac{p(\theta, \phi | y}{p(\theta | \phi, y)}
\]
\subsubsection{Drawing simulations from the posterior distribution}
Draw from the joint posterior distribution $p(\theta, \phi|y)$ 
\begin{enumerate}
\item Draw the vector or hyperparameters from its marginal posterior distribution $p(\phi|y)$
\item Draw the parameter vector $\theta$ from its conditional posterior distribution, $p(\theta | \phi, y)$ 
\item If desired, draw predictive values $\tilde{y}$ from the posterior predictive distribution given the drawn $\theta$. 
\end{enumerate}

\subsubsection{Application to the model of rat tumors}
\[
y_j \sim \textrm{Bin}(n_j, \theta_j)
\]
\[
\theta_j \sim \textrm{Beta}(\alpha, \beta)
\]
Joint, conditional and marginal posterior distributions
\[
p(\theta, \alpha, \beta | y) \propto p(\alpha, \beta) p(\theta | \alpha, \beta) p(y|\theta, \alpha, \beta)
\]
\[
\propto p(\alpha, \beta) \prod_{j=1}^J \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \theta_j^{\alpha - 1}(1 - \theta_j)^{\beta - 1}\prod_{j=1}^J \theta_j^{y_j}(1 - \theta_j)^{n_j - y_j}
\]
The joint density is
\[
p(\theta | \alpha, \beta, y) = \prod_{j=1}^J \frac{\Gamma(\alpha + \beta + n_j)}{\Gamma(\alpha + y_j)\Gamma(\beta + n_j - y_j)}\theta_j^{\alpha + y_j -1}(1 - \theta_j)^{\beta + n_j - y_j - 1}
\]
The marginal posterior distribution
\[
p(\alpha, \beta | y) \propto p(\alpha, \beta) \prod_{j=1}^J \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}\frac{\Gamma(\alpha + y_j) \Gamma(\beta + n_j - y_j)}{\Gamma (\alpha + \beta + n_j)}
\]
\subsection{Normal model with exchangeable parameters}
Simple hierarchical model based on the normal distribution, in which observed data are normally distributed with a different mean for each "group" or "experiment", with known observation variance, and a normal population distribution for the group means. One-way normal random-effects model with known data variance 

\subsubsection{The data structure}
$J$ independent experiments, with experiment $j$ estimating the parameter $\theta_j$ from $n_j$ independent normally distributed data points, $y_{ij}$ each with knwon error variance $\sigma^2$; that is
\[
y_{ij} | \theta_j \sim N(\theta_j, \sigma^2), i=1, \dots, n_j; j = 1, \dots, J
\]
Using standard notation from the analysis of variance, we label the sample mean of each group $j$ as
\[
\overline{y}_{\cdot j} = \frac{1}{n_j}\sum_{i=1}^{n_j}y_{ij}
\]
with sampling variance
\[
\sigma_j^2 = \sigma^2/n_j
\]
We can then write the likelihood for each $\theta_j$ using the sufficient statistics $\overline{y}_{\cdot j}$ 
\[
\overline{y}_{\cdot j} | \theta_j \sim N(\theta_j, \sigma_j^2) 
\]
\subsubsection{The hierarchical model}
We assume that the parameters $\theta_j$ are drawn from a normal distribution with hyperparameters $(\mu, \tau)$:
\[
p(\theta_1, \dots, \theta_j | \mu, \tau) = \prod_{j=1}^J N(\theta_j | \mu, \tau^2)
\]
\[
p(\theta_1, \dots, \theta_J) = \int \prod_{j=1}^J [N(\theta_j | \mu, \tau^2)]p(\mu, \tau) d(\mu, \tau)
\]
We assign a noninformative uniform hyperprior distribution to $\mu$ given $\tau$
\[
p(\mu, \tau) = p(\mu | \tau)p(\tau) \propto p(\tau)
\]
\subsubsection{The joint posterior distribution}
\[
p(\theta, \mu, \tau | y) \propto p(\mu, \tau)p(\theta | \mu, \tau)p(y|\theta)
\]
\[
\propto p(\mu, \tau) \prod_{j=1}^J N(\theta_j | \mu, \tau^2) \prod_{j=1}^J N(\overline{y}_{\cdot j} | \theta_j, \sigma_j^2)
\]
\subsubsection{The conditional posterior distribution of the normal means, given the hyperparameters}
The conditional posterior distributions for the $\theta_j$'s are independent, and
\[
\theta_j | \mu, \tau, y \sim N(\hat{\theta}_j, V_j)
\]
where 
\[
\hat{\theta}_j = \frac{\frac{1}{\sigma_j^2}\overline{y}_{\cdot j} + \frac{1}{\tau^2}\mu}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}}\textrm{ and }V_j = \frac{1}{\frac{1}{\sigma_j^2} + \frac{1}{\tau^2}}
\]
\subsubsection{The marginal posterior distribution of the hyperparameters}
For the normal distribution, the marginal likelihood has a particularly simple form. The marginal distributions of the group means $\overline{y}_{\cdot j}$, averaging over $\theta$, are independent normal:
\[
\overline{y}_{\cdot j} | \mu, \tau \sim N(\mu, \sigma_j^2 + \tau^2)
\]
Thus we can write the marginal posterior density as 

















