# Part 1: Introduction to Bayesian Inference
When we observed a data set sampled from a larger population, there is variability in the data
and uncertainty about the truth. Statistical inference is about applying consistent reasoning to infer the truth when not all the data is available. Bayesian statistics is a particular approach to applying probability to make statistical inferences. We can express our uncertainty as probability and update our subjective beliefs in light of new data or evidence (\cite{o2004bayesian}). In particular Bayesian inference interprets probability as a measure of believability an individual has about the occurrence of a particular event or the plausibility of a hypothesis. This is in contrast to Frequentist statistics, which assumes that probabilities are the long-run frequency of events from repeated trials.

In order to carry out Bayesian inference, we utilise Bayes Rule. To derive Bayes' rule, we start with the definition of conditional probability, which gives us a rule for determining the probability of an event $A$, given the occurrence of another event $B$. 

\begin{equation}
	\begin{aligned}
		P(A|B) = \frac{P(A \cap B)}{P(B)}
	\end{aligned}
\end{equation}

This states that the probability of $A$ occurring given that $B$ has occurred is equal to the probability that they have both occurred, relative to the probability that  $B$ has occurred. This is a rearrangement of the product or chain rule $P(A\cap B) = P(B|A)P(A)$. For the joint probability of $A$ and $B$ both occurring we can write:

\begin{equation}\label{eq:bayes2}
	\begin{aligned}
		P(A\cap B) = P(A|B)P(B)  \\
		P(B \cap A) = P(B|A)P(A) \\
		\therefore  P(A|B) = \frac{P(A)P(B|A)}{P(B)}
	\end{aligned}
\end{equation}

## Bayes' Rule for Bayesian Inference

To use Bayes Rule for Bayesian inference we use a modified version of Bayes' rule above that represents the process of stating prior beliefs and updating them in the face of new data. 

\begin{equation}\label{eq:bayes3}
	\begin{aligned}
		p(\theta|D) = \frac{p(\theta)p(D|\theta)}{P(D)}
	\end{aligned}
\end{equation}


- We start with a parameter of interest $\theta$ which we specify prior distribution for. The prior distribution represents our initial belief about the parameters and can be written as $p(\theta)$.

- Then we must consider the data $D$ and the probability of seeing the data  as generated by a model with parameter $\theta$. This is expressed as the likelihood.

- Lastly the probability of the data $D$ itself is called the marginal likelihood or evidence $P(D)$. This is determined by summing (or integrating) across all possible values of $\theta$, weighted by our prior beliefs about for each value of $\theta$.

From this we get the posterior distribution $p(\theta|D)$. This is the updated strength of our beliefs in the possible values of $\theta$ once the evidence $D$ has been taken into account. 

## Calculating the Posterior 

The evidence or marginal likelihood $P(D)$ is formally written as:

\begin{equation}
	\begin{aligned}
		p(D) = \int p(\theta) p(D|\theta)\ d\theta
	\end{aligned}
\end{equation}

When we have one or two parameters we can calculate the posterior analytically either by integrating the marginal likelihood or using conjugate priors. In higher parameter spaces it can be difficult to obtain $P(D)$ so the posterior is often simplified to the un-normalised posterior.
\begin{equation}
	\begin{aligned}
		P(\theta|D) \propto P(\theta)P(D|\theta)
	\end{aligned}
\end{equation}

In [None]:
# Example

# prior 

# likelihood

# posterior

In [None]:
# plot

## Simulation Methods
Without the normalising constant we cannot make the posterior distribution an actual probability distribution (that integrates to one). In Bayesian statistics, the posterior distribution has to be a probability distribution, from which one can derive moments like the posterior mean. 
When analytical methods are not appropriate, we can instead use simulation methods to generate samples from our posterior distribution for us to make inference from.

### MCMC
Markov Chain Monte Carlo (MCMC) methods are a class of algorithms for sampling from a probability distribution. When the posterior distribution cannot be solved analytically, we can use MCMC to generate random samples of parameter values drawn from the posterior distribution, $p(\theta|x)$.

In [None]:
library(rjags)

## model

### Posterior Summaries

The result of running MCMC is instead of having a probability distribution of values of $\theta$ and corresponding densities, there is a large dataset of sampled parameter values. The more probable regions will contain more data points so we can look at the posterior distribution by plotting a histogram of samples. We can summarise $\theta$ by calculating descriptive statistics such as the mean, median or standard deviation of the samples. Probabilities like $P(\theta \geq 0.5)$ are calculated by counting all the samples with $\theta \geq 0.5$ and dividing by the total number of samples. 

This method works well on more complex problems when we have many parameters. Usually if we wanted to infer two parameters $\theta_1$ and $\theta_2$ we would have the joint posterior distribution:

$$p(\theta_1, \theta_2|D) \propto p(\theta_1, \theta_2)  p(D|\theta_1, \theta_2) $$ 
and inferring the value of $\theta_1$ on its own would require the marginal posterior distribution for $\theta_1$ (that is, the posterior distribution for a on its own, not the joint distribution with $\theta_2$), which we can get by summing over all values of $\theta_2$:

\begin{equation}
	p(\theta_1|D) = \int p(\theta_1, \theta_2|D)\ d\theta_2
\end{equation}

Having posterior samples makes the process of marginalisation much easier as MCMC already returns the marginal distribution of each parameter for if all other parameter were ignored. We can easily plot the distribution or make inference about a single parameter using the marginal samples provided.

In [None]:
# summarise posterior

### Nested Sampling