# Bayesian Inference

This notebooks details an approach to Bayesian inference that is motivated from focusing on the likelihood. It is improtant to note that "inference" means "inference of model parameters given the observed data".

## Motivation

### Prior

In the world there may be some physical parameter of interest, $\theta$, that you would like to estimate or measure but can not do so directly. For example, the parameter might be the mass of a subatomic particle, the time of transit, or INSERT A THIRD EXAMPLE LATER. Given experience of the world, or other knowledge that you may already have that constrains the value of $\theta$, you can represent your belief of the value of $\theta$ in a probability density function refered to as the _prior_, $p\left(\theta\right)$.

In this example, to make the prior more notationally distinct it will be represented as $\pi\left(\vec{\theta}\right)$. This additionally represents the fact that the prior may exist for multiple parameters of interest, $\vec{\theta}$.

### Model and Likelihood

It might not be possible to directly observe $\vec{\theta}$, but observations can be made that can be influence by the value of the paramter. These observations are data, $\vec{x}$, and their nature and dependence on the paramter can be modeled and formalized through a p.d.f. $p\left(\vec{x} \,\middle|\,\vec{\theta}\right)$.

From this model, $p$, and the observed data, $\vec{x}$, the likelihood function for the paramter values can then be constructed.

$$
L\left(\vec{\theta}\right) = L\left(\vec{\theta}\,\middle|\,\vec{x}\right)
$$

It is worth reiterating that the likelihood is a function of the model parameters only and so exists in parameter space, and it is not a p.d.f. and so under no requirement to be normalized to unity.

It is then natural, given that the likelihood encodes the comaptibility of possible paramter values with the observed data, to use this information to improve our beliefs for $\theta$ &mdash; to update our prior, $\pi\left(\vec{\theta}\right)$. This information should be formalized as a new p.d.f., $p\left(\vec{\theta}\, \middle|\,\vec{x}\right)$, referred to as the _posterior_ that is determined through a combination of the likelihood and the prior.

**Have two different distributions, so what it the motivation to multiply them as opposed to doing something else? Motivate this**

Noting that the likelihood function is a scalar and so its value can also be arrived at as a constant multiplying the model

$$
L\left(\vec{\theta}\,\middle|\,\vec{x}\right) = k \cdot p\left(\vec{x}\,\middle|\,\vec{\theta}\right)
$$

### Application of Bayes' Theorem

### BELOW IS UNORDED

As a p.d.f. is desired, then it makes sense to normalize the joint distribution by marginalizing it over all of paramter space

$$
\frac{L\left(\vec{\theta}\,\middle|\,\vec{x}\right) \pi\left(\vec{\theta}\right)}{\displaystyle \int_{\theta} L\left(\vec{\theta}\,\middle|\,\vec{x}\right) \pi\left(\vec{\theta}\right)\,d\vec{\theta}}
$$

Noting that the likelihood function is a scalar and so its value can also be arrived at as a constant multiplying the model

$$
L\left(\vec{\theta}\,\middle|\,\vec{x}\right) = k \cdot p\left(\vec{x}\,\middle|\,\vec{\theta}\right)
$$

it is seen that

\begin{split}
\frac{L\left(\vec{\theta}\,\middle|\,\vec{x}\right) \pi\left(\vec{\theta}\right)}{\displaystyle \int_{\theta} L\left(\vec{\theta}\,\middle|\,\vec{x}\right) \pi\left(\vec{\theta}\right)\,d\vec{\theta}} &= \frac{p\left(\vec{x}\,\middle|\,\vec{\theta}\right) \pi\left(\vec{\theta}\right)}{\displaystyle \int_{\theta} p\left(\vec{x}\,\middle|\,\vec{\theta}\right) \pi\left(\vec{\theta}\right)\,d\vec{\theta}} \\
    &= \frac{p\left(\vec{x}\,\middle|\,\vec{\theta}\right) \pi\left(\vec{\theta}\right)}{\displaystyle p\left(\vec{x}\right)}
\end{split}

It is then seen by Bayes' Theorem &mdash; a fact of probability that follows directly from the Kolmogorov probability axioms &mdash;

$$
p\left(A \middle| B\right) p\left(B\right) = p\left(B \middle| A\right) p\left(A\right)
$$

that the _posterior_ is given by

$$
\boxed{p\left(\vec{\theta}\, \middle|\,\vec{x}\right) = \frac{p\left(\vec{x}\,\middle|\,\vec{\theta}\right) \pi\left(\vec{\theta}\right)}{p\left(\vec{x}\right)}}\,.
$$

Noting that as the data are already observed and are so **fixed** then the total probability of the data, referred to as the _evidence_, $p\left(x\right)$ is a constant for those data. So it can then be said that the _posterior_ is proportial to the _likelihood_ times the _prior_

$$
\boxed{p\left(\vec{\theta}\, \middle|\,\vec{x}\right) \propto L\left(\vec{\theta}\,\middle|\,\vec{x}\right) \pi\left(\vec{\theta}\right)}\,.
$$