# Bayesian Framework

### Concepts

* (probabolistic) model
* data/sample
* likelihood
* maximum likelihood
* prior
* posterior
* maximum a posteriori (MAP)
* predictive distribution
* conjugate priors

### example: independent coin toss

* $P(\text{heads}) = p$
* then $P(HHT) = p \times p \times (1 - p)$

let 

$x_i = \begin{cases}
    1 & \text{if } i^{th} \text{ toss is heads} \\
    0 & \text{otherwise}
\end{cases}$

consider 3 cases

1. 300 heads, 200 tails
2. 3 heads, 2 tails
3. 5 heads, 0 tails

### maximum likelihood principle

pick the model (and parameters) that has the highest likelihood given the sample/data

$L(\theta) = P(\text{data} \mid \theta)$  
$\hat{\theta} = argmax_\theta P(\text{data} \mid \theta)$

let $\ell(\theta) = \log L(\theta)$. Then $argmax_\theta L(\theta) = argmax_{\theta} \ell(\theta)$

### back to the coin toss example

$L(p) = \binom{n}{x} p^x (1-p)^{n-x}$  
$\ell(p) = \log \binom{n}{x} + x \log p + (n-x) \log (1-p)$

where $x$ is the number of heads and $n$ is the total number of coin tosses

To maximize w.r.t. $p$, take the derivative and set to 0:

$0 = \frac{x}{p} - \frac{n-x}{1-p}$  
$\implies \hat{p} = \frac{x}{n}$

alternatively, consider the prior:

* $P(p=.5) = .9$
* $P(p=.6) = .1$
* $P(p \not\in \{.5, .6\}) = 0$

Another prior:

* probability density function $f(p) \propto p^2 (1-p)^2$ when $p \in (0, 1)$ and 0 otherwise
* $f(p) = 30 p^2 (1-p)^2$

### prior

distribution over models

oftentimes we choose a type of model and a family of distributions for the parameters of the model

### back to the coin toss example

given a prior and no data, we can compute the probability of heads:  
$P(H) = \sum_\theta P(H | \theta) P(\theta)$  

for the first prior, we have $.9 \times .5 + .1 \times .6 = .51$

### posterior distribution

if we have data, then we can update our prior belief

$P(\theta \mid \text{data})$

To compute, we use ...

### Bayes' rule

$P(\theta \mid x) = \frac{P(x \mid \theta) P(\theta)}{P(x)}$

note that $P(x \mid \theta) = L(\theta)$, the likelihood

to compute the denominator:

$P(x) = \sum_\theta P(x | \theta) P(\theta)$

oftentimes we can just use $P(\theta | x) \propto P(x | \theta) P(\theta)$  
i.e., posterior $\propto$ prior $\times$ likelihood

### back to the coin toss example ...

using the first prior, we can compute 
$P(p=.5 | x) = \frac{.9 \times .5^x .5^{n-x}}{.9 \times .5^x .5^{n-x} + .1 \times .6^x .4^{n-x}}$

for the second prior ...  
$f(p | x) = \frac{f(p) p^x (1-p)^{n-x}}{\int f(p) p^x (1-p)^{n-x} dp}$  

we can avoid doing the integral in the denominator by saying ...  
$\propto p^{x+2} (1-p)^{n-x+2}$  
and then noting that probabilities must sum up to or integrate to 1

### maximum a posteriori principle

choose the model that maximizes the posterior $P(\theta | x)$

note that this often doesn't require normalizing the posterior distribution (just need to compute the argmax)

### back to the coin toss example ...

using the prior $f(p) \propto p^2 (1-p)^2$, we have  
$f(p|x) = p^{x+2} (1-p)^{n-x+2}$  
and taking the derivative w.r.t. $p$ and setting to 0, we get  
$\hat{p} = \frac{x+1}{n+2}$

### conjugate prior

a prior distribution such that the the posterior distribution is of the same family

### back to the coin toss example ...

we started with

* $x \mid p \sim Binomial(n, p)$
* $p \sim Beta(2, 2)$

then we get

* $p \mid x \sim Beta(x + 2, n - x + 2)$

#### beta distribution

$\theta \sim Beta(a, b)$ iff 
$f(\theta) = \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \theta^{a-1} (1-\theta)^{b-1}$

$\theta \sim Beta(a, b) \implies$

* $E[\theta] = \frac{a}{a+b}$
* $argmax_\theta f(\theta) = \frac{a-1}{a+b-2}$

### predictive distribution

the distribution of a new observation given the posterior

computed by integrating out the posterior to get an "aggregate prediction"

$f(\tilde{x} | x) = \int_\Theta f(\tilde{x} | \theta, x) f(\theta | x) d\theta$

where $x$ is the sample and $\tilde{x}$ is a new observation

### back to the coin toss example ...

given our data of $x$ heads out of $n$ tosses and prior $Beta(2, 2)$, we have $P(H) = f(\tilde{x} | x)$

### example: normal distribution

given

* model $X_i \stackrel{iid}{\sim} \mathcal{N}(\mu, \beta)$ where $\beta$ is the precision ($\beta = \sigma^{-2}$)
* data $X_1, ..., X_n$

then

* $L(\mu, \beta) = \prod_i f(x_i | \mu, \beta) = (\frac{\beta}{2 \pi})^{n/2} \exp(-\frac{\beta}{2} \sum_i (x_i - \mu)^2)$
* $\ell(\mu, \beta) = -\frac{n}{2} \log 2 \pi + \frac{n}{2} \log \beta - \frac{\beta}{2} \sum_i (x_i - \mu)^2$

the MLE is found by taking partial derivatives w.r.t. $\mu$ and $\beta$, setting them to $0$, and solving

* $\hat{\mu} = \frac{\sum_i X_i}{n}$
* $\hat{\beta} = \frac{n}{\sum_i (X_i - \hat{\mu})^2}$

note that estimators $\hat{\mu}$ and $\hat{\beta}$ are *random variables*

* $\hat{\mu}$ and $\hat{\beta}$ are functions of random variables $X_1, ..., X_n$ and so must also be random variables

properties of our MLEs:

* $E[\hat{\mu}] = E[\frac{\sum_i X_i}{n}] = \mu$
$\implies \hat{\mu}$ is unbiased
* $Var(\hat{\mu}) = Var(\frac{\sum_i X_i}{n}) = n^{-2} \sum_i Var(X_i)
= \frac{1}{n \beta}$

* $E[\hat{\beta}] = \frac{n-1}{n} \frac{1}{\beta}$
$\implies \hat{\beta}$ is biased

### example: normal distribution with known variance/precision

conjugate prior for $\mu$: normal

* $\mu \sim \mathcal{N}(m, b)$ (again, $b$ is precision)

then we get the posterior  
$f(\mu | x) \propto f(x | \mu) f(\mu)$  
$\propto e^{-\frac{b}{2} \beta^2} e^{b m \mu} e{-\frac{n \beta}{2} \mu^2} e^{\beta \sum x_i \mu}$  
and we can compute the parameters for $\mu \mid x$ by completing the square

$\mu \mid x \sim \mathcal{N}(\frac{mb + \beta \sum x_i}{b + n \beta}, b + n \beta)$