# Generative Models

## Generalizing Bayes' Rule

Recall our second problem from [the introduction](./01_bayes_rule_intro.ipynb#Motivating-Examples---Limitations-of-Machine-Learning-and-(frequentist)-Statistics): we observe some heart rate measurements from a workout of a given person. We want to know which heart rate interval contains the true mean for this person with 95% _probability_ a.k.a. the **credible interval** (this is _not_ the same as a 95% confidence interval, see [this thorough explanation](http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/)). We'd also like to know how prior knowledge (e.g. some known population mean and variance) can influence our inference of this interval.

For this, we need to generalize Bayes' rule a bit. Let $y = y_1,\ldots,y_n$ be the data we observe, and $\theta$ a parametric **model** representing some **process that generates the data $y$**. We are now interested in $p(\theta\mid y)$, i.e. the probability (density!) of a model, given the observed data. Bayes' rule can now be written as:

<div style="font-size: 3em">
$$
p(\theta\mid y) = \frac{p(y\mid\theta)p(\theta)}{p(y)}
$$
</div>

The components of this equation should be interpreted as follows:

- a **likelihood** $p(y\mid\theta)$, i.e. the likelihood of observing $y$, given a particular model $\theta$,
- a **prior distribution** $p(\theta)$, i.e. the distribution of possible parameter values of the model $\theta$,
- a **marginal likelihood** $p(y)$, i.e. the joint distribution $p(y, \theta)$, with $\theta$ integrated out: $\int_\theta p(y\mid\theta)p(\theta)\,\mathrm{d}x$. This component is usually ignored,
- a **posterior distribution** $p(\theta\mid y)$, i.e. the distribution of $\theta$, given its prior and after having observed the data $y$;

To reiterate: let it sink in that the prior and posterior distributions are (continuous) **distributions of model parameters**.

Or, in other words, a model (made up of one or more model parameters) can be seen as just a random variable. It is not fixed at a given value or state.

## Example, observing a single value with a simple model

If we assume that the _process of generating values_ $y_i$ is defined as sampling from a Normal distribution with unknown mean $\mu$ and variance $\sigma^2$: $y \sim \mathcal{N}(\mu, \sigma^2)$, $\theta = \{\mu, \sigma^2\}$. In this case, Bayes' rule looks like:

<div style="font-size: 2em">
$$
p(\mu, \sigma^2\mid y_1,\ldots,y_n) = \frac{p(y_1,\ldots,y_n\mid\mu, \sigma^2)p(\mu,\sigma^2)}{p(y_1,\ldots,y_n)}
$$
</div>

This seems a bit intimidating, so let's simplify the example by assuming just one observation, $y=170$ bpm. The model $\theta$ can be simplified by considering the variance $\sigma^2$ to be fixed, e.g. at 100 bpm. Also assume we prviously observed a large sample of heart rates with a mean of 130 bpm and variance of 80 from a diverse population. A full specification of the model now looks as follows:

<div style="font-size: 2em">
$$
\begin{align}
\mu &\sim\color{red}{\mathcal{N}(m_0, s_0^2)}\,\mathrm{(prior)}\\
y &\sim\color{blue}{\mathcal{N}(\mu, \sigma^2_0)}\,\mathrm{(likelihood)}\\
m_0 &=130\\
s_0^2 &=80\\
\sigma^2_0 &=100
\end{align}
$$
</div>

where $\mu$ is a random variable, $y$ is a random variable for which we have observations (n=1), and $m_0$, $s_0^2$ and $\sigma_0^2$ are **hyperparameters** (these are fixed before doing inference). This makes it possible to compute the posterior distribution:

<div style="font-size: 2em">
$$
\begin{align}
p(\mu\mid y) &= \frac{\color{blue}{p(y\mid\mu)}\color{red}{p(\mu)}}{p(y)}\\
&= \frac{\color{blue}{p(y\mid\mu)}\color{red}{p(\mu)}}{\int_{-\infty}^{\infty}p(y\mid\mu)p(\mu)\,\mathrm{d}\mu}\\
&\propto \color{blue}{p(y\mid\mu)}\color{red}{p(\mu)}\,\,(1)\\
&=\color{blue}{\frac{1}{\sqrt{2\pi\color{black}{\sigma_0^2}}}exp\left\{-\frac{1}{2\color{black}{\sigma_0^2}}{(\color{black}{y} - \mu)}^2\right\}}\color{red}{\frac{1}{\sqrt{2\pi \color{black}{s_0^2}}}exp\left\{-\frac{1}{2 \color{black}{s_0^2}}{(\mu - \color{black}{m_0})}^2\right\}}\,\,(2)\\
&\propto \ldots\,\,(3)\\
&\propto exp\left\{-\frac{1}{2 \color{orange}{{\left(\frac{1}{s_0^2} + \frac{1}{\sigma_0^2}\right)}^{-1}}}{\left(\mu - \color{green}{{\left(\frac{1}{s_0^2} + \frac{1}{\sigma_0^2}\right)}^{-1}\left(\frac{m_0}{s_0^2} + \frac{y}{\sigma_0^2}\right)}\right)}^2\right\}\\
&= exp\left\{-\frac{1}{2 \color{orange}{s_1^2}}{(\mu - \color{green}{m1})}^2\right\}\,\,(4)
\end{align}
$$
</div>

The result at (4) is, apart from some constant, equal to the normal PDF. The denominator in (1) evaluates to a constant that makes the numerator a proper PDF. Since the normal PDF integrates to 1, we can don't need to compute the integral in the denominator in (1).

Be aware that the expression (2) looks very complicated, but actually has only 1 variable, i.e. $\mu$.

For those interested in the rewriting magic that leads to the result, look [here](http://www.ams.sunysb.edu/~zhu/ams570/Bayesian_Normal.pdf).

Given our likelihood and priors as defined above, the posterior distribution becomes:

<div style="font-size: 2em">
$$
\mu\mid y \sim \mathcal{N}(m_1, s_1^2),\\
s_1^2 = {\left(\frac{1}{s_0^2} + \frac{1}{\sigma_0^2}\right)}^{-1}\\
m_1 = s_1^2\left(\frac{m_0}{s_0^2} + \frac{y}{\sigma_0^2}\right)
$$
</div>

In [None]:
from scipy.stats import norm

def to_posterior(y, likelihood_variance, prior):
    posterior_variance = 1 / ((1/prior.var()) + (1/likelihood_variance))
    posterior_mean = posterior_variance * ((prior.mean() / prior.var()) + (y / likelihood_variance))
    posterior = norm(loc=posterior_mean, scale=np.sqrt(posterior_variance))
    setattr(posterior, 'name', 'posterior_{}_{}'.format(prior.mean(), prior.var()))
    return posterior

In [None]:
import plotly.graph_objs as go
import numpy as np

def _dist_range(distributions, n=100):
    ranges = [dist.ppf([0.001, 0.999]) for dist in distributions]
    return np.linspace(
        min((r[0] for r in ranges)),
        max((r[1] for r in ranges)),
        n
    )

def plot_density(distributions):
    x = _dist_range(distributions)
    return go.FigureWidget(
        data=[
            go.Scatter(x=x, y=dist.pdf(x), mode='lines', line={'shape': 'spline', 'width': 4}, fill='tozeroy', name=dist.name)
            for dist in distributions
        ]
    )

In [None]:
prior = norm(loc=130, scale=10)
setattr(prior, 'name', 'prior_130_1')

plot_density([prior, to_posterior(170, 100, prior)])