# Generative Models

## Generalizing Bayes' Rule

Recall our second problem from [the introduction](./01_bayes_rule_intro.ipynb#Motivating-Examples---Limitations-of-Machine-Learning-and-(frequentist)-Statistics): we observe some heart rate measurements from a workout of a given person. We want to know which heart rate interval contains the true mean for this person with 95% _probability_ a.k.a. the **credible interval** (this is _not_ the same as a 95% confidence interval, see [this thorough explanation](http://jakevdp.github.io/blog/2014/06/12/frequentism-and-bayesianism-3-confidence-credibility/)). We'd also like to know how prior knowledge (e.g. some known population mean and variance) can influence our inference of this interval.

For this, we need to generalize Bayes' rule a bit. Let $y = y_1,\ldots,y_n$ be the data we observe, and $\theta$ a parametric **model** representing some **process that generates the data $y$**. We are now interested in $p(\theta\mid y)$, i.e. the probability (density!) of a model, given the observed data. Bayes' rule can now be written as:

<div style="font-size: 3em">
$$
p(\theta\mid y) = \frac{p(y\mid\theta)p(\theta)}{p(y)}
$$
</div>

The components of this equation should be interpreted as follows:

- a **likelihood** $p(y\mid\theta)$, i.e. the likelihood of observing $y$, given a particular model $\theta$,
- a **prior distribution** $p(\theta)$, i.e. the distribution of possible parameter values of the model $\theta$,
- a **marginal likelihood** $p(y)$, i.e. the joint distribution $p(y, \theta)$, with $\theta$ integrated out: $\int_\theta p(y\mid\theta)p(\theta)\,\mathrm{d}x$. This component is usually ignored,
- a **posterior distribution** $p(\theta\mid y)$, i.e. the distribution of $\theta$, given its prior and after having observed the data $y$;

To reiterate: the prior and posterior distributions are (continuous) **distributions of model parameters**.

Or, in other words, a model (made up of one or more model parameters) can be seen as just a random variable. It is not fixed at a given value or state.

## Example, observing a single value with a simple model

If we assume that the _process of generating values_ $y_i$ is defined as sampling from a Normal distribution with unknown mean $\mu$ and variance $\sigma^2$: $y \sim \mathcal{N}(\mu, \sigma^2)$, $\theta = \{\mu, \sigma^2\}$. In this case, Bayes' rule looks like:

<div style="font-size: 2em">
$$
p(\mu, \sigma^2\mid y_1,\ldots,y_n) = \frac{p(y_1,\ldots,y_n\mid\mu, \sigma^2)p(\mu,\sigma^2)}{p(y_1,\ldots,y_n)}
$$
</div>

This seems a bit intimidating, so let's simplify the example by assuming just one observation, $y=170$ bpm. The model $\theta$ can be simplified by considering the variance $\sigma^2$ to be fixed, e.g. at 100 bpm. Also assume we prviously observed a large sample of heart rates with a mean of 130 bpm and variance of 80 from a diverse population. A full specification of the model now looks as follows:

<div style="font-size: 2em">
$$
\begin{align}
\mu &\sim\color{red}{\mathcal{N}(m_0, s_0^2)}\,\mathrm{(prior)}\\
y &\sim\color{blue}{\mathcal{N}(\mu, \sigma^2_0)}\,\mathrm{(likelihood)}\\
m_0 &=130\\
s_0^2 &=80\\
\sigma^2_0 &=100
\end{align}
$$
</div>

where $\mu$ is a random variable, $y$ is a random variable for which we have observations (n=1), and $m_0$, $s_0^2$ and $\sigma_0^2$ are **hyperparameters** (these are fixed before doing inference). This makes it possible to compute the posterior distribution (note that $\propto$ means "proportional to"):

<div style="font-size: 2em">
$$
\begin{align}
p(\mu\mid y) &= \frac{\color{blue}{p(y\mid\mu)}\color{red}{p(\mu)}}{p(y)}\\
&= \frac{\color{blue}{p(y\mid\mu)}\color{red}{p(\mu)}}{\int_{-\infty}^{\infty}p(y\mid\mu)p(\mu)\,\mathrm{d}\mu}\\
&\propto \color{blue}{p(y\mid\mu)}\color{red}{p(\mu)}\,\,(1)\\
&=\color{blue}{\frac{1}{\sqrt{2\pi\color{black}{\sigma_0^2}}}exp\left\{-\frac{1}{2\color{black}{\sigma_0^2}}{(\color{black}{y} - \mu)}^2\right\}}\color{red}{\frac{1}{\sqrt{2\pi \color{black}{s_0^2}}}exp\left\{-\frac{1}{2 \color{black}{s_0^2}}{(\mu - \color{black}{m_0})}^2\right\}}\,\,(2)\\
&\propto \ldots\,\,(3)\\
&\propto exp\left\{-\frac{1}{2 \color{orange}{{\left(\frac{1}{s_0^2} + \frac{1}{\sigma_0^2}\right)}^{-1}}}{\left(\mu - \color{green}{{\left(\frac{1}{s_0^2} + \frac{1}{\sigma_0^2}\right)}^{-1}\left(\frac{m_0}{s_0^2} + \frac{y}{\sigma_0^2}\right)}\right)}^2\right\}\\
&= exp\left\{-\frac{1}{2 \color{orange}{s_1^2}}{(\mu - \color{green}{m1})}^2\right\}\,\,(4)
\end{align}
$$
</div>

The result at (4) is, apart from some constant, equal to the normal PDF. The denominator in (1) evaluates to a constant that makes the numerator a proper PDF. Since the normal PDF integrates to 1, we don't need to compute the integral in the denominator in (1).

Be aware that the expression (2) looks very complicated, but actually has only 1 variable, i.e. $\mu$.

For those interested in the rewriting magic hidden in (3) that leads to the result, look [here](http://www.ams.sunysb.edu/~zhu/ams570/Bayesian_Normal.pdf).

Given our likelihood and priors as defined above, the posterior distribution becomes:

<div style="font-size: 2em">
$$
\mu\mid y \sim \mathcal{N}(m_1, s_1^2),\\
s_1^2 = {\left(\frac{1}{s_0^2} + \frac{1}{\sigma_0^2}\right)}^{-1}\\
m_1 = s_1^2\left(\frac{m_0}{s_0^2} + \frac{y}{\sigma_0^2}\right)
$$
</div>

This can be easily implemented using [scipy's norm object](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html):

In [None]:
from scipy.stats import norm  # norm is the scipy object representing a normal distribution

def to_posterior(y, likelihood_variance, prior):
    posterior_variance = 1 / ((1/prior.var()) + (1/likelihood_variance))  # s_1^2 (see above)
    posterior_mean = posterior_variance * ((prior.mean() / prior.var()) + (y / likelihood_variance))  # m_1 (see above)
    posterior = norm(loc=posterior_mean, scale=np.sqrt(posterior_variance))
    setattr(posterior, 'name', f'posterior_{prior.name}')
    return posterior

In [None]:
import plotly.graph_objs as go
import numpy as np

# Helper function for creating an x range for plotting multiple density functions
def _dist_range(distributions, n=1000):
    ranges = [dist.ppf([0.001, 0.999]) for dist in distributions]
    return np.linspace(
        min((r[0] for r in ranges)),
        max((r[1] for r in ranges)),
        n
    )

def plot_densities(distributions, title=''):
    x = _dist_range(distributions)
    return go.FigureWidget(
        data=[
            go.Scatter(x=x, y=dist.pdf(x), mode='lines', line={'shape': 'spline', 'width': 4}, fill='tozeroy', name=dist.name)
            for dist in distributions
        ],
        layout=go.Layout(
            title=title
        )
    )

What's the influence of the choice of prior on the posterior distribution?

In [None]:
prior_tight = norm(loc=130, scale=1)
setattr(prior_tight, 'name', 'prior_tight')

prior_med = norm(loc=130, scale=10)
setattr(prior_med, 'name', 'prior_med')

prior_flat = norm(loc=130, scale=30)
setattr(prior_flat, 'name', 'prior_flat')

plot_densities([prior_tight, prior_med, prior_flat], title='Prior Densities')

How do these priors reflect our (un-)certainty about $\mu$?

In [None]:
plot_densities([to_posterior(170, 100, prior_tight), to_posterior(170, 100, prior_med), to_posterior(170, 100, prior_flat)], title='Posterior Densities')

## Multiple Observations

So far, we only considered 1 observation. When dealing with multiple observations, the posterior becomes:

<div style="font-size: 2em">
$$
\begin{align}
p(\mu\mid y_1,\ldots,y_n) &\propto \color{blue}{p(y_1,\ldots,y_n\mid\mu)}p(\mu)\\
&= \prod_{i=1}^n \color{blue}{\frac{1}{\sqrt{2\pi\color{black}{\sigma_0^2}}}exp\left\{-\frac{1}{2\color{black}{\sigma_0^2}}{(\color{black}{y_i} - \mu)}^2\right\}}p(\mu)
\end{align}
$$
</div>

Because of the [chain rule of probability](https://en.wikipedia.org/wiki/Chain_rule_(probability)), the likelihood of all observations $p(y_1,\ldots,y_n\mid\mu)$
equals the product of likelihoods of each single observation $y_i$. Keep in mind that this is based on the assumption that every $y_i$ is **independent** from the others.

This eventually leads to a similar posterior as for a single observation:

<div style="font-size: 2em">
$$
\mu\mid y_1,\ldots,y_n \sim \mathcal{N}(m_1, s_1^2),\\
s_1^2 = {\left(\frac{1}{s_0^2} + \frac{n}{\sigma^2}\right)}^{-1}\\
m_1 = s_1^2\left(\frac{m_0}{s_0^2} + \frac{n\bar{y}}{\sigma^2}\right)
$$
</div>

with $n$ being the size, $\bar{y}$ the mean, and $\sigma^2$ the variance of our sample of observations $y_1,\ldots,y_n$. The posterior computation can be generalized as follows:

In [None]:
def to_posterior(y, prior, likelihood_var=1):
    n = len(y)
    y_bar = np.mean(y)
    y_var = np.var(y) if n > 1 else likelihood_var
    posterior_variance = 1 / ((1/prior.var()) + (n/y_var))  # s_1^2 (see above)
    posterior_mean = posterior_variance * ((prior.mean() / prior.var()) + (n*y_bar / y_var))  # m_1 (see above)
    posterior = norm(loc=posterior_mean, scale=np.sqrt(posterior_variance))
    setattr(posterior, 'name', f'posterior_prior={prior.name}_ybar={y_bar:.1f}_yvar={y_var:.1f}_n={n}')
    return posterior

Taking the "tight" prior from the previous example, how is it updated based on multiple measurements?

In [None]:
plot_densities([
    prior_tight,
    to_posterior([170], prior_tight, likelihood_var=100),
    to_posterior(norm.rvs(loc=170, scale=10, size=10), prior_tight),
    to_posterior(norm.rvs(loc=170, scale=10, size=100), prior_tight),
    to_posterior(norm.rvs(loc=170, scale=10, size=1000), prior_tight),
], title='Posteriors for "tight" prior')

And how about the "medium certainty" prior?

In [None]:
plot_densities([
    prior_med,
    to_posterior([170], prior_med, likelihood_var=100),
    to_posterior(norm.rvs(loc=170, scale=10, size=10), prior_med),
    to_posterior(norm.rvs(loc=170, scale=10, size=100), prior_med),
    to_posterior(norm.rvs(loc=170, scale=10, size=1000), prior_med),
], title='Posteriors for "medium" prior')

In [None]:
prior_med.var()

In [None]:
norm.var(loc=170, scale=10)

In [None]:
to_posterior(norm.rvs(loc=170, scale=10, size=10), prior_med).var()

Why is the posterior variance (much) smaller than that of the prior or the sample?

## The Posterior has the Answers

Given that we observe some sample of 10 heart rate measurements from an unknown person's first workout, e.g.

In [None]:
y = norm.rvs(loc=170, scale=10, size=10)
y

and given the "medium-certainty" prior defined above (in which a population mean of 130 BPM, with variance of 100 is assumed), the posterior becomes

In [None]:
posterior = to_posterior(y, prior_med)

Using the posterior's mean, variance, and (inverse) [Cumulative Distribution Function (CDF)](https://en.wikipedia.org/wiki/Cumulative_distribution_function), it is now possible to answer questions such as:

_What is this user's expected mean heart rate?_

In [None]:
posterior.mean()

_What is the probability that this new user's mean heart rate is below 170?_

In [None]:
posterior.cdf(170)

_What is the probability that this new user's mean heart rate is between 165 en 175?_

In [None]:
posterior.cdf(175) - posterior.cdf(165)

_For which heart rate can we assign a 30% probability that this user's true mean is below it?_

In [None]:
posterior.ppf(0.3)  # This is the inverse of the CDF

_What is the range of heart rates for which there is a 95% probability that it contains this user's true mean heart rate?_

In [None]:
f'[{posterior.ppf(0.025)} - {posterior.ppf(0.975)}]'

## Introducing ... X

So far we only looked at inference of a single (distribution) parameter based on univariate data $y$. But the Bayesian framework is far more flexible. Can we express a simple linear regression? Assume the most basic example, i.e.

<div style="font-size: 2em">
$$
y_i = \beta x_i + \epsilon_i, \epsilon \sim \mathcal{N}(0, \sigma^2)
$$
</div>

which could alternatively be written as a likelihood:

<div style="font-size: 2em">
$$
y_i\mid\beta,\sigma^2 \sim \mathcal{N}(\beta x_i, \sigma^2)
$$
</div>

with PDF:

<div style="font-size: 2em">
$$
p(y_i\mid\beta,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}exp\left\{-\frac{{(y_i - \beta x_i)}^2}{2\sigma^2}\right\}
$$
</div>

Be reminded that only $\beta$ and $\sigma^2$ are random variables in this expression, $x$ and $y$ are fixed.

Using similar techniques as above, it is possible to derive the posterior $p(\beta,\sigma^2\mid y)$, **in some cases(*)**.

## * The "Different" Cases



In general, getting the posterior distribution of our parameters of interest involves (depending on the complexity of the model, and number of parameters) a multiplication of 2 or more PDF's, and dividing that by some integral of this product:

<div style="font-size: 2em">
$$
\text{posterior} = \frac{\text{product of PDF's of likelihood and (many) different priors}}{\text{the integral over the product of PDF's in the numerator}}
$$
</div>

We've seen that in the case of a normally distributed likelihood and normally distributed prior, some clever rewriting gives us a normally distributed posterior (and gives us the value of that nasty integral for free).

But... what if we choose a different prior? There are many different distributions/PDF's to choose from. Instead of the [example model above](#Example,-observing-a-single-value-with-a-simple-model), the prior for $\mu$ could have been a [t distribution](https://en.wikipedia.org/wiki/Student%27s_t-distribution) with PDF

<div style="font-size: 2em">
$$
p(\mu) = \frac{\Gamma\left(\frac{\nu+1}{2}\right)}{\Gamma\left(\frac{\nu}{2}\right)\sqrt{\nu\pi}}{\left(1 + \frac{\mu^2}{\nu}\right)}^{-\frac{\nu+1}{2}}
$$
</div>

where $\Gamma(x)$ is the [Gamma function](https://en.wikipedia.org/wiki/Gamma_function), or $(x - 1)!$. Taking the product of this PDF and the normal PDF for the likelihood does not result in a well known PDF.

What if, in general, $p(y\mid\theta)p(\theta)$ does not evaluate to a known family of probability density? It means we need to compute that integral in the denominator. Which turns out to be intractable in most cases. Are there other means to get the posterior distribution?