# Posterior Approximation

## The Law of Large Numbers and Monte Carlo Estimation

What if there's no well-defined PDF for our product of likehood and prior(s)? It means we cannot use `scipy.stats.norm`, and its convenient `mean()`, `pdf()`, `cdf()`, or `ppf()` functions to establish probabilistic answers to questions about the posterior. Is it possible to answer such questions in a different way? Assuming we have a sample $y_1,\ldots,y_n$ from an _unknown_ distribution:

In [None]:
import numpy as np

y = np.random.normal(loc=130, scale=10, size=10)  # although using samples from a normal distribution, pretend to not know the true distribution

The distribution is not defined by a PDF, but by a **sample**. According to the [Central Limit Theorem](https://en.wikipedia.org/wiki/Central_limit_theorem), it is possible to approximate the distribution mean with the sample mean, with large enough sample. Is $n=10$ good enough?

In [None]:
np.mean(y)

In [None]:
np.std(y)

As [mentioned before](./notebooks/01_bayes_rule_intro.ipynb#Continuous:), the probability of $Y$ being between e.g. 110 and 130 is

$$Pr(110 \le Y \le 130) = \int_{110}^{130}p(y)$$

However, $p(y)$ is undefined and the only thing we have is a sample. But the sample can be used to approximate the integral:

In [None]:
def sample_cdf(x, sample):
    return (sample <= x).sum() / sample.size  # "integrating" by counting

In [None]:
sample_cdf(130, y) - sample_cdf(110, y)

This can be compared with the probability obtained from the true normal distribution:

In [None]:
from scipy.stats import norm

unknown_norm = norm(loc=130, scale=10)
unknown_norm.cdf(130) - unknown_norm.cdf(110)

Samples also allow us to compute quantiles:

In [None]:
def sample_ppf(p, sample):
    p_index = int(np.round((sample.size - 1) * p))
    return np.sort(sample)[p_index]

So the range of $y$ for which the probability is 80% that it contains $y$'s true mean is:

In [None]:
f'[{sample_ppf(0.1, y)} - {sample_ppf(0.9, y)}]'

where the interval from the true normal distribution would be:

In [None]:
f'[{unknown_norm.ppf(0.1)} - {unknown_norm.ppf(0.9)}]'

Using a larger sample leads to more accurate estimations:

In [None]:
y_large = np.random.normal(loc=130, scale=10, size=10000)

{
    'mean (130)': np.mean(y_large),
    'std (10)': np.std(y_large),
    'p_110_130 (0.4772)': sample_cdf(130, y_large) - sample_cdf(110, y_large),
    '80% ([117.2 - 142.8])': f'[{sample_ppf(0.1, y_large)} - {sample_ppf(0.9, y_large)}]'
}



This process of computing means, variances, probabilities, percentiles (or any other quantities of interest) from simulated samples is called **Monte Carlo Estimation**. When using more complex models, simulations such as above can also be **chained**. For example, consider a model similar to the one in the [previous note](./02_generative_models.ipynb#Example,-observing-a-single-value-with-a-simple-model):

<div style="font-size: 2em">
$$
\begin{align}
\mu &\sim\color{red}{\mathcal{t}(\nu, m_0, s_0)}\,\mathrm{(prior)}\\
y &\sim\color{blue}{\mathcal{N}(\mu, \sigma^2_0)}\,\mathrm{(likelihood)}\\
\nu &=(n-1)\,\text{degrees of freedom for a sample size of}\,n\\
m_0 &=130\\
s_0 &=10\\
\sigma^2_0 &=100
\end{align}
$$
</div>

The joint density $p(y, \mu) = \color{blue}{p(y\mid\mu)}\color{red}{p(\mu)}$ cannot be expressed as a known family (e.g. Normal) of probability densities, but can be estimated by simulation:

In [None]:
from scipy.stats import t

# basing the prior on a sample of n=10, simulating m=1000 samples
mu = t.rvs(df=9, loc=130, scale=10, size=1000)

# an array of n location parameters can be plugged in to get n new samples
y_chained = norm.rvs(loc=mu, scale=10)

{
    'mean (130)': np.mean(y_chained),
    'std (10)': np.std(y_chained),
    'p_110_130 (0.4772)': sample_cdf(130, y_chained) - sample_cdf(110, y_chained),
    '80% ([117.2 - 142.8])': f'[{sample_ppf(0.1, y_chained)} - {sample_ppf(0.9, y_chained)}]'
}

These are samples from the **prior predictive** distribution, i.e. given the likelihood and all the priors in a model, what values of $y$ can we expect (before having observed any data)?

- Can we use the above approach to sample from the _posterior_ distribution $p(\mu\mid y)$? Why (not)?