# T3 - Bayesian inference
Now that we have reviewed some probability, we can look at statistical inference.
$
% Loading TeX (MathJax)... Please wait
%
\newcommand{\Reals}{\mathbb{R}}
\newcommand{\Expect}[0]{\mathbb{E}}
\newcommand{\NormDist}{\mathcal{N}}
%
\newcommand{\DynMod}[0]{\mathscr{M}}
\newcommand{\ObsMod}[0]{\mathscr{H}}
%
\newcommand{\mat}[1]{{\mathbf{{#1}}}}
%\newcommand{\mat}[1]{{\pmb{\mathsf{#1}}}}
\newcommand{\bvec}[1]{{\mathbf{#1}}}
%
\newcommand{\trsign}{{\mathsf{T}}}
\newcommand{\tr}{^{\trsign}}
\newcommand{\tn}[1]{#1}
\newcommand{\ceq}[0]{\mathrel{≔}}
%
\newcommand{\I}[0]{\mat{I}}
\newcommand{\K}[0]{\mat{K}}
\newcommand{\bP}[0]{\mat{P}}
\newcommand{\bH}[0]{\mat{H}}
\newcommand{\bF}[0]{\mat{F}}
\newcommand{\R}[0]{\mat{R}}
\newcommand{\Q}[0]{\mat{Q}}
\newcommand{\B}[0]{\mat{B}}
\newcommand{\C}[0]{\mat{C}}
\newcommand{\Ri}[0]{\R^{-1}}
\newcommand{\Bi}[0]{\B^{-1}}
\newcommand{\X}[0]{\mat{X}}
\newcommand{\A}[0]{\mat{A}}
\newcommand{\Y}[0]{\mat{Y}}
\newcommand{\E}[0]{\mat{E}}
\newcommand{\U}[0]{\mat{U}}
\newcommand{\V}[0]{\mat{V}}
%
\newcommand{\x}[0]{\bvec{x}}
\newcommand{\y}[0]{\bvec{y}}
\newcommand{\z}[0]{\bvec{z}}
\newcommand{\q}[0]{\bvec{q}}
\newcommand{\br}[0]{\bvec{r}}
\newcommand{\bb}[0]{\bvec{b}}
%
\newcommand{\bx}[0]{\bvec{\bar{x}}}
\newcommand{\by}[0]{\bvec{\bar{y}}}
\newcommand{\barB}[0]{\mat{\bar{B}}}
\newcommand{\barP}[0]{\mat{\bar{P}}}
\newcommand{\barC}[0]{\mat{\bar{C}}}
\newcommand{\barK}[0]{\mat{\bar{K}}}
%
\newcommand{\D}[0]{\mat{D}}
\newcommand{\Dobs}[0]{\mat{D}_{\text{obs}}}
\newcommand{\Dmod}[0]{\mat{D}_{\text{obs}}}
%
\newcommand{\ones}[0]{\bvec{1}}
\newcommand{\AN}[0]{\big( \I_N - \ones \ones\tr / N \big)}
%
% END OF MACRO DEF
$

In [None]:
import resources.workspace as ws
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
plt.ion();

The [previous tutorial](T2%20-%20Gaussian%20distribution.ipynb) studied the Gaussian probability density function (pdf), defined by:

$$\begin{align}
\mathcal{N}(x \mid \mu, \sigma^2) = (2 \pi \sigma^2)^{-1/2} e^{-(x-\mu)^2/2 \sigma^2} \, , \tag{G1}
\end{align}$$

In [None]:
def pdf_G1(x, meanval, variance):
    return sp.stats.norm.pdf(x, loc=meanval, scale=np.sqrt(variance))

The following implements the the [uniform](https://en.wikipedia.org/wiki/Uniform_distribution_(continuous))
(or "flat" or "box") pdf.

In [None]:
def pdf_U1(x, meanval, variance):
    lower = meanval - np.sqrt(3*variance)
    upper = meanval + np.sqrt(3*variance)
    # pdfx = scipy.stats.uniform(loc=lower, scale=(upper-lower)).pdf(x)
    height = 1/(upper - lower)
    pdfx = height * np.ones_like(x)
    pdfx[x<lower] = 0
    pdfx[x>upper] = 0
    return pdfx

These distributions will help illustrate:

# Bayes' rule
In the Bayesian approach, knowledge and uncertainty about the unknown ($x$)
is quantified through probability.
And **Bayes' rule** is how we do inference: it says how to condition/merge/assimilate/update this belief based on data/observation ($y$).
For *continuous* "random variables", $x$ and $y$, it reads:

$$\begin{align}
p(x|y) &= \frac{p(x) \, p(y|x)}{p(y)} \, , \tag{BR} \\[1em]
\text{i.e.} \qquad \texttt{posterior}\,\text{[pdf of $x$ given $y$]}
\; &= \;
\frac{\texttt{prior}\,\text{[pdf of $x$]}
\; \times \;
\texttt{likelihood}\,\text{[pdf of $y$ given $x$]}}
{\texttt{normalisation}\,\text{[pdf of $y$]}} \, ,
\end{align}
$$

Note that, in contrast to (the frequent aim of) classical statistics, Bayes' rule in itself makes no attempt at producing only a single estimate (but the topic is briefly discussed [further below](#Exc-2.28-(optional):)). It merely states how quantitative belief (weighted possibilities) should be updated in view of new data.

**Exc 2.10:** Derive Bayes' rule from the definition of [conditional pdf's](https://en.wikipedia.org/wiki/Conditional_probability_distribution#Conditional_continuous_distributions).

In [None]:
# ws.show_answer('BR derivation')

**Exc 2.11 (optional):** Laplace called "statistical inference" the reasoning of "inverse probability" (1774). You may also have heard of "inverse problems" in reference to similar problems, but without a statistical framing. In view of this, why do you think we use $x$ for the unknown, and $y$ for the known/given data?

In [None]:
# ws.show_answer('inverse')

Bayes' rule, eqn. (BR), involves functions (the densities), but applies for any/all values of $x$ (and $y$).
Thus, upon discretisation, eqn. (BR) becomes the multiplication of two arrays of values,
followed by a normalisation (explained [below](#Exc-2.14:)). It is hard to overstate how simple this principle is.

In [None]:
def Bayes_rule(prior_values, lklhd_values, dx):
    prod = prior_values * lklhd_values         # pointwise multiplication
    posterior_values = prod/(np.sum(prod)*dx)  # normalization
    return posterior_values

bounds = -15, 15
grid1d = np.linspace(*bounds, 201)
dx = grid1d[1]  - grid1d[0]

The code below shows Bayes' rule in action.

In [None]:
@ws.interact(y=(*bounds, 1),
             R=(0.01, 20, 0.2),
             top=['y', 'R'])
def Bayes1(y=4.0, R=1.0,
           prior_is_G=True,
           lklhd_is_G=True):
    xf = 0
    Pf = 1
    x = grid1d

    prior_vals = pdf_G1(x, xf, Pf) if prior_is_G else pdf_U1(x, xf, Pf)
    lklhd_vals = pdf_G1(y, x, R)   if lklhd_is_G else pdf_U1(y, x, R)
    postr_vals = Bayes_rule(prior_vals, lklhd_vals, dx)

    def plot(x, y, c, lbl):
        plt.fill_between(x, y, color=c, alpha=.3, label=lbl)
    plt.figure(figsize=(8, 4))
    plot(x, prior_vals, 'blue'  , f'Prior, N(x | {xf:.4g}, {Pf:.4g})')
    plot(x, lklhd_vals, 'green' , f'Lklhd, N({y} | x, {R:.4g})')
    plot(x, postr_vals, 'red'   , f'Postr, pointwise')

    try:
        # See exercise below
        xa, Pa = Bayes_rule_G1(xf, Pf, y, R)
        label = f'Postr, parametric\nN(x | {xa:.4g}, {Pa:.4g})'
        postr_vals_G1 = pdf_G1(x, xa, Pa)
        plt.plot(x, postr_vals_G1, 'purple', label=label)
    except NameError:
        pass

    plt.ylim(0, 0.6)
    plt.legend(loc="upper right", prop={'family': 'monospace'})
    plt.show()

**Exc 2.12:** This exercise serves to make you acquainted with how Bayes' rule blends information.  
 Move the sliders (use arrow keys?) to animate it, and answer the following (with the boolean checkmarks both on and off).
 * What happens to the posterior when $R \rightarrow \infty$ ?
 * What happens to the posterior when $R \rightarrow 0$ ?
 * Move $y$ around. What is the posterior's location (mean/mode) when $R$ equals the prior variance?
 * Can you say something universally valid (for any $y$ and $R$) about the height of the posterior pdf?
 * Does the posterior scale (width) depend on $y$?  
   *Optional*: What does this mean [information-wise](https://en.wikipedia.org/wiki/Differential_entropy#Differential_entropies_for_various_distributions)?
 * Consider the shape (ignoring location & scale) of the posterior. Does it depend on $R$ or $y$?
 * Can you see a shortcut to computing this posterior rather than having to do the pointwise multiplication?
 * For the case of two uniform distributions: What happens when you move the prior and likelihood too far apart? Is the fault of the implementation, the math, or the problem statement?
 * Play around with the grid resolution (see the cell above). What is in your opinion a "sufficient" grid resolution?

In [None]:
# ws.show_answer('Posterior behaviour')

#### Exc 2.14 (optional):
Show that the normalization in `Bayes_rule()` amounts to (approximately) the same as dividing by $p(y)$.

In [None]:
# ws.show_answer('BR normalization')

In fact, since $p(y)$ is thusly implicitly known,
we often don't bother to write it down, simplifying Bayes' rule (eqn. BR) to
$$\begin{align}
p(x|y) \propto p(x) \, p(y|x) \, .  \tag{BR2}
\end{align}$$
Actually, do we even need to care about $p(y)$ at all? All we really need to know is how much more likely some value of $x$ (or an interval around it) is compared to any other $x$.
The normalisation is only necessary because of the *convention* that all densities integrate to $1$.
However, for large models, we usually can only afford to evaluate $p(y|x)$ at a few points (of $x$), so that the integral for $p(y)$ can only be roughly approximated. In such settings, estimation of the normalisation factor becomes an important question too.

#### Exc 2.15 'Nonlinear regression':
- (a) Suppose the "observation model" consists in squaring, i.e.
      $y = x^2/4 + \varepsilon$, i.e. $p(y|x) = \NormDist(y|x^2/4, R)$, where $R$ is the variance of $\varepsilon$. Implement this in the above interactive animation code.
- (b) Try $y = |x|$. Compare with (a).
- (c) Try $y = x + 3$. Describe the impact.
- (d) Try $y = 2 x$. Can you reproduce a posterior obtained with $y = x$ ?

Restore $y = x$.

## Gaussian-Gaussian Bayes

The above animation shows Bayes' rule in 1 dimension. Previously, we saw how a Gaussian looks in 2 dimensions. Can you imagine how Bayes' rule looks in 2 dimensions (we'll see in [T5](T5%20-%20Kalman%20filter%20(multivariate).ipynb))? In higher dimensions ($D_x \gg 1$), these things get difficult to imagine, let alone visualize. Similarly, the size of the problem becomes a computational difficulty.

**Exc 2.16 (optional):**
 * (a) How many point-multiplications are needed on a grid with $N$ points in $D_x$ dimensions? Imagine an $D_x$-dimensional cube where each side has a grid with $N$ points on it.
 * *PS: Of course, if the likelihood contains an actual model $\mathcal{H}(x)$ as well, its evaluations (computations) could be significantly more costly than the point-multiplications of Bayes' rule itself.*
 * (b) Suppose we model 15 physical quantities (fields), at each grid node, on a discretized surface model of Earth. Assume the resolution is $1^\circ$ for latitude (110km), $1^\circ$ for longitude. How many variables, $D_x$, are there in total? This is the ***dimensionality*** of the unknown.
 * (c) Suppose each variable is has a pdf represented with a grid using only $N=20$ points. How many multiplications are necessary to calculate Bayes rule (jointly) for all variables on our Earth model?

In [None]:
# ws.show_answer('Dimensionality', 'a')

In response to this computational difficulty, we try to be smart and do something more analytical ("pen-and-paper"): we only compute the parameters (mean and (co)variance) of the posterior pdf.

This is doable and quite simple in the Gaussian-Gaussian case:  
- With a prior $p(x) = \mathcal{N}(x \mid x^\text{f}, P^\text{f})$ and  
- a likelihood $p(y|x) = \mathcal{N}(y \mid x,R)$,  
- the posterior is
$
p(x|y)
= \mathcal{N}(x \mid x^\text{a}, P^\text{a}) \,,
$
where, in the 1-dimensional/univariate/scalar (multivariate is discussed in [T5](T5%20-%20Kalman%20filter%20(multivariate).ipynb)) case:

$$\begin{align}
    P^\text{a} &= 1/(1/P^\text{f} + 1/R) \, , \tag{5} \\\
  x^\text{a} &= P^\text{a} (x^\text{f}/P^\text{f} + y/R) \, .  \tag{6}
\end{align}$$

*There are a lot of sub/super-scripts. Please take a moment to somewhat digest the formulae.*

#### Exc  2.18 'Gaussian-Gaussian Bayes':
Consider the following identity, where $P^\text{a}$ and $x^\text{a}$ are given by eqns. (5) and (6).
$$\frac{(x-x^\text{f})^2}{P^\text{f}} + \frac{(x-y)^2}{R} \quad=\quad \frac{(x - x^\text{a})^2}{P^\text{a}} + \frac{(y - x^\text{f})^2}{P^\text{f} + R} \,, \tag{S2}$$
Notice that the left hand side (LHS) is the sum of two squares with $x$,
but the RHS only contains one square with $x$.
- (a) Derive the first term of the RHS, i.e. eqns. (5) and (6).
- (b) *Optional*: Derive the full RHS (i.e. also the second term).
- (c) Derive $p(x|y) = \mathcal{N}(x \mid x^\text{a}, P^\text{a})$ from eqns. (5) and (6)
  using part (a), Bayes' rule (BR2), and the Gaussian pdf (G1).

In [None]:
# answers.show_answer('BR Gauss, a.k.a. completing the square', 'a')

**Exc 2.19:**
The statement $x = \mu \pm \sigma$ is *sometimes* used
as a shorthand for $p(x) = \mathcal{N}(x \mid \mu, \sigma^2)$. Suppose
- you think the temperature $x = 20°C \pm 2°C$,
- a thermometer yields the observation $y = 18°C \pm 2°C$.

Show that your posterior is $p(x|y) = \mathcal{N}(x \mid 19, 2)$

In [None]:
# ws.show_answer('GG BR example')

The following implements a Gaussian-Gaussian Bayes' rule (eqns 5 and 6).

In [None]:
def Bayes_rule_G1(xf, Pf, y, R):
    Pa = 1 / (1/Pf + 1/R)
    xa = Pa * (xf/Pf + y/R)
    return xa, Pa

**Re-run**/execute the interactive animation code cell up above.
*Note that the inputs and outputs for `Bayes_rule_G1()` are not discretised density values (as for `Bayes_rule()`), but simply 2 numbers: the mean and the variance.*

#### Exc 2.20:
- (a) Under what conditions does `Bayes_rule_G1()` provide a good approximation to `Bayes_rule()`?
- (b) *Optional*. Try using one or more of the other [distributions readily available in `scipy`](https://stackoverflow.com/questions/37559470/) in the above animation.

**Exc 2.22:** Algebra exercise: Show that eqn. (5) can be written as
$$P^\text{a} = K R \,,    \tag{8}$$
where
$$K = P^\text{f}/(P^\text{f}+R) \,,    \tag{9}$$
is called the "Kalman gain".  
Then shown that eqns (5) and (6) can be written as
$$\begin{align}
    P^\text{a} &= (1-K) P^\text{f} \, ,  \tag{10} \\\
  x^\text{a} &= x^\text{f} + K (y-x^\text{f}) \tag{11} \, ,
\end{align}$$

In [None]:
# ws.show_answer('BR Kalman1')

**Exc 2.24 (optional):**
- (a) Show that $0 < K < 1$ since $0 < P^\text{f}, R$.
- (b) Show that $P^\text{a} < P^\text{f}, R$.
- (c) Show that $x^\text{a} \in (x^\text{f}, y)$.
- (d) Why do you think $K$ is called a "gain"?

In [None]:
# ws.show_answer('KG intuition')

**Exc 2.26:** Re-define `Bayes_rule_G1` so to as to use eqns. 9-11. Remember to re-run the cell. Verify that you get the same plots as before.

In [None]:
# ws.show_answer('BR Kalman1 code')

#### Exc 2.28 (optional):
*If you must* pick a single point value for your estimate (for example, an action to be taken), you can **decide** on it by optimising (with respect to the estimate) the expected value of some utility/loss function [[ref](https://en.wikipedia.org/wiki/Bayes_estimator)]. For example, if the density of $X$ is symmetric,
   and $\text{Loss}$ is convex and symmetric,
   then $\Expect[\text{Loss}(X - \theta)]$ is minimized
   by the mean, $\Expect[X]$, which also coincides with the median.
   <!-- See Corollary 7.19 of Lehmann, Casella -->
For the expected *squared* loss, $\Expect[(X - \theta)^2]$,
the minimum is the mean for *any distribution*.
Show the latter result.  
*Hint: insert $0 = \,?\, - \,?$.*

In summary, the intuitive idea of **considering the mean of $p(x)$ as the point estimate** has good theoretical foundations.

### Next: [T4 - Filtering & time series](T4%20-%20Filtering%20%26%20time%20series.ipynb)