In [None]:
import resources.workspace as ws
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
plt.ion();

$
% START OF MACRO DEF
% DO NOT EDIT IN INDIVIDUAL NOTEBOOKS, BUT IN macros.py
%
\newcommand{\Reals}{\mathbb{R}}
\newcommand{\Expect}[0]{\mathbb{E}}
\newcommand{\NormDist}{\mathcal{N}}
%
\newcommand{\DynMod}[0]{\mathscr{M}}
\newcommand{\ObsMod}[0]{\mathscr{H}}
%
\newcommand{\mat}[1]{{\mathbf{{#1}}}} 
%\newcommand{\mat}[1]{{\pmb{\mathsf{#1}}}}
\newcommand{\bvec}[1]{{\mathbf{#1}}} 
%
\newcommand{\trsign}{{\mathsf{T}}} 
\newcommand{\tr}{^{\trsign}} 
\newcommand{\tn}[1]{#1} 
\newcommand{\ceq}[0]{\mathrel{≔}}
%
\newcommand{\I}[0]{\mat{I}} 
\newcommand{\K}[0]{\mat{K}}
\newcommand{\bP}[0]{\mat{P}}
\newcommand{\bH}[0]{\mat{H}}
\newcommand{\bF}[0]{\mat{F}}
\newcommand{\R}[0]{\mat{R}}
\newcommand{\Q}[0]{\mat{Q}}
\newcommand{\B}[0]{\mat{B}}
\newcommand{\C}[0]{\mat{C}}
\newcommand{\Ri}[0]{\R^{-1}}
\newcommand{\Bi}[0]{\B^{-1}}
\newcommand{\X}[0]{\mat{X}}
\newcommand{\A}[0]{\mat{A}}
\newcommand{\Y}[0]{\mat{Y}}
\newcommand{\E}[0]{\mat{E}}
\newcommand{\U}[0]{\mat{U}}
\newcommand{\V}[0]{\mat{V}}
%
\newcommand{\x}[0]{\bvec{x}}
\newcommand{\y}[0]{\bvec{y}}
\newcommand{\z}[0]{\bvec{z}}
\newcommand{\q}[0]{\bvec{q}}
\newcommand{\br}[0]{\bvec{r}}
\newcommand{\bb}[0]{\bvec{b}}
%
\newcommand{\bx}[0]{\bvec{\bar{x}}}
\newcommand{\by}[0]{\bvec{\bar{y}}}
\newcommand{\barB}[0]{\mat{\bar{B}}}
\newcommand{\barP}[0]{\mat{\bar{P}}}
\newcommand{\barC}[0]{\mat{\bar{C}}}
\newcommand{\barK}[0]{\mat{\bar{K}}}
%
\newcommand{\D}[0]{\mat{D}}
\newcommand{\Dobs}[0]{\mat{D}_{\text{obs}}}
\newcommand{\Dmod}[0]{\mat{D}_{\text{obs}}}
%
\newcommand{\ones}[0]{\bvec{1}} 
\newcommand{\AN}[0]{\big( \I_N - \ones \ones\tr / N \big)}
%
% END OF MACRO DEF
$
# The Gaussian (Normal) distribution

Consider the Gaussian random variable $x \sim \mathcal{N}(b, B)$.

Equivalently, we may write
$\begin{align}
p(x) = \mathcal{N}(x \mid b, B)
\end{align}$
for its probability density function (**pdf**), which is given by
$$\begin{align}
\mathcal{N}(x \mid b, B) = (2 \pi B)^{-1/2} e^{-(x-b)^2/2 B} \, , \tag{G1}
\end{align}$$
for $x \in (-\infty, +\infty)$.

**Exc 2.2:** Code it up (complete the code below)! Hints:
* Note that `**` is the power operator in Python.
* As in Matlab, $e^x$ is available as `np.exp(x)`

In [None]:
def pdf_G1(x, b, B):
    "Univariate (scalar), Gaussian pdf"
    ### INSERT ANSWER HERE ###
    return pdf_values

In [None]:
#ws.show_answer('pdf_G1')


Let's plot the pdf.

In [None]:
# Density parameters
b = 0  # mean     of distribution
B = 25 # variance of distribution

# Grid computations
N  = 201                    # num of grid points
xx = np.linspace(-20, 20,N) # grid
dx = xx[1]-xx[0]            # grid spacing
pp = pdf_G1(xx, b, B)       # pdf values

# Plot
plt.figure(figsize=(6, 2))
plt.plot(xx, pp);

This could for example be the pdf of a stochastic noise variable. It could also describe our *quantitative belief* and uncertainty about a parameter (or state), which we model as randomness in Bayesian statistics.

**Exc 2.3:** Play around with `b` and `B` (and re-run the above code) and answer these questions by looking at the resulting figure:
 * How does the pdf curve change when `b` changes?
 * How does the pdf curve change when you increase `B`?
 * In a few words, describe the shape of the Gaussian pdf curve. Does this ring a bell for you? Hint: it should be clear as a bell!
 
<mark><font size="-1">
<b>NB:</b> Restore `B=25` and re-run the above cell (this is a convenient value for the below examples)
</font></mark>


**Exc 2.4*:** Recall the definition of the expectation (in $x$), namely
$$\Expect [f(x)] \mathrel{≔} \int p(x) f(x) \, d x \,,$$
where the integral is over the domain of $x$.

Recall $p(x) = \mathcal{N}(x \mid b, B)$ from eqn (G1).  
Use pen, paper, and calculus to show that
 - (i) $E[1] = 1$.  
   *Hint: This is actually quite difficult.  
   Both Bernouilli and Laplace both did not figure it out
   until C. F. Gauss managed it by computing $(E[1])^2$ instead.  
   Try doing the same!* 
 - (ii) the first parameter, $b$, indicates its mean, i.e. that $b = \Expect[x]$.
 - (iii) the second parameter, $B>0$, indicates its variance, i.e. that $B = \Expect[(x-b)^2]$.

**Exc 2.5:** Recall $p(x) = \mathcal{N}(x \mid b, B)$ from eqn (G1).  
Use pen, paper, and calculus to answer the following questions,  
which derive some helpful mnemonics about the distribution.

 * (i) Find $x$ such that $p(x) = 0$.
 * (ii) Where is the location of the mode (maximum) of the distribution?
I.e. find $x$ such that $\frac{d p}{d x}(x) = 0$.  
Hint: it's easier to analyse $\log p(x)$ rather than $p(x)$ itself.
 * (iii) Where is the inflection point? I.e. where $\frac{d^2 p}{d x^2}(x) = 0$.
 * (iv) Some forms of "sensitivity analysis" (a basic form of uncertainty quantification) consist in evaluating $\frac{d^2 p}{d x^2}(x)$ at the mode.  
Explain this by reference to the Gaussian shape.
Hint: calculate and interpret $\frac{d^2 p}{d x^2}(b)$

### The multivariate (i.e. vector) case
Here's the pdf of the *multivariate* Gaussian:
$$\begin{align}
\NormDist(x \mid  b, B)
&=
|2 \pi B|^{-1/2} \, \exp\Big(-\frac{1}{2}\|x-b\|^2_B\Big) \, , \tag{GM}
\end{align}$$
where $|.|$ represents the determinant, and $\|.\|_W$ represents the norm with weighting: $\|x\|^2_W = x^T W^{-1} x$.  
In this multivariate case, $B$ is called the *covariance* (matrix).

The following implements this pdf. Take a moment to digest the code. Don't worry if you don't understand all of the details. Hints:
 * `@` produces matrix multiplication (`*` in `Matlab`);
 * `*` produces array multiplication (`.*` in `Matlab`);
 * `axis=-1` makes `np.sum()` work along the last dimension of an ND-array.

In [None]:
from numpy.linalg import det, inv

def weighted_norm22(xx, W):
    "Computes the norm of each vector (on the last axis) of xx, weighted by W."
    ww = np.sum( (xx @ inv(W)) * xx, axis=-1)
    return ww

def pdf_GM(xx, b, B):
    "pdf -- Gaussian, Multivariate: N(x | b, B)"
    c = np.sqrt(det(2*np.pi*B))
    return 1/c * np.exp(-0.5*weighted_norm22(xx - b, B))

The following code plots the pdf as contour (iso-density) curves.

In [None]:
XX, YY = np.meshgrid(xx, xx)
grid = np.dstack((XX, YY))

@ws.interact(corr=(0, 1, .05), var1=(0.1**2, 10**2))
def plot_Gaussian_contours(corr=0.7, var1=1):
    var2 = 1
    cov12 = np.sqrt(var1 * var2) * corr
    Cov = B * np.array([[var1  , cov12],
                        [cov12 , var2]])
    # Eval
    pp = pdf_GM(grid, b=0, B=Cov)

    # Plot
    plt.figure(figsize=(4, 4))
    plt.contour(XX, YY, pp)
    plt.axis('equal');
    plt.show()

**Exc 2.7:** How do the contours look? Try to understand why. Cases:
 * (a) correlation=0.    
 * (b) correlation=0.99.
 * (c) correlation=0.5. (Note that we've used `plt.axis('equal')`).
 * (d) correlation=0.5, but with non-equal variances.

**Exc 2.8:** Play the [correlation game](http://guessthecorrelation.com/) (doesn't work right in Chrome) until you get a score (shown as gold coins) of 5 or more.

**Exc 2.9:**
* What's the difference between correlation and covariance?
* What's the difference between correlation (or covariance) and dependence?
* Does correlation imply causation?
* Can you use correlation to in making predictions?

# Bayes' rule
Bayes' rule is how we do inference.  
It defines how we should merge our prior (quantitative belief) about $x$,  
when given an observation $y$ somehow related to $x$.


For continuous random variables, $x$ and $y$, it reads:

$$\begin{align}
p(x|y) = \frac{p(x) \, p(y|x)}{p(y)} \, , \tag{2}
\end{align}$$

or, in words:


$$
\text{"posterior" (pdf of $x$ given $y$)}
\; = \;
\frac{\text{"prior" (pdf of $x$)}
\; \times \;
\text{"likelihood" (pdf of $y$ given $x$)}}
{\text{"normalization" (pdf of $y$)}} \, .
$$

In [None]:
#ws.show_example('BR')

**Exc 2.10:** Derive Bayes' rule from the definition of [conditional pdf's](https://en.wikipedia.org/wiki/Conditional_probability_distribution#Conditional_continuous_distributions).

In [None]:
#ws.show_answer('BR derivation')

<em>Exercises marked with an asterisk (*) are optional.</em>

**Exc 2.11*:** Slightly after reverend T. Bayes, P. S. Laplace also independently (and more clearly) developed Bayes' rule, published in 1774. Some time thereafter, what we now call "statistical inference" came to be known as the reasoning of "inverse probability". Nowadays, "inverse problems" are often given a statistical interpretation. Considering this context, why do you think we use $x$ for the "unknown", and $y$ for the known/given/fixed data?

In [None]:
#ws.show_answer('inverse')

Computers generally work with discrete, numerical representations of mathematical entities.
Numerically, pdfs may be represented by their `values` on a grid, such as `xx` from above. Bayes' rule (2) then consists of *grid-point-wise* multiplication, as shown below.

In [None]:
def Bayes_rule(prior_values, lklhd_values, dx):
    "Numerical (pointwise) implementation of Bayes' rule."
    pp = prior_values * lklhd_values   # pointwise multiplication
    posterior_values = pp/(sum(pp)*dx) # normalization
    return posterior_values

The code below shows Bayes' rule in action.  
Again, remember that the only thing it's doing is multiplying the `prior value` and `likelihood value` at each gridpoint.  
Move the sliders with the arrow keys to animate it.

In [None]:
# Fix the prior's parameters
b = 0 # mean
B = 1 # variance

@ws.interact(y=(-10, 10, 1), R=(0.01, 20, 0.2))
def animate_Bayes(y=4.0, R=1):
    prior_vals = pdf_G1(xx, b, B)
    lklhd_vals = pdf_G1(y, xx, R)
    
    postr_vals = Bayes_rule(prior_vals, lklhd_vals, xx[1]-xx[0])

    plt.figure(figsize=(10, 4))
    plt.plot(xx, prior_vals, label='prior $\mathcal{N}(x | b, B)$')
    plt.plot(xx, lklhd_vals, label='likelihood $\mathcal{N}(y | x, R)$')
    plt.plot(xx, postr_vals, label='posterior - pointwise')
    
    ### Uncomment this block AFTER doing the Exc 2.24 ###
    # xhat, P = Bayes_rule_G1(b, B, y, R)
    # postr_vals2 = pdf_G1(xx, xhat, P)
    # plt.plot(xx, postr_vals2, '--', label='posterior - parametric\n $\mathcal{N}(x|\hat{x}, P)$')
    
    plt.ylim(ymax=0.6)
    plt.legend()
    plt.show()

**Exc 2.12:** This exercise serves to make you acquainted with how Bayes' rule blends information.  
Move the sliders to see what happens, and answer the following:
 * What happens to the posterior when $R \rightarrow \infty$ ?
 * What happens to the posterior when $R \rightarrow 0$ ?
 * Move around $y$. What is the posterior's location (mean/mode) when $R = B$ ?
 * Does the posterior scale (width) depend on $y$?  
   What does this mean [information-wise](https://en.wikipedia.org/wiki/Differential_entropy#Differential_entropies_for_various_distributions)?
 * Consider the shape (ignoring location & scale) of the posterior. Does it depend on $R$ or $y$?
 * Can you see a shortcut to computing this posterior rather than having to do the pointwise multiplication?

In [None]:
#ws.show_answer('Posterior behaviour')

**Exc 2.14:** Show that the normalization in `Bayes_rule()` amounts to the same as dividing by $p(y)$.

In [None]:
#ws.show_answer('BR normalization')

In fact, since $p(y)$ is implicitly known,
we often don't bother to write it down, simplifying Bayes' rule (2) to
$$\begin{align}
p(x|y) \propto p(x) \, p(y|x) \, .  \tag{3}
\end{align}$$
In fact, do we even need to care about $p(y)$ at all? All we really need to know is how much more likely some possible $x = a$ (or an interval around $a$) is compared to any other $x=b$. There is no additional information in $p(y)$, as reflected in the fact that it is implicitly known by the integral $\int p(x) \, p(y|x) \, d x = p(y)$.

And if we want to be really philosophical, we can note that this last equality is true only because of the convention that all densities integrate to $1$. Otherwise it would be a proportionality. And something that holds only by convention cannot contain any additional information.

PS1: In some cases (not of our concern here) Bayes' rule is applied to random variables where $y$ is not a given constant. In this case one must of course also keep track of $p(y)$.

PS2: There are methods where $ p(x|y)$ is not known (has not been evaluated) for all $x$, but only at a few points.
In these methods, estimation of the normalisation factor becomes an important question too.

**Exc 2.15*:** 
* (a) Implement a "uniform" (or "flat" or "box") distribution pdf and call it `pdf_U1(x, b, B)`. These <a href="https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)#Moments">formulae</a> for its mean/variance will be useful. In the above animations, replace `pdf_G1` with your new `pdf_U1` (both for the prior and likelihood). Ensure that everything is working correctly. 

In [None]:
#ws.show_answer('pdf_U1')

* (b) 
 - Why (in the figure) are the walls of the pdf (ever so slightly) inclined?
 - What happens when you move the prior and likelihood too far apart? Is the fault of the implementation, the math, or the problem statement?

In [None]:
#ws.show_answer('BR U1')

* (c)*:
 - Re-do Exc 2.12, now with `pdf_U1`.
* (d)*:
 - Now test a Gaussian prior with a uniform likelihood.

<mark><font size="-1">
<b>NB:</b> At the end of this exercise, restore `pdf_G1` (both the prior and likelihood) in the above animation (for later use). 
</font></mark>


### Gaussian-Gaussian Bayes

The above animation shows Bayes' rule in 1 dimension. Previously, we saw how a Gaussian looks in 2 dimensions. Can you imagine how Bayes' rule looks in 2 dimensions? In higher dimensions, these things get difficult to imagine, let alone visualize.

Similarly, the size of the calculations required for Bayes' rule poses a difficulty. Indeed, the following exercise shows that (pointwise) multiplication for all grid points becomes preposterous in high dimensions.

**Exc 2.16:**
 * (a) How many point-multiplications are needed on a grid with $N$ points in $M$ dimensions? (Imagine an $M$-dimensional cube where each side has a grid with $N$ points on it)
 * (b) Suppose we model 15 physical quantities, on each grid point, on a discretized surface model of Earth. Assume the resolution is $1^\circ$ for latitude (110km), $1^\circ$ for longitude. How many variables are there in total? This is the dimensionality ($M$) of the problem.
 * (c) Suppose each variable is has a pdf represented with a grid using only $N=10$ points. How many multiplications are necessary to calculate Bayes rule (jointly) for all variables on our Earth model?

In [None]:
#ws.show_answer('Dimensionality a')
#ws.show_answer('Dimensionality b')
#ws.show_answer('Dimensionality c')

In response to this computational difficulty, we try to be smart and do something more analytical ("pen-and-paper"): we only compute the parameters (mean and (co)variance) of the posterior pdf.

This is doable and quite simple in the Gaussian-Gaussian case:  
With a prior $p(x) = \mathcal{N}(x \mid b,B)$ and a likelihood $p(y|x) = \mathcal{N}(y \mid x,R)$,
the posterior is
$$\begin{align}
p(x|y)
&= \mathcal{N}(x \mid \hat{x},P) \tag{4} \, ,
\end{align}$$
where, in the univariate (1-dimensional) case:
$$\begin{align}
    P &= 1/(1/B + 1/R) \, , \tag{5} \\\
  \hat{x} &= P(b/B + y/R) \, .  \tag{6} 
\end{align}$$

The multivariate case is discussed in a later tutorial; for now, try to tackle exc 2.18.

#### Exc  2.18 'Gaussian Bayes':
Derive the above expressions for $P$ and $\hat{x}$
from Bayes' rule (3) and the expression for a Gaussian pdf (G1).

In [None]:
#ws.show_answer('BR Gauss')

**Exc 2.20:** Algebra exercise: Show that $P = K R$, where
$$K = B/(B+R) \,,    \tag{7}$$
is called the "Kalman gain".
Then shown that eqns (5) and (6) can be rewritten as
$$\begin{align}
    P &= (1-K)B \, ,  \tag{8} \\\
  \hat{x} &= b + K (y-b) \tag{9} \, ,
\end{align}$$
*Hint: For eqn (8), begin from the right-hand side.*

**Exc 2.22*:** Consider the formula for $K$ and its role in the previous couple of equations. Why do you think $K$ is called a "gain"?

In [None]:
#ws.show_answer('KG intuition')

**Exc 2.24:** Implement a Gaussian-Gaussian Bayes' rule (eqns 5 and 6, or eqns 8 and 9) by completing the code below.

In [None]:
def Bayes_rule_G1(b, B, y, R):
    ### INSERT ANSWER HERE ###
    return xhat, P

In [None]:
#ws.show_answer('BR Gauss code')

**Exc 2.26:** Go back to the above animation code, and uncomment the block that uses `Bayes_rule_G1()`. Re-run.  
Make sure its curve coincides with that which uses pointwise multiplication (i.e. `Bayes_rule()`).
This is the main secret of the "Kalman filter".

**Exc 2.30*:** Why are we so fond of the Gaussian assumption?

In [None]:
#ws.show_answer('Why Gaussian')

### Next: [Univariate (scalar) Kalman filtering](T3%20-%20Univariate%20Kalman%20filtering.ipynb)