## Gaussians

In class, we derived (or will derive) the posterior and predictive distributions for a data point generated from a Gaussian-Gaussian model: having a Gaussian likelihood with unknown mean and known variance, and with a Gaussian prior on the mean of the likelihood with known mean and known variance. This model can be written in generative process notation\footnote{I try to use the following convention for variables: scalar variables are lowercase and italics (e.g., $x$), indices tend to be $n$, $t$, $c$ or $k$, the largest index is uppercase (e.g., $N$), vector variables are bold and lowercase (e.g., ${\bf x}$), and matrix variables are bold and uppercase (e.g., ${\bf X}$).} as:

$\mu \sim {\rm N}(\mu_0, \sigma^2_0)$
$x_1,\dotsc, x_N | \mu, \sigma_x^2 \overset{iid}{\sim} {\rm N}(\mu, \sigma^2_x)$

Remember that $iid$ means {\em independent} and {\em identically distributed} and the generative process notation $x | \mu, \sigma_x^2 \sim {\rm N}(\mu, \sigma_x^2)$ means that given the values of parameters $\mu$ and $\sigma_x^2$, $x$ is normally distributed with mean $\mu$ and variance $\sigma_x^2$. So, this means
$p(x|\mu,\sigma_x^2) = \frac{1}{\sigma_x \sqrt{2 \pi}} e^{-\frac{1}{2 \sigma_x^2}\left(x-\mu \right)^2}$

Note that it is traditional to use a zero subscript for the parameters for a prior distribution. For this model, there is a closed form solution for the posterior probability, $p(\mu | x_1, \dotsc, x_N)$, and predictive probability, $p(x_{N+1} | x_1,\dotsc,x_N)$. Remember that you can always look up these special models whose posterior distribution is the same form as the prior distribution on Wikipedia [Conjugate Prior](https://en.wikipedia.org/wiki/Conjugate_prior). They are

$
\begin{align*}
{\rm Posterior:} & &  \mu & | x_1,\dotsc, x_N \sim {\rm N} \left( \frac{\mu_0 \sigma_0^{-2} + \sigma_x^{-2} \sum_{n=1}^N{x_n}   } {\sigma_0^{-2} + N \sigma_x^{-2} }, \left[ \sigma_0^{-2} + N \sigma_x^{-2} \right]^{-1}  \right)  \\
{\rm Prediction} & & x_{N+1} & | x_1, \dotsc, x_N \sim {\rm N} \left( \frac{ \mu_0 \sigma_0^{-2} + \sigma_x^{-2} \sum_{n=1}^N{x_n}  }{\sigma_0^{-2} + N \sigma_x^{-2} },  \left[ \sigma_0^{-2} + N \sigma_x^{-2} \right]^{-1} + \sigma_x^2 \right)
\end{align*}
$
So, the predictive distribution has the same mean as the posterior distribution, but it has larger variance (it is $\sigma_x^2$ larger). For this problem, use $\mu_0=0$ and $\sigma_0^2 = 1$.  In this problem, we will explore how the number of data points and variance of the likelihood affect the posterior and predictive distributions.


### 1. (a) Prior
To provide a baseline, turn in a plot of the prior distribution. Please make sure your plot captures the ``interesting'' part of the distribution (i.e., the two extrema of the x-axis are the tails and the width and maximum of the bell are clearly visible). 

In [None]:
#fill me

### 1. (b) One Datum update
Calculate and plot the posterior and predictive distributions after observing $x_1=2$ for $\sigma_x^2 = 0.25$ and $\sigma_x^2=4$ (that is 4 different distributions: the posterior and predictive for $\sigma_x^2=0.25$ and the posterior and predictive for $\sigma_x^2=4$). How does changing the variance of the likelihood affect the distributions? Are there any differences? Why?

In [None]:
#fill me

### 1. (c) Multiple data update
Calculate and plot the posterior and predictive distributions given $(x_1,\dotsc, x_5) = (2.1, 2.5, 1.4, 2.2, 1.8)$ for $\sigma_x^2 = 0.25$ and $\sigma_x^2 = 4$. How does this  compare to the previous example? Note that the average of the data points is 2, and so both contribute the same average value. For cases that differ, why do they differ then? For those that do not, why don't they differ?


In [None]:
fill me