1. **Restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart)
2. **Run all cells** (in the menubar, select Cell$\rightarrow$Run All).
3. __Use the__ `Validate` __button in the Assignments tab before submitting__.

__Include comments, derivations, explanations, graphs, etc.__ 

You __work in groups__ (= 3 people). __Write the full name and S/U-number of all team members!__

---


# Assignment 2 (Statistical Machine Learning 2024)
# **Deadline: 18 October 2024**

## Instructions
* Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE` __including comments, derivations, explanations, graphs, etc.__ 
Elements and/or intermediate steps required to derive the answer have to be in the report. If an exercise requires coding, explain briefly what the code does (in comments). All figures should have titles (descriptions), axis labels, and legends.
* __Please use LaTeX to write down equations/derivations/other math__! How to do that in Markdown cells can be found [here](https://www.fabriziomusacchio.com/blog/2021-08-10-How_to_use_LaTeX_in_Markdown/), a starting point for various symbols is [here](https://www.overleaf.com/learn/latex/Mathematical_expressions).
* Please do __not add new cells__ to the notebook, try to write the answers only in the provided cells. Before you turn the assignment in, make sure everything runs as expected.
* __Use the variable names given in the exercises__, do not assign your own variable names. 
* __Only one team member needs to upload the solutions__. This can be done under the Assignments tab, where you fetched the assignments, and where you can also validate your submissions. Please do not change the filenames of the individual Jupyter notebooks.

For any problems or questions regarding the assignments, ask during the tutorial or send an email to charlotte.cambiervannooten@ru.nl and janneke.verbeek@ru.nl .

## Introduction
Assignment 2 consists of:
1. Classification and decision theory (30 points),
2. Bayesian linear regression (20 points),
3. __Sequential learning (50 points)__.

## Libraries

Please __avoid installing new packages__, unless really necessary.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import scipy.stats as ss

# Set fixed random seed for reproducibility
np.random.seed(2022)

## Exercise 3 - Sequential Learning (50 points)
### Part 1: Obtaining the prior
Consider a four dimensional variable $[x_1, x_2, x_3, x_4]^T$, distributed according to a multivariate Gaussian with mean $\tilde{\mathbf{\mu}} = [1,0,1,2]^T$ and covariance matrix $\tilde{\mathbf{\Sigma}}$ given as
\begin{equation}
    \tilde{\mathbf{\Sigma}} =
    \left(\begin{array}{cc|cc} 
    0.14 & -0.3 & 0.0 & 0.2 \\ 
    -0.3 & 1.16 & 0.2 & -0.8 \\ \hline 
    0.0 & 0.2 & 1.0 & 1.0 \\ 
    0.2 & -0.8 & 1.0 & 2.0 \end{array}\right)
    \label{mat}
    \tag{2}
\end{equation}
We are interested in the conditional distribution over $[x_1, x_2]^T$, given that $x_3 = x_4 = 0$. We know this conditional distribution will also take the form of a Gaussian:
\begin{equation}     \label{prior}
    p\big([x_1,x_2]^T \,|\, x_3 = x_4 = 0 \big) = \mathcal{N}([x_1,x_2]^T | \mathbf{\mu}_p, \mathbf{\Sigma}_{p})
    \tag{3}
\end{equation}
for which the mean and covariance matrix are most easily expressed in terms of the (partitioned) precision matrix (see Bishop,$\S2.3.1$).
#### Part 1.1
Use the partitioned precision matrix $\tilde{\mathbf{\Lambda}} = \tilde{\mathbf{\Sigma}}^{-1}$ to give an explicit expression for the mean $\mathbf{\mu}_p$ and covariance matrix $\mathbf{\Sigma}_p$ of this distribution and calculate their values. (This distribution will be taken as the _prior_ information for the rest of this exercise, hence the subscript $p$). You may use `np.linalg.inv` to calculate matrix inverses.

YOUR ANSWER HERE

Please also provide the values of $\mathbf{\mu}_p$ and $\mathbf{\Sigma}_p$ in code.

In [None]:
"""
Calculate the mean and covariance. Note: mu is a column vector.
mu_p : array
    The mean.
Sigma_p : matrix
    The covariance.
"""
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""
Hidden test for checking the value of mu_p and Sigma_p.
"""

#### Part 1.2
Generate random number pairs distributed according to the distribution in (3).
Initialize your random generator and then draw a $single$ pair
    \begin{equation}
    \mathbf{\mu}_t = [\mu_{t_1}, \mu_{t_2}]^T
    \label{mu_t}
    \tag{4}
\end{equation}
from this distribution. This will be the "true" mean, hence the subscript $t$. Draw 100 more pairs from the same distribution and plot them together with $\mathbf{\mu}_t$ to see where the "true" mean falls within the prior distribution.

**Hint**: You can use the function `np.random.multivariate_normal`.

In [None]:
"""
Plot of the randomly generated number pairs and the "true" mean.
"""
# YOUR CODE HERE
raise NotImplementedError()

#### Part 1.3
Make a surface plot of the probability density of the distribution (3).

**Hint**: use the function `ss.multivariate_normal` to calculate the probability density of a multivariate Gaussian random variable. The functions `np.mgrid` and `Axes3D.plot_surface` may also prove useful.

In [None]:
"""
Plot of the probability density.
"""
# YOUR CODE HERE
raise NotImplementedError()

### Part 2: Generating the data
Here we assume we are dealing with a 2D-Gaussian data generating process 
\begin{equation}
p(\mathbf{x}) = \mathcal{N}(\mathbf{x} | \mathbf{\mu}, \mathbf{\Sigma}) \label{data} \tag{5}
\end{equation}
For the mean $\mathbf{\mu}$, we will use the value $\mathbf{\mu}_t$ drawn in (4) to _generate_ the data. Subsequently, we will pretend that we do not know this "true" value $\mathbf{\mu}_t$ of $\mathbf{\mu}$, and estimate $\mathbf{\mu}$ from the data. For the covariance matrix $\mathbf{\Sigma}$ we will use the "true" value
\begin{equation}
    \mathbf{\Sigma}_t = \left(\begin{array}{cc} 2.0 & 0.8 \\ 0.8 & 4.0 \end{array} \right) \label{VarD} \tag{6}
\end{equation}
to generate the data.
#### Part 2.1
Generate at least 1000 data pairs $\{x_i, y_i\}$, distributed according to equation (5) with $\mathbf{\mu} = \mathbf{\mu}_t$ and $\mathbf{\Sigma} = \mathbf{\Sigma}_t$. Make a scatter plot of these noisy observations and superimpose the plot from Part 1.2 (the prior distribution of the mean) for additional context.

In [None]:
"""
Generate 1000 data pairs.
"""
# YOUR CODE HERE
raise NotImplementedError()

#### Part 2.2
From now on, we will assume (pretend) that the "true" mean $\mathbf{\mu}_t$ is unknown and estimate $\mathbf{\mu}$ from the data. Calculate the maximum likelihood estimate of $\mathbf{\mu}_{\mathrm{ML}}$ and $\mathbf{\Sigma}_{\mathrm{ML}}$ for the data, and also an unbiased estimate of $\mathbf{\Sigma}$ (see Bishop, $\S2.3.4$).

In [None]:
"""
Calculate the maximum likelihood estimate.
"""
# YOUR CODE HERE
raise NotImplementedError()

Compare the estimates to the true values $\mathbf{\mu}_t$ and $\mathbf{\Sigma}_t$.

YOUR ANSWER HERE

### Part 3: Sequential learning algorithms
We will now estimate the mean $\mathbf{\mu}$ from the generated data and the known variance $\mathbf{\Sigma}_{t}$ _sequentially_, i.e., by considering the data points one-by-one.
#### Part 3.1
Write a procedure that processes the data points $\{\mathbf{x}_n\}$ in the generated file one-by-one, and after each step computes an updated estimate of $\mathbf{\mu}_{\mathrm{ML}}$, the maximum likelihood of the mean (using Bishop, eq.2.126).

In [None]:
"""
Calculate the maximum likelihood of the mean.
"""
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

Now we also use the prior information $p(\mathbf{\mu}) = \mathcal{N}(\mathbf{\mu} | \mathbf{\mu}_p, \mathbf{\Sigma}_p)$. From the prior, the generated data and the known variance $\mathbf{\Sigma}_t$, we will estimate the mean $\mathbf{\mu}$.
#### Part 3.2
Work out the details of sequential Bayesian inference (see eq.2.144) for the mean $\mathbf{\mu}$. Apply Bayes' theorem in eq. 2.113-2.117 at each step $n=1,\dots,N$ to compute the new posterior mean $\mathbf{\mu}^{(n)}$ and covariance $\mathbf{\Sigma}^{(n)}$ from the old posterior mean $\mathbf{\mu}^{(n-1)}$ and covariance $\mathbf{\Sigma}^{(n-1)}$ after a new point $(\mathbf{x}_n)$ has arrived. Use this updated posterior as the prior in the next step. The first step starts from the original prior (3).

__Note__: Do not confuse the posterior $\mathbf{\Sigma}^{(n)}$ with the known $\mathbf{\Sigma}_t$ of the data generating process. 


__Hints__: Bayes' rule is also valid if earlier acquired information is taken into account, for example if this is earlier seen data $D_{n-1} = \{x_1, \ldots, x_{n-1}\}$. Bayes' rule conditioned on this earlier data is 
$$P(\mu|x_{n},D_{n-1}) \propto P(\mu|D_{n-1}) P(x_{n}|\mu,D_{n-1}).$$
Since $D_{n} = \{x_1, \ldots, x_{n}\}$ this is written more conveniently as
$$P(\mu|D_n) \propto P(\mu|D_{n-1}) P(x_{n}|\mu,D_{n-1}).$$
If given the model parameters $\mu$, the probability distribution of $x_n$ is independent of earlier data $D_{n-1}$, we can further reduce this to
$$P(\mu|D_{n}) \propto P(\mu|D_{n-1}) P(x_{n}|\mu).$$
You should be able to see the relation with (2.144), in particular that the factor between square brackets in (2.144) is to be identified with $P(\mu|D_{n-1})$.

Another important insight is that if $P(\mu|D_{n-1})$ and $P(x_{n}|\mu)$ are of the form (2.113) and (2.114), 
that is, if $P(\mu|D_{n-1})$ is a Gaussian distribution over $\mu$ with a certain mean and covariance (you are free to give these any name, e.g. $\mu^{(n-1)}$, $\Sigma^{(n-1)}$) and if $P(x_{n}|\mu)$ is also Gaussian with a mean that is linear $\mu$, then you can use (2.116) and (2.117) to compute the posterior $P(\mu|D_{n})$, which is therefore also Gaussian.
 
It is your task to show this. To do this you have to figure out the mapping of the variables and parameters in the current exercise, i.e., what is the correspondence between $\mu, x_n, \Sigma_t, \mu^{(n-1)}, \Sigma^{(n-1)}$ etc. with $x,\mu,\Lambda, y,A,b,L$. Don't forget that some quantities can also be zero or and other may be identity matrices.

YOUR ANSWER HERE

#### Part 3.3
Write a procedure that processes the data points $\{\mathbf{x}_n\}$ in the generated file one-by-one, and after each step computes an updated estimate of $\mathbf{\mu}_{\mathrm{MAP}}$ - the maximum of the posterior distribution, using the results of the previous exercise.

In [None]:
"""
Calculate the MAP of the mean.
"""
# YOUR CODE HERE
raise NotImplementedError()

#### Part 3.4
Plot both estimates (ML and MAP) in a single graph (1D or 2D) as a function of the number of data points observed. Indicate the true values $\{\mu_{t_1}, \mu_{t_2}\}$ as well. Evaluate your result.

Make sure you store the values for $\mathbf{\mu}_{\mathrm{ML}}$ and $\mathbf{\mu}_{\mathrm{MAP}}$ at each intermediate step $n$ and use these to plot against each other. Useful graphs to get an impression of the convergence behaviour are:
* lineplots of components of $\mathbf{\mu}_{\mathrm{ML}}^{(n)}$ and $\mathbf{\mu}_{\mathrm{MAP}}^{(n)}$ vs. $n$,
* 2D-plot joining points ($\mathbf{\mu}_{\mathrm{ML}}^{(n)}$, $\mathbf{\mu}_{\mathrm{ML}}^{(n+1)}$) for successive $n$,
* combinations of $\mathbf{\mu}_{\mathrm{ML}}$ and $\mathbf{\mu}_{\mathrm{MAP}}$ components in a single plot,
* the final posterior distribution.

In [None]:
"""
Plots of ML and MAP estimates.
"""
# YOUR CODE HERE
raise NotImplementedError()

Now interpret what you see on the plots.

YOUR ANSWER HERE