1. **Restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart)
2. **Run all cells** (in the menubar, select Cell$\rightarrow$Run All).
3. __Use the__ `Validate` __button in the Assignments tab before submitting__.

__Include comments, derivations, explanations, graphs, etc.__ 

You __work in groups__ (= 3 people). __Write the full name and S/U-number of all team members!__

---

# Assignment 4 (Statistical Machine Learning 2024)
# **Deadline: 21 December 2024**

## Instructions
* Fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE` __including comments, derivations, explanations, graphs, etc.__ 
Elements and/or intermediate steps required to derive the answer have to be in the report. If an exercise requires coding, explain briefly what the code does (in comments). All figures should have titles (descriptions), axis labels, and legends.
* __Please use LaTeX to write down equations/derivations/other math__! How to do that in Markdown cells can be found [here](https://www.fabriziomusacchio.com/blog/2021-08-10-How_to_use_LaTeX_in_Markdown/), a starting point for various symbols is [here](https://www.overleaf.com/learn/latex/Mathematical_expressions).
* Please do __not add new cells__ to the notebook, try to write the answers only in the provided cells. Before you turn the assignment in, make sure everything runs as expected.
* __Use the variable names given in the exercises__, do not assign your own variable names. 
* __Only one team member needs to upload the solutions__. This can be done under the Assignments tab, where you fetched the assignments, and where you can also validate your submissions. Please do not change the filenames of the individual Jupyter notebooks.

For any problems or questions regarding the assignments, ask during the tutorial or send an email to charlotte.cambiervannooten@ru.nl and janneke.verbeek@ru.nl .

## Introduction
Assignment 4 consists of:
1. Bayesian inference in binary response problem (50 points);
2. __The EM algorithm for doping detection (50 points)__;
3. Gibbs sampling and Metropolis-Hastings (50 points);
4. State-Space models (50 points).

## Libraries

Please __avoid installing new packages__, unless really necessary.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it to at least version 3."

import numpy as np
import matplotlib.pyplot as plt

# Set fixed random seed for reproducibility
np.random.seed(2022)

## The EM algorithm for doping detection (50 points)
In a certain hypothetical sport, a banned substance has become popular as a performance enhancing drug, as its presence is hard to establish in blood samples directly. Recently, it has been discovered that users of the drug tend to show a strong positive correlation between concentrations of two other quantities, $x_1$ and $x_2$, present in the blood. In contrast, 'clean' athletes tend to fall in one of two or three groups, that either show no or a negative correlation between $x_1$ and $x_2$. Unfortunately, as each sample contains only a single, instantaneous, measurement for each variable, it is not possible to establish this correlation from the sample. However, in many cases it is possible to distinguish to which _class_ a certain sample belongs by also looking at the values of two other measured variables, $x_3$ and $x_4$: certain combinations of measured values are often typical for one class but highly unusual for others.

After a high profile event, a large scale test has resulted in 2000 samples. Rumours suggest the number of positives could be as high as 20\%. However, the exact relationship between different classes and typical $\mathbf{x}$ values is still not clear. This is where the EM-algorithm comes in ...

The blood sample measurements are modelled as a mixture of $K$ Gaussians, one for each class
\begin{equation}
p(\mathbf{x}|\mathbf{\mu}, \mathbf{\Sigma}, \mathbf{\pi}) = \sum_{k=1}^K \pi_k \mathcal{N}(\mathbf{x}|\mathbf{\mu}_k, \mathbf{\Sigma}_k)
\label{Gmm}
\tag{1}
\end{equation}
where $\mathbf{x} = [x_1, x_2, x_3, x_4]$ represents the values for the measured quantities in the blood sample, $\mathbf{\mu} = \{\mathbf{\mu}_1, \ldots, \mathbf{\mu}_K\}$ and $\mathbf{\Sigma} = \{\Sigma_1, \ldots, \Sigma_K\}$ are the means and covariance matrices of the Gaussians for each class, and $\mathbf{\pi} = \{\pi_1, \ldots, \pi_K\}$ are the mixing coefficients in the overall data set.

We first load the data and set $N$ to the number of datapoints and $D$ to the number of variables in the data set $X$.

In [None]:
# Load data
X = np.loadtxt("doping_mixdata.txt")
N, D = X.shape

1. Try to give an estimate of the number, size and shape of the classes in the data by plotting the distribution of the variables, e.g, using `plt.hist`, `plt.scatter` or `scatter3()`.

In [None]:
"""
Experiment with different plots.
"""
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

2. Implement an EM-algorithm using the description and formulas given in Bishop, $\S9.2.2$. Use variable $K$ for the number of classes and choose a priori equal mixing coefficients $\pi_k$. Initialize the means $\mathbf{\mu}_k$ to random values around the sample mean of each variable, e.g. set $\mu_{k,1}$ to $\bar{x}_1 + [-1 \leq \epsilon \leq +1]$. Initialize the $\mathbf{\Sigma}_k$ to diagonal matrices with reasonably high variances, e.g. `random.randint(2,6)`, to avoid very small responsibilities in the first step. Make sure the EM-loop runs over at least 100 iterations. Display relevant quantities, at least the log likelihood (9.28), after each step so you can monitor progress and convergence.

In [None]:
"""
Implement the EM algorithm.
"""
# YOUR CODE HERE
raise NotImplementedError()



Now implement a plot routine that plots the ${x_1,x_2}$ coordinates of the data points, and color each data point  according to the most probable component in the mixture model.

In [None]:
"""
Plot routine.
"""
# YOUR CODE HERE
raise NotImplementedError()

3. Set $K=2$, initialize your random generator and run the EM-algorithm on the data. Try different random initializations.

    _(Should converge within 50 steps to two clusters, accounting for $\pm$1/3 resp. 2/3 of the data)._

    Plot the ${x_1,x_2}$ coordinates coloured according to the most probable component.
    

In [None]:
"""
Run the EM algorithm. 
"""
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""
Plot the most probable component. 
"""
# YOUR CODE HERE
raise NotImplementedError()

Describe what happens and compare the results of the different initializations.

YOUR ANSWER HERE

Compute the correlation coefficients
\begin{equation}
\rho_{12} = \frac{\mathrm{cov}[x_1,x_2]}{\sqrt{\mathrm{var}[x_1] \mathrm{var}[x_2]}}
\label{correlationcoeff}
\tag{2}
\end{equation}
of each of the components (i.e., use their covariance matrices to compute variances and covariances in (\ref{correlationcoeff}), see also Bishop, eq. (2.93).

**Hint**: According to Wikipedia, the correlation is none if $|\rho|<0.1$, small if $0.1<|\rho|<0.3$, medium if $0.3<|\rho|<0.5$ and strong if $|\rho|>0.5$.

In [None]:
"""
Compute the correlation coefficients. 
"""
# YOUR CODE HERE
raise NotImplementedError()

Does either class show the characteristic strong positive correlation for $\{x_1, x_2\}$?

YOUR ANSWER HERE

4. Increase the number of classes to $K=3$ and rerun your algorithm on the data, again trying different random initializations. Plot the ${x_1,x_2}$ coordinates colored according to the most probable component and compute the correlation coefficients of each of the components. Check both your plot and your coefficients if one of the clusters now displays the strong positive $\{x_1, x_2\}$ correlation we are looking for.

    Increase to $K=4$, do the same, and see if this improves your result (in terms of detection of the doping-cluster). Based on your findings, is the rumoured 1-in-5 estimate for users of the drug credible?
    
    **Note:** Please use only the cells allotted for code and explanations.

In [None]:
"""
Run the EM algorithm with K=3. 
"""
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
"""
Run the EM algorithm with K=4. 
"""
# YOUR CODE HERE
raise NotImplementedError()

YOUR ANSWER HERE

Having found the offending cluster in the data using the EM-algorithm, we are now presented with four samples $\{A, B, C, D\}$, with values for $[x_1, x_2, x_3, x_4]$ given as:
\begin{eqnarray*}
\text{A} & = & [11.85, 2.2, 0.5, 4.0] \\
\text{B} & = & [11.95, 3.1, 0.0, 1.0] \\
\text{C} & = & [12.00, 2.5, 0.0, 2.0] \\
\text{D} & = & [12.00, 3.0, 1.0, 6.3]
\end{eqnarray*}
One of these is from a subject who took the performance enhancing drug, and one is from a subject who tried to tamper with the test by artificially altering one or more of the $x_i$ levels in their blood sample.

5. Identify which sample belongs to the suspected user and which one belongs to the 'fraud'. Explain your choices.

In [None]:
# Use this code cell if you need to perform any extra computations, then write your explanation in the text cell below.

YOUR ANSWER HERE