# Stochastic Simulation

*Winter Semester 2023/24*

16.02.2024

Prof. Sebastian Krumscheid<br>
Asstistant: Stjepan Salatovic

<h3 align="center">
Exercise sheet 12
</h3>

---

<h1 align="center">
Markov Chain Monte Carlo
</h1>

In [1]:
import matplotlib.pyplot as plt
import numpy as np

from scipy.stats import norm
from scipy.special import gamma
from ipywidgets import interact

import warnings
warnings.filterwarnings("ignore")

In [2]:
plt.rc('font', size=14)          # Controls default text sizes
plt.rc('axes', titlesize=16)     # Fontsize of the axes title
plt.rc('axes', labelsize=14)     # fontsize of the x and y labels
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)    # fontsize of the tick labels
plt.rc('legend', fontsize=14)    # legend fontsize

## Exercise 1

In many applications of interest, it is not uncommon to encounter the need for sampling from a multi-modal distribution $f$. The theory developed so far can be directly applicable to these types of distributions.
However, in practice, sampling from these distributions using MCMC can be computationally challenging,  as we will investigate in this problem. Throughout this exercise, we will consider the bi-modal distribution
$$\tag{1}
f(x;\gamma,x_0)=\frac{e^{-\gamma (x^2-x_0)^2}}{Z}, \quad \gamma>0,
$$
where $Z$ is some normalizing constant. Depending on the values of $\gamma$ and $x_0$, designing a sampling strategy to properly sample from (1) can become challenging. Intuitively, if both peaks are too far apart, using a random walk Metropolis  (RWM) might not work, as it is possible for the sampler to get stuck on one of the peaks if the _step-size_ is too small. Conversely, a RWM with very large _steps_ might tend to reject quite often, thus rendering the whole sampling procedure inefficient. We begin by verifying this. Implement the RWM algorithm using as proposal distribution the density of the normal distribution $q(x,y)=\mathcal{N}(x,\sigma^2)$ and target distribution $f(x;\gamma,x_0)$ for $\gamma=1$,  $x_0=1,4,9,25$ and different choices of $\sigma$. Discuss the quality of your samples by analyzing the trace-plots (one realization of the chain), autocorrelation functions and histograms of the chains obtained.

**Remark:** Have a look at [`plt.acorr`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.acorr.html) for plotting the autocorrelation.

In [3]:
def f(x: float, gamma: float, x0: float) -> float:
    """
    Computes the density of a bi-modal distribution.

    Args:
        x (float): The point at which to evaluate the density.
        gamma (float): The gamma parameter of the distribution.
        x0 (float): The x0 parameter of the distribution.

    Returns:
        float: The density of the bi-modal distribution at point x.
    """
    # TODO
    return

In [4]:
def RWM(sigma: float, gamma: float, x0: float, N: int=10000) -> np.array:
    """
    Implements the Random Walk Metropolis (RWM) algorithm.

    Args:
        sigma (float): The standard deviation of the proposal distribution.
        gamma (float): The gamma parameter of the target distribution.
        x0 (float): The x0 parameter of the target distribution.
        N (int, optional): The number of samples to generate. Defaults to 10000.

    Returns:
        np.array: An array of samples generated by the RWM algorithm.
    """
    # TODO
    return

## Exercise 2

Ideally, we would like to obtain (approximately) i.i.d samples from a target distribution $f$ using Markov Chain Monte Carlo (MCMC) algorithms. 
One practical way of doing so is via _sub-sampling_ (also called _batch sampling_), which is implemented to reduce or eliminate correlation between the successive values in the Markov chain.
That is, instead of considering the entire chain $\{X_n\colon n\ge 0\}$, say, this technique sub-samples the chain with a batch size $k>1$, so that only the values $\{X_{kn}\colon n\ge 0\}$ are considered.
If the covariance $\text{Cov}_f(X_0,X_n)$ vanishes as $n\to\infty$, then the idea of sub-sampling is quite natural since $X_{kn}$ and $X_{k(n+1)}$ can be considered to be approximately independent for $k$ sufficiently big; estimating such a $k$ may be
difficult in practice though.
While sub-sampling provides a way of generating (approx.) i.i.d. samples from $f$ and may thus be useful assessing the convergence of a MCMC method, it necessarily leads to an efficiency loss.
Let $\{X_n\in \mathbb{R}^d \colon n\ge 0\}$ be a Markov chain with a unique stationary 	distribution $f$, and $X_0 \sim f$ (i.e., the chain is at equilibrium). Take $\phi\colon \mathbb{R}^d\to\mathbb{R}$ such that $\mathbb{E}_f\bigl({\lvert \phi\rvert}^2\bigr)<\infty$ and consider two estimators for $\mu=\mathbb{E}_f(\phi)$, namely one that uses the entire Markov chain ($\hat\mu$) and one based on sub-sampling ($\hat\mu_{k}$) using only every $k$-th value:
\begin{equation*}
\hat\mu = \frac{1}{Nk}\sum_{n=1}^{Nk}\phi(X_n)\;,\quad\text{and}\quad \hat\mu_{k} = \frac{1}{N}\sum_{n=1}^N\phi(X_{nk})\;.
\end{equation*}
Show that the variance of $\hat\mu$ satisfies $\mathbb{V}_f(\hat\mu)\le \mathbb{V}_f(\hat\mu_{k})$ for every $k>1$.

## Exercise 3

At every iteration of the general Metropolis-Hastings algorithm, a
new candidate state $\boldsymbol{Y}_{n+1}$ is proposed by sampling
$\boldsymbol{Y}_{n+1} \sim q(\boldsymbol{X}_{n},\cdot)$, given the
current state $\boldsymbol{X}_{n}$. Here,
$q(\boldsymbol{x},\boldsymbol{y})$ is the so-called proposal
density. Consider now the case where the proposal does not depend on
the current state, that is
$q(\boldsymbol{x},\boldsymbol{y}) \equiv q(\boldsymbol{y})$, so that the proposed candidate is $\boldsymbol{Y}_{n+1} \sim q$. This particular
Markov Chain Monte Carlo (MCMC) variant is sometimes called
_independent Metropolis-Hastings algorithm_ with fixed proposal
(or simply _independence sampler_). Let's denote the target
density by $f$. As such, this MCMC variant appears very similar to the
Accept-Reject method for sampling from $f$ (cf. Lab 02).

1. Suppose there exists a positive constant $C$ such that
	$f(\boldsymbol{x})\le C q(\boldsymbol{x})$ for any
	$\boldsymbol{x}\in\text{supp}(f)=\{\boldsymbol{x}\in\mathbb{R}^d\colon
	f(\boldsymbol{x})>0\}$. Show that the expected acceptance probability of the independent
		Metropolis-Hastings algorithm is _at least_ $\frac{1}{C}$
		whenever the chain is stationary. How does this compare to the
		expected acceptance probability of an Accept-Reject method?

2. Let us compare the independent Metropolis-Hastings algorithm
	and the Accept-Reject method in some more detail by an
	example. Specifically, the goal is to sample from a Gamma
	distribution with shape parameter $\alpha$ and scale parameter
	$\beta$, denoted by $\text{Gamma}(\alpha,\beta)$, so that the target
	PDF reads
	$$f(x) \equiv f(x;\alpha,\beta) = \beta^\alpha x^{\alpha -1}
	e^{-\beta x}/\Gamma(\alpha) \mathbb{I}_{\{x\ge 0\}},$$ where
	$\Gamma$ denotes the Gamma function.

    **Hint:** [`scipy.special.gamma`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.gamma.html)

    1. Implement the Accept-Reject method to sample from
		$\text{Gamma}(\alpha,1)$ for $\alpha>1$, using the PDF of the
		$\text{Gamma}(a,b)$ distribution with $a = [\alpha]$ as auxiliary
		density (here $[\alpha]$ denotes the integer part of
		$\alpha$). Show that $b = [\alpha]/\alpha$ is the optimal choice for $b$.


        **Hint:** Recall that
    				$\sum_{k=1}^K \xi_k \sim \mathrm{Gamma}(K,\beta)$ for
    				$K\in\mathbb{N}$, if
    				$\xi_k\overset{\mathrm{i.i.d.}}{\sim}
    				\mathrm{Gamma}(1,\beta)\equiv \mathrm{Exp}(\beta)$.
   2. Use your Accept-Reject method to generate $m$ random numbers
    $X_1,\dots ,X_m$ with each $X_i\sim\text{Gamma}(\alpha,1)$, when
    using $n=5000$ random variables $Y_1,\dots ,Y_n$ from the auxiliary
    $\text{Gamma}([\alpha],[\alpha]/\alpha)$ distribution. Notice that
    $m$ is a random variable, which is smaller than $n$ due to
    rejections. Perform the simulations for $\alpha = 4.85$.

    3. Implement the independent Metropolis-Hastings algorithm using
    as proposal $q$ the PDF of the
    $\text{Gamma}([\alpha],[\alpha]/\alpha)$ distribution.

    4. Use the same sample $Y_1,\dots ,Y_n$ used within the
    Accept-Reject method, now in the corresponding Metropolis-Hastings
    algorithm to generate $n=5000$ realizations of the target
    distribution $\text{Gamma}(\alpha,1)$ with $\alpha = 4.85$.

    5. Compare both methods with respect to:
        1. their acceptance rates,
        2. their estimates for the mean of the $\text{Gamma}(4.85,1)$
        distribution, which is $4.85$,
        3. the correctness of the target distribution.


        Discuss your results.

In [5]:
def f(x: np.array , a: float, b: float) -> np.array:
    """
    Evaluates the PDF of a Gamma distribution with parameters a and b.

    Args:
        x (np.array): An array of values for which to compute the PDF of the specified Gamma distribution.
        a (float): The shape parameter of the Gamma distribution.
        b (float): The rate parameter of the Gamma distribution.

    Returns:
        np.array: Computed PDF values for the input values
    """
    # TODO
    return

In [6]:
def optimal_C(alpha: float, a: float, b: float) -> float:
    """
    Optimal value for `C`: optimization of the ratio `f(x; alpha, 1) / f(x; a, b)`.
    
    Args:
        alpha (float): Parameter of the target distribution function `f`.
        a (float): The shape parameter of the Gamma distribution.
        b (float): The rate parameter of the Gamma distribution.

    Returns:
        float: Optimal value for `C
    """
    # TODO
    return

In [7]:
def acceptance_rejection(alpha: float, n: int = 5000) -> tuple:
    """
    Performs acceptance-rejection sampling to generate samples from a target distribution.

    Args:
        alpha (float): Parameter of the target distribution function `f`.
        n (int, optional): Number of samples to generate. Defaults to 5000.

    Returns:
        tuple: A tuple containing the generated samples (numpy array) and the acceptance probability (float).
    """
    # TODO
    return

In [8]:
def metropolis_hastings(alpha: float, n: int = 5000, initial_value: float = 4) -> tuple:
    """
    Performs Metropolis-Hastings sampling to generate samples from a target distribution.

    Args:
        alpha (float): Parameter of the target distribution function `f`.
        n (int, optional): Number of samples to generate. Defaults to 5000.
        initial_value (float, optional): Initial value of the Markov chain. Defaults to 4.

    Returns:
        tuple: A tuple containing the generated samples (numpy array) and the acceptance rate (float).
    """
    # TODO
    return

## Exercise 4 

Let $X\subset\mathbb{R}^d$ and $P_i:X\times\mathcal{B}(X)\rightarrow[0,1]$, $i=1\dots,m$ be a Markov transition kernel on $X$ with $\mathcal{B}(X)$ the associated $\sigma-$algebra. 

1. Given $a_1,\dots,a_m\in \mathbb{R}^+$, such that $\sum_{i=1}^{m}a_i=1$, show that $P(x,A)=\sum_{i=1}^{m} a_i P_i(x,A)$ is a Markov kernel.

2. Now consider $a_1,\dots,a_m\in \mathbb{R}$, such that $\sum_{i=1}^{m}a_i=1$ (i.e, not necessarily positive weights). Construct an affine combination of kernels $P(x,A)=\sum_{i=1}^{m}\alpha_iP_i(x,A)$ that is still Markovian.

	 **Hint:** Consider a kernel $P$, for which there exists a measure $\nu$ on $(X, \mathcal{B}(X))$ such that $\forall x \in X$, $P_1(x, A) \geq \epsilon \nu(A)$ $\forall A \in \mathcal{B}(X)$, $\forall x \in X$.

3. Suppose that a measure $\pi:\mathcal{B}\rightarrow[0,1]$ is invariant for each kernel $P_i$. Show that it is also invariant for $P=\sum_{i=1}^m a_i P_i$, where  $a_1,\dots,a_m\in \mathbb{R}^+$, such that $\sum_{i=1}^{m}a_i=1$.

# Exercise 5 (optional, no solution)

Consider a Markov chain $\{X_n\} \sim \text{Markov}(\pi,P)$ on a discrete state space $\mathcal{X}$ at equilibrium, with $P$ irreducible, and $\pi$ the unique invariant probability measure of $P$. 
Let $l^2_{\pi}$ be the Hilbert space $l^2_{\pi} = \{f:\mathcal{X} \to \mathbb{R}: \sum_{i\in \mathcal{X}} f(i)^2 \pi_i < \infty\}$ with inner product $(f,g)_{l^2_\pi} = \sum_{i \in \mathcal{X}} f(i)g(i)\pi_i$, and $l^2_{\pi,0} = \{f \in l^2_\pi: \mathbb{E}_\pi[f]=0\}$.

1. Show that if $(P,\pi)$ are in detailed balance, then $(Pf,g)_{l^2_\pi} = (f,Pg)_{l^2_\pi}$ for any $f,g \in l^2_\pi$.

2. Show that $\mathbb{E}[f(X_n)f(X_m)]=(P^{m-n}f,f)_{l^2_\pi}$ for any $f \in l^2_\pi$ and $m>n$.

3. Consider now the estimator $$\hat{\mu}_N = \frac{1}{N} \sum_{n=1}^N f(X_n)$$ of $\mu = \mathbb{E}_\pi[f]$ under the assumption that $f \in l^2_\pi$. Show that $\mathbb{E}_\pi[\hat{\mu}_N] = \mu$, and $$\mathbb{V}ar[\hat{\mu}_N] = \frac{1}{N} \sum_{l=0}^N c_l (P^l \tilde{f}, \tilde{f})_{l^2_\pi},$$ with $\tilde{f} = f - \mathbb{E}_{\pi}[f] \in l^2_{\pi,0}$ and
\begin{align}
c_{l,N} = \begin{cases} 1, \quad l=0 \\ 2(1-\frac{l}{N}), \quad l>0 \end{cases}
\end{align}

4. Conclude that the asymptotic variance $\mathbb{V}(f,p) \coloneqq \lim_{N \to \infty} N \mathbb{V}ar_{\pi}(\hat{\mu}_N)$ satisfies $\mathbb{V}(f,p) = ((2(I-P)^{-1}-I)\tilde{f},\tilde{f})_{l^2_\pi}$ if 
$$
\tag{3}
\sup_{g \in l^2_{\pi,0}} \frac{(Pg,g)_{l^2_\pi}}{\|g\|_{l^2_\pi}} = \beta < 1.
$$

5. Consider now the two irreducible transition matrices $P_1$ and $P_2$, both in detailed balance with $\pi$ and satisfying (3) for some $\beta_1,\beta_2$.
Show that if $(P_1)_{ij} \geq (P_2)_{ij} \forall i \neq j$, then 
\begin{align}
\mathbb{V}(f,P_1) \leq \mathbb{V}(f,P_2),
\end{align}
for any $f \in l^2_\pi$.

    **Hint:** Take $P(\lambda) = (1-\lambda)P_1 + \lambda P_2, \lambda \in [0,1]$ and show that $\frac{d}{d \lambda} \mathbb{V}(f,P(\lambda)) \geq 0$.