# Stochastic Simulation

*Winter Semester 2023/24*

16.02.2024

Prof. Sebastian Krumscheid<br>
Asstistant: Stjepan Salatovic

<h3 align="center">
Exercise sheet 12
</h3>

---

<h1 align="center">
Markov Chain Monte Carlo
</h1>

In [1]:
import matplotlib.pyplot as plt
import numpy as np

from scipy.stats import norm
from scipy.special import gamma
from ipywidgets import interact

import warnings
warnings.filterwarnings("ignore")

In [2]:
plt.rc('font', size=14)          # Controls default text sizes
plt.rc('axes', titlesize=16)     # Fontsize of the axes title
plt.rc('axes', labelsize=14)     # fontsize of the x and y labels
plt.rc('xtick', labelsize=12)    # fontsize of the tick labels
plt.rc('ytick', labelsize=12)    # fontsize of the tick labels
plt.rc('legend', fontsize=14)    # legend fontsize

## Exercise 1

In many applications of interest, it is not uncommon to encounter the need for sampling from a multi-modal distribution $f$. The theory developed so far can be directly applicable to these types of distributions.
However, in practice, sampling from these distributions using MCMC can be computationally challenging,  as we will investigate in this problem. Throughout this exercise, we will consider the bi-modal distribution
$$\tag{1}
f(x;\gamma,x_0)=\frac{e^{-\gamma (x^2-x_0)^2}}{Z}, \quad \gamma>0,
$$
where $Z$ is some normalizing constant. Depending on the values of $\gamma$ and $x_0$, designing a sampling strategy to properly sample from (1) can become challenging. Intuitively, if both peaks are too far apart, using a random walk Metropolis  (RWM) might not work, as it is possible for the sampler to get stuck on one of the peaks if the _step-size_ is too small. Conversely, a RWM with very large _steps_ might tend to reject quite often, thus rendering the whole sampling procedure inefficient. We begin by verifying this. Implement the RWM algorithm using as proposal distribution the density of the normal distribution $q(x,y)=\mathcal{N}(x,\sigma^2)$ and target distribution $f(x;\gamma,x_0)$ for $\gamma=1$,  $x_0=1,4,9,25$ and different choices of $\sigma$. Discuss the quality of your samples by analyzing the trace-plots (one realization of the chain), autocorrelation functions and histograms of the chains obtained.

**Remark:** Have a look at [`plt.acorr`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.acorr.html) for plotting the autocorrelation.

In [3]:
def f(x: float, gamma: float, x0: float) -> float:
    """
    Computes the density of a bi-modal distribution.

    Args:
        x (float): The point at which to evaluate the density.
        gamma (float): The gamma parameter of the distribution.
        x0 (float): The x0 parameter of the distribution.

    Returns:
        float: The density of the bi-modal distribution at point x.
    """
    return np.exp(-gamma * (x ** 2 - x0) ** 2)

In [4]:
def RWM(sigma: float, gamma: float, x0: float, N: int=10000) -> np.array:
    """
    Implements the Random Walk Metropolis (RWM) algorithm.

    Args:
        sigma (float): The standard deviation of the proposal distribution.
        gamma (float): The gamma parameter of the target distribution.
        x0 (float): The x0 parameter of the target distribution.
        N (int, optional): The number of samples to generate. Defaults to 10000.

    Returns:
        np.array: An array of samples generated by the RWM algorithm.
    """
    X = np.zeros(N)
    X[0] = 0
    px = f(X[0], gamma, x0)

    for i in range(N-1):
        y = X[i] + sigma * np.random.randn()
        py = f(y, gamma, x0)
        if py / px > np.random.random():
            X[i+1] = y
            px = py
        else:
            X[i+1] = X[i]
    return X

In [5]:
def plot_diagnostics(sigma: float, x0: float, N: int = 10000) -> None:
    """Interaction helper."""
    gamma = 1

    # Generate samples using the RWM algorithm
    X = RWM(sigma, gamma, x0, N)

    plt.figure(figsize=(15, 10))

    # Trace-plot
    plt.subplot(3, 1, 1)
    plt.plot(X)
    plt.title("Trace-plot")

    # Autocorrelation function
    plt.subplot(3, 1, 2)
    plt.acorr(X - X.mean(), maxlags=100, normed=True)
    plt.xlim(-1, 100)
    plt.title("Autocorrelation")

    # Histogram
    plt.subplot(3, 1, 3)
    plt.hist(X, bins=100, density=True)

    # Overlay the target distribution
    x = np.linspace(X.min(), X.max(), 1000)
    plt.plot(x, [f(xi, gamma, x0) for xi in x], 'r')
    plt.title("Histogram and target distribution")

    plt.suptitle(f"$x_0 = {x0}, \gamma = {gamma}, \sigma = {sigma}$")
    
    plt.tight_layout();

We implement the sampler for different values of $x_0$ and $\sigma$ and show the results in the following Figure. As we can see, once both $x_0$ and $\sigma$ are of a similar magnitude, then the sampler is able to correctly explore the distribution. This is contrary to what happens, when $x_0$ is much larger than $\sigma$ and as such, the sampler tends to get stuck at one of the peaks of the distribution. This in turn can be fixed by increasing the size of $\sigma$. Notice here that for $\sigma$ around $10$, the chain has better mixing and a more rapidly decaying auto-correlation plot, albeit the acceptance rate is 0.017, which is quite small. In general, it is difficult to choose an appropriate $\sigma$ to sample from this type of multi-modal distributions when using random walk Metropolis. Some recent advances to overcome this difficulty are the so-called Hamiltonian Monte Carlo, the delayed-rejection Metropolis-Hastings, and the parallel-tempering algorithm. We refer the interested reader to [1].

[1]: S. Brooks, A. Gelman. Handbook of Markov Chain Monte Carlo, 2011, CRC press.

In [6]:
interact(plot_diagnostics, sigma=(0.1, 16.0), x0=[1, 4, 9, 25], N=(1000, 10000));

interactive(children=(FloatSlider(value=8.05, description='sigma', max=16.0, min=0.1), Dropdown(description='x…

## Exercise 2

Ideally, we would like to obtain (approximately) i.i.d samples from a target distribution $f$ using Markov Chain Monte Carlo (MCMC) algorithms. 
One practical way of doing so is via _sub-sampling_ (also called _batch sampling_), which is implemented to reduce or eliminate correlation between the successive values in the Markov chain.
That is, instead of considering the entire chain $\{X_n\colon n\ge 0\}$, say, this technique sub-samples the chain with a batch size $k>1$, so that only the values $\{X_{kn}\colon n\ge 0\}$ are considered.
If the covariance $\text{Cov}_f(X_0,X_n)$ vanishes as $n\to\infty$, then the idea of sub-sampling is quite natural since $X_{kn}$ and $X_{k(n+1)}$ can be considered to be approximately independent for $k$ sufficiently big; estimating such a $k$ may be
difficult in practice though.
While sub-sampling provides a way of generating (approx.) i.i.d. samples from $f$ and may thus be useful assessing the convergence of a MCMC method, it necessarily leads to an efficiency loss.
Let $\{X_n\in \mathbb{R}^d \colon n\ge 0\}$ be a Markov chain with a unique stationary 	distribution $f$, and $X_0 \sim f$ (i.e., the chain is at equilibrium). Take $\phi\colon \mathbb{R}^d\to\mathbb{R}$ such that $\mathbb{E}_f\bigl({\lvert \phi\rvert}^2\bigr)<\infty$ and consider two estimators for $\mu=\mathbb{E}_f(\phi)$, namely one that uses the entire Markov chain ($\hat\mu$) and one based on sub-sampling ($\hat\mu_{k}$) using only every $k$-th value:
\begin{equation*}
\hat\mu = \frac{1}{Nk}\sum_{n=1}^{Nk}\phi(X_n)\;,\quad\text{and}\quad \hat\mu_{k} = \frac{1}{N}\sum_{n=1}^N\phi(X_{nk})\;.
\end{equation*}
Show that the variance of $\hat\mu$ satisfies $\mathbb{V}_f(\hat\mu)\le \mathbb{V}_f(\hat\mu_{k})$ for every $k>1$.

Let $k>1$. Then define $\hat{\mu}_k^{(0)},\hat{\mu}_k^{(1)},\dots, \hat{\mu}_k^{(k-1)}$ as the shifted versions of $\hat{\mu}_k$, in the sense that:
\begin{equation*}
\hat{\mu}_k^{(j)} = \frac{1}{N}\sum_{n=1}^N\phi(X_{nk-j})\;,\quad j=0,1,\dots,k-1\;.
\end{equation*}
Notice that the estimator $\hat{\mu}$ can then be written as:
\begin{equation*}
\hat{\mu} = \frac{1}{k}\sum_{j=0}^{k-1} \hat{\mu}_k^{(j)}\;,
\end{equation*}
so that the variance of $\hat\mu$ satisfies:
\begin{equation*}
\begin{aligned}
\mathbb{V}_f(\hat\mu) = \mathbb{V}_f\biggl(\frac{1}{k}\sum_{j=0}^{k-1} \hat{\mu}_k^{(j)}\biggr)
&= \frac{\mathbb{V}_f\bigl(\hat{\mu}_k^{(0)}\bigr)}{k} + \sum_{i\not=j}\frac{\text{Cov}_f\bigl(\hat{\mu}_k^{(i)},\hat{\mu}_k^{(j)}\bigr)}{k^2}\\
&\le  \frac{\mathbb{V}_f\bigl(\hat{\mu}_k^{(0)}\bigr)}{k} + \sum_{i\not=j}\frac{\sqrt{\mathbb{V}_f\bigl(\hat{\mu}_k^{(i)}\bigr)\mathbb{V}_f\bigl(\hat{\mu}_k^{(j)}\bigr)}}{k^2}
\end{aligned}
\end{equation*}
in view of the Cauchy--Schwarz inequality and the stationarity. The
claim then follows form the stationarity of the Markov chain again,
indeed
\begin{equation*}
\begin{aligned}
\mathbb{V}_f(\hat\mu) &\le  \frac{\mathbb{V}_f\bigl(\hat{\mu}_k^{(0)}\bigr)}{k} + \sum_{i\not=j}\frac{\sqrt{\mathbb{V}_f\bigl(\hat{\mu}_k^{(i)}\bigr)\mathbb{V}_f\bigl(\hat{\mu}_k^{(j)}\bigr)}}{k^2}\\
&= \frac{\mathbb{V}_f\bigl(\hat{\mu}_k^{(0)}\bigr)}{k} + \frac{k-1}{k}\mathbb{V}_f\bigl(\hat{\mu}_k^{(0)}\bigr) = \mathbb{V}_f\bigl(\hat{\mu}_k\bigr)\;.
\end{aligned}
\end{equation*}

## Exercise 3

At every iteration of the general Metropolis-Hastings algorithm, a
new candidate state $\boldsymbol{Y}_{n+1}$ is proposed by sampling
$\boldsymbol{Y}_{n+1} \sim q(\boldsymbol{X}_{n},\cdot)$, given the
current state $\boldsymbol{X}_{n}$. Here,
$q(\boldsymbol{x},\boldsymbol{y})$ is the so-called proposal
density. Consider now the case where the proposal does not depend on
the current state, that is
$q(\boldsymbol{x},\boldsymbol{y}) \equiv q(\boldsymbol{y})$, so that the proposed candidate is $\boldsymbol{Y}_{n+1} \sim q$. This particular
Markov Chain Monte Carlo (MCMC) variant is sometimes called
_independent Metropolis-Hastings algorithm_ with fixed proposal
(or simply _independence sampler_). Let's denote the target
density by $f$. As such, this MCMC variant appears very similar to the
Accept-Reject method for sampling from $f$ (cf. Lab 02).

1. Suppose there exists a positive constant $C$ such that
	$f(\boldsymbol{x})\le C q(\boldsymbol{x})$ for any
	$\boldsymbol{x}\in\text{supp}(f)=\{\boldsymbol{x}\in\mathbb{R}^d\colon
	f(\boldsymbol{x})>0\}$. Show that the expected acceptance probability of the independent
		Metropolis-Hastings algorithm is _at least_ $\frac{1}{C}$
		whenever the chain is stationary. How does this compare to the
		expected acceptance probability of an Accept-Reject method?

Let $C>0$ such that $f(\boldsymbol{x})\le C q(\boldsymbol{x})$ for any $\boldsymbol{x}\in\text{supp}(f)$.
Suppose that the chain is stationary. 
Then the expected acceptance probability is
\begin{equation*}
\begin{aligned}
\mathbb{E}\Biggl(\min\biggl\{ \frac{f(\boldsymbol{Y}_{n+1})q(\boldsymbol{X}_n)}{f(\boldsymbol{X}_{n})q(\boldsymbol{Y}_{n+1})},1\biggr\}\Biggr) &=  \int \mathbb{I}_{\left\{\frac{f(\mathbf{y})q(\mathbf{x})}{q(\mathbf{y})f(\mathbf{x})}\ge 1\right\}}f(\mathbf{x})q(\mathbf{y})\,d\mathbf{x}\,d\mathbf{y}\\
&+ \int \frac{f(\mathbf{y})q(\mathbf{x})}{q(\mathbf{y})f(\mathbf{x})} \mathbb{I}_{\left\{\frac{f(\mathbf{y})q(\mathbf{x})}{q(\mathbf{y})f(\mathbf{x})}< 1\right\}}f(\mathbf{x})q(\mathbf{y})\,d\mathbf{x}\,d\mathbf{y}\\
&= 2\int \mathbb{I}_{\left\{\frac{f(\mathbf{y})q(\mathbf{x})}{q(\mathbf{y})f(\mathbf{x})}\ge 1\right\}}f(\mathbf{x})q(\mathbf{y})\,d\mathbf{x}\,d\mathbf{y}\\
&\ge 2\int \mathbb{I}_{\left\{\frac{f(\mathbf{y})}{q(\mathbf{y})}\ge \frac{f(\mathbf{x})}{q(\mathbf{x})}\right\}}f(\mathbf{x})\frac{f(\mathbf{y})}{C}\,d\mathbf{x}\,d\mathbf{y}\\
& = \frac{2}{C} \mathbb{P}\Biggl( \frac{f(\boldsymbol{X}_1)}{q(\boldsymbol{X}_1)}\ge \frac{f(\boldsymbol{X}_2)}{q(\boldsymbol{X}_2)}\Biggr) = \frac{1}{C}\;,
\end{aligned}
\end{equation*}
where the last equality follows form the fact that $\boldsymbol{X}_1$ and $\boldsymbol{X}_2$ are independent and both distributed according to $f$ by assumption. In contrast, the 	average acceptance probability for an Accept-Reject method is always equal to $1/C$.

2. Let us compare the independent Metropolis-Hastings algorithm
	and the Accept-Reject method in some more detail by an
	example. Specifically, the goal is to sample from a Gamma
	distribution with shape parameter $\alpha$ and scale parameter
	$\beta$, denoted by $\text{Gamma}(\alpha,\beta)$, so that the target
	PDF reads
	$$f(x) \equiv f(x;\alpha,\beta) = \beta^\alpha x^{\alpha -1}
	e^{-\beta x}/\Gamma(\alpha) \mathbb{I}_{\{x\ge 0\}},$$ where
	$\Gamma$ denotes the Gamma function.

    **Hint:** [`scipy.special.gamma`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.special.gamma.html)

    1. Implement the Accept-Reject method to sample from
		$\text{Gamma}(\alpha,1)$ for $\alpha>1$, using the PDF of the
		$\text{Gamma}(a,b)$ distribution with $a = [\alpha]$ as auxiliary
		density (here $[\alpha]$ denotes the integer part of
		$\alpha$). Show that $b = [\alpha]/\alpha$ is the optimal choice for $b$.


        **Hint:** Recall that
    				$\sum_{k=1}^K \xi_k \sim \mathrm{Gamma}(K,\beta)$ for
    				$K\in\mathbb{N}$, if
    				$\xi_k\overset{\mathrm{i.i.d.}}{\sim}
    				\mathrm{Gamma}(1,\beta)\equiv \mathrm{Exp}(\beta)$.
   2. Use your Accept-Reject method to generate $m$ random numbers
    $X_1,\dots ,X_m$ with each $X_i\sim\text{Gamma}(\alpha,1)$, when
    using $n=5000$ random variables $Y_1,\dots ,Y_n$ from the auxiliary
    $\text{Gamma}([\alpha],[\alpha]/\alpha)$ distribution. Notice that
    $m$ is a random variable, which is smaller than $n$ due to
    rejections. Perform the simulations for $\alpha = 4.85$.

    3. Implement the independent Metropolis-Hastings algorithm using
    as proposal $q$ the PDF of the
    $\text{Gamma}([\alpha],[\alpha]/\alpha)$ distribution.

    4. Use the same sample $Y_1,\dots ,Y_n$ used within the
    Accept-Reject method, now in the corresponding Metropolis-Hastings
    algorithm to generate $n=5000$ realizations of the target
    distribution $\text{Gamma}(\alpha,1)$ with $\alpha = 4.85$.

    5. Compare both methods with respect to:
        1. their acceptance rates,
        2. their estimates for the mean of the $\text{Gamma}(4.85,1)$
        distribution, which is $4.85$,
        3. the correctness of the target distribution.


        Discuss your results.

Proving that $b = [\alpha]/\alpha$ is the optimal choice for $b$ follows from straightforward calculations.
In fact, the ratio $f(x;\alpha,1)/f(x;a,b)$ is 
\begin{equation*}
\frac{f(x;\alpha,1)}{f(x;a,b)} = b^{-a} x^{\alpha-a}e^{-(1-b)x} \frac{\Gamma(a)}{\Gamma(\alpha)} \;,
\end{equation*}
for any $x\ge 0$, which yields the bound 
$$
C = b^{-a}{\left(\frac{\alpha-a}{(1-b)e}\right)}^{\alpha-a} \frac{\Gamma(a)}{\Gamma(\alpha)}
$$
for $b<1$. The optimal choice for $b$ then follows by optimization of $C = C(b)$.

In [7]:
def f(x: np.array , a: float, b: float) -> np.array:
    """
    Evaluates the PDF of a Gamma distribution with parameters a and b.

    Args:
        x (np.array): An array of values for which to compute the PDF of the specified Gamma distribution.
        a (float): The shape parameter of the Gamma distribution.
        b (float): The rate parameter of the Gamma distribution.

    Returns:
        np.array: Computed PDF values for the input values
    """
    return b ** a * x ** (a - 1) * np.exp(-b * x) / gamma(a) * (x >= 0)

In [8]:
def optimal_C(alpha: float, a: float, b: float) -> float:
    """
    Optimal value for `C`: optimization of the ratio `f(x; alpha, 1) / f(x; a, b)`.
    
    Args:
        alpha (float): Parameter of the target distribution function `f`.
        a (float): The shape parameter of the Gamma distribution.
        b (float): The rate parameter of the Gamma distribution.

    Returns:
        float: Optimal value for `C
    """
    return ((alpha - a) / (1 - b)) ** (alpha - a) * np.exp(a - alpha) * gamma(a) / gamma(alpha) / b ** a

In [13]:
def acceptance_rejection(alpha: float, n: int = 5000) -> tuple:
    """
    Performs acceptance-rejection sampling to generate samples from a target distribution.

    Args:
        alpha (float): Parameter of the target distribution function `f`.
        n (int, optional): Number of samples to generate. Defaults to 5000.

    Returns:
        tuple: A tuple containing the generated samples (numpy array) and the acceptance probability (float).
    """
    a = np.floor(alpha)
    b = np.floor(alpha) / alpha
    C = optimal_C(alpha, a, b)
    print(C)

    Aar = 0  # Acceptance probability
    Xar = np.zeros(n)  # Array to store accepted sample
    i = 0  # Counter for accepted samples
    p = 0  # Counter for proposals
    
    while i < n:
        xi = np.random.gamma(shape=a, scale=1/b)  # Generate a sample from the proposal distribution
        ratio = f(xi, alpha, 1) / (f(xi, a, b) * C)  # Calculate the acceptance ratio
        p += 1
        if np.random.random() < ratio:
            Xar[i] = xi
            i += 1

    Aar = n / p  # Calculate the acceptance probability

    return Xar, Aar

In [14]:
def metropolis_hastings(alpha: float, n: int = 5000, initial_value: float = 4) -> tuple:
    """
    Performs Metropolis-Hastings sampling to generate samples from a target distribution.

    Args:
        alpha (float): Parameter of the target distribution function `f`.
        n (int, optional): Number of samples to generate. Defaults to 5000.
        initial_value (float, optional): Initial value of the Markov chain. Defaults to 4.

    Returns:
        tuple: A tuple containing the generated samples (numpy array) and the acceptance rate (float).
    """
    a = np.floor(alpha)
    b = np.floor(alpha) / alpha

    Xmh = np.zeros(n)  # Array to store generated samples
    Xmh[0] = initial_value  # Initialize the first sample
    Amh = 0  # Counter for accepted proposals

    for i in range(n-1):
        y = np.random.gamma(shape=a, scale=1/b)  # Generate a sample from the proposal distribution
        py = f(y, alpha, 1)
        px = f(Xmh[i], alpha, 1)
        qyx = f(Xmh[i], a, b)
        qxy = f(y, a, b)
        
        ratio = (py * qyx) / (px * qxy)  # Compute the Metropolis-Hastings ratio
        
        if np.random.random(1) < ratio:
            Xmh[i+1] = y
            Amh += 1
        else:
            Xmh[i+1] = Xmh[i]

    Amh = Amh / n  # Calculate the acceptance rate

    return Xmh, Amh

In [15]:
alpha = 4.85

In [16]:
Xar, Aar = acceptance_rejection(alpha)
Xmh, Amh = metropolis_hastings(alpha)

1.1051430406038085


In [13]:
print("--------------------------")
print(f"AR acceptance rate: {Aar:.3f}")
print(f"MH acceptance rate: {Amh:.3f}")
print("--------------------------")
print(f"AR mean: {np.mean(Xar):.3f}")
print(f"MH mean: {np.mean(Xmh):.3f}")
print("--------------------------")

--------------------------
AR acceptance rate: 0.902
MH acceptance rate: 0.938
--------------------------
AR mean: 4.871
MH mean: 4.893
--------------------------


In [14]:
def plot_sampling_diagnostics(alpha: float) -> None:
    """
    Plots diagnostics for the Acceptance-Rejection and Metropolis-Hastings sampling methods.

    Args:
        alpha (float): Parameter of the target distribution function `f`.
    """
    n = 5000
    Xar, Aar = acceptance_rejection(alpha)
    Xmh, Amh = metropolis_hastings(alpha)

    a = np.floor(alpha)
    b = np.floor(alpha) / alpha
    C = optimal_C(alpha, a, b)

    x = np.linspace(0, 20, 1000)
    nn = np.arange(1, n+1)

    plt.figure(figsize=(15, 15))

    # Plot densities
    plt.subplot(3, 2, 1)
    plt.plot(x, C*f(x, a, b))
    plt.plot(x, f(x, alpha, 1))
    plt.legend(["Cg(x)", "f(x)"])
    plt.title("Densities")

    # Plot cumulative means
    plt.subplot(3, 2, 2)
    plt.semilogx(nn, np.cumsum(Xar)/nn)
    plt.semilogx(nn, np.cumsum(Xmh)/nn)
    plt.hlines(alpha, 0, n, color='black')
    plt.legend(["AR", "MH", "True mean"])
    plt.title("Cumulative Means")

    # Plot autocorrelation for AR
    plt.subplot(3, 2, 3)
    plt.acorr(Xar  -Xar.mean(), normed=True)
    plt.xlim(-1, 10)
    plt.legend(["AR"])
    plt.title("Autocorrelation for AR")

    # Plot autocorrelation for MH
    plt.subplot(3, 2, 4)
    plt.acorr(Xmh - Xmh.mean(), normed=False)
    plt.xlim(-1, 10)
    plt.legend(["MH"])
    plt.title("Autocorrelation for MH")

    # Hist for AR
    plt.subplot(3, 2, 5)
    plt.hist(Xar, bins=100, density=True)
    plt.plot(x, f(x, alpha, 1), label="f")
    plt.title('Hist for AR')
    plt.legend()

    # Hist for MH
    plt.subplot(3, 2, 6)
    plt.hist(Xmh, bins=100, density=True)
    plt.plot(x, f(x, alpha, 1), label="f")
    plt.title('Hist for MH')
    plt.legend()

    plt.tight_layout()
    plt.show()

The following Figures show that both methods produce accurate approximations to the mean and the PDF. Moreover, we would expect  to see that the sample generated by the MH algorithm contains more correlations, this is because the the AR method should, in fact, produce independent samples, while the MH is a Markov chain. This, however, is difficult to appreciate in the autocorrelation plots, as the samples obtained using MH de-correlate quite rapidly.  

In [15]:
interact(plot_sampling_diagnostics, alpha=(1.0, 5.0));

interactive(children=(FloatSlider(value=3.0, description='alpha', max=5.0, min=1.0), Output()), _dom_classes=(…

## Exercise 4 

Let $X\subset\mathbb{R}^d$ and $P_i:X\times\mathcal{B}(X)\rightarrow[0,1]$, $i=1\dots,m$ be a Markov transition kernel on $X$ with $\mathcal{B}(X)$ the associated $\sigma-$algebra. 

1. Given $a_1,\dots,a_m\in \mathbb{R}^+$, such that $\sum_{i=1}^{m}a_i=1$, show that $P(x,A)=\sum_{i=1}^{m} a_i P_i(x,A)$ is a Markov kernel.

We need to verify that 
1. $\forall A \in \mathcal{B}$, $P(\cdot, A)$ is measurable.
2. $\forall x \in X$, $P(x, \cdot)$ is a probability measure on $(X, \mathcal{B}(X))$.

For the first we have that $\forall A \in \mathcal{B}(X)$, $P(\cdot, A) = \sum a_i P_i(\cdot, A)$ which is measurable as a linear combination of measurable functions. Moreover, $\forall x \in X$, $P(x, \cdot) = \sum_i a_i P_i(x, \cdot)$ is a probability measure on $(X, \mathcal{B}(X))$ since it is a convex combination of probability measures. Notice, in particular, that $P(x, X) = \sum_i a_i P_i(x, X) = \sum_i a_i = 1$.

2. Now consider $a_1,\dots,a_m\in \mathbb{R}$, such that $\sum_{i=1}^{m}a_i=1$ (i.e, not necessarily positive weights). Construct an affine combination of kernels $P(x,A)=\sum_{i=1}^{m}\alpha_iP_i(x,A)$ that is still Markovian.

	 **Hint:** Consider a kernel $P$, for which there exists a measure $\nu$ on $(X, \mathcal{B}(X))$ such that $\forall x \in X$, $P_1(x, A) \geq \epsilon \nu(A)$ $\forall A \in \mathcal{B}(X)$, $\forall x \in X$.

Consider a kernel $P$, for which there exists a measure $\nu$ on $(X, \mathcal{B}(X))$ such that $\forall x \in X$, $P_1(x, A) \geq \epsilon \nu(A)$ $\forall A \in \mathcal{B}(X)$, $\forall x \in X$. Then if we take $P_2(x, A) = \nu(A)$, the kernel $P(x, A) = (1 + a)P_1(x, A) - aP_2(x, A)$ is Markovian for any $0 < a < \epsilon$.

3. Suppose that a measure $\pi:\mathcal{B}\rightarrow[0,1]$ is invariant for each kernel $P_i$. Show that it is also invariant for $P=\sum_{i=1}^m a_i P_i$, where  $a_1,\dots,a_m\in \mathbb{R}^+$, such that $\sum_{i=1}^{m}a_i=1$.

Each kernel $P_i$ preserves $\pi$, that is
\begin{equation}
\int_X P_i(x, A)\pi(dx) = \pi(A), \ \ A \in \mathcal{B}(X),\ \ \forall i = 1,\dots, m.
\end{equation}
We have $\forall A \in \mathcal{B}(X)$ $\int_X P(x, A)\pi(dx) = \sum_i a_i \int_X P_i(x, A)\pi(dx) = \sum_i a_i \pi(A) = \pi(A)$, hence $P$ also preserves $\pi$.

# Exercise 5 (optional, no solution)

Consider a Markov chain $\{X_n\} \sim \text{Markov}(\pi,P)$ on a discrete state space $\mathcal{X}$ at equilibrium, with $P$ irreducible, and $\pi$ the unique invariant probability measure of $P$. 
Let $l^2_{\pi}$ be the Hilbert space $l^2_{\pi} = \{f:\mathcal{X} \to \mathbb{R}: \sum_{i\in \mathcal{X}} f(i)^2 \pi_i < \infty\}$ with inner product $(f,g)_{l^2_\pi} = \sum_{i \in \mathcal{X}} f(i)g(i)\pi_i$, and $l^2_{\pi,0} = \{f \in l^2_\pi: \mathbb{E}_\pi[f]=0\}$.

1. Show that if $(P,\pi)$ are in detailed balance, then $(Pf,g)_{l^2_\pi} = (f,Pg)_{l^2_\pi}$ for any $f,g \in l^2_\pi$.

2. Show that $\mathbb{E}[f(X_n)f(X_m)]=(P^{m-n}f,f)_{l^2_\pi}$ for any $f \in l^2_\pi$ and $m>n$.

3. Consider now the estimator $$\hat{\mu}_N = \frac{1}{N} \sum_{n=1}^N f(X_n)$$ of $\mu = \mathbb{E}_\pi[f]$ under the assumption that $f \in l^2_\pi$. Show that $\mathbb{E}_\pi[\hat{\mu}_N] = \mu$, and $$\mathbb{V}ar[\hat{\mu}_N] = \frac{1}{N} \sum_{l=0}^N c_l (P^l \tilde{f}, \tilde{f})_{l^2_\pi},$$ with $\tilde{f} = f - \mathbb{E}_{\pi}[f] \in l^2_{\pi,0}$ and
\begin{align}
c_{l,N} = \begin{cases} 1, \quad l=0 \\ 2(1-\frac{l}{N}), \quad l>0 \end{cases}
\end{align}

4. Conclude that the asymptotic variance $\mathbb{V}(f,p) \coloneqq \lim_{N \to \infty} N \mathbb{V}ar_{\pi}(\hat{\mu}_N)$ satisfies $\mathbb{V}(f,p) = ((2(I-P)^{-1}-I)\tilde{f},\tilde{f})_{l^2_\pi}$ if 
$$
\tag{3}
\sup_{g \in l^2_{\pi,0}} \frac{(Pg,g)_{l^2_\pi}}{\|g\|_{l^2_\pi}} = \beta < 1.
$$

5. Consider now the two irreducible transition matrices $P_1$ and $P_2$, both in detailed balance with $\pi$ and satisfying (3) for some $\beta_1,\beta_2$.
Show that if $(P_1)_{ij} \geq (P_2)_{ij} \forall i \neq j$, then 
\begin{align}
\mathbb{V}(f,P_1) \leq \mathbb{V}(f,P_2),
\end{align}
for any $f \in l^2_\pi$.

    **Hint:** Take $P(\lambda) = (1-\lambda)P_1 + \lambda P_2, \lambda \in [0,1]$ and show that $\frac{d}{d \lambda} \mathbb{V}(f,P(\lambda)) \geq 0$.