# Stan (MCMC) utilities

<hr>

In [1]:
%load_ext blackcellmagic
%load_ext autoreload
%autoreload 2
import re

import numpy as np
import pandas as pd
df = pd.read_csv('sample_data.csv')

import cmdstanpy
import arviz as az

import bebi103

import bokeh.io

bokeh.io.output_notebook()

The bebi103 package contains a set of convenience functions for use with [Stan](http://mc-stan.org/). Much of the functionality is included in [ArviZ](https://arviz-devs.github.io/arviz/), and indeed the `bebi103.stan` module depends on ArviZ and extensively uses its `InferenceData` data type. Furthermore, the bebi103 package offers visualization functions for MCMC results which we will introduce as we go through a Bayesian workflow example below. Again, much of this functionality is present in ArviZ, but with modifications. Specifically, at least in my experience, there are some problems with the Bokeh backend with ArviZ.

## Model and sample data

To demonstrate the usage of the Stan utilities of bebi103, we will consider the following hierarchical generative model.

\begin{align}
&\theta_i \sim \text{Norm}(5, 5),\;\;i\in\{1, 2\},\\[1em]
&\tau_i \sim \text{HalfNorm}(0, 10),\;\;i\in\{1, 2\},\\[1em]
&\sigma_i \sim \text{HalfNorm}(0, 10),\;\;i\in\{1, 2\},\\[1em]
&\rho \sim \text{Uniform}(-1, 1),\\[1em]
&\mathsf{T} = \begin{pmatrix}\tau_1^2 & 0 \\ 0 & \tau_2^2\end{pmatrix}, \\[1em]
&\mathsf{\Sigma} = \begin{pmatrix}\sigma_1^2 & \rho\sigma_1\sigma_2 \\ \rho\sigma_1\sigma_2 & \sigma_2^2\end{pmatrix}, \\[1em]
&\theta_{1, i} \sim \text{Norm}(\theta, \mathsf{T}) \;\forall\,i\in\{1, 2, 3\},\\[1em]
& \begin{pmatrix} x_{i, j} \\ y_{i,j} \end{pmatrix} \sim \text{Norm}(\theta_i, \mathsf{\Sigma})\;\forall\,i \in \{1, 2, 3\}, \; j \in \{1, \ldots, n_i \}. 
\end{align}

This is a hierarchical model in which the three vector-valued parameters $\theta_{1,1}$, $\theta_{1,2}$, and $\theta_{1,3}$ are conditioned on the vector-valued hyperparameter $\theta$. We will use a fabricated data set generated from this model with parameters described in the [Using the user guide](using_user_guide.ipynb) section. It is useful to understand the data set, so let's look at the data frame.

In [2]:
df.head()

Unnamed: 0,x,y,trial
0,3.199667,11.432423,1
1,4.308811,10.89086,1
2,4.257118,11.665208,1
3,1.186072,10.39362,1
4,4.789491,11.416682,1


Each trial (there are three, numbered 1, 2, and 3) has an *x* measurment and a *y* measurement.

To begin our analysis, we will use the following Stan model, which directly follows from the above generative model.

```C
data {
  // Total number of data points
  int N;

  // Number of entries in each level of the hierarchy
  int J_1;

  //Index array to keep track of hierarchical structure
  int index_1[N];

  // The measurements
  real x[N];
  real y[N];
}


transformed data {
  // Data are two-dimensional, so store in a vector
  vector[2] xy[N];

  for (i in 1:N) {
    xy[i, 1] = x[i];
    xy[i, 2] = y[i];
  }
}


parameters {
  // Hyperparameters level 0
  vector[2] theta;

  // How hyperparameters vary
  vector<lower=0>[2] tau;

  // Parameters
  vector[2] theta_1[J_1];
  vector<lower=0>[2] sigma;
  real<lower=-1, upper=1> rho;
}


transformed parameters {
  // Covariance matrix for hyperparameters
  matrix[2, 2] Tau = [
    [tau[1]^2,  0       ], 
    [0,         tau[2]^2]
  ];

  // Covariance matrix for likelihood
  matrix[2, 2] Sigma = [
    [sigma[1]^2,                 rho * sigma[1] * sigma[2]], 
    [rho * sigma[1] * sigma[2],  sigma[2]^2               ]
  ];
}


model {
  // Hyperpriors
  theta ~ normal(5, 5);
  tau ~ normal(0, 10);

  // Priors
  theta_1 ~ multi_normal(theta, Tau);
  sigma ~ normal(0, 10);
  rho ~ uniform(-1, 1);

  // Likelihood
  for (i in 1:N) {
    xy[i] ~ multi_normal(theta_1[index_1[i]], Sigma);
  }
}


generated quantities {
  real x_ppc[N];
  real y_ppc[N];
  real log_lik[N];

  {
    vector[2] xy_ppc;

    for (i in 1:N) {
      xy_ppc = multi_normal_rng(theta_1[index_1[i]], Sigma);
      log_lik[i] = multi_normal_lpdf(xy_ppc | theta_1[index_1[i]], Sigma);
      x_ppc[i] = xy_ppc[1];
      y_ppc[i] = xy_ppc[2];
    }
  }
}
```

A few important notes about the Stan file.

1. The array `index_1` keeps track of the dependence of the data on each parameter, which are themselves conditioned on hyperparameters.
2. Posterior predictive checks and the log-likelihood function are calculated in the `generated quantities` block.

## A first pass through a simple workflow

We will now proceed through a MCMC workflow, doing the following steps:

1. Compile the Stan model.
2. Convert the data frame into input data for the Stan model.
3. Convert the samples into an ArviZ `InferenceData` instance.
4. Check the sampling diagnostics.
5. Create visualizations of the samples.

### Compiling the Stan model

Starting with the first step making a compiled Stan model that we can use for sampling, we will use [CmdStanPy](https://cmdstanpy.readthedocs.io/en/latest/). Although the bebi103 package supports [PyStan](https://pystan.readthedocs.io/en/latest/), CmdStanPy is preferred because of the simplicity of its interface and shorter model compilation time.

If we wish to suppress messages to the screen, we can use `bebi103.stan.disable_logging()` with context management.

In [3]:
with bebi103.stan.disable_logging():
    sm = cmdstanpy.CmdStanModel(stan_file='sample_model.stan')

### Preparing hierarchical data for Stan

Now that we have a compiled Stan model, we need to supply it data. Referring to the data block in the Stan code above, for this hierarchical model, we need to specify how many data points, what the values of *x* and *y* are, the number of different parameters (`J_1`), and a set of indices the give which of the three $\theta_1$ values upon which each data point is conditioned. This information is already contained in the tidy data frame, `df`. The `'x'` and `'y'` columns contain the data, and the `'trial'` column determines the level in the hierarchy. The value of `J_1` is given by the number of unique entries in the `'trial'` column, and the number of data points is given by the total number of rows in the data frame.

The `bebi103.stan.df_to_datadict_hier()` function converts a tidy Pandas data frame into a dictionary of data that can be passed to a Stan program for a hierarchical model. Furthermore, since the `'trial'` column could have strings or other entries in them, these need to be converted to integers for use in Stan. A column is added to Stan with a suffix `_stan` to give the values of that column that are used in the Stan program. This is for reference when analyzing the results.

In [4]:
data, df = bebi103.stan.df_to_datadict_hier(
    df, level_cols="trial", data_cols=["x", "y"]
)

# Take a look at the updated data frame
df.head()

Unnamed: 0,x,y,trial,trial_stan
0,3.199667,11.432423,1,1
19,4.914436,12.063128,1,1
18,1.435777,7.652054,1,1
17,3.720426,9.402883,1,1
16,5.217736,11.316355,1,1


The data frame has been appropriately updated (the trials were numbers, so there is no difference). Let's look at the `data` dictionary.

In [5]:
data

{'N': 63,
 'J_1': 3,
 'index_1': array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
        2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]),
 'x': array([ 3.19966682e+00,  4.91443608e+00,  1.43577663e+00,  3.72042621e+00,
         5.21773649e+00,  2.45600834e+00,  1.41347129e+00,  3.39407189e+00,
         4.03784248e+00,  2.84546683e+00,  2.33852942e+00,  4.74669596e+00,
         2.40726012e+00,  5.19411912e+00,  4.07225410e+00,  4.78949124e+00,
         1.18607230e+00,  4.25711845e+00,  4.30881119e+00,  9.00909323e+00,
        -2.16818532e-01,  1.08560503e+00,  3.19326444e+00,  6.01394969e-01,
        -5.74518522e-01, -2.34811850e-01,  2.11118778e+00,  5.10072325e+00,
         5.12538845e+00,  2.19125022e+00,  1.69384109e+00,  3.86955927e+00,
         5.11263131e+00,  3.12546099e+00,  1.83221644e+00,  5.02185204e-01,
         7.09153643e+00, -1.85898197e+00,  2.21074922e+00,  3.34

The dictionary has all of the entries necessary to pass into the Stan program while sampling.

### Sampling and conversion to ArviZ InferenceData

We can now go about our sampling and get the results. We will again suppress logging. Note that this is in general not a good idea. I do it here to manage space in the documentation.

In [6]:
with bebi103.stan.disable_logging():
    samples = sm.sample(data=data, chains=4, iter_sampling=1000, seed=3252)

The bebi103 package works with the excellent ArviZ `InferenceData` data type. ArviZ can be used to convert samples from Stan, PyMC3, emcee, PyTorch, and many others into a well-defined universal data format.

In [7]:
samples = az.from_cmdstanpy(samples, posterior_predictive=['x_ppc', 'y_ppc'], log_likelihood='log_lik')

### Converting InferenceData to a Pandas DataFrame

Now that the samples are stored as `InferenceData`, we can go about using them. Before we do that, we may want to convert the posterior samples into a Pandas data frame to do our own analysis.

In [8]:
df_samples = bebi103.stan.posterior_to_dataframe(samples)

# Take a look
df_samples.head()

Unnamed: 0,theta[0],theta[1],tau[0],tau[1],"theta_1[0,0]","theta_1[0,1]","theta_1[1,0]","theta_1[1,1]","theta_1[2,0]","theta_1[2,1]",...,"Tau[0,1]","Tau[1,0]","Tau[1,1]","Sigma[0,0]","Sigma[0,1]","Sigma[1,0]","Sigma[1,1]",chain__,draw__,divergent__
0,0.745658,7.73568,2.13085,3.9549,2.95811,10.0316,1.77162,5.7839,1.92782,6.85613,...,0.0,0.0,15.6412,4.22708,3.61024,3.61024,3.78122,0,0,False
1,1.55425,6.17589,1.8292,4.57194,3.57556,10.2306,2.50321,6.05627,1.35128,6.64666,...,0.0,0.0,20.9026,4.07698,3.32477,3.32477,3.33682,0,1,False
2,2.86515,5.40452,3.3957,4.85225,3.94654,10.8954,2.44126,5.89735,1.02892,6.20736,...,0.0,0.0,23.5443,3.5881,2.95844,2.95844,3.16458,0,2,False
3,2.44369,6.98701,0.740784,3.08805,3.64068,10.344,2.64325,6.08444,1.43757,6.36464,...,0.0,0.0,9.53604,3.89782,2.9579,2.9579,2.90648,0,3,False
4,2.99485,6.52673,0.324104,2.45259,3.51659,10.1793,2.94214,6.57432,2.40215,7.24472,...,0.0,0.0,6.01521,4.40357,3.88878,3.88878,4.17561,0,4,False


Note how multdimensional variables are named. They are **zero-indexed**, and use brackets with different axes of the index separated by a comma **with no spaces**. Note also that the last three columns are `chain__`, `draw__`, and `divergent__`, which specify which chain the sample comes from, which of the successive draws it was, and whether or not it had a divergence, respectively. This nomenclature is used throughout the bebi103 MCMC parsing functions.

### Checking diagnostics

We may check the following diagnostic features:

1. The effective sample size (ESS).
2. The rank-normalized Gelman-Rubin statistic, Rhat ($\hat{R}$).
3. Divergences.
4. Tree depth.
5. Energy-Bayes fraction of missing information (E-BFMI)

The functions to check these diagnostics are based on some of the functionality in ArviZ and from the [work of Michael Betancourt](https://github.com/betanalpha/jupyter_case_studies/blob/master/pystan_workflow/stan_utility.py).

We can start by checking the effective sample size. The `bebi103.stan.check_ess()` function checks for both the effective sample size and the "tail" effective sample size, as defined in [Vehtari, et al., 2019](https://arxiv.org/abs/1903.08008). 

In [9]:
bebi103.stan.check_ess(samples)

Effective sample size looks reasonable for all parameters.


True

The function returned `True`, indicating that all variables have an effective samples size and tail effective sample size greater than the rule of thumb of 100 per chain. We can adjust the rule of thumb, if we like. If use the `total_ess_rule_of_thumb=400` kwarg, then any parameter with fewer than 1600 effective samples (since we had four chains) is flagged a failure of the ESS check.

In [10]:
bebi103.stan.check_ess(samples, total_ess_rule_of_thumb=400)

tail-ESS for parameter theta[0] is 1419.9502356659.
ESS for parameter tau[0] is 1303.2077873488836.
tail-ESS for parameter tau[0] is 1385.2222284396416.
ESS for parameter Tau[0,0] is 1303.209227835824.
tail-ESS for parameter Tau[0,0] is 1385.2222284396416.
  ESS or tail-ESS below 100 per chain indicates that expectation values
  computed from samples are unlikely to be good approximations of the
  true expectation values.


False

The function now returns `False`, and also prints diagnostic information about the parameters for which the test failed.

Now let's turn to Rhat. The Rhat value is a rank-normalized, folded Rhat, as described in [Vehtari, et al., 2019](https://arxiv.org/abs/1903.08008).

In [34]:
bebi103.stan.check_rhat(samples)

['theta[0]', 'theta[1]', 'tau[0]', 'tau[1]', 'theta_1[0,0]', 'theta_1[0,1]', 'theta_1[1,0]', 'theta_1[1,1]', 'theta_1[2,0]', 'theta_1[2,1]', 'sigma[0]', 'sigma[1]', 'rho', 'Tau[0,0]', 'Tau[0,1]', 'Tau[1,0]', 'Tau[1,1]', 'Sigma[0,0]', 'Sigma[0,1]', 'Sigma[1,0]', 'Sigma[1,1]']
Rhat for parameter Tau[0,1] is NaN.
Rhat for parameter Tau[1,0] is NaN.


False

The function returned `False`, meaning the that Rhat check failed. It informed us that two of the Rhat values were NaN, which causes a failure of the check. It is often the case that variables in the `generated parameters` block have NaN Rhat because they are not in any way manipulated by the sampler. In this case, the off-diagonal terms in the matrix `Tau` are both always zero. We can omit these using the `omit` kwarg.

In [12]:
bebi103.stan.check_rhat(samples, omit=['Tau[0,1]', 'Tau[1,0]'])

Rhat looks reasonable for all parameters.


True

Next, we can check for divergences.

In [13]:
bebi103.stan.check_divergences(samples)

47 of 4000 (1.175%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.


False

We failed the divergence test because we had 47 total divergences. We will deal with these momentarily.

We can also check if the sampler was hitting the maximum tree depth. By default, the tree depth used in a sampling call by CmdStanPy is 10, which is the default `max_treedepth` kwarg value for `bebi103.stan.check_treedepth()`. You need to set this kwarg to match whatever tree depth used in the sampling. In our case, we can use the default.

In [14]:
bebi103.stan.check_treedepth(samples)

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.


True

We have passed this test.

Finally, we can check the E-BFMI.

In [15]:
bebi103.stan.check_energy(samples)

E-BFMI indicated no pathological behavior.


True

We have also passed this test.

For convenience, we can perform all tests at once using the `bebi103.stan.check_all_diagnostics()` function.

In [16]:
bebi103.stan.check_all_diagnostics(samples, omit=['Tau[0,1]', 'Tau[1,0]'])

Effective sample size looks reasonable for all parameters.

Rhat looks reasonable for all parameters.

47 of 4000 (1.175%) iterations ended with a divergence.
  Try running with larger adapt_delta to remove divergences.

0 of 4000 (0.0%) iterations saturated the maximum tree depth of 10.

E-BFMI indicated no pathological behavior.


4

The return value of this function is a number that, when converted to binary, each digit in the code stands for whether or not a test passed. A digit of zero indicates the test passed. The ordering of the tests goes:

- ess
- r_hat
- divergences
- tree depth
- E-BFMI

For example, a warning code of 12 has a binary representation of 01100, which means that R-hat and divergences tests failed. Our warning code of 4 has a binary representation of 00100, which means that the divergence test failed.

### Visualizing results

The bebi103 package allows for three visualizations of MCMC results.

- Trace plots: `bebi103.viz.trace()`
- Parallel coordinate plots: `bebi103.viz.parcoord()`
- Corner plot: `bebi103.viz.corner()`

Each of these takes as their first four arguments:

- `samples`: ArviZ `InferenceData` instance of containing the samples. (Note that `corner()` also accepts other data type for `samples`, such as in [display of bootstrap replicates](bootstrap.ipynb#Graphical-display-of-a-confidence-region).)
- `var_names`: List of the names of variables to include in the plot.
- `palette`: List of colors to use.
- `omit`: List of variable names or regular expression patterns for variables to omit in the plot.

`samples` is the only required argument. 

#### Trace plots

We will begin with a trace plot of the hyperparameters *θ* and *τ*.

In [17]:
bokeh.io.show(
    bebi103.viz.trace(samples, var_names=['theta[0]', 'theta[1]', 'tau[0]', 'tau[1]'])
)

The traces are colorcoded by chain.

#### Parallel coordinate plots

For a parallel coordinate plot, we want to omit all four entries of the matrices `Tau` and `Sigma` because the entries are either zero or determined uniquely by other parameters. We can specify regular expressions in the `omit` list to conveniently omit all of the entries in these matrix parameters.

In [19]:
bokeh.io.show(
    bebi103.viz.parcoord(samples, omit=[re.compile("Tau.*"), re.compile("Sigma.*")])
)

In the plot, samples that did not result in a divergence are semi-transparent gray by default. Samples that had a divergence are colored in orange and shown with thicker lines. Because the scale of respective parameters may be different, it is often easer to rescale each parameter for the plot. We can do this with the `transformation` kwarg, and setting it to `'minmax'` sets the minimum of each parameter value to zero and maximum to one.

In [20]:
bokeh.io.show(
    bebi103.viz.parcoord(
        samples,
        omit=[re.compile("Tau.*"), re.compile("Sigma.*")],
        transformation="minmax",
    )
)

This plot is more revealing. We see (especially if we zoom in) that most of the divergences are going through small `tau[0]`. This is indicative of the "funnel" behavior often encountered in hierarchical models.

#### Corner plots

Corner plots display scatter plots of samples from all pairs of variables, as well as histograms or ECDFs for individual variables. We can make a corner plot for the hyperparameters and also for ρ.

In [21]:
bokeh.io.show(
    bebi103.viz.corner(samples, var_names=['theta[0]', 'theta[1]', 'rho', 'tau[0]', 'tau[1]'])
)

These are very useful plots for interpreting a posterior, and also for diagnosing problems with sampling. The divergent samples appear in orange. The clustering of divergences for small `tau[0]` is clear.

The diagonal can also be plotted with ECDFs and contours may also be added to the off-diagonal plots.

In [22]:
bokeh.io.show(
    bebi103.viz.corner(
        samples,
        var_names=["theta[0]", "theta[1]", "rho", "tau[0]", "tau[1]"],
        plot_ecdf=True,
        show_contours=True,
    )
)

## Simulation-based calibration (SBC)

In a [principled Bayesian workflow](https://betanalpha.github.io/assets/case_studies/principled_bayesian_workflow.html), simulation-based calibration (SBC) is a useful tool to identify pathologies in a generative model and in the sampler's ability to draw samples from it. The procedure is laid out in [Talts, et al., 2018](https://arxiv.org/abs/1804.06788), and is described in the [Stan manual](https://mc-stan.org/docs/2_24/stan-users-guide/simulation-based-calibration.html).

Briefly, the procedure is:

1. Draw a parameter set *θ* out of the prior. 
2. Use *θ* to draw a data set *y* out of the likelihood.
3. Perform MCMC sampling of the posterior using *y* as if it were the actual measured data set. Draw *L* MCMC samples of the parameters.
4. Do steps 1-3 *N* times (hundreds to a thousand is usually a good number).

For each of these calculations, you should run diagnostics to make sure the effective sample size, Rhat, etc., are within acceptable ranges. You can also compute z-scores, shrinkage, and rank statistics. Please see the [Talts, et al. paper](https://arxiv.org/abs/1804.06788) for details on these quantities and their interpretation. Our focus here is how to use the `bebi103.stan.sbc()` function to perform the analysis and compute these quantities.

### Example model

We have already seen in the example so far that we have divergences and the sampler does not seem to be able to sample small values of `tau[0]`. We will therefore write a new Stan program where we have noncentered parameters.

In [23]:
sm_nc = cmdstanpy.CmdStanModel(stan_file='sample_model_noncentered.stan')

INFO:cmdstanpy:compiling stan program, exe file: /Users/bois/Dropbox/git/bebi103/doc/user_guide/sample_model_noncentered
INFO:cmdstanpy:compiler options: stanc_options=None, cpp_options=None
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103/doc/user_guide/sample_model_noncentered


In [24]:
samples_nc = sm_nc.sample(data=data, chains=4, iter_sampling=1000, seed=3252, adapt_delta=0.99, max_treedepth=15)

INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:start chain 2
INFO:cmdstanpy:start chain 3
INFO:cmdstanpy:start chain 4
INFO:cmdstanpy:finish chain 4
INFO:cmdstanpy:finish chain 3
INFO:cmdstanpy:finish chain 2
INFO:cmdstanpy:finish chain 1


In [25]:
samples_nc = az.from_cmdstanpy(samples_nc, posterior_predictive=['x_ppc', 'y_ppc'], log_likelihood='log_lik')

bokeh.io.show(
    bebi103.viz.corner(samples_nc, var_names=['theta[0]', 'theta[1]', 'rho', 'tau[0]', 'tau[1]'])
)

In [26]:
sm_prior_pred = cmdstanpy.CmdStanModel(stan_file='sample_model_prior_predictive.stan')

INFO:cmdstanpy:found newer exe file, not recompiling
INFO:cmdstanpy:compiled model file: /Users/bois/Dropbox/git/bebi103/doc/user_guide/sample_model_prior_predictive


In [27]:
samples_prior_pred = sm_prior_pred.sample(
    data=data, fixed_param=True, iter_sampling=1000
)

INFO:cmdstanpy:start chain 1
INFO:cmdstanpy:finish chain 1


In [28]:
samples_prior_pred = az.from_cmdstanpy(prior=samples_prior_pred, prior_predictive=['x', 'y'])

In [29]:
samples_prior_pred

In [44]:
try:
    df_sbc = pd.read_csv('sbc_results.csv')
except:
    df_sbc = bebi103.stan.sbc(
        prior_predictive_model=sm_prior_pred,
        posterior_model=sm_nc,
        prior_predictive_model_data=data,
        posterior_model_data=data,
        measured_data=['x', 'y'],
        posterior_predictive_var_names=['x_ppc', 'y_ppc'],
        log_likelihood_var_name='log_lik',
        measured_data_dtypes={'x': float, 'y': float},
        sampling_kwargs={'max_treedepth': 15, 'adapt_delta': 0.99},
        diagnostic_check_kwargs={'omit': ['Tau.*', 'Sigma.*']},
        cores=4,
        N=4,
        progress_bar=True,
    )
    
    df_sbc.to_csv('sbc_results.csv', index=False)

100%|██████████| 4/4 [06:22<00:00, 95.73s/it] 
