<a href="https://colab.research.google.com/github/kayla-jackson/spatial-modeling/blob/test-exploratory-sims/notebooks/mcmc_simulation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet bebi103 arviz cmdstanpy"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
# ------------------------------

CmdStan install directory: /root/.cmdstan
Installing CmdStan version: 2.33.1
Downloading CmdStan version 2.33.1
Download successful, file: /tmp/tmps04zhdu4
Extracting distribution


DEBUG:cmdstanpy:cmd: make build -j1
cwd: None


Unpacked download as cmdstan-2.33.1
Building version cmdstan-2.33.1, may take several minutes, depending on your system.


DEBUG:cmdstanpy:cmd: make examples/bernoulli/bernoulli
cwd: None


Installed cmdstan-2.33.1
Test model compilation


In [2]:
# Clone github repo
!git clone --branch test-exploratory-sims https://github.com/kayla-jackson/spatial-modeling.git

Cloning into 'spatial-modeling'...
remote: Enumerating objects: 98, done.[K
remote: Counting objects: 100% (98/98), done.[K
remote: Compressing objects: 100% (74/74), done.[K
remote: Total 98 (delta 39), reused 61 (delta 18), pack-reused 0[K
Receiving objects: 100% (98/98), 5.82 MiB | 5.84 MiB/s, done.
Resolving deltas: 100% (39/39), done.


In [3]:
# Setup directories
repo_dir = "./spatial-modeling/"
data_dir = os.path.join(repo_dir, "data")
stan_dir = os.path.join(repo_dir, "inst/stan")

# Load libraries

In [4]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import cmdstanpy
import arviz as az

# plotting packages
import iqplot
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

# Parameter Estimation with Markov Chain Monte Carlo (MCMC)

The code below demonstrates how to perform parameter estimation using MCMC. The model is defined in the file `inst/stan/car.stan`.

First, we have to compile the model.

In [5]:
fn = os.path.join(stan_dir, "car_prob.stan")

sm = cmdstanpy.CmdStanModel(stan_file=fn)

06:26:12 - cmdstanpy - INFO - compiling stan file /content/spatial-modeling/inst/stan/car_prob.stan to exe file /content/spatial-modeling/inst/stan/car_prob
INFO:cmdstanpy:compiling stan file /content/spatial-modeling/inst/stan/car_prob.stan to exe file /content/spatial-modeling/inst/stan/car_prob
DEBUG:cmdstanpy:cmd: make STANCFLAGS+=--filename-in-msg=car_prob.stan /content/spatial-modeling/inst/stan/car_prob
cwd: /root/.cmdstan/cmdstan-2.33.1
DEBUG:cmdstanpy:Console output:

--- Translating Stan model to C++ code ---
bin/stanc --filename-in-msg=car_prob.stan --o=/content/spatial-modeling/inst/stan/car_prob.hpp /content/spatial-modeling/inst/stan/car_prob.stan

--- Compiling, linking C++ code ---
g++ -std=c++1y -pthread -D_REENTRANT -Wno-sign-compare -Wno-ignored-attributes      -I stan/lib/stan_math/lib/tbb_2020.3/include    -O3 -I src -I stan/src -I stan/lib/rapidjson_1.1.0/ -I lib/CLI11-1.9.1/ -I stan/lib/stan_math/ -I stan/lib/stan_math/lib/eigen_3.4.0 -I stan/lib/stan_math/lib/bo

Next, we specify the data and let Stan do the rest of the work. The paramers that you need to specify are as follows:


* `count_n`: The number of points you will use to estimate parameters.
* `counts`: The data you will use for fitting. Should be an array of size `count_n`.

* `rates`: Represents the probability of capture at each grid point. Should be between `$0$` and `$1$`. Should be an array of size `count_n`.


* `bs_mu`: For now, set to 0
* `mu_mu`: For now, set to 0

* `bs_var`: For now, set to 1
* `mu_var`: For now, set to 5



In [23]:
# Importing the data - 100 genes, 64 valid positions

df = pd.read_csv(os.path.join(data_dir, 'counts.csv'))

df = df[~df['border']] # Filter out locations where border = True

# Create counts array
gene_labels = [f'gene_{i}' for i in range(1, 101)]
gene_counts = df.iloc[:, -100:][gene_labels].values.flatten()

true_p = np.ravel([[p]*100 for p in df['theta'].values])

'''
W = np.array([
    0, 1, 1, 0,
    1, 0, 0, 1,
    1, 0, 0, 1,
    0, 1, 1, 0
])
'''

data = dict(
    count_n=6400,
    counts=gene_counts,
    rates=true_p,

    bs_mu=0,
    bs_var=1,
    mu_mu=0,
    mu_var=5,
)

In [24]:
# Perform sampling with Stan
samples = sm.sample(
    data=data,
    chains=4,
    iter_sampling=1000,
)

# Convert to ArviZ InferenceData instance
samples = az.from_cmdstanpy(posterior=samples)

DEBUG:cmdstanpy:cmd: /content/spatial-modeling/inst/stan/car_prob info
cwd: None
DEBUG:cmdstanpy:input tempfile: /tmp/tmpd__drkl7/34kt2qwl.json
06:50:34 - cmdstanpy - INFO - CmdStan start processing
INFO:cmdstanpy:CmdStan start processing


chain 1 |          | 00:00 Status

chain 2 |          | 00:00 Status

chain 3 |          | 00:00 Status

chain 4 |          | 00:00 Status

DEBUG:cmdstanpy:idx 0
DEBUG:cmdstanpy:running CmdStan, num_threads: 1
DEBUG:cmdstanpy:CmdStan args: ['/content/spatial-modeling/inst/stan/car_prob', 'id=1', 'random', 'seed=8779', 'data', 'file=/tmp/tmpd__drkl7/34kt2qwl.json', 'output', 'file=/tmp/tmpd__drkl7/car_prob8p8nuw3i/car_prob-20231017065034_1.csv', 'method=sample', 'num_samples=1000', 'algorithm=hmc', 'adapt', 'engaged=1']
DEBUG:cmdstanpy:idx 1
DEBUG:cmdstanpy:running CmdStan, num_threads: 1
DEBUG:cmdstanpy:CmdStan args: ['/content/spatial-modeling/inst/stan/car_prob', 'id=2', 'random', 'seed=8779', 'data', 'file=/tmp/tmpd__drkl7/34kt2qwl.json', 'output', 'file=/tmp/tmpd__drkl7/car_prob8p8nuw3i/car_prob-20231017065034_2.csv', 'method=sample', 'num_samples=1000', 'algorithm=hmc', 'adapt', 'engaged=1']
DEBUG:cmdstanpy:idx 2
DEBUG:cmdstanpy:running CmdStan, num_threads: 1
DEBUG:cmdstanpy:CmdStan args: ['/content/spatial-modeling/inst/stan/car_prob', 'id=3', 'random', 'seed=8779', 'data', 'file=/tmp/tmpd__drkl7/34kt2qwl.json', 'ou

                                                                                                                                                                                                                                                                                                                                                                                                                

06:51:22 - cmdstanpy - INFO - CmdStan done processing.
INFO:cmdstanpy:CmdStan done processing.
DEBUG:cmdstanpy:runset
RunSet: chains=4, chain_ids=[1, 2, 3, 4], num_processes=4
 cmd (chain 1):
	['/content/spatial-modeling/inst/stan/car_prob', 'id=1', 'random', 'seed=8779', 'data', 'file=/tmp/tmpd__drkl7/34kt2qwl.json', 'output', 'file=/tmp/tmpd__drkl7/car_prob8p8nuw3i/car_prob-20231017065034_1.csv', 'method=sample', 'num_samples=1000', 'algorithm=hmc', 'adapt', 'engaged=1']
 retcodes=[0, 0, 0, 0]
 per-chain output files (showing chain 1 only):
 csv_file:
	/tmp/tmpd__drkl7/car_prob8p8nuw3i/car_prob-20231017065034_1.csv
 console_msgs (if any):
	/tmp/tmpd__drkl7/car_prob8p8nuw3i/car_prob-20231017065034_0-stdout.txt
DEBUG:cmdstanpy:Chain 1 console:
method = sample (Default)
  sample
    num_samples = 1000 (Default)
    num_warmup = 1000 (Default)
    save_warmup = 0 (Default)
    thin = 1 (Default)
    adapt
      engaged = 1 (Default)
      gamma = 0.050000000000000003 (Default)
      delt




AttributeError: ignored

Take a quick look at the posterior distribution

In [None]:
samples.posterior

In [None]:
# Convert posterior samples to dataframe for plotting
df_mcmc = samples.posterior.to_dataframe()

# Take a look
df_mcmc.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mu,log_bs,burst_size
chain,draw,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,7.58786,0.605529,4.03208
0,1,6.73069,0.530424,3.39175
0,2,7.10242,0.520057,3.31174
0,3,7.85359,0.689744,4.8949
0,4,7.5388,0.549471,3.54382


In [None]:
# Some plots of the parameters
plots = [
    iqplot.histogram(df_mcmc, q=param, rug=False)
    for param in ["mu", "burst_size"]
]

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Isolate the parameters as an  `np.array()`

In [None]:
mu_vals = df_mcmc.mu.values.flatten()
bs_vals = df_mcmc.burst_size.values.flatten()

The vector `mu_vals` (or `bs_vals`) contain many estimates for the parameters `mu` or `burst size`, but how do you know which one to pick? You can simply take the mean of these vectors (e.g. `np.mean(mu_vals)) and use that or the best estimate, but if the distribution is skewed like it is for burst size, you may want to choose some other quantile instead of the mean.

You can explore this from plotting.

In [None]:
# np.quantile(mu_vals, [0, 0.025, 0.5, 0.95, 1])
np.quantile(bs_vals, [0, 0.025, 0.5, 0.95, 1])

In [None]:
# save the parameter estimates for the current "rates"
best_mu = np.mean(mu_vals)
best_bs = np.quantile(bs_vals, 0.5)

Repeat the steps above, changing the input `rates` at each iteration. Keep track of the `best_mu` and `best_bs`. How do these values change for each new set of rates?