<a href="https://colab.research.google.com/github/kayla-jackson/spatial-modeling/blob/test-exploratory-sims/notebooks/car_stan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
# Colab setup ------------------
import os, sys, subprocess
if "google.colab" in sys.modules:
    cmd = "pip install --upgrade iqplot colorcet bebi103 arviz cmdstanpy"
    process = subprocess.Popen(cmd.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    stdout, stderr = process.communicate()
    import cmdstanpy; cmdstanpy.install_cmdstan()
# ------------------------------

Installing CmdStan version: 2.33.1
Install directory: /root/.cmdstan
Downloading CmdStan version 2.33.1
Download successful, file: /tmp/tmpdcik2kaa
Extracting distribution


DEBUG:cmdstanpy:cmd: make build -j1
cwd: None


Unpacked download as cmdstan-2.33.1
Building version cmdstan-2.33.1, may take several minutes, depending on your system.


DEBUG:cmdstanpy:cmd: make examples/bernoulli/bernoulli
cwd: None


Test model compilation
Installed cmdstan-2.33.1


In [None]:
# Clone github repo
!git clone --branch test-exploratory-sims https://github.com/kayla-jackson/spatial-modeling.git

In [7]:
# Setup directories
repo_dir = "./spatial-modeling/"
data_dir = os.path.join(repo_dir, "data")
stan_dir = os.path.join(repo_dir, "inst/stan")

# Load libraries

In [22]:
import numpy as np
import pandas as pd
import scipy.special
import scipy.stats as st

import cmdstanpy
import arviz as az

# plotting packages
import iqplot
import bokeh.io
import bokeh.plotting
bokeh.io.output_notebook()

# Parameter Estimation with Markov Chain Monte Carlo (MCMC)

The code below demonstrates how to perform parameter estimation using MCMC. The model is defined in the file `inst/stan/car.stan`.

First, we have to compile the model.

In [None]:
fn = os.path.join(stan_dir, "car_prob.stan")

sm = cmdstanpy.CmdStanModel(stan_file=fn)

Next, we specify the data and let Stan do the rest of the work. The paramers that you need to specify are as follows:


* `count_n`: The number of points you will use to estimate parameters.
* `counts`: The data you will use for fitting. Should be an array of size `count_n`.

* `rates`: Represents the probability of capture at each grid point. Should be between `$0$` and `$1$`. Should be an array of size `count_n`.


* `bs_mu`: For now, set to 0
* `mu_mu`: For now, set to 0

* `bs_var`: For now, set to 1
* `mu_var`: For now, set to 5



In [17]:
# Let's say I have a 2x2 grid. This is just fake data.

W = np.array([
    0, 1, 1, 0,
    1, 0, 0, 1,
    1, 0, 0, 1,
    0, 1, 1, 0
])

data = dict(
    count_n=4,
    counts=np.array([10, 30, 15, 12]),
    rates=np.array([0.1, 0.2, 0.2, 0.35]),

    bs_mu=0,
    bs_var=1,
    mu_mu=0,
    mu_var=5,
)

In [None]:
# Perform sampling with Stan
samples = sm.sample(
    data=data,
    chains=4,
    iter_sampling=1000,
)

# Convert to ArviZ InferenceData instance
samples = az.from_cmdstanpy(posterior=samples)

Take a quick look at the posterior distribution

In [None]:
samples.posterior

In [20]:
# Convert posterior samples to dataframe for plotting
df_mcmc = samples.posterior.to_dataframe()

# Take a look
df_mcmc.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,mu,log_bs,burst_size
chain,draw,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0,7.58786,0.605529,4.03208
0,1,6.73069,0.530424,3.39175
0,2,7.10242,0.520057,3.31174
0,3,7.85359,0.689744,4.8949
0,4,7.5388,0.549471,3.54382


In [None]:
# Some plots of the parameters
plots = [
    iqplot.histogram(df_mcmc, q=param, rug=False)
    for param in ["mu", "burst_size"]
]

bokeh.io.show(bokeh.layouts.gridplot(plots, ncols=2))

Isolate the parameters as an  `np.array()`

In [30]:
mu_vals = df_mcmc.mu.values.flatten()
bs_vals = df_mcmc.burst_size.values.flatten()

The vector `mu_vals` (or `bs_vals`) contain many estimates for the parameters `mu` or `burst size`, but how do you know which one to pick? You can simply take the mean of these vectors (e.g. `np.mean(mu_vals)) and use that or the best estimate, but if the distribution is skewed like it is for burst size, you may want to choose some other quantile instead of the mean.

You can explore this from plotting.

In [None]:
# np.quantile(mu_vals, [0, 0.025, 0.5, 0.95, 1])
np.quantile(bs_vals, [0, 0.025, 0.5, 0.95, 1])

In [33]:
# save the parameter estimates for the current "rates"
best_mu = np.mean(mu_vals)
best_bs = np.quantile(bs_vals, 0.5)

Repeat the steps above, changing the input `rates` at each iteration. Keep track of the `best_mu` and `best_bs`. How do these values change for each new set of rates?