Version 2.0; 30 November 2024
Authors: Max Goplerud, Omiros Papaspiliopoulos, Giacomo Zanella
This README.md explains the replication archive for the analyses conducted in our paper. To use PF-VI for your own data, please skip to the "Software" section for installation instructions.
The files contained in this archive are listed below. Their usage is described in more detail in the following sections. To simply re-create all of the figures in the paper and examine the output of the analyses, please run the code/create_figures.R script.
build_gg: The replicator should add the raw data from thedata.zipfrom Ghitza and Gelman's (2013) replication archive (https://doi.org/10.7910/DVN/PZAOO6) to this folder. This is required to run theprepare_GG.Rscript that creates thedata/gg_data.RDSobject that is needed for all scripts involving the Ghitza-Gelman data.code: This contains the scripts needed to replicate the analyses.aux_functions.R: Auxiliary functions needed in the simulations, e.g. calculating the accuracy of the variational approximation.bash_condaenv.sh: Create conda environment to call Python from R.binomial_2K.stanandlinear_2K.stan: STAN models used for simulationscollect_sims.R: Pulls together the simulation results into a single.RDSfile.create_figures.R: This will create all tables and figures used in the manuscript. It relies on saved versions of the simulation and analysis output that were run on a high-performance computing environment.gen_data_PSFZ.py: Python script used to sample from the collapsed Gibbs sampler from Papaspiliopoulos, Stumpf-Fetizon, and Zanella (2023) in the simulations. It is lightly adapted fromhttps://github.com/timsf/crossed-effectslog_figures.txt: This contains the output fromRwhencreate_figures.Rwas run.model_list.txt: This contains the combinations of years (1-2) and models (1-9) for the Ghitza-Gelman analysis.prepare_GG.R: This will create the relevant data needed for the Ghitza-Gelman analysis. This script directly taken from an author's previous work (Goplerud;https://doi.org/10.7910/DVN/DI19IB)run_simulations.R: This script will run the simulations; see "Simulation" section for more information about its usage.run_stan_gg.R: This script will estimate the HMC models for the Ghitza-Gelman analysis; see "Ghitza-Gelman Analysis" section for more information about its usage.run_stan_pathfinder.R: This script will estimate the ADVI/Pathfinder models for the Ghitza-Gelman analysis; see "Ghitza-Gelman Analysis" section for more information about its usage.run_vglmer_gg.R: This script will estimate the VI models for the Ghitza-Gelman analysis; see "Ghitza-Gelman Analysis" section for information about its usage.simulation_list: This contains the parameter configuration needed for the simulations, as well as the amount of memory and disk space requested.summarize_gg_acc_advi.R: Estimates accuracy from ADVI/Pathfinder models; see "Ghitza-Gelman Analysis" section for more information about its usage.summarize_gg_mrp.R: Estimates the post-stratified quantities and the linear predictor for each type of respondent; see "Ghitza-Gelman Analysis" section for more information about its usage.summarize_gg_uqf.R: Estimates the UQF from the Ghitza-Gelman CAVI models; see "Ghitza-Gelman Analysis" section for more information about its usage.
crossed_effects_python: This contains Python code to implement the samplers outlined in Papaspiliopoulos, Stumpf-Fetizon, and Zanella (2023); it is downloaded from herehttps://github.com/timsf/crossed-effectsdata: This contains data needed for the analyses. It contains a number of files.formula_list.RDS: A list of formula for the Ghitza-Gelman models.gg_data.RDS,gg_poststrat.RDSandgg_statevalues.RDS: These are needed for the Ghitza-Gelman analyses; they are not provided with the replication data but can be created as discussed above using theprepare_GG.Rscript.gg_table_description.xlsx: This contains the text used in Table 1
figures: This contains the tables and figures in the manuscript.gg_table.texcan only be produced ifoutput_vglmercontains all of the estimated models. GitHub contains the pre-produced version.
output: This contains the output of the simulations and the Ghitza-Gelman analysis.all_simulations.RDS: This contains the results for all of the simulations.acc_advi.RDS: This contains the accuracy of estimates using Pathfinder and ADVI for the Ghitza-Gelman analysis.output_advi: This contains the ADVI/Pathfinder models estimating usingcmdstanr. These are too large to include on GitHub, and thus the folder is empty, but can be accessed and reproduced as noted below.output_mrp: This contains the state-level post-stratified estimates and the linear predictor estimates for the Ghitza-Gelman analyses.output_uqf: This contains the estimated UQF for the Ghitza-Gelman analyses.output_simulation: This contains the results for each simulation. These are too large to included on GitHub, and thus the folder is empty, but can be accessed and reproduced as noted below.output_stan: This contains the HMC models fit using Stan for the Ghitza-Gelman analyses. These are too large to include on GitHub, and thus the folder is empty, but can be accessed and reproduced as noted below.output_vglmer: This contains many different variational models estimated for the Ghitza-Gelman analyses with different factorization schemes. These are too large to include on GitHub. The summarized information on each model is included invglmer_fit_[0-9]_[0-9].RDS.
The software used in this analysis is a branch of the vglmer package that is hosted on GitHub at https://github.com/mgoplerud/vglmer/tree/collapsed. This corresponds to the collapsed branch of the vglmer package. It can be installed in R as follows:
remotes::install_github('mgoplerud/vglmer', ref = 'collapsed')
This branch will eventually be integrated into the main vglmer branch and then put onto CRAN. The specific commit used for the analysis was b9a1112.
The simulations and some scripts need a version of cmdstanr installed. The scripts use the following path; this would likely need to be adapted depending on one's specific environment.
set_cmdstan_path('/cmdstan/cmdstan-2.35.0/')
The following packages and package versions were used in some part of the numerical analyses. The additional package ggpubr is needed for creating the tables and figures in code/create_figures.R
package_list <- c("brms", "caret", "cmdstanr", "dplyr", "glue", "KernSmooth", "lme4", "Matrix", "purrr", "readxl", "reshape2", "reticulate", "RSpectra", "rstan", "scales", "stringr", "tidyverse", "vglmer")
An abbreviated version of the accompanying sessionInfo() is shown below:
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
time zone: Etc/UTC
tzcode source: system (glibc)
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] vglmer_0.1.0 lubridate_1.9.3 forcats_1.0.0
[4] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[7] tidyverse_2.0.0 stringr_1.5.1 scales_1.3.0
[10] rstan_2.32.6 StanHeaders_2.32.10 RSpectra_0.16-2
[13] reticulate_1.39.0 reshape2_1.4.4 readxl_1.4.3
[16] purrr_1.0.2 lme4_1.1-35.5 Matrix_1.7-0
[19] KernSmooth_2.23-24 glue_1.7.0 dplyr_1.1.4
[22] cmdstanr_0.8.1 caret_6.0-94 lattice_0.22-6
[25] ggplot2_3.5.1 brms_2.22.0 Rcpp_1.0.13
To create the tables and figures in the manuscript, you may run the following script that pulls together the output from the underlying simulations that has been collected into a small number of files.
Rscript --vanilla code/create_figures.R figures
If you are interested in running the simulations or Ghitza-Gelman data analysis directly, please see the following sections.
First, to run the simulations, you must create a conda environment to call the Python scripts. This can be done as follows or adapted for your specific high-performance computing environment:
bash code/bash_condaenv.sh
We use a high-performance computing (HPC) environment to parallelize the simulations. An individual simulation can be run using the following R command where the arguments indicate, respectively: the type of model to be run ("linear", i.e., Gaussian, or "binomial"), the simulation number, and the size of the model (i.e., the number of levels of the random effect---see run_simulations.R for exactly how this argument maps onto the number of levels). The amount of memory requested on the HPC can be found in simulation_list.txt for each combination of "family" and "size".
Rscript --vanilla code/run_simulations.R linear 1 1
Please note that one would need to adjust the paths of the output to adapt to one's specific HPC; our script redirects them into the output_simulation folder. These individual files were put into a single object (output/all_simulations.RDS) by running Rscript --vanilla code/collect_sims.R.
The relevant tables and figures are created as discussed in the "Creating Tables and Figures" section.
We compare our results against an MCMC baseline estimated using STAN. The models are provided with the replication code (in output_stan) but can be re-estimated, if desired, as follows. One must create the Ghitza-Gelman data using the steps noted above to download the data and then run Rscript --vanilla code/prepare_GG.R. To run the models, one can run, where the first argument is the year (1 or 2), the second is the model (1-9), the third must say binomial, and the fourth is the number of cores (4). The final argument is 4 in all simulations as
Rscript --vanilla code/run_stan_gg.R 1 1 binomial 4
The variational models are run using the run_vglmer_gg.R script as follows, where the first argument is the year (1 or 2) and the second is the model (1-9).
Rscript --vanilla code/run_vglmer_gg.R 1 1
Please note that one would need to adjust the paths of the output to adapt to one's specific HPC as, at present, the assume this done via the submission scripts (e.g., SLURM and/or HTCondor).
Additional quantities after model estimation are computing, including the UQF and the accuracy of state-level predictions---see the paper for more details. These can be estimated using the following three scripts, with appropriate adjustments to the file path if one does this locally. Note that these all require files that are too large to put on the GitHub and must be manually recomputed. In the first two scripts, the first argument is the year (1 or 2) and the second is the model (1-9).
# To estimate the UQF
Rscript --vanilla code/summarize_gg_uqf.R 1 1
# To estimate the MRP quantities and compute accuracy
Rscript --vanilla code/summarize_gg_mrp.R 1 1
# To estimate the MRP quantities and compute accuracy for ADVI models
Rscript --vanilla summarize_gg_acc_advi.R
The HPC maps this output into the output_uqf and output_mrp folders for the first two scripts and into acc_advi.RDS for the third. This is used in the "Creating Tables and Figures" section.