Skip to content

mgoplerud/pfvi_glmm_replication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Replication Data For "Partially Factorized Variational Inference for High-Dimensional Mixed Models"

Version 2.0; 30 November 2024

Authors: Max Goplerud, Omiros Papaspiliopoulos, Giacomo Zanella

Overview

This README.md explains the replication archive for the analyses conducted in our paper. To use PF-VI for your own data, please skip to the "Software" section for installation instructions.

File Listing

The files contained in this archive are listed below. Their usage is described in more detail in the following sections. To simply re-create all of the figures in the paper and examine the output of the analyses, please run the code/create_figures.R script.

  • build_gg: The replicator should add the raw data from the data.zip from Ghitza and Gelman's (2013) replication archive (https://doi.org/10.7910/DVN/PZAOO6) to this folder. This is required to run the prepare_GG.R script that creates the data/gg_data.RDS object that is needed for all scripts involving the Ghitza-Gelman data.
  • code: This contains the scripts needed to replicate the analyses.
    • aux_functions.R: Auxiliary functions needed in the simulations, e.g. calculating the accuracy of the variational approximation.
    • bash_condaenv.sh: Create conda environment to call Python from R.
    • binomial_2K.stan and linear_2K.stan: STAN models used for simulations
    • collect_sims.R: Pulls together the simulation results into a single .RDS file.
    • create_figures.R: This will create all tables and figures used in the manuscript. It relies on saved versions of the simulation and analysis output that were run on a high-performance computing environment.
    • gen_data_PSFZ.py: Python script used to sample from the collapsed Gibbs sampler from Papaspiliopoulos, Stumpf-Fetizon, and Zanella (2023) in the simulations. It is lightly adapted from https://github.com/timsf/crossed-effects
    • log_figures.txt: This contains the output from R when create_figures.R was run.
    • model_list.txt: This contains the combinations of years (1-2) and models (1-9) for the Ghitza-Gelman analysis.
    • prepare_GG.R: This will create the relevant data needed for the Ghitza-Gelman analysis. This script directly taken from an author's previous work (Goplerud; https://doi.org/10.7910/DVN/DI19IB)
    • run_simulations.R: This script will run the simulations; see "Simulation" section for more information about its usage.
    • run_stan_gg.R: This script will estimate the HMC models for the Ghitza-Gelman analysis; see "Ghitza-Gelman Analysis" section for more information about its usage.
    • run_stan_pathfinder.R: This script will estimate the ADVI/Pathfinder models for the Ghitza-Gelman analysis; see "Ghitza-Gelman Analysis" section for more information about its usage.
    • run_vglmer_gg.R: This script will estimate the VI models for the Ghitza-Gelman analysis; see "Ghitza-Gelman Analysis" section for information about its usage.
    • simulation_list: This contains the parameter configuration needed for the simulations, as well as the amount of memory and disk space requested.
    • summarize_gg_acc_advi.R: Estimates accuracy from ADVI/Pathfinder models; see "Ghitza-Gelman Analysis" section for more information about its usage.
    • summarize_gg_mrp.R: Estimates the post-stratified quantities and the linear predictor for each type of respondent; see "Ghitza-Gelman Analysis" section for more information about its usage.
    • summarize_gg_uqf.R: Estimates the UQF from the Ghitza-Gelman CAVI models; see "Ghitza-Gelman Analysis" section for more information about its usage.
  • crossed_effects_python: This contains Python code to implement the samplers outlined in Papaspiliopoulos, Stumpf-Fetizon, and Zanella (2023); it is downloaded from here https://github.com/timsf/crossed-effects
  • data: This contains data needed for the analyses. It contains a number of files.
    • formula_list.RDS: A list of formula for the Ghitza-Gelman models.
    • gg_data.RDS, gg_poststrat.RDS and gg_statevalues.RDS: These are needed for the Ghitza-Gelman analyses; they are not provided with the replication data but can be created as discussed above using the prepare_GG.R script.
    • gg_table_description.xlsx: This contains the text used in Table 1
  • figures: This contains the tables and figures in the manuscript.
    • gg_table.tex can only be produced if output_vglmer contains all of the estimated models. GitHub contains the pre-produced version.
  • output: This contains the output of the simulations and the Ghitza-Gelman analysis.
    • all_simulations.RDS: This contains the results for all of the simulations.
    • acc_advi.RDS: This contains the accuracy of estimates using Pathfinder and ADVI for the Ghitza-Gelman analysis.
    • output_advi: This contains the ADVI/Pathfinder models estimating using cmdstanr. These are too large to include on GitHub, and thus the folder is empty, but can be accessed and reproduced as noted below.
    • output_mrp: This contains the state-level post-stratified estimates and the linear predictor estimates for the Ghitza-Gelman analyses.
    • output_uqf: This contains the estimated UQF for the Ghitza-Gelman analyses.
    • output_simulation: This contains the results for each simulation. These are too large to included on GitHub, and thus the folder is empty, but can be accessed and reproduced as noted below.
    • output_stan: This contains the HMC models fit using Stan for the Ghitza-Gelman analyses. These are too large to include on GitHub, and thus the folder is empty, but can be accessed and reproduced as noted below.
    • output_vglmer: This contains many different variational models estimated for the Ghitza-Gelman analyses with different factorization schemes. These are too large to include on GitHub. The summarized information on each model is included in vglmer_fit_[0-9]_[0-9].RDS.

Software

The software used in this analysis is a branch of the vglmer package that is hosted on GitHub at https://github.com/mgoplerud/vglmer/tree/collapsed. This corresponds to the collapsed branch of the vglmer package. It can be installed in R as follows:

remotes::install_github('mgoplerud/vglmer', ref = 'collapsed')

This branch will eventually be integrated into the main vglmer branch and then put onto CRAN. The specific commit used for the analysis was b9a1112.

The simulations and some scripts need a version of cmdstanr installed. The scripts use the following path; this would likely need to be adapted depending on one's specific environment.

set_cmdstan_path('/cmdstan/cmdstan-2.35.0/')

The following packages and package versions were used in some part of the numerical analyses. The additional package ggpubr is needed for creating the tables and figures in code/create_figures.R

package_list <- c("brms", "caret", "cmdstanr", "dplyr", "glue", "KernSmooth", "lme4", "Matrix", "purrr", "readxl", "reshape2", "reticulate", "RSpectra", "rstan", "scales", "stringr", "tidyverse", "vglmer")

An abbreviated version of the accompanying sessionInfo() is shown below:

R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: Ubuntu 22.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
 [1] vglmer_0.1.0        lubridate_1.9.3     forcats_1.0.0
 [4] readr_2.1.5         tidyr_1.3.1         tibble_3.2.1
 [7] tidyverse_2.0.0     stringr_1.5.1       scales_1.3.0
[10] rstan_2.32.6        StanHeaders_2.32.10 RSpectra_0.16-2
[13] reticulate_1.39.0   reshape2_1.4.4      readxl_1.4.3
[16] purrr_1.0.2         lme4_1.1-35.5       Matrix_1.7-0
[19] KernSmooth_2.23-24  glue_1.7.0          dplyr_1.1.4
[22] cmdstanr_0.8.1      caret_6.0-94        lattice_0.22-6
[25] ggplot2_3.5.1       brms_2.22.0         Rcpp_1.0.13

Creating Tables and Figures

To create the tables and figures in the manuscript, you may run the following script that pulls together the output from the underlying simulations that has been collected into a small number of files.

Rscript --vanilla code/create_figures.R figures

If you are interested in running the simulations or Ghitza-Gelman data analysis directly, please see the following sections.

Simulations

First, to run the simulations, you must create a conda environment to call the Python scripts. This can be done as follows or adapted for your specific high-performance computing environment:

bash code/bash_condaenv.sh

We use a high-performance computing (HPC) environment to parallelize the simulations. An individual simulation can be run using the following R command where the arguments indicate, respectively: the type of model to be run ("linear", i.e., Gaussian, or "binomial"), the simulation number, and the size of the model (i.e., the number of levels of the random effect---see run_simulations.R for exactly how this argument maps onto the number of levels). The amount of memory requested on the HPC can be found in simulation_list.txt for each combination of "family" and "size".

Rscript --vanilla code/run_simulations.R linear 1 1

Please note that one would need to adjust the paths of the output to adapt to one's specific HPC; our script redirects them into the output_simulation folder. These individual files were put into a single object (output/all_simulations.RDS) by running Rscript --vanilla code/collect_sims.R.

The relevant tables and figures are created as discussed in the "Creating Tables and Figures" section.

Ghitza-Gelman Analysis

We compare our results against an MCMC baseline estimated using STAN. The models are provided with the replication code (in output_stan) but can be re-estimated, if desired, as follows. One must create the Ghitza-Gelman data using the steps noted above to download the data and then run Rscript --vanilla code/prepare_GG.R. To run the models, one can run, where the first argument is the year (1 or 2), the second is the model (1-9), the third must say binomial, and the fourth is the number of cores (4). The final argument is 4 in all simulations as

Rscript --vanilla code/run_stan_gg.R 1 1 binomial 4

The variational models are run using the run_vglmer_gg.R script as follows, where the first argument is the year (1 or 2) and the second is the model (1-9).

Rscript --vanilla code/run_vglmer_gg.R 1 1

Please note that one would need to adjust the paths of the output to adapt to one's specific HPC as, at present, the assume this done via the submission scripts (e.g., SLURM and/or HTCondor).

Additional quantities after model estimation are computing, including the UQF and the accuracy of state-level predictions---see the paper for more details. These can be estimated using the following three scripts, with appropriate adjustments to the file path if one does this locally. Note that these all require files that are too large to put on the GitHub and must be manually recomputed. In the first two scripts, the first argument is the year (1 or 2) and the second is the model (1-9).

# To estimate the UQF
Rscript --vanilla code/summarize_gg_uqf.R 1 1
# To estimate the MRP quantities and compute accuracy
Rscript --vanilla code/summarize_gg_mrp.R 1 1
# To estimate the MRP quantities and compute accuracy for ADVI models
Rscript --vanilla summarize_gg_acc_advi.R

The HPC maps this output into the output_uqf and output_mrp folders for the first two scripts and into acc_advi.RDS for the third. This is used in the "Creating Tables and Figures" section.

About

Replication Data for: Partially factorized variational inference for high-dimensional mixed models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors