Variable Prioritization for Black Box Methods via RelATive cEntrality (RATE)

Our ability to build good predictive models has, in many cases, outstripped our ability to extract interpretable information about the relevance of the input covariates being used. The central aim of Crawford et al. (2018) is to assess variable importance after having fit a nonlinear or nonparametric (Bayesian) model. In this work, we propose a new "RelATive cEntrality" (RATE) measure as an interpretable way to summarize the importance of covariates. By assessing entropy in the joint posterior distribution via Kullback-Leibler divergence (KLD), we can correctly prioritize candidate variables which are not just marginally important, but also those whose associations stem from a significant covarying relationship with other variables in the data. We demonstrate our proposed approach in the context of statistical genetics, where the discovery of variants that are involved in nonlinear interactions is of particular interest. In this repository, we focus on illustrating RATE through Gaussian process (GP) regression; although, methodological innovations can easily be applied to other machine learning-type methods such as Bayesian kernel ridge (BKR) regression or (deep) neural networks. It is well known that nonlinear models often exhibit greater predictive accuracy than linear models, particularly for outcomes generated by complex data architectures. With simulations and real data examples, we show that applying RATE enables an explanation for this improved performance.

RATE is implemented as a set of parallelizable routines, which can be carried out within an R environment. Supplementary Material for Crawford et al. (2018) can be found on our lab website.

R Packages Required for RATE

The RATE function software requires the installation of the following R libraries:

BAKR (via GitHub)

Unless stated otherwise, the easiest method to install many of these packages is with the following example command entered in an R shell:

install.packages("corpcor", dependecies = TRUE)

Alternatively, one can also install R packages from the command line.

C++ Functions Required for GP Regression

The code in this repository assumes that basic C++ functions and applications are already set up on the running personal computer or cluster. If not, the functions and necessary Rcpp packages to build nonlinear covariance matrices (e.g. BAKR) and fit a GP regression model will not work properly. A simple option is to use gcc. macOS users may use this collection by installing the Homebrew package manager and then typing the following into the terminal:

brew install gcc

For macOS users, the Xcode Command Line Tools include a GCC compiler. Instructions on how to install Xcode may be found here. For extra tips on how to run C++ on macOS, please visit here. For tips on how to avoid errors dealing with "-lgfortran" or "-lquadmath", please visit here.

Demonstrations and Tutorials for Running RATE

We provide a few example scripts that demonstrate how to conduct variable selection in nonlinear models with RATE measures. Here, we consider a simple (and small) genetics example where we simulate genotype data for n individuals with p measured genetic variants. We then randomly select a small number of these predictor variables to be causal and have true association with the generated (continuous) phenotype. These scripts are meant to illustrate proof of concepts and specifically walk through: (1) how to compute a covariance matrix using the Gaussian kernel function; (2) how to fit a standard Bayesian Gaussian process (GP) regression model; and (3) prioritizing variables via their first, second, third, and fourth order distributional centrality.

Relevant Citations

L. Crawford, S.R. Flaxman, D.E. Runcie, and M. West (2018). Variable prioritization in nonlinear black box methods: a genetic association case study. Annals of Applied Statistics. In Press.

Questions and Feedback

For questions or concerns with the RATE functions, please contact Lorin Crawford.

We appreciate any feedback you may have with our repository and instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
Software		Software
Tutorials		Tutorials
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Variable Prioritization for Black Box Methods via RelATive cEntrality (RATE)

R Packages Required for RATE

C++ Functions Required for GP Regression

Demonstrations and Tutorials for Running RATE

Relevant Citations

Questions and Feedback

About

Releases

Packages

Languages

License

guhjy/RATE

Folders and files

Latest commit

History

Repository files navigation

Variable Prioritization for Black Box Methods via RelATive cEntrality (RATE)

R Packages Required for RATE

C++ Functions Required for GP Regression

Demonstrations and Tutorials for Running RATE

Relevant Citations

Questions and Feedback

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages