-
Notifications
You must be signed in to change notification settings - Fork 92
Tutorial: GTEx v8 MASH models integration with a Coronary Artery Disease GWAS
This article presents a working example of GWAS harmonization and imputation to a reference QTL dataset. This provides a tutorial on integrating the latest GTEX v8, biologically informed MASHR models. These models use effect sizes computed with MASHR, on fine-mapped variables from DAP-G.
Fine-mapping analysis on GTEx v8 have selected many variants with evidence of a causal role in expression or splicing; however many of these variants lack an rsid and are not defined in a typical GWAS. To improve GWAS-QTL integration, we harmonize and impute missing variants' summary statistics from the GWAS to the QTL data set, to leverage the missing fine-mapped variants.
Summary statistics imputation is performed, instead of individual-level imputation, because it can be done with a reference panel of individuals, effectively bypassing hurdles in accessing every possible GWAS' sample data. Summary statistics imputation is also much less computationally intensive than individual-level computation.
In the following, we use QTL and GTEx interchangeably, but the concepts apply to any QTL in general. You can find a conceptual overview of best practices here.
- A UNIX pc, workstation, or computation server
- GWAS tools and its prerequisites
- MetaXcan and its prerequisites
- Sample data set. For ease of use, this dataset includes files already published here.
We suggest using an environment management solution like pyenv or conda.
In the following, python
for python >3.5, in an environment with all required packages.
We list here the packages needed for the imputation scripts.
For an overview of GWAS tools, see here.
In the following:
-
GWAS_TOOLS
is a variable pointing to the path ofsrc
folder in GWAS tools software, assumed to have been cloned from github locally (e.g./home/numa/software/genomic_tools/src
). -
METAXCAN
is a variable pointing to thesoftware
folder in MetaXcan software assumed to have been cloned from github locally (e.g./home/numa/software/metaxcan/software
). -
DATA
is a variable pointing to where you downloaded and uncompressed the sample data set (e.g./home/numa/data/gwas_processing_sample_data
) -
OUTPUT
is a fixed folder where you want to output results (i.e./home/numa/cad_mashr_integration
)
When integrating GWAS and QTL studies with methods like PrediXcan, as stated in the best practices guide, it is generally better to fully harmonize and impute a GWAS' missing summary statistics to the QTL study (GTEx in this case). Once the GWAS is harmonized and imputed, then can integration via S-PrediXcan or S-MultiXcan proceed.
For completeness' sake we will also cover less taxing alternatives like "Quick imputation".
The GWAS data set is a gzipped, tab-separated text file. Its first rows look like:
markername chr bp_hg19 effect_allele noneffect_allele effect_allele_freq median_info model beta se_dgc p_dgc het_pvalue n_studies
rs143225517 1 751756 C T .158264 .92 FIXED .013006 .017324 .4528019 .303481 35
rs3094315 1 752566 A G .763018 1 FIXED -.005243 .0157652 .7394597 .146867 36
...
We'll cover in the following:
- a) Harmonization
- b) Alternative quick and dirty harmonization
- Imputation of missing summary statistics
- a) Running S-PrediXcan on imputed GWAS
- b) Running S-PrediXcan on quick-and-dirty harmonized GWAS
This is the basic bash command for harmonizing.
#!/bin/bash
# Here you could write HPC directives if running on a compute cluster
python $GWAS_TOOLS/gwas_parsing.py \
-gwas_file $DATA/gwas/cad.add.160614.website.txt.gz \
-liftover $DATA/liftover/hg19ToHg38.over.chain.gz \
-snp_reference_metadata $DATA/reference_panel_1000G/variant_metadata.txt.gz METADATA \
-output_column_map markername variant_id \
-output_column_map noneffect_allele non_effect_allele \
-output_column_map effect_allele effect_allele \
-output_column_map beta effect_size \
-output_column_map p_dgc pvalue \
-output_column_map chr chromosome \
--chromosome_format \
-output_column_map bp_hg19 position \
-output_column_map effect_allele_freq frequency \
--insert_value sample_size 184305 --insert_value n_cases 60801 \
-output_order variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue zscore effect_size standard_error sample_size n_cases \
-output $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz
For documentation on the harmonization tools, see the GWAS tools wiki.
A quick overview of the arguments follow.
-output_column_map
argument can be specified multiple times, and states how to rename input columns, from an input name to a target column name.
Although there is some flexibility to the naming, we recommend at least chromosome, position, allele and id columns to have the output names shown here.
Please consider using this name conventions because the following steps assume them, and also because some (optional but important) harmonization steps are only performed if the data has columns identified with these names.
--chromosome_format
is an optional flag that prepends "chr" string to the GWAS chromosome entries,
to be compatible with liftover. --insert_value
creates extra columns with constant values extracted from the CAD GWAS article.
We obtained the following execution log:
INFO - Parsing input GWAS
INFO - loaded 9455778 variants
INFO - Performing liftover
INFO - 9455778 variants after liftover
INFO - Creating index to attach reference ids
INFO - Acquiring reference metadata
INFO - alligning alleles
INFO - 7626868 variants after restricting to reference variants
INFO - Ensuring variant uniqueness
INFO - 7626866 variants after ensuring uniqueness
INFO - Checking for missing frequency entries
INFO - Saving...
INFO - Finished converting GWAS in 771.5422394247726 seconds
A simpler harmonization scheme, alternative to the previous one, is using MetaXcan's built-in harmonization.
#!/bin/bash
python $METAXCAN/M03_betas.py \
--snp_map_file $DATA/coordinate_map/map_snp150_hg19.txt.gz \
--gwas_file $DATA/gwas/cad.add.160614.website.txt.gz \
--snp_column markername \
--non_effect_allele_column noneffect_allele \
--effect_allele_column effect_allele \
--beta_column beta \
--pvalue_column p_dgc \
--keep_non_rsid \
--throw \
--output $OUTPUT/quick_harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz
The file $DATA/coordinate_map/map_snp150_hg19.txt.gz
details the mapping from UCSC's hg19 snp database
to GTEx reference.
The other arguments with a _column
suffix are standard MetaXcan arguments
detailing how to identidy input GWAS columns.
Note that the arguments listed here can also be used when running S-PrediXcan, so that S-PrediXcan will perform the quick harmonization behind the scenes. This will not output the harmonized gwas, which might or might not be desireable depending on your user case. Please note that if you run S-PrediXcan with all 49 tissues this way, the same mapping will be performed for each of the 49 tissues; so. the more tissues you run, the more you'll benefit from constructing a harmonized GWAS file first and running S-PrediXcan on it.
This quick harmonization tool is provided for convenience, when a user can't be bothered with the complete harmonization and imputation scheme.
Imputing summary statistics is much less computationally intensive than individual-level imputation followed by GWAS on imputed genotypes. However, it still takes considerable time.
Summary statistics imputation works in a "region-wide" approach, each region a conceptual computation unit. Imputation takes all variants from the GTEx reference in a chromosomal region (we use Berisa-Pickrell LD blocks) and impute missing GWAS variants using present GWAS variants and genotypes from GTEx. Our implementation allows to split the imputation of a full GWAS in "sub-batches", i.e. just imputing for a few regions. By splitting the execution in smaller units, we can parallelize in an HPC environment.
See the following example:
python $GWAS_TOOLS/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
-parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet \
-parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 1 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch 0 \
--standardise_dosages \
-output $OUTPUT/summary_imputation/CARDIoGRAM_C4D_CAD_ADDITIVE_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz
The following arguments state the splitting of computation:
-
-chromosome 1
states that only regions in chromosome 1 must be computed -
-sub_batches 10
states that these regions must be split in 10 subsets -
-sub_batch 0
states that only regions in the first subset must be computed
Following this example, you would have to run 220 imputation jobs (22 chromosomes x 10 sub-batches).
The -sub_batch
argument should take values from [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
, i.e. sub_batch is zero-based and 10 sub-batches are to be run.
Since we split the imputation jobs into smaller units, we also have a utility to gather all the partial results, with some additional postprocessing for good measure.
python $GWAS_TOOLS/gwas_summary_imputation_postprocess.py \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz\
-folder $OUTPUT/summary_imputation \
-pattern CARDIoGRAM_C4D_CAD_ADDITIVE.* \
-parsimony 7 \
-output $OUTPUT/processed_summary_imputation/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz
This will give a ready-to use GWAS file.
On some genotyping technologies, palindromic variants (e.g. A/T) are ambiguously defined. To correct this, the default configuration of our imputation+postprocessing scheme will produce the following behinfd the scenes:
- Exclude ambiguous variants from input
- Impute summary statistics for ambiguous variants alongside missing variants
- For ambiguous variants, report
abs(observed summary statistic) * sign(imputed) summary statistic
This way, any potential ambiguity in the GWAS will at leastconsistent with the GTEx observations.
To run S-PrediXcan using MASHR-M models, you need a command like the following
python $METAXCAN/SPrediXcan.py \
--gwas_file $OUTPUT/processed_summary_imputation/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
--snp_column panel_variant_id \
--effect_allele_column effect_allele \
--non_effect_allele_column non_effect_allele \
--zscore_column zscore \
--model_db_path $DATA/models/eqtl/mashr/mashr_Whole_Blood.db \
--covariance $DATA/models/eqtl/mashr/mashr_Whole_Blood.txt.gz \
--keep_non_rsid \
--additional_output \
--model_db_snp_key varID \
--throw \
--output_file $OUTPUT/spredixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE__PM__Whole_Blood.csv
This uses MASHR-M Whole Blood expression model. The sample data also includes sqtl models, for 49 GTEx tissues.
After you run S-PrediXcan, you can run S-MultiXcan too, as in the following:
python $METAXCAN/SMulTiXcan.py \
--models_folder $DATA/models/eqtl/mashr \
--models_name_pattern "mashr_(.*).db" \
--snp_covariance $DATA/models/gtex_v8_expression_mashr_snp_covariance.txt.gz \
--metaxcan_folder $OUTPUT/spredixcan/eqtl/ \
--metaxcan_filter "CARDIoGRAM_C4D_CAD_ADDITIVE__PM__(.*).csv" \
--metaxcan_file_name_parse_pattern "(.*)__PM__(.*).csv" \
--gwas_file $OUTPUT/processed_summary_imputation/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
--snp_column panel_variant_id \
--effect_allele_column effect_allele \
--non_effect_allele_column non_effect_allele \
--zscore_column zscore \
--keep_non_rsid \
--model_db_snp_key varID \
--cutoff_condition_number 30 \
--verbosity 7 \
--throw \
--output $OUTPUT/smultixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE_smultixcan.txt
Through this tutorial, we worked on the integration of Coronary Artery Disease GWAS with GTEx v8 MASHR prediction models. We first harmonized the GWAS, then imputed missing summary statistics for the GWAS, then ran S-PrediXcan and S-MultiXcan.
CARDIoGRAM CAD GWAS downloaded from here, from Nikpay et al (Nat Gen 2016) article.
In the past, MetaXcan and GWAS Tools used different versions of python. Now both use on python>3.5 and this tutorial was updated op reflect the fact.
On a linux environment, this tutorial was run with the following:
#Download the code
git clone git@github.com:hakyimlab/summary-gwas-imputation.git
git clone git@github.com:hakyimlab/MetaXcan.git
# At the HPC cluster available to me, I load conda this way module load gcc/6.2.0 miniconda2/4.4.10
conda create -n myenv python=3.6
source activate myenv
conda install scipy numpy pandas -y
conda install -c conda-forge pyarrow -y
conda install -c bioconda pyliftover -y
wget https://zenodo.org/record/3657902/files/sample_data.tar?download=1
tar -xvpf sample_data.tar\?download\=1
rm sample_data.tar\?download\=1
export DATA=data
export GWAS_TOOLS=summary-gwas-imputation/src
export METAXCAN=MetaXcan/software
export OUTPUT=results
# harmonize input gWAS to our reference
python $GWAS_TOOLS/gwas_parsing.py \
-gwas_file $DATA/gwas/cad.add.160614.website.txt.gz \
-liftover $DATA/liftover/hg19ToHg38.over.chain.gz \
-snp_reference_metadata $DATA/reference_panel_1000G/variant_metadata.txt.gz METADATA \
-output_column_map markername variant_id \
-output_column_map noneffect_allele non_effect_allele \
-output_column_map effect_allele effect_allele \
-output_column_map beta effect_size \
-output_column_map p_dgc pvalue \
-output_column_map chr chromosome \
--chromosome_format \
-output_column_map bp_hg19 position \
-output_column_map effect_allele_freq frequency \
--insert_value sample_size 184305 --insert_value n_cases 60801 \
-output_order variant_id panel_variant_id chromosome position effect_allele non_effect_allele frequency pvalue zscore effect_size standard_error sample_size n_cases \
-output $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz
# The following imputes a portion of the imput GWAS.
# The imputation is meant to be split over many executions to improve paralellism
# so you would have to iterate `-chromosome` between [1,22] and `-sub_batch` between [0,_sub_batches],
# ideally in an HPVC environment
python $GWAS_TOOLS/gwas_summary_imputation.py \
-by_region_file $DATA/eur_ld.bed.gz \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
-parquet_genotype $DATA/reference_panel_1000G/chr1.variants.parquet \
-parquet_genotype_metadata $DATA/reference_panel_1000G/variant_metadata.parquet \
-window 100000 \
-parsimony 7 \
-chromosome 1 \
-regularization 0.1 \
-frequency_filter 0.01 \
-sub_batches 10 \
-sub_batch 0 \
--standardise_dosages \
-output $OUTPUT/summary_imputation_1000G/CARDIoGRAM_C4D_CAD_ADDITIVE_chr1_sb0_reg0.1_ff0.01_by_region.txt.gz
# Finally, postprocess the harmonized input GWAs and all of the imputation batches' results
python $GWAS_TOOLS/gwas_summary_imputation_postprocess.py \
-gwas_file $OUTPUT/harmonized_gwas/CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
-folder $OUTPUT/summary_imputation_1000G \
-pattern "CARDIoGRAM_C4D_CAD_ADDITIVE.*" \
-parsimony 7 \
-output $OUTPUT/processed_summary_imputation_1000G/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz
python $METAXCAN/SPrediXcan.py \
--gwas_file $OUTPUT/processed_summary_imputation_1000G/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
--snp_column panel_variant_id --effect_allele_column effect_allele --non_effect_allele_column non_effect_allele --zscore_column zscore \
--model_db_path $DATA/models/eqtl/mashr/mashr_Whole_Blood.db \
--covariance $DATA/models/eqtl/mashr/mashr_Whole_Blood.txt.gz \
--keep_non_rsid --additional_output --model_db_snp_key varID \
--throw \
--output_file $OUTPUT/spredixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE__PM__Whole_Blood.csv