Skip to content

Best practices for integrating GWAS and GTEX v8 transcriptome prediction models

Alvaro Barbeira edited this page Feb 7, 2020 · 6 revisions

Best practices when integrating GWAS and prediction models

In the following article, we'll provide an overview of harmonization and imputation of GWAS variants to a reference QTL set (such as GTEx; we'll use QTL and GTEx interchangeably, although the concepts apply to any transcriptome study).

There is a vast and heterogeneous landscape of GWAS (Genome-Wide Association Studies) publicly available. These studies were conducted on different genotyping platforms, using different imputation schemes, defined on different releases of the human genome.

We are particularly interested in methods in the PrediXcan family. These methods use transcriptomic prediction models, trained on separate cohorts, to infer transcriptome variation in the GWAS study, and then compute transcriptome-to-trait associations. When the intersection of variants in the GWAS and models is small, we have shown here that performance decreases in general in any method that performs some sort of Transcriptome-Wide Analysis.

In other words, the sets of variants available to a particular GWAS might be poorly matched to the set of variants in transcriptome prediction models, so that the predictions have decreased performance. This GWAS-model variant intersection must be considered in any application.

We have released an exciting new family of models that incorporate biological information, which we term MASHR-M. They are available here (in the link:mashr_eqtl.tar and mashr_sqtl.tar are single-tissue prediction models and LD compilation, with gtex_v8_expression_mashr_snp_smultixcan_covariance.txt.gz and gtex_v8_splicing_mashr_snp_smultixcan_covariance.txt.gz the S-MultiXcan LD compilations). These models use effect sizes computed with MASHR, on fine-mapped variables from DAP-G.

These models use fine-mapped variants to improve prediction quality. The models are parsimonious and perform better than current alternatives. The fine-mapping was performed with DAP-G on GTEx-v8 data; this GTEx version used hg38 as reference, and many variants don't have an rsid. Most GWAS are defined in previous versions of the human genome (hg17, hg18, hg19), and matching GWAS variants to the new models might prove a fraught cause in older GWAS, or those with limited imputation.

We have developed a rich set of tools to reconcile variants between GWAS and models. The general process we employ consist of two steps:

  1. Harmonization of GWAS variants
    • Fixing GWAS format inconsistencies
    • Mapping genomic coordinates between different human genome release assemblies.
    • curating variants for matching allele definition
  2. Imputation of summary statistics
    • Using a reference panel to impute missing associations from present ones using BLUP (Best Linear Unbiased predictors)
    • Performed on harmonized GWAS

Even if stating the problem is straightforward, the technical details such as formatting inconsistencies on many GWAS datasets compound to a mind-numbing level of complexity. We quantify the effect of these preprocessing schemes when running S-PrediXcan.

GWAS Preprocessing: harmonization and imputation

In this preprint, we processed 114 GWAS traits to homogenize them in the GTEx v8 data set. We implemented a feature-rich "Full" GWAS harmonization approach, followed up by imputation of summary statistics missing from the GTEx v8 data set. There were 2 traits defined on hg17, 13 on hg18, and 54 in hg19.

We also implemented a second harmonization approach, "Quick", which is suboptimal but easier to use. We provide it for convenience on less demanding cases, or users avert to the complexity of full harmonization and imputation.

  1. a) "Full" harmonization (simply called "Harmonization" in the following)

    • Implemented here: src/
    • Custom script handling many formatting intricacies found in publicly available GWAS.
    • Converts between human genome release versions using liftover
    • Curates variants for matching alleles
    • more flexible
  2. b) "Quick" Harmonization

    • Implemented in MetaXcan's software/
    • Uses mapping tables (available here) that define conversion between variants on different human genome assemblies, precomputed from ucsc' snp database
    • Integrated into MetaXcan tools
    • Less flexible. Maps less variants
    • Easier to use
  3. Imputation

    • Implemented here: src/
    • Less computationally extensive than individual-level imputation.
    • optimized for large-scale, many-traits phenome-wide analysis.
    • can run in an ordinary UNIX pc and HPC server both (the latter is recommended)

A note on "ambiguous variants": (e.g. A/T vs T/A) the imputation scheme will impute summary statistics for palindromic variants, and then report sign(imputed summary statistic) * abs(observed summary statistic). This way, any potential ambiguity in the GWAS will at leastconsistent with the GTEx observations.

Integration with MASHR-M models

We ran S-PrediXcan using 49 MASHR-M models on all 3 families of GWAS processing ("Quick Harmonization", "Harmonization", "Imputation") on the 114 traits above.

We compare several results metrics to highlight the importance of harmonization and imputation.

Fraction of snps used

The above plot shows the distribution of median fraction of snps in each model that were used (i.e. were present in the GWAS) for each trait-tissue pair, discriminated by the human genome release version underlying each GWAS. We aggregated by tissue-trait pair to illustrate the more straightforward application of S-PrediXcan to a trait using a few tissues of interest. We notice that for the three hg17, hg18, and hg19, imputation reaches a fraction of snps used very near to 100%. Using merely harmonized GWAS present a less favorable distribution of GWAS-models variant intersection, with the quick harmonization scheme performing worse than the full harmonization scheme. However, for hg19, the gain of imputation is more modest, suggesting that newer GWAS with high quality sequencing and imputation might yield acceptable results without going through the complexities of imputation.

Fraction of snps used

The above plot shows a similar plot for the number of associations obtained, and observe a similar trend: imputation achieves the highest number of computable association in all GWAS, with a similar distribution across hg17, hg18, hg19-based GWAS. Using only a harmonization scheme achieves less performance on most traits; but a few traits in hg19 exhibit a good enough performance without imputation.

Colocalized, significant associations

The above plot shows the number of significantly associated genes that are also colocalized (in any tissue) via ENLOC (our colocalization method of choice). Again, the highest number of colocalized associations is obtained in the imputation scheme, but for hg19-based GWAS harmonization performs acceptably.


We recommend performing full harmonization followed by missing summary statistics imputation to leverage all the patterns of variation available to MASHR-M models.

However, different projects might have different constraints. We have ran S-PrediXcan on 4000 GWAS traits from the rapid GWAS project in this preprint. Performing imputation here for 4000 traits would had taken too long on the computational resources available to us. However, the full harmonization scheme takes about 15 minutes of running time per trait, and when running S-PrediXcan with MASHR-M we found that 95% of model snps were available in the GWAS. in cases such as this, just the harmonization suffices.

For users that favor simplicty, the alternative quick harmonization scheme is easy to use and quick to implement using MetaXcan tools.

A tutorial on a particular GWAS trait is shown here.