Skip to content

Downstreamer

Patrick Deelen edited this page Nov 9, 2021 · 45 revisions

Downstreamer can be used to to perform key gene prioritization and pathway enrichment using GWAS summary statistics. The key gene prioritization of genes are based on co-regulation. This co-regulation is a measure related to co-expression but less driven by tissue differences [2]. Downstreamer does not rely on known pathway assignments of genes but instead uses predicted gene-pathway assignments [1]. Both the gene-pathway assignments and co-regulation scores are calculated using 31,499 public RNA-seq samples from many different tissues [1].

A preprint is available here: https://www.medrxiv.org/content/10.1101/2021.10.21.21265342v2

Content

1️⃣ Getting started
2️⃣ Running Downstreamer
3️⃣ Key gene enrichment analysis
4️⃣ Most important output files
5️⃣ Code availability

1. Getting started

Download tool and reference data here: https://downloads.molgeniscloud.org/downloads/depict2/

2. Running Downstreamer

The main function of Downstreamer is using GWAS summary statistics to predict key genes and pathways. The commands needed to do this using the by us provided pathway databases are listed below.

We have also created wrapper scripts that use configuration files instead of command line arguments. The instructions for this can be found in the readme of the bundle download above.

All Downstreamer commands have the following basis:

java -Xmx10g -jar Downstreamer.jar --mode [MODE] --output

On top of this you might need to allocate more memory when running Downstream. If you get an heap space error, then increase the value of -Xmx10g to increase the amount of memory that Downstreamer can use.

An overview of the different modes can be found here: DS additional modes

2.a. Preparing GWAS summary statistics

First the GWAS summary statistics should be formatted in the following manner:

  • TAB separated text file
  • First column must contain the variant identifiers as used in the genetic reference panel for linkage disequilibrium (LD) calculations. If the reference panel uses RS identifiers then these should be used. If chr:pos identifiers are used by the reference panel then these should be used here.
  • The following columns should contain the GWAS variant p-values. The header of these columns should be name of the GWAS
  • It is not needed to filter on variants present in the reference data, these will automatically be excluded.

Secondly this text file needs to be converted to a binary file for quick access by Downstreamer. That can be done using the CONVERT_TXT mode with the following command:

java -Xmx10g -jar Downstreamer.jar --mode CONVERT_TXT --output [PATH_TO_OUTPUT_FILE] --pvalueToZscore --gwas [PATH_TO_GWAS_TEXT_FILE]

You now get 3 files, 1 binary file with the summary statistics matrix and 2 text files depicting the rows and columns. If all went well the columns files will contain the names of the studies in the original text file.

❗ While it is possible to combine multiple GWAS summary statistics in a single file this is only recommended if they are performed on the same cohort using the same genotyping data.

2.b. Dowstreamer STEP1

The first part of Downstreamer, STEP1 mode, is the most computationally and memory intensive. Here the GWAS SNP p-values are converted to GWAS gene p-values. Additionally this step does a very basis pruning to identify independent top GWAS hits.

It is possible to provide the STEP2 arguments when running STEP1, in this case Downstreamer will automatically continue with STEP2.

Arguments used by STEP1

Argument ShortArgument Type Description
--correctLambda -cl flag Correct the GWAS for the lambda inflation
--debug -d flag Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis.
--gwas -g path GWAS Z-score binary matrix. Rows variants, Cols traits. Without .dat
--genes -ge path File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm
--maf -maf double Minimum MAF
--output -o path The output path
--permutations -p int Number of initial permutations before using Farebrother's RUBEN algorithm to determine gene p-values, min / recommended: 100,000
--permutationFDR -pfdr int Number of random phenotypes to use to determine null distribution for FDR calculation, recommended: 100
--permutationGeneCorrelations -pgc int Number of random phenotypes to use to determine gene correlations, recommended: 10,000
--permutationPathwayEnrichment -ppe int Number of random phenotypes to use to determine null distribution pathway enrichment, recommend: 10,000
--permutationsRescue -pr int Number of permutations to use as fallback in case Farebrother's RUBEN algorithm failed. Optional but recommend to do: 10,000,000.
--referenceGenotypeFormat -R type The reference genotype data type. If not defined will attempt to automatically select the first matching dataset on the specified path
* PED_MAP - plink PED MAP files.
* PLINK_BED - plink BED BIM FAM files.
* VCF - bgziped vcf with tabix index file
* VCFFOLDER - matches all bgziped vcf files + tabix index in a folder
* SHAPEIT2 - shapeit2 phased haplotypes .haps & .sample
* GEN - Oxford .gen & .sample
* BGEN - Oxford .bgen & optionally .sample
* TRITYPER - TriTyper format folder
--referenceGenotypes -r basePath The reference genotypes
--referenceSamples -rs path Samples to include from reference genotypes
--threads -t int Maximum number of calculation threads
--saveUsedVariantsPerGene -uvg flag Save all the variants used per gene to calculate the gene p-value (Warning this will create a very large file)
--variantCorrelation -v double Max correlation between variants to use (recommend = 0.95)
--variantFilter -vf path File with variants to include in gene p-value calculation (optional)
--variantGene -vg path Variant gene mapping. col 1: variant ID col 2: ENSG ID. (optional)
--window -w int Number of bases to add left and right of gene window for step 1, use -1 to not have any variants in window

2.c.Downstreamer STEP2

To run the STEP2 mode the programs needs the gene p-values of the real GWASes and the randomized null GWASes. It also takes the top hits as determined in STEP1 but it is possible to provide your own list of top hits --alternaitveTopHits.

The gene p-values are the basis of the pathway enrichment analysis and gene prioritization in STEP2. Because it is not needed to recalculate the gene p-values when performing multiple enrichment analysis this step only needs to be run once.

Arguments used by STEP2

Argument ShortArgument Type Description
--annotDb -adb String Name of Pathway databases that you want to annotate with GWAS info. Rownames MUST be enembl Id's
--alternaitveTopHits -ath name=path Alternative file with top GWAS hits id chr pos p-value. name must be name as in oter output files for trait
--covariates -cov path File with covariates used to correct the gene p-values. Works in conjunction with -rgl. Residuals of this regression are used as input for the GLS
--cisWindowExtent -cwe int Cis window is defined as [startOfGene-cwe, endOfGene+cwe]. Defaults to: 250000
--debug -d flag Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis.
--excludeHla -eh flag Exclude HLA locus during pathway enrichment (chr6 20mb - 40mb)
--forceNormalGenePvalues -fngp flag Force normal gene p-values before pathway enrichment
--forceNormalPathwayPvalues -fnpp flag Force normal pathway p-values before pathway enrichment
--geneCorrelationWindow -gcw int Window in bases to calculate gene correlations for GLS of pathway enrichment
--genes -ge path File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm
--genePruningR -gpr double Exclude correlated genes in pathway enrichment
--ignoreGeneCorrelations -igc flag Ignore gene correlations in pathway enrichment (not recommended)
--output -o path The output path
--pathwayDatabase -pd name=path Pathway databases, binary matrix with either z-scores for predicted gene pathway associations or 0 / 1 for gene assignments
--permutationFDR -pfdr int Number of random phenotypes to use to determine null distribution for FDR calculation
--permutationGeneCorrelations -pgc int Number of random phenotypes to use to determine gene correlations
--permutationPathwayEnrichment -ppe int Number of random phenotypes to use to determine null distribution pathway enrichment
--qnorm -qn flag Quantile normalize the permutations gene p-values before pathway enrichment to fit the real gene p-value distribution
--regress-gene-lengths -rgl flag Linearly correct for the effect of gene lengths on gene p-values
--saveExcel -se flag Save enrichment results also as excel files. Will generate 1 file per input phenotype
--stepOneOutput -soo path The output path of STEP1
--threads -t int Maximum number of calculation threads

3. Key gene enrichment analysis

After the key gene prioritization performed in STEP2 it is possible to test determine the enrichment of these key-genes in pathways or HPO-terms. This is done using the PRIO_GENE_ENRICH mode.

❗ The --pathwayDatabase option beheaves slightly different here. Now it is used to define on which predictions the enrichment should be performed. This of course can only be done on gene prioritization. If you also performed pathway or tissue prioritization these should be omitted here.

Arguments used by PRIO_GENE_ENRICH

Argument ShortArgument Type Description
--debug -d flag Activate debug mode. This will result in a more verbose log file and will save many intermediate results to files. Not recommended for large analysis.
--excludeHla -eh flag Exclude HLA locus during pathway enrichment (chr6 20mb - 40mb)
--genes -ge path File with gene info. col1: geneName (ensg) col2: chr col3: startPos col4: stopPos col5: geneType col6: chrArm
--output -o path The output path
--pathwayDatabase -pd name=path Pathway databases, binary matrix with either z-scores for predicted gene pathway associations or 0 / 1 for gene assignments
--pathwayDatabase2 -pd2 name=path Pathway databases, binary matrix with 0 or 1 for gene pathway assignments. Only used by PRIO_GENE_ENRICH mode
--stepOneOutput -soo path The output path of STEP1

4. Most important output files

Among the different intermediate files, two files contain the primary output.

TraitName_enrichtments.xlsx

This file contains the results of the different enrichment analysis. Each sheet contains the results of single enrichment source. By default we run enrichment on the following types of datasets. But in principle it can be done on any genes times X matrix.

Gene co-regulation enrichments

As an input we use a gene by gene matrix to expression the relation between genes. We typically use co-regulation to do this (see introduction above). This means that for each gene we have a metric how it relates to all other genes, there scores are then correlated to the GWAS gene scores. The idea being that genes that are co-regulated with many genes that are picked up by the GWAS are more important to the studied trait. Since this is done in a genome wide manner we can prioritize trans acting genes outside of the GWAS loci.

Pathway enrichments

The pathways enrichment are performed using the predicted pathways assignments from [1]. Here we test if the predicted gene assignments to a pathway are correlate to GWAS gene signal.

Sample & tissue enrichments

The sample enrichment (in the expression tab) indicate the correlation between expression levels of each of the 31,499 samples in our gene-network and the GWAS gene p-values. We found that to top enriched samples are often very relevant for the trait studied trait.

The GTEx sheet is used to determine the enrichment of primary tissues by correlating the GWAS to the average expression profile of each tissue in GTEx.

TraitName_cisPrio.xlsx

The co-regulation scores can also be used to prioritize candidate genes within the loci identified by the GWAS. In this file we list all the genes within a window around the independent top variants and we prioritize these variants based on the overall gene prioritization. These prioritization scores do not necessarily have to be very strong since they these genes might only have a small effect on the overall outcome of the traits. We do however suspect that the causal genes within a cis locus will, on average, have higher scores. We are currently investigating this further.

TraitName_[KeyGene]_Enrichment.xlsx

This file is created using the PRIO_GENE_ENRICH mode and contains the enrichment of the Bonferroni significant and FDR 5% significant key genes. The bracketed part of the file name is arbitrary and depends on the name you have used for these predictions in the --pathwayDatabase argument. This file can contain multiple sheets, one for each database specified using --pathwayDatabase2.

Additionally, there will also be a file with GenePvalues instead of a database name. These are the enrichment when using the PASCAL based gene p-values directly.

5. Code availability

https://github.com/molgenis/systemsgenetics/tree/master/Downstreamer

References

[1] https://www.nature.com/articles/s41467-019-10649-4

[2] https://www.nature.com/articles/ng.3173

Clone this wiki locally