Skip to content
Harmonise GWAS summary statistics against a reference VCF
Python Shell
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
bin restructured project Jul 12, 2018
sumstat_harmoniser uses rsid as final step to resolve multiple records in vcf Oct 12, 2018
tests
LICENSE updated argparse and readme to be clear Jul 18, 2018
environment.yaml remove snakemake from requirements Jul 20, 2018
flowchart_v3.svg
readme.md

readme.md

GWAS summary statistics harmoniser

Scripts for harmonising GWAS summary stats against a reference VCF sitelist. See flowchart for implementation details.

Requirements

# Download
git clone https://github.com/opentargets/sumstat_harmoniser.git
cd sumstat_harmoniser

# Install dependencies into isolated environment
conda env create -n sumstat_harmoniser --file environment.yaml

# Activate environment
source activate sumstat_harmoniser

# Run tests
cd tests
./run_tests.py

Usage

$ bin/sumstat_harmoniser --help
usage: main.py [-h] --sumstats <file> --vcf <file> [--hm_sumstats <file>]
               [--hm_statfile <file>] [--strand_counts <file>]
               [--palin_mode [infer|forward|reverse|drop]]
               [--af_vcf_field <str>] [--infer_maf_threshold <float>]
               --chrom_col <str> --pos_col <str> --effAl_col <str>
               --otherAl_col <str> [--beta_col <str>] [--or_col <str>]
               [--or_col_lower <str>] [--or_col_upper <str>] [--eaf_col <str>]
               [--rsid_col <str>] [--only_chrom <str>] [--in_sep <str>]
               [--out_sep <str>] [--na_rep_in <str>] [--na_rep_out <str>]
               [--chrom_map <str> [<str> ...]]

Summary statistc harmoniser

optional arguments:
  -h, --help            show this help message and exit

Input files:
  --sumstats <file>     GWAS summary statistics file
  --vcf <file>          Reference VCF file. Use # as chromosome wildcard.

Output files:
  --hm_sumstats <file>  Harmonised sumstat output file (use 'gz' extension to
                        gzip)
  --hm_statfile <file>  Statistics from harmonisation process output file.
                        Should only be used in conjunction with --hm_sumstats.
  --strand_counts <file>
                        Output file showing number of variants that are
                        forward/reverse/palindromic

Harmonisation mode:
  --palin_mode [infer|forward|reverse|drop]
                        Mode to use for palindromic variants: (a) infer strand
                        from effect-allele freq, (b) assume forward strand,
                        (c) assume reverse strand, (d) drop palindromic
                        variants

Strand inference options:
  Options that are specific to strand inference (--palin_mode infer)

  --af_vcf_field <str>  VCF info field containing alt allele freq (default:
                        AF_NFE)
  --infer_maf_threshold <float>
                        Max MAF that will be used to infer palindrome strand
                        (default: 0.42)

Input column names:
  --chrom_col <str>     Chromosome column
  --pos_col <str>       Position column
  --effAl_col <str>     Effect allele column
  --otherAl_col <str>   Other allele column
  --beta_col <str>      beta column
  --or_col <str>        Odds ratio column
  --or_col_lower <str>  Odds ratio lower CI column
  --or_col_upper <str>  Odds ratio upper CI column
  --eaf_col <str>       Effect allele frequency column
  --rsid_col <str>      rsID column in the summary stat file

Other args:
  --only_chrom <str>    Only process this chromosome
  --in_sep <str>        Input file column separator (default: tab)
  --out_sep <str>       Output file column separator (default: tab)
  --na_rep_in <str>     How NA are represented in the input file (default: "")
  --na_rep_out <str>    How to represent NA values in output (default: "")
  --chrom_map <str> [<str> ...]
                        Map summary stat chromosome names, e.g. `--chrom_map
                        23=X 24=Y`

Examples

# Count number of variants that are forward strand/reverse strand/palindromic,
# without harmonisation

bin/sumstat_harmoniser \
  --sumstats {input sumstats} \
  --vcf {reference vcf} \
  --chrom_col {chrom name column} \
  --pos_col {bp position column} \
  --effAl_col {effect allele column} \
  --otherAl_col {non effect allele column} \
  --stand_counts {output count file}

# Harmonise variants whilst inferring strand of palindromic variants (requires
# known effect allele and VCF alt allele frequencies)

bin/sumstat_harmoniser \
  --sumstats {input sumstats} \
  --vcf {reference vcf} \
  --chrom_col {chrom name column} \
  --pos_col {bp position column} \
  --effAl_col {effect allele column} \
  --otherAl_col {non effect allele column} \
  --beta_col {beta column} \
  --palin_mode infer \
  --eaf_col {effect allele freq column} \
  --af_vcf_field {alt allele freq VCF info field} \
  --infer_maf_threshold {freq threshold for inference} \
  --hm_sumstats {output harmonised sumstats} \
  --hm_statfile {output outcome code stats file}

# Harmonise variants whilst assuming palindromic variants are on forward strand

bin/sumstat_harmoniser \
  --sumstats {input sumstats} \
  --vcf {reference vcf} \
  --chrom_col {chrom name column} \
  --pos_col {bp position column} \
  --effAl_col {effect allele column} \
  --otherAl_col {non effect allele column} \
  --beta_col {beta column} \
  --palin_mode forward \
  --hm_sumstats {output harmonised sumstats} \
  --hm_statfile {output outcome code stats file}

Todo

  • ( ) Check that tbi file exists for the VCF
  • ( ) Find way to speed up reference VCF query (currently using tabix which takes up 85% of run time). Possibilities:
    • ( ) Giggle is reported to be faster than tabix, but I don't know if this is only for multiple querys.
    • ( ) Load required lines from reference VCF into memory
You can’t perform that action at this time.