Skip to content
hail-based pipelines for annotating variant callsets and exporting them to elasticsearch
Branch: master
Clone or download
mike-w-wilson Add gnomad 38 ref data (#145)
* Added geno2mp, gnomad 38 liftover data, and coverage data to mega ref

* Fixed typo in v02 luigi combined ref

* Scripts to create gnomAD 38 data for loading pipeline

* Added argparse for genome build in combine_ref scripts
Latest commit d7d1490 Jun 24, 2019

The hail scripts in this repo can be used to pre-process variant callsets and export them to elasticsearch.


./gcloud_dataproc/ - general-purpose scripts that run locally and perform various operations on dataproc clusters - such as submitting jobs, getting job status, creating clusters, etc.

  • - creates a dataproc cluster that has VEP pre-installed with a GRCh37 cache. This allows hail pipelines to to use vds.vep(..) to run VEP on GRCh37-aligned datasets.
  • - creates a dataproc cluster that has VEP pre-installed with a GRCh38 cache. This allows hail pipelines to to use vds.vep(..) to run VEP on GRCh38-aligned datasets.
  • creates a dataproc cluster without installing VEP, so vds.vep(..) won't work.
  • creates a cluster that allows hail commmands to be run interactively in an ipython notebook.
  • connects to a cluster that was created by, and re-opens ipython dashboard in the browser.

./gcloud_dataproc/ - contains scripts that run locally and perform steps necessary to download, pre-process, and create vds or keytable versions of various reference datasets.

  • prints the names of all existing dataproc clusters in the project.

  • lists all active dataproc jobs.

  • prints gcloud details on a specific dataproc cluster.

  • prints details on a specific dataproc job.

  • resize an existing dataproc cluster.

  • deletes a specific dataproc cluster.

  • / kills a specific hail job.

  • submits a python hail script to the cluster.

./hail_scripts/ - contains hail scripts that can only run in a hail environment or dataproc cluster.

Main hail pipelines:

  • annotation and pre-processing pipeline for GRCh37 and GRCh38 rare disease callsets.
  • - joins gnomad exome and genome datasets into a structure that contains the info used in the gnomAD browser, and exports this to elasticsearch.
  • run VEP on a vcf or vds and write the result to a .vds. WARNING: this must run on a cluster created with either or, depending on the genome version of the dataset being annotated.


  • subsets a vcf or vds to a specific chromosome or locus - useful for creating small datasets for testing.
  • converts a .tsv table to a VDS by allowing the user to specify the chrom, pos, ref, alt column names
  • import a vcf and writes it out as a vds
  • export a subset of vds variants to a .tsv for inspection
  • print out the vds variant schema
  • reads in a tsv and imputes the types. Then prints out the keytable schema.
  • print out vds stats such as the schema, variant count, etc.
  • connects to an elasticsearch instance and prints current indices and other stats

NOTE: Some of the scripts require a running elasticsearch instance. For deploying a stand-alone elasticsearch cluster see: or for deploying one as part of seqr see:

Hail 0.2 scripts:

The submit scripts in gcloud_dataproc currently always use Hail 0.1. cloudtools can be used to create a cluster with Hail 0.2.

zip -r hail_scripts
cluster start --packages=elasticsearch somecluster
gcloud dataproc jobs submit pyspark --cluster=somecluster ./hail_scripts/v02/ -- --genome-version=37 --host=$ELASTICSEARCH_HOST_IP


Run VEP:

./gcloud_dataproc/ --hail-version 0.1 ./hail_scripts/v01/ gs://<dataset path> 

Run rare disease callset pipeline:

./gcloud_dataproc/v01/ cluster1 2 12 ;   # create cluster with 2 persistent, 12 preemptible nodes

./gcloud_dataproc/ --cluster cluster1 --project seqr-project ./hail_scripts/v01/ -g 38 --max-samples-per-index 180 --host $ELASTICSEARCH_HOST_IP --num-shards 12  --project-guid my_dataset_name  --sample-type WES  --dataset-type VARIANTS  gs://my-datasets/GRCh38/my_dataset.vcf.gz

There's also a shortcut for running the rare disease pipeline which combines the 2 commands above into 1:

python ./gcloud_dataproc/ --genome-version 38 --host $ELASTICSEARCH_HOST_IP --project-guid my_dataset_name  --sample-type WES  --dataset-type VARIANTS gs://my-datasets/GRCh38/my_dataset.vcf.gz
You can’t perform that action at this time.