Annotation Support in Variant Transforms
The annotation support in Variant Transforms has two goals:
- First, we want to be able to properly parse annotation fields in input VCFs.
- The second goal is to be able to annotate VCF files during the import process, using open annotation tools/databases like Variant Effect Predictor (VEP).
Parsing Annotation Fields
The annotation field format that is currently supported is based on this spec developed in the SnpEff project. VEP uses the same format and here is a trimmed example from a file annotated by VEP (taken from the gnomAD dataset):
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|ALLELE_NUM|...>
Corresponding annotations for a sample variant record where
For each variant record, multiple annotation lists are provided which are
separated by comma. The first field of each set refers to an alternate allele or
ALT (for details of the format see
Note that the actual record of the above example, has more fields and also each
alternate allele can be repeated multiple times (with different annotations);
here we have only selected a small sub-set, one for each
flag is used, annotation lists are split and moved under
alternate_bases records. In other words, there will be a
repeated record field
CSQ under each
Alternate Allele Matching
As it can be seen in the above example, the alternate alleles listed in the
CSQ field are different from
ALT strings of the variant record (for example
AAC->C variant is represented by a single
- which means a deletion).
This is because, the gnomAD dataset is run through VEP with the
option. Variant Transforms supports matching alt fields in the minimal mode. There are actually
three supported modes of
Variant Transforms can also simulate the
--minimalmode of VEP with --minimal_vep_alt_matching but this is not recommended as it can result in ambiguous matching. The number of such ambiguous cases are counted and reported in the Dataflow dashboard. Also with this flag, a new field is added to the output BigQuery table to mark ambiguous cases. This should be used only as a last resort.
Running VEP through Variant Transforms
You can run VEP on input VCF files before importing them into BigQuery. The
minimum number of flags to enable this feature is
--annotation_output_dir [GCS_PATH] where
[GCS_PATH] is a path in a GCS
bucket that your project owns.
While parsing annotation fields and breaking them into separate fields is a useful feature but it still requires pre-annotated VCFs. To address this need, we have started an effort to provide options to annotate input files as part of the BigQuery import process, i.e., automatically done by Variant Transforms.
Currently, this option is only supported for VEP and that is done by bringing up Virtual Machines (VM) on Google Cloud Platform (GCP) and running VEP docker images. This is slow and improving this experience is on our roadmap.
Relevant flags for this feature are:
--run_annotation_pipelineenables the automatic annotation feature.
--annotation_output_dira path on Google Cloud Storage (GCS) where VEP output files are copied. The file hierarchy of input files is replicated at this location with the same file names followed by
_vep_output.vcf. Note that if this directory already exists, then Variant Transforms fails. This is to prevent unintentional overwriting of old annotated VCFs.
--shard_variantsby default, the input files are sharded into smaller temporary VCF files before running VEP annotation. If the input files are small, i.e., each VCF file contains less than 50,000 variants, setting this flag can be computationally wasteful.
--number_of_variants_per_shardthe maximum number of variants written to each shard if
shard_variantsis true. The default value should work for most cases. You may change this flag to a smaller value if you have a dataset with a lot of samples. Notice that pipeline may take longer to finish for smaller value of this flag.
--vep_image_urithe docker image for VEP created using the Dockerfile in variant-annotation GitHub repo. By default
gcr.io/cloud-lifesciences/vep_91is used which is a public image that Google maintains (VEP version 91).
--vep_cache_paththe GCS location that has the compressed version of VEP cache. This file can be created using build_vep_cache.sh script. By default
gs://cloud-lifesciences/vep/vep_cache_homo_sapiens_GRCh38_91.tar.gzis used which is good for human genome aligned with GRCh38 reference sequence.
--vep_info_fieldby default, an INFO field called
CSQ_VTis added to hold the new annotations; use this field to override the field name.
For other parameters, like how many VMs to use or where the VMs should be
located, the same parameters for
Dataflow pipeline execution
are reused, e.g.,
--num_workers. It is recommended to provide one worker per
50,000 variants. For instance, for a dataset with 5,000,000 variants, you may
--num_workers to 100.
Caveats and troubleshooting
annotation_output_dir, beside output annotated VCFs, there is a
directory which contains the logs from virtual machines on which VEP was run.
If VEP fails, this is a good place to look for causes
(in addition to usual Variant Transforms log files).
option of VEP is enabled for these runs and the above log files contain a
report of variants that do not match the reference. If you are using the right
VEP cache, there should be no mismatch. Note that mismatches are dropped and
hence are not present in the final BigQuery output table.
Querying annotations directly
We imported the gnomAD genomes table using the features described in Parsing Annotation Fields.
The gnomAD sample query notebook has additional queries you can run to practice querying annotation data.
Here is another sample query to extract high impact variants in the HBB gene which encodes the beta chain of Haemoglobin:
#standardSQL SELECT reference_name, start_position, reference_bases, ALT.alt, CSQ.* FROM vcf_imports_external.gnomad_genomes_chr_hg19 AS T, T.alternate_bases AS ALT, ALT.CSQ AS CSQ WHERE CSQ.SYMBOL = "HBB" AND CSQ.IMPACT = "HIGH" ORDER BY start_position
and here is a snapshot from the BigQuery UI for this query:
Note that the gnomAD dataset does not have any calls or samples but it is fairly simple practice to extend this query to pick samples that have these high impact variants.
Joining with other databases
You can also join with other databases that are available in BigQuery. For instance, there are several pathway databases that are available publicly in BigQuery. Here is a sample query to extract high impact variants that are in the double-strand break repair pathway as defined by the GO (Gene Ontology) biological process (GO:0006302).
#standardSQL SELECT reference_name, start_position, reference_bases, ALT.alt, CSQ.Consequence, CSQ.Impact, CSQ.SYMBOL FROM `vcf_imports_external.gnomad_genomes_chr_hg19` AS T, T.alternate_bases AS ALT, ALT.CSQ AS CSQ WHERE # Note: Matching based on symbol is "best effort" as the names may not be # standardized across sources. CSQ.SYMBOL IN (SELECT DB_Object_Symbol FROM `isb-cgc.genome_reference.GO_Annotations` WHERE GO_ID = 'GO:0006302') AND CSQ.IMPACT = "HIGH" ORDER BY start_position