Skip to content
This repository has been archived by the owner on Aug 16, 2022. It is now read-only.

mrcepid-rap/mrcepid-annotatecadd

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AnnotateCADD (DNAnexus Platform App)

This is the source code for an app that runs on the DNAnexus Platform. For more information about how to run or modify it, see https://documentation.dnanexus.com/.

Table of Contents

Introduction

This applet annotates vcf files generated by the mrcepid-filterbcf applet with CADD variant deleteriousness scores. This applet makes heavy use of bcftools and CADD. Please see the bcftools or CADD manual for more information on how individual commands/tools used in this applet work.

This README makes use of DNANexus file and project naming conventions. Where applicable, an object available on the DNANexus platform has a hash ID like:

  • file – file-1234567890ABCDEFGHIJKLMN
  • project – project-1234567890ABCDEFGHIJKLMN

Information about files and projects can be queried using the dx describe tool native to the DNANexus SDK:

dx describe file-1234567890ABCDEFGHIJKLMN

Note: This README pertains to data included as part of the DNANexus project "MRC - Variant Filtering" (project-G2XK5zjJXk83yZ598Z7BpGPk)

Background

CADD is one of the more reliable tools for variant interpretation. However, it has two primary limitations:

  1. Only precomputed scores are provided for all possible (3.2Gb * 3) SNVs in the genome and NOT most InDels
  2. It requires a large amount of storage space for resource files that are used as part of its annotation model.

In particular due to the system resources required, running CADD as part of variant filtering and annotation on the DNANexus platform (as part of mrcepid-filterbcf) was not feasible. This is primarily due to major discrepancies in the storage space needed by CADD (~200Gb), and the storage space needed for variant filtering (~20Gb). In brief, DNANexus links storage space to memory and CPU requirements, so to have enough storage space for CADD resource files, we would need a much more powerful (and expensive) AWS instance than necessary. Additionally, injesting CADD resources onto an AWS instance takes ~1 hour. This would add significant cost to filtering. Thus, this applet runs multiple VCFs on one instance to avoid having to waste resources on ingesting CADD annotations once per VCF file.

Dependencies

Docker

This applet uses Docker to supply dependencies to the underlying AWS instance launched by DNANexus. The Dockerfile used to build dependencies is available as part of the MRCEpid organisation at:

https://github.com/mrcepid-rap/dockerimages/blob/main/cadd.Dockerfile

This Docker image is built off of a 20.04 Ubuntu distribution with miniconda3 pre-installed available via dockerhub. This was done to remove issues with installing miniconda3 via Dockerfile which is a dependancy of CADD. This image is very light-weight and only provides basic OS installation and miniconda3. Other basic software (e.g. wget, make, and gcc) need to be installed manually. For more details on how to build a Docker image for use on the UKBiobank RAP, please see:

https://github.com/mrcepid-rap#docker-images

In brief, the primary bioinformatics software dependencies required by this Applet (and provided in the associated Docker image) are:

This list is not exhaustive and does not include dependencies of dependencies and software needed to acquire other resources (e.g. wget). See the referenced Dockerfile for more information.

Resource Files

This app also makes use of several resource files that are specific to CADD. These files were all downloaded from the CADD Downloads website. These are provided as part of the DNANexus project "MRC - Variant Filtering" (project-G2XK5zjJXk83yZ598Z7BpGPk) in the folder project_resources/ with specific directories listed here:

  • CADD resource files – /project_resources/cadd_files/
  • CADD known VEP hg38 cache - /project_resources/vep_cadd_files/

Methodology

This applet is step 3 (mrc-annotatecadd) of the rare variant testing pipeline developed by Eugene Gardner for the UKBiobank RAP at the MRC Epidemiology Unit:

This applet is fairly straightforward. Given a list of VCF files (the number of files is up to the user), this applet will annotate all variants in that VCF with CADD v1.6. The command line is straightforward:

CADD-scripts/CADD.sh -g GRCh38 -o variants.cadd.tsv.gz variants.vcf

The difficulty was in getting CADD to run properly on AWS, which is unnecessary to document here. For more specifics on this, please see the Dockerfile cited in Docker and the applet source code at src/mrcepid-annotatecadd.py

Running on DNANexus

Inputs

input description
input_vcfs List of files from mrcepid-filterbcf to annotate with CADD

input_vcfs is a file list that MUST contain DNANexus file hash keys (e.g. like file-1234567890ABCDEFGHIJ). A simple way to generate such a list is with the following bash/perl one-liner:

dx ls -l filtered_vcfs/ukb23148_c7_b*_v1_chunk*.bcf.norm* | perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "$1\n";}' > cadd_list.txt

This command will:

  1. Find all filtered vcf files on chromosome 7 and print in dna nexus "long" format which includes a column for file hash (column 7)
  2. Extract the file hash using a perl one-liner and print one file hash per line

The final input file will look something like:

file-1234567890ABCDEFGHIJ
file-2345678901ABCDEFGHIJ
file-3456789012ABCDEFGHIJ
file-4567890123ABCDEFGHIJ

This file then needs to be uploaded to the DNANexus platform, so it can be provided as input:

dx upload cadd_list.txt

Outputs

output description
output_vcfs Output VCF(s) with filtered genotypes and sites, annotated with CADD
output_vcf_idxs .csi index files for output_vcfs
output_veps tabix indexed TSV file with all variants from outputvcfs with annotations
output_vep_idxs .tbi index files for output_veps
output_per_samples .tsv file of variants per-individual to collate number of variants per person

output_vcfs is a standard vcf.gz format file with INFO fields derived from VEP annotations. These fields are identical to those in the paired .tsv.gz file provided with the output_veps output. These are the following:

INFO Field row number in .tsv.gz Field dtype (pandas) Possible Levels (If Factor) Short Description
CHROM 1 object chr1-chr22,chrX, chrY chromosome of variant
POS 2 int64 NA position of variant
REF 3 object NA reference allele (Note that this is not major/minor allele!)
ALT 4 object NA alternate allele (Note that this is not major/minor allele!)
varID 5 object NA variant ID. Simple a concatenation like CHROM:POS:REF:ALT. This ID should be used to match on per-marker association test results.
ogVarID 6 object NA original variant ID in UKBiobank VCF files. Note: This ID SHOULD NOT be used to match on association test results. It is included for posterity and to allow matching on original UK Biobank VCF files.
FILTER 7 object PASS, FAIL Did this variant pass QC?
AF 8 float64 NA Allele Frequency of the allele listed in ALT
F_MISSING 9 float64 NA Proportion of missing (./.) genotypes
AN 10 int64 NA Number of possible alleles ([sample size - n.missing] * 2)
AC 11 int64 NA Number of non-reference alleles of the allele listed in ALT
MANE 12 object NA The MANE transcript for this variant
ENST 13 object NA ENSEMBL transcript (e.g. ENST) for this variant
ENSG 14 object NA ENSEMBL gene (e.g. ENSG) for this variant
BIOTYPE 15 object protein_coding VEP biotype of the ENST.
SYMBOL 16 object NA HGNC Gene symbol
CSQ 17 object All possible values in this table. VEP annotated CSQ
gnomAD_AF 18 float64 NA Non-Finnish European allele frequency of this variant in gnomAD exomes. 0 if not in gnomAD
CADD 19 float64 NA CADDv1.6 phred score
REVEL 20 float64 NA REVEL score for this variant if CSQ is "missense", else nan
SIFT 21 object NA SIFT score for this variant if CSQ is "missense", else NA
POLYPHEN 22 object NA POLYPHEN score for this variant if CSQ is "missense", else NA
LOFTEE 23 object HC,LC LOFTEE score for this variant if PARSED_CSQ is "PTV", else NA. Variants with a "HC" value are high-confidence and should be retained for testing
AA 24 object NA The amino acid change if a missense, InDel, or PTV variant
AApos 25 object NA The position of this variant in the translated protein of this ENST (based on associated ENSP ID)
PARSED_CSQ 26 object All possible values in this table. Eugene-determined consequence. See the README for mrcepid-filterbcf for more information
MULTI 27 bool True, False Was this variant original multiallelic? Determined based on presence of at least one ";" in the ID field
INDEL 28 bool True, False Is this variant an InDel? True if len(REF) != len(ALT)
MINOR 29 object NA The minor allele for this variant. Will be the same as ALT if AF < 0.5
MAJOR 30 object NA The minor allele for this variant. Will be the same as REF if AF < 0.5
MAF 31 float64 NA The minor allele frequency for this variant. Will be the same as AF if AF < 0.5
MAC 32 int64 NA The minor allele count for this variant. Will be the same as AC if AF < 0.5

Command line example

If this is your first time running this applet within a project other than "MRC - Variant Filtering", please see our organisational documentation on how to download and build this app on the DNANexus Research Access Platform:

https://github.com/mrcepid-rap

Running this command is fairly straightforward using the DNANexus SDK toolkit. For the input vcf(s) (provided with the flag -iinput_vcfs) one can use a list of file hashes of output VCFs from mrcepid-filterbcf:

# Using file hash
dx run mrcepid-annotatecadd --priority low --destination filtered_vcfs/ -iinput_vcfs=file-A12345

Where file-A12345 is the list file described above.

Brief I/O information can also be retrieved on the command line:

dx run mrcepid-annotatecadd --help

Some notes here regarding execution:

  1. Outputs are automatically named based on the prefix of the input vcf full path (this is regardless of if you use hash or full path). So the primary VCF output for the above command-line will be ukb23156_c1_b0_v1.norm.filtered.tagged.missingness_filtered.annotated.cadd.vcf.gz. All outputs will be named using a similar convention. Note the addition of .cadd compared to the output of mrcepid-filterbcf!

  2. I have set a sensible (and tested) default for compute resources on DNANexus that is baked into the json used for building the app (at dxapp.json) so setting an instance type is unnecessary. This current default is for a mem3_ssd1_v2_x64 instance (64 CPUs, 512 Gb RAM, 2400Gb storage). Please note that this applet is set up for the parallelisation of many files. To run one file, one needs much less memory. If necessary to adjust compute resources, one can provide a flag like --instance-type mem3_ssd1_v2_x8 to dx run.

Batch Running

It is easier to implement batch running manually, rather than use built-in DNANexus batch functionality. In brief, first generate a list of all files that need to be run through the process as outlined above:

dx ls -l filtered_vcfs/ukb23148_c7_b*_v1_chunk*.norm.filtered.tagged.missingness_filtered.annotated.bcf | \ 
    perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "$1\n";}' > cadd_list.txt

Then, simply use the *NIX default split command to generate a set of individual files that can work through all the files found above on individual instances:

split -a 1 -l 31 cadd_list.txt cadd_list_

A few important notes on the above:

  1. We set the number of files per-list to 31 (using -l 31) because our instance has 64 cores and requires 2 cores per file. This means we should be able to run a total of 32 files at a time, but we need a core to be able to monitor these processes, thus why we do 31 files.
  2. We CAN set the number of files to greater than 31, but this means other files need to finish processing before others can start, meaning runtime will be longer than expected.
  3. This will create files named cadd_list_a, cadd_list_b, cadd_list_c, etc.

Then we upload to dna nexus, and generate a set of commands that will then run this applet:

dx upload bcf_list_* --destination batch_lists/
dx ls -l batch_lists/cadd_list_* | \
    perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "dx run mrcepid-annotatecadd --priority low -iinput_vcfs=$1 --destination filtered_vcfs/ --yes --brief;\n";}' | bash

Releases

No releases published

Packages

No packages published

Languages