AnnotateCADD (DNAnexus Platform App)

This is the source code for an app that runs on the DNAnexus Platform. For more information about how to run or modify it, see https://documentation.dnanexus.com/.

Introduction

This applet annotates vcf files generated by the mrcepid-filterbcf applet with CADD variant deleteriousness scores. This applet makes heavy use of bcftools and CADD. Please see the bcftools or CADD manual for more information on how individual commands/tools used in this applet work.

This README makes use of DNANexus file and project naming conventions. Where applicable, an object available on the DNANexus platform has a hash ID like:

file – file-1234567890ABCDEFGHIJKLMN
project – project-1234567890ABCDEFGHIJKLMN

Information about files and projects can be queried using the dx describe tool native to the DNANexus SDK:

dx describe file-1234567890ABCDEFGHIJKLMN

Note: This README pertains to data included as part of the DNANexus project "MRC - Variant Filtering" (project-G2XK5zjJXk83yZ598Z7BpGPk)

Background

CADD is one of the more reliable tools for variant interpretation. However, it has two primary limitations:

Only precomputed scores are provided for all possible (3.2Gb * 3) SNVs in the genome and NOT most InDels
It requires a large amount of storage space for resource files that are used as part of its annotation model.

In particular due to the system resources required, running CADD as part of variant filtering and annotation on the DNANexus platform (as part of mrcepid-filterbcf) was not feasible. This is primarily due to major discrepancies in the storage space needed by CADD (~200Gb), and the storage space needed for variant filtering (~20Gb). In brief, DNANexus links storage space to memory and CPU requirements, so to have enough storage space for CADD resource files, we would need a much more powerful (and expensive) AWS instance than necessary. Additionally, injesting CADD resources onto an AWS instance takes ~1 hour. This would add significant cost to filtering. Thus, this applet runs multiple VCFs on one instance to avoid having to waste resources on ingesting CADD annotations once per VCF file.

Dependencies

Docker

This applet uses Docker to supply dependencies to the underlying AWS instance launched by DNANexus. The Dockerfile used to build dependencies is available as part of the MRCEpid organisation at:

https://github.com/mrcepid-rap/dockerimages/blob/main/cadd.Dockerfile

This Docker image is built off of a 20.04 Ubuntu distribution with miniconda3 pre-installed available via dockerhub. This was done to remove issues with installing miniconda3 via Dockerfile which is a dependancy of CADD. This image is very light-weight and only provides basic OS installation and miniconda3. Other basic software (e.g. wget, make, and gcc) need to be installed manually. For more details on how to build a Docker image for use on the UKBiobank RAP, please see:

https://github.com/mrcepid-rap#docker-images

In brief, the primary bioinformatics software dependencies required by this Applet (and provided in the associated Docker image) are:

htslib and samtools
bcftools
cadd
- For more details on how to install CADD, please see the CADD github repo

This list is not exhaustive and does not include dependencies of dependencies and software needed to acquire other resources (e.g. wget). See the referenced Dockerfile for more information.

Resource Files

This app also makes use of several resource files that are specific to CADD. These files were all downloaded from the CADD Downloads website. These are provided as part of the DNANexus project "MRC - Variant Filtering" (project-G2XK5zjJXk83yZ598Z7BpGPk) in the folder project_resources/ with specific directories listed here:

CADD resource files – /project_resources/cadd_files/
CADD known VEP hg38 cache - /project_resources/vep_cadd_files/

Methodology

This applet is step 3 (mrc-annotatecadd) of the rare variant testing pipeline developed by Eugene Gardner for the UKBiobank RAP at the MRC Epidemiology Unit:

This applet is fairly straightforward. Given a list of VCF files (the number of files is up to the user), this applet will annotate all variants in that VCF with CADD v1.6. The command line is straightforward:

CADD-scripts/CADD.sh -g GRCh38 -o variants.cadd.tsv.gz variants.vcf

The difficulty was in getting CADD to run properly on AWS, which is unnecessary to document here. For more specifics on this, please see the Dockerfile cited in Docker and the applet source code at src/mrcepid-annotatecadd.py

Running on DNANexus

Inputs

input	description
input_vcfs	List of files from mrcepid-filterbcf to annotate with CADD

input_vcfs is a file list that MUST contain DNANexus file hash keys (e.g. like file-1234567890ABCDEFGHIJ). A simple way to generate such a list is with the following bash/perl one-liner:

dx ls -l filtered_vcfs/ukb23148_c7_b*_v1_chunk*.bcf.norm* | perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "$1\n";}' > cadd_list.txt

This command will:

Find all filtered vcf files on chromosome 7 and print in dna nexus "long" format which includes a column for file hash (column 7)
Extract the file hash using a perl one-liner and print one file hash per line

The final input file will look something like:

file-1234567890ABCDEFGHIJ
file-2345678901ABCDEFGHIJ
file-3456789012ABCDEFGHIJ
file-4567890123ABCDEFGHIJ

This file then needs to be uploaded to the DNANexus platform, so it can be provided as input:

dx upload cadd_list.txt

Outputs

output	description
output_vcfs	Output VCF(s) with filtered genotypes and sites, annotated with CADD
output_vcf_idxs	.csi index files for `output_vcfs`
output_veps	tabix indexed TSV file with all variants from `outputvcfs` with annotations
output_vep_idxs	.tbi index files for `output_veps`
output_per_samples	.tsv file of variants per-individual to collate number of variants per person

output_vcfs is a standard vcf.gz format file with INFO fields derived from VEP annotations. These fields are identical to those in the paired .tsv.gz file provided with the output_veps output. These are the following:

INFO Field	row number in .tsv.gz	Field dtype (pandas)	Possible Levels (If Factor)	Short Description
CHROM	1	object	chr1-chr22,chrX, chrY	chromosome of variant
POS	2	int64	NA	position of variant
REF	3	object	NA	reference allele (Note that this is not major/minor allele!)
ALT	4	object	NA	alternate allele (Note that this is not major/minor allele!)
varID	5	object	NA	variant ID. Simple a concatenation like CHROM:POS:REF:ALT. This ID should be used to match on per-marker association test results.
ogVarID	6	object	NA	original variant ID in UKBiobank VCF files. Note: This ID SHOULD NOT be used to match on association test results. It is included for posterity and to allow matching on original UK Biobank VCF files.
FILTER	7	object	PASS, FAIL	Did this variant pass QC?
AF	8	float64	NA	Allele Frequency of the allele listed in `ALT`
F_MISSING	9	float64	NA	Proportion of missing (./.) genotypes
AN	10	int64	NA	Number of possible alleles ([sample size - n.missing] * 2)
AC	11	int64	NA	Number of non-reference alleles of the allele listed in `ALT`
MANE	12	object	NA	The MANE transcript for this variant
ENST	13	object	NA	ENSEMBL transcript (e.g. ENST) for this variant
ENSG	14	object	NA	ENSEMBL gene (e.g. ENSG) for this variant
BIOTYPE	15	object	protein_coding	VEP biotype of the ENST.
SYMBOL	16	object	NA	HGNC Gene symbol
CSQ	17	object	All possible values in this table.	VEP annotated CSQ
gnomAD_AF	18	float64	NA	Non-Finnish European allele frequency of this variant in gnomAD exomes. 0 if not in gnomAD
CADD	19	float64	NA	CADDv1.6 phred score
REVEL	20	float64	NA	REVEL score for this variant if CSQ is "missense", else nan
SIFT	21	object	NA	SIFT score for this variant if CSQ is "missense", else NA
POLYPHEN	22	object	NA	POLYPHEN score for this variant if CSQ is "missense", else NA
LOFTEE	23	object	HC,LC	LOFTEE score for this variant if PARSED_CSQ is "PTV", else NA. Variants with a "HC" value are high-confidence and should be retained for testing
AA	24	object	NA	The amino acid change if a missense, InDel, or PTV variant
AApos	25	object	NA	The position of this variant in the translated protein of this ENST (based on associated ENSP ID)
PARSED_CSQ	26	object	All possible values in this table.	Eugene-determined consequence. See the README for mrcepid-filterbcf for more information
MULTI	27	bool	True, False	Was this variant original multiallelic? Determined based on presence of at least one ";" in the ID field
INDEL	28	bool	True, False	Is this variant an InDel? True if len(REF) != len(ALT)
MINOR	29	object	NA	The minor allele for this variant. Will be the same as ALT if AF < 0.5
MAJOR	30	object	NA	The minor allele for this variant. Will be the same as REF if AF < 0.5
MAF	31	float64	NA	The minor allele frequency for this variant. Will be the same as AF if AF < 0.5
MAC	32	int64	NA	The minor allele count for this variant. Will be the same as AC if AF < 0.5

Command line example

If this is your first time running this applet within a project other than "MRC - Variant Filtering", please see our organisational documentation on how to download and build this app on the DNANexus Research Access Platform:

https://github.com/mrcepid-rap

Running this command is fairly straightforward using the DNANexus SDK toolkit. For the input vcf(s) (provided with the flag -iinput_vcfs) one can use a list of file hashes of output VCFs from mrcepid-filterbcf:

# Using file hash
dx run mrcepid-annotatecadd --priority low --destination filtered_vcfs/ -iinput_vcfs=file-A12345

Where file-A12345 is the list file described above.

Brief I/O information can also be retrieved on the command line:

dx run mrcepid-annotatecadd --help

Some notes here regarding execution:

Outputs are automatically named based on the prefix of the input vcf full path (this is regardless of if you use hash or full path). So the primary VCF output for the above command-line will be ukb23156_c1_b0_v1.norm.filtered.tagged.missingness_filtered.annotated.cadd.vcf.gz. All outputs will be named using a similar convention. Note the addition of .cadd compared to the output of mrcepid-filterbcf!
I have set a sensible (and tested) default for compute resources on DNANexus that is baked into the json used for building the app (at dxapp.json) so setting an instance type is unnecessary. This current default is for a mem3_ssd1_v2_x64 instance (64 CPUs, 512 Gb RAM, 2400Gb storage). Please note that this applet is set up for the parallelisation of many files. To run one file, one needs much less memory. If necessary to adjust compute resources, one can provide a flag like --instance-type mem3_ssd1_v2_x8 to dx run.

Batch Running

It is easier to implement batch running manually, rather than use built-in DNANexus batch functionality. In brief, first generate a list of all files that need to be run through the process as outlined above:

dx ls -l filtered_vcfs/ukb23148_c7_b*_v1_chunk*.norm.filtered.tagged.missingness_filtered.annotated.bcf | \ 
    perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "$1\n";}' > cadd_list.txt

Then, simply use the *NIX default split command to generate a set of individual files that can work through all the files found above on individual instances:

split -a 1 -l 31 cadd_list.txt cadd_list_

A few important notes on the above:

We set the number of files per-list to 31 (using -l 31) because our instance has 64 cores and requires 2 cores per file. This means we should be able to run a total of 32 files at a time, but we need a core to be able to monitor these processes, thus why we do 31 files.
We CAN set the number of files to greater than 31, but this means other files need to finish processing before others can start, meaning runtime will be longer than expected.
This will create files named cadd_list_a, cadd_list_b, cadd_list_c, etc.

Then we upload to dna nexus, and generate a set of commands that will then run this applet:

dx upload bcf_list_* --destination batch_lists/
dx ls -l batch_lists/cadd_list_* | \
    perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "dx run mrcepid-annotatecadd --priority low -iinput_vcfs=$1 --destination filtered_vcfs/ --yes --brief;\n";}' | bash

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
test		test
LICENCE		LICENCE
Readme.developer.md		Readme.developer.md
Readme.md		Readme.md
dxapp.json		dxapp.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AnnotateCADD (DNAnexus Platform App)

Table of Contents

Introduction

Background

Dependencies

Docker

Resource Files

Methodology

Running on DNANexus

Inputs

Outputs

Command line example

Batch Running

About

Releases

Packages

Languages

License

mrcepid-rap/mrcepid-annotatecadd

Folders and files

Latest commit

History

Repository files navigation

AnnotateCADD (DNAnexus Platform App)

Table of Contents

Introduction

Background

Dependencies

Docker

Resource Files

Methodology

Running on DNANexus

Inputs

Outputs

Command line example

Batch Running

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages