This is the source code for an app that runs on the DNAnexus Platform. For more information about how to run or modify it, see https://documentation.dnanexus.com/.
This applet annotates vcf files generated by the mrcepid-filterbcf applet with CADD variant deleteriousness scores. This applet makes heavy use of bcftools and CADD. Please see the bcftools or CADD manual for more information on how individual commands/tools used in this applet work.
This README makes use of DNANexus file and project naming conventions. Where applicable, an object available on the DNANexus platform has a hash ID like:
- file –
file-1234567890ABCDEFGHIJKLMN
- project –
project-1234567890ABCDEFGHIJKLMN
Information about files and projects can be queried using the dx describe
tool native to the DNANexus SDK:
dx describe file-1234567890ABCDEFGHIJKLMN
Note: This README pertains to data included as part of the DNANexus project "MRC - Variant Filtering" (project-G2XK5zjJXk83yZ598Z7BpGPk)
CADD is one of the more reliable tools for variant interpretation. However, it has two primary limitations:
- Only precomputed scores are provided for all possible (3.2Gb * 3) SNVs in the genome and NOT most InDels
- It requires a large amount of storage space for resource files that are used as part of its annotation model.
In particular due to the system resources required, running CADD as part of variant filtering and annotation on the DNANexus platform (as part of mrcepid-filterbcf) was not feasible. This is primarily due to major discrepancies in the storage space needed by CADD (~200Gb), and the storage space needed for variant filtering (~20Gb). In brief, DNANexus links storage space to memory and CPU requirements, so to have enough storage space for CADD resource files, we would need a much more powerful (and expensive) AWS instance than necessary. Additionally, injesting CADD resources onto an AWS instance takes ~1 hour. This would add significant cost to filtering. Thus, this applet runs multiple VCFs on one instance to avoid having to waste resources on ingesting CADD annotations once per VCF file.
This applet uses Docker to supply dependencies to the underlying AWS instance launched by DNANexus. The Dockerfile used to build dependencies is available as part of the MRCEpid organisation at:
https://github.com/mrcepid-rap/dockerimages/blob/main/cadd.Dockerfile
This Docker image is built off of a 20.04 Ubuntu distribution with miniconda3 pre-installed available via dockerhub. This was done to remove issues with installing miniconda3 via Dockerfile which is a dependancy of CADD. This image is very light-weight and only provides basic OS installation and miniconda3. Other basic software (e.g. wget, make, and gcc) need to be installed manually. For more details on how to build a Docker image for use on the UKBiobank RAP, please see:
https://github.com/mrcepid-rap#docker-images
In brief, the primary bioinformatics software dependencies required by this Applet (and provided in the associated Docker image) are:
- htslib and samtools
- bcftools
- cadd
- For more details on how to install CADD, please see the CADD github repo
This list is not exhaustive and does not include dependencies of dependencies and software needed to acquire other resources (e.g. wget). See the referenced Dockerfile for more information.
This app also makes use of several resource files that are specific to CADD. These files were all downloaded from the
CADD Downloads website. These are provided as part of the DNANexus
project "MRC - Variant Filtering" (project-G2XK5zjJXk83yZ598Z7BpGPk) in the folder project_resources/
with specific
directories listed here:
- CADD resource files –
/project_resources/cadd_files/
- CADD known VEP hg38 cache -
/project_resources/vep_cadd_files/
This applet is step 3 (mrc-annotatecadd) of the rare variant testing pipeline developed by Eugene Gardner for the UKBiobank RAP at the MRC Epidemiology Unit:
This applet is fairly straightforward. Given a list of VCF files (the number of files is up to the user), this applet will annotate all variants in that VCF with CADD v1.6. The command line is straightforward:
CADD-scripts/CADD.sh -g GRCh38 -o variants.cadd.tsv.gz variants.vcf
The difficulty was in getting CADD to run properly on AWS, which is unnecessary to document here. For more specifics on
this, please see the Dockerfile cited in Docker and the applet source code at src/mrcepid-annotatecadd.py
input | description |
---|---|
input_vcfs | List of files from mrcepid-filterbcf to annotate with CADD |
input_vcfs
is a file list that MUST contain DNANexus file hash keys (e.g. like file-1234567890ABCDEFGHIJ). A simple
way to generate such a list is with the following bash/perl one-liner:
dx ls -l filtered_vcfs/ukb23148_c7_b*_v1_chunk*.bcf.norm* | perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "$1\n";}' > cadd_list.txt
This command will:
- Find all filtered vcf files on chromosome 7 and print in dna nexus "long" format which includes a column for file hash (column 7)
- Extract the file hash using a perl one-liner and print one file hash per line
The final input file will look something like:
file-1234567890ABCDEFGHIJ
file-2345678901ABCDEFGHIJ
file-3456789012ABCDEFGHIJ
file-4567890123ABCDEFGHIJ
This file then needs to be uploaded to the DNANexus platform, so it can be provided as input:
dx upload cadd_list.txt
output | description |
---|---|
output_vcfs | Output VCF(s) with filtered genotypes and sites, annotated with CADD |
output_vcf_idxs | .csi index files for output_vcfs |
output_veps | tabix indexed TSV file with all variants from outputvcfs with annotations |
output_vep_idxs | .tbi index files for output_veps |
output_per_samples | .tsv file of variants per-individual to collate number of variants per person |
output_vcfs
is a standard vcf.gz format file with INFO fields derived from VEP annotations. These fields are identical
to those in the paired .tsv.gz file provided with the output_veps
output. These are the following:
INFO Field | row number in .tsv.gz | Field dtype (pandas) | Possible Levels (If Factor) | Short Description |
---|---|---|---|---|
CHROM | 1 | object | chr1-chr22,chrX, chrY | chromosome of variant |
POS | 2 | int64 | NA | position of variant |
REF | 3 | object | NA | reference allele (Note that this is not major/minor allele!) |
ALT | 4 | object | NA | alternate allele (Note that this is not major/minor allele!) |
varID | 5 | object | NA | variant ID. Simple a concatenation like CHROM:POS:REF:ALT. This ID should be used to match on per-marker association test results. |
ogVarID | 6 | object | NA | original variant ID in UKBiobank VCF files. Note: This ID SHOULD NOT be used to match on association test results. It is included for posterity and to allow matching on original UK Biobank VCF files. |
FILTER | 7 | object | PASS, FAIL | Did this variant pass QC? |
AF | 8 | float64 | NA | Allele Frequency of the allele listed in ALT |
F_MISSING | 9 | float64 | NA | Proportion of missing (./.) genotypes |
AN | 10 | int64 | NA | Number of possible alleles ([sample size - n.missing] * 2) |
AC | 11 | int64 | NA | Number of non-reference alleles of the allele listed in ALT |
MANE | 12 | object | NA | The MANE transcript for this variant |
ENST | 13 | object | NA | ENSEMBL transcript (e.g. ENST) for this variant |
ENSG | 14 | object | NA | ENSEMBL gene (e.g. ENSG) for this variant |
BIOTYPE | 15 | object | protein_coding | VEP biotype of the ENST. |
SYMBOL | 16 | object | NA | HGNC Gene symbol |
CSQ | 17 | object | All possible values in this table. | VEP annotated CSQ |
gnomAD_AF | 18 | float64 | NA | Non-Finnish European allele frequency of this variant in gnomAD exomes. 0 if not in gnomAD |
CADD | 19 | float64 | NA | CADDv1.6 phred score |
REVEL | 20 | float64 | NA | REVEL score for this variant if CSQ is "missense", else nan |
SIFT | 21 | object | NA | SIFT score for this variant if CSQ is "missense", else NA |
POLYPHEN | 22 | object | NA | POLYPHEN score for this variant if CSQ is "missense", else NA |
LOFTEE | 23 | object | HC,LC | LOFTEE score for this variant if PARSED_CSQ is "PTV", else NA. Variants with a "HC" value are high-confidence and should be retained for testing |
AA | 24 | object | NA | The amino acid change if a missense, InDel, or PTV variant |
AApos | 25 | object | NA | The position of this variant in the translated protein of this ENST (based on associated ENSP ID) |
PARSED_CSQ | 26 | object | All possible values in this table. | Eugene-determined consequence. See the README for mrcepid-filterbcf for more information |
MULTI | 27 | bool | True, False | Was this variant original multiallelic? Determined based on presence of at least one ";" in the ID field |
INDEL | 28 | bool | True, False | Is this variant an InDel? True if len(REF) != len(ALT) |
MINOR | 29 | object | NA | The minor allele for this variant. Will be the same as ALT if AF < 0.5 |
MAJOR | 30 | object | NA | The minor allele for this variant. Will be the same as REF if AF < 0.5 |
MAF | 31 | float64 | NA | The minor allele frequency for this variant. Will be the same as AF if AF < 0.5 |
MAC | 32 | int64 | NA | The minor allele count for this variant. Will be the same as AC if AF < 0.5 |
If this is your first time running this applet within a project other than "MRC - Variant Filtering", please see our organisational documentation on how to download and build this app on the DNANexus Research Access Platform:
https://github.com/mrcepid-rap
Running this command is fairly straightforward using the DNANexus SDK toolkit. For the input vcf(s) (provided with the flag
-iinput_vcfs
) one can use a list of file hashes of output VCFs from mrcepid-filterbcf
:
# Using file hash
dx run mrcepid-annotatecadd --priority low --destination filtered_vcfs/ -iinput_vcfs=file-A12345
Where file-A12345
is the list file described above.
Brief I/O information can also be retrieved on the command line:
dx run mrcepid-annotatecadd --help
Some notes here regarding execution:
-
Outputs are automatically named based on the prefix of the input vcf full path (this is regardless of if you use hash or full path). So the primary VCF output for the above command-line will be
ukb23156_c1_b0_v1.norm.filtered.tagged.missingness_filtered.annotated.cadd.vcf.gz
. All outputs will be named using a similar convention. Note the addition of.cadd
compared to the output of mrcepid-filterbcf! -
I have set a sensible (and tested) default for compute resources on DNANexus that is baked into the json used for building the app (at
dxapp.json
) so setting an instance type is unnecessary. This current default is for a mem3_ssd1_v2_x64 instance (64 CPUs, 512 Gb RAM, 2400Gb storage). Please note that this applet is set up for the parallelisation of many files. To run one file, one needs much less memory. If necessary to adjust compute resources, one can provide a flag like--instance-type mem3_ssd1_v2_x8
todx run
.
It is easier to implement batch running manually, rather than use built-in DNANexus batch functionality. In brief, first generate a list of all files that need to be run through the process as outlined above:
dx ls -l filtered_vcfs/ukb23148_c7_b*_v1_chunk*.norm.filtered.tagged.missingness_filtered.annotated.bcf | \
perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "$1\n";}' > cadd_list.txt
Then, simply use the *NIX default split command to generate a set of individual files that can work through all the files found above on individual instances:
split -a 1 -l 31 cadd_list.txt cadd_list_
A few important notes on the above:
- We set the number of files per-list to 31 (using
-l 31
) because our instance has 64 cores and requires 2 cores per file. This means we should be able to run a total of 32 files at a time, but we need a core to be able to monitor these processes, thus why we do 31 files. - We CAN set the number of files to greater than 31, but this means other files need to finish processing before others can start, meaning runtime will be longer than expected.
- This will create files named cadd_list_a, cadd_list_b, cadd_list_c, etc.
Then we upload to dna nexus, and generate a set of commands that will then run this applet:
dx upload bcf_list_* --destination batch_lists/
dx ls -l batch_lists/cadd_list_* | \
perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {print "dx run mrcepid-annotatecadd --priority low -iinput_vcfs=$1 --destination filtered_vcfs/ --yes --brief;\n";}' | bash