This is the source code for an app that runs on the DNAnexus Platform. For more information about how to run or modify it, see https://documentation.dnanexus.com/.
This applet splits raw vcf.gz files provided by the UKBiobank platform into more manageable chunks for downstream processing.
This README makes use of DNANexus file and project naming conventions. Where applicable, an object available on the DNANexus platform has a hash ID like:
- file –
file-1234567890ABCDEFGHIJKLMN
- project –
project-1234567890ABCDEFGHIJKLMN
Information about files and projects can be queried using the dx describe
tool native to the DNANexus SDK:
dx describe file-1234567890ABCDEFGHIJKLMN
Note: This README pertains to data included as part of the DNANexus project "MRC - Variant Filtering" (project-G2XK5zjJXk83yZ598Z7BpGPk)
-
v2.0.1
- Fixed a bug where skipped sites were duplicated if processing multiple VCFs in one job.
-
v2.0.0
- Added support for WGS data. This comes with several changes under-the-hood to support issues with the larger WGS data. These changes will not effect processing of WES data, but please see below for specific changes:
- Modified standard number of CPUs per-vcf to 8. We have done extensive testing to ensure that the process is
$O(n^2)$ in terms of CPU usage – e.g., 2 CPUs = 60m / VCF, 4 CPUs = 30m / VCF - Fixed an issue with sites with large number of alternate alleles.
- Issue can lead to excessive memory requirements when processing some VCFs.
- We have added a method to filter sites with excessive alternate alleles, which can be triggered using the
alt_allele_threshold
parameter. - This parameter is set to 99999 by default, so will not affect WES processing
- We recommend a setting of 15 when analysing WGS data
- All sites that are filtered will be reported in the
skipped_sites.{output_name}.tsv
- Modified standard number of CPUs per-vcf to 8. We have done extensive testing to ensure that the process is
- Several quality-of-life changes to the codebase to improve readability and maintainability
- Implemented newer version of
general_utilities
- Added docstrings to all functions
- Implemented tests – please see
Readme.developer.md
for more information. - Fixed a bug where star alleles could result in an unexpected number of variants in a split bcf file
- Added support for WGS data. This comes with several changes under-the-hood to support issues with the larger WGS data. These changes will not effect processing of WES data, but please see below for specific changes:
-
v1.0.0
- Initial numbered release, see git logs for previous changes.
UKBiobank provides individual .vcf.gz files. This is the same number as for the 200k release despite the sample size increased by ~2.5x. As such, pipelines designed to work on the 200k data no longer work as efficiently; we wrote this tool to split provided vcf.gz files into smaller chunks. Default file locations on the RAP are as follows:
- WES:
/Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/
- WGS:
/Bulk/DRAGEN WGS/Whole genome variant call files (GVCFs) (DRAGEN) [500k release]/
This applet uses Docker to supply dependencies to the underlying AWS instance launched by DNANexus. The Dockerfile used to build dependencies is available as part of the MRCEpid organisation at:
https://github.com/mrcepid-rap/dockerimages/blob/main/burdentesting.Dockerfile
This Docker image is built off of the primary 20.04 Ubuntu distribution available via dockerhub. This image is very light-weight and only provides basic OS installation. Other basic software (e.g. wget, make, and gcc) need to be installed manually. For more details on how to build a Docker image for use on the UKBiobank RAP, please see:
https://github.com/mrcepid-rap#docker-images
or
https://github.com/mrcepid-rap/dockerimages
In brief, the primary bioinformatics software dependencies required by this Applet (and provided in the associated Docker image) are:
This list is not exhaustive and does not include dependencies of dependencies and software needed to acquire other resources (e.g. wget). See the referenced Dockerfile for more information.
This applet does not have any external dependencies.
This applet is step 1 (mrcepid-splitbcf) of the rare variant testing pipeline developed by Eugene Gardner for the UKBiobank RAP:
This applet has three major steps:
- Generate a list of all variants by chromosome/coordinate in a single vcf file 2. If required & requested, filter out sites with excessive alternate alleles
- Normalise and left correct VCFs, while removing sites with excessive alternate alleles if requested
- Split this list into smaller chunks, by default (see
chunk_size
parameter) ~5000 variants each - Extract variants for each of these chunks into a separate file.
In brief, each vcf file is split into roughly 5000-line chunks by default. However, a split bcf file cannot have
< chunk_size
/ 2 sites or > chunk_size
+ (chunk_size
/ 2) sites. In practice, and for WES data, this results in
most original vcf.gz files being split into between 4-5 smaller .bcfs.
BIG Note There is a fix implemented in this pipeline that solves a previous bug that would duplicate a site if both of the following were true:
- The site has an identical position value to another site in the vcf (i.e. split multiallelics)
- The site and it's position duplicate SPAN the junction of a bcf split
- e.g. if variant 1 of a multiallelic pair is variant no. 5000 and variant 2 is variant no. 5001
This bug resulted in both variants in the pair being present in both split bcf files. In theory, this could also happen to a 3 variant multiallelic, but I have not encountered such a situation. As noted, I have since fixed this issue in the resulting VCFs using a custom script. I am leaving this documentation in here for posterity and as a warning in case this error is ever encountered again.
input | Optional | Default | description |
---|---|---|---|
input_vcfs | False | N/A | List of raw input vcf.gz file(s) from DNA Nexus |
chunk_size | True | 5000 | The number of variants to include per-output BCF produced by this applet. |
output_name | True | None | Additional string to use when creating information files (run info, skipped_sites). This does NOT modify the name of split VCF files. |
alt_allele_threshold | True | 99,999 | The maximum number of alternate alleles in a single variant before filtering a site. Uses '>' (greater than) to determine threshold. |
human_reference | True | project-Fx2x0fQJ06KfqV7Y3fFZq1jp:file-Fx2x270Jx0j17zkb3kbBf6q2 | dxfile / path pointing to the reference genome the provided VCFs are aligned to |
human_reference_index | True | project-Fx2x0fQJ06KfqV7Y3fFZq1jp:file-Fx2x21QJ06f47gV73kZPjkQQ | dxfile / path pointing to the reference genome index (.fai) the provided VCFs are aligned to |
The format of the input_vcfs file is as follows:
file-1234567890ABCDEFGHIJ
file-ABCDEFGHIJ1234567890
file-0987654321JIHGFEDCBA
file-JIHGFEDCBA0987654321
Where file-1234567890ABCDEEFGHIJ
is the DNANexus file ID for an unprocessed vcf file.
Please find an example (for WES data) of generating the input list(s) for the input_vcfs
parameter below:
# 1. get a list of all VCFs
dx ls -l 'Bulk/Exome sequences/Population level exome OQFE variants, pVCF format - final release/ukb23157*.vcf.gz' | perl -ane 'chomp $_; if ($F[6] =~ /^\((\S+)\)$/) {$id = $1; $F[5] =~ s/.vcf.gz//; print "$F[5]\t$id\n";}' > bcf_list.txt
# 2. Split to a manageble per-size job:
split -l 18 -d ../bcfsplitter_input.lst splitjob_
# 3. Upload to DNAnexus
dx upload splitjob_* --destination batch_files/
Note: -l 18
is used above to ensure that all CPUs on the default instance type are used. Please see
VM Resources for more information on why this parameter was used as an example.
The alt_allele_threshold
parameter is used to filter out sites with excessive alternate alleles. When using this tool
with WGS data, we suggest using a value of 15
. We have optimised this parameter to allow for the applet to function
on a mem1_ssd1
instance type (c5d
family on AWS). This is to use low memory machines, which are not typicaly in
demand on AWS. Thus, one can often use low priority on DNANexus
to save costs.
output | description |
---|---|
output_vcfs | All .bcf chunks from the resulting split of all files listed in input_vcfs |
run_info | Summary statistics for the each VCF file split as part of this process. Will be named like vcf_info.{output_name}.tsv |
skipped_sites | A .tsv file containing sites with > alt_allele_threshold alternate alleles. Will be named like skipped_sites.{output_name}.tsv |
output_vcfs is named based on the original file name of the vcf in input_vcfs
with an additional 'chunk' identifier
like:
ukb23157_c1_b0_v1_chunkN.bcf
run_info contains the following columns:
vcf: The original vcf file
dxid: The dxid of the original vcf file
n_sites: Number of sites found in the original file
n_final_sites: Number of sites after filtering for alt_allele_threshold
norm_n_alts: Number of alternate alleles found in the normalised file – this SHOULD equal n_norm_sites
vcf_size: Size of the original vcf file in bytes
skipped_sites contains the following columns:
vcf: The original vcf file
chrom: Chromosome of the site
pos: Position of the site
alts: Alternate alleles with MAF ~> 0.1% as a comma-separated list
acs: Allele counts for each alternate allele listed in 'alts' as a comma-separated list
If this is your first time running this applet within a project other than "MRC - Variant Filtering", please see our organisational documentation on how to download and build this app on the DNANexus Research Access Platform:
https://github.com/mrcepid-rap
For the input vcf (provided with the flag -iinput_vcf
) one can use a file hash from one of the above files:
# 1. WES example:
dx run mrcepid-bcfsplitter --priority low --destination filtered_vcfs/ \
-iinput_vcfs=file-1234567890ABCDE -ioutput_name=list1
# 2. WGS example:
dx run mrcepid-bcfsplitter --priority low --destination filtered_vcfs/ \
-iinput_vcfs=file-1234567890ABCDE -ioutput_name=list1 -ialt_allele_threshold=15 -ichunk_size=2500
Brief I/O information can also be retrieved on the command line:
dx run mrcepid-bcfsplitter --help
Sensible (and tested) defaults for compute resources on DNANexus that is baked into the json used for building
the app (at dxapp.json
). The current default is for a mem1_ssd1_v2_x72 instance (72 CPUs, 144 Gb RAM). This machine
is typically not in high demand on AWS and should allow for jobs to run on low priority. If this machine does
appear to be in high demand, switch to another instance type in the c5d family of machines (mem1_ssd1 on DNANexus).
Each VCF requires 8 CPUs to run. Thus, when determining the number of VCFs to run per job, one should take the number of CPUs available on the machine and divide by 8. For example, since the default machine has 72 CPUs, one should run a multiple of 9 VCFs per job (e.g., 9, 18, 27, 36, ...). Note that runtime is roughly linear. Thus, if you have 18 VCFs, you should expect the job to take double the time of a job with 9 VCFs. This is important to consider when using low priority as the longer a job runs, the more likely it is to be interrupted. Using the default instance type, we have had good results on low priority with 18 VCFs / job. This strikes a reasonable balance between number of jobs to monitor (DNANexus is limited to 100 jobs / user) and the likelihood of using low priority.
Note: If running more than 18 files / job, you may need to adjust the timeoutPolicy parameter in dxapp.json
when
building this applet.
If necessary to adjust compute resources, one can provide a flag like --instance-type mem1_ssd1_v2_x36
.
Please see Readme.developer.md
for information on setting up and running tests.