DNA Assembly Benchmark for Nanopore long reads
A system for benchmarking DNA assembly tools, based on 3rd generation sequencers.
git clone https://github.com/kkrizanovic/NanoMark.git cd NanoMark src/nanomark.py setup
This will download and compile all tools necessary to run a benchmark.
- Cgmemtime to measure runtime andmaximum memory consumption
- Quast to estimate assembly quality
- Everyhting needed to run each denovo assembly tool for whom a wrapper is defined in folder: ¨¨¨NanoMark/wrappers```
All tools will be downloaded to folder
Running a benchmark
Run benchmark with a following command: NanoMark.py --benchmark <reads_file> <reference_file>
Benchmark results will be stored in folder: /intermediate Each benchmark will have its own folder with a randomly generated name, e.g. benchmark_129eed95-377d-4e8e-a5bf-309631e3df3d. Inside will be a folder for each assembler and a folder for Quast results (containing quast results for each assembler).
Summary Quast and cgmemtime results for each assembler are stored in a benchmark folder in a .tsv file benchmark_summary.tsv. The file contains the following values for each assembler:
- Name : Assembler name
Obtained from cgmemtime:
- Real time : real execution time
- CPU time : CPU execution time
- Maximum RSS : maximum memory consumption
Obtained from Quast:
contigs : number of contigs larger then 500bp generated by an assembler
- Largest contig : largest contig generated by an assembler
- Total length : total contig length generated by an assembler
- GC (%) GC content of contigs generated by an assembler, can be compared to reference GC content to estimate assembly quality
- N50 : N50 values for generated contigs
- NG50 : NG50 value for generated contigs
misassemblies : number of missasembiled generated by an assembler
- Genome fraction (%) : genome fraction covered by generated contigs
- Duplication ratio : how many time is each reference base covered in generated contigs
N's per 100 kbp
mismatches per 100 kbp
indels per 100 kbp
Quast generated fields present in summarized results are represented in a list "qfields" in function "summarize_results" in NanoMark.py. New field can be added by modifying that list.
Including new assemblers in the benchmark
The benchmarking tool currently includes following assembly tools:
- Loman, Quick and Simpson assembly pipeline (http://www.nature.com/nmeth/journal/v12/n8/full/nmeth.3444.html)
Additional assembly tools can be included by writting a wrapper script in Python. Each assembler that needs to be included in the benchmark must have a corresponding wrapper in folder: /wrappers. Wrapper script filenames must start with "wrapper_"
Each wrapper must define three varibales:
- Installation path (ASSEMBLER_PATH)
- Assembler name written to benchmark summary .tsv file (ASSEMBLER_NAME)
- Results file filename (must include relative path from assembler installation path) (ASSEMBLER_RESULTS)
Each wrapper must also define two functions:
- download_and_install() : installs the assembler and makes it ready to run
- run(reads_file, reference_file, machine_name, output_path, output_suffix='') : runs the assembler on given reads and reference files, results are stored in a given folder: output_folder. Attributes machine_name and output_suffix are currently not used, but must be included in function header for compatibility.
Included wrappers are good examples of wrapper implementation.
Usage of wrapper scripts
Installing an assembler (often requires sudo access):
Running the assembly process consists of specifying all reads files in the form:
wrapper_?.py run output_folder reads_type,<reads_path>[<reads_path_b,frag_len,frag_stddev]
More than one dataset can be described in the same command line, simply by listing them in a space-separated manner.
Reads_type can be one of: nanopore/pacbio/single/paired/mate. If reads_type != "paired" or "mate", last three parameters can be omitted.
If reads_type == "paired" or "mate", other end of the pair needs to be in another file provided by reads_path_b.
Examples of usage:
wrapper_falcon.py run output_folder nanopore,reads.fa wrapper_allpathslg.py run output_folder paired,datasets/frag_reads.Solexa-25396.A.fastq,datasets/frag_reads.Solexa-25396.B.fastq,180,10 mate,datasets/jump_reads.Solexa-42866.A.fastq,datasets/jump_reads.Solexa-42866.B.fastq,3000,500 mate,datasets/jump_reads.Solexa-44956.A.fastq,datasets/jump_reads.Solexa-44956.B.fastq,3000,500 nanopore,datasets/reads.fastq
This work has been supported in part by Croatian Science Fundation under the project UIP-11-2013-7353.