Resource-efficient sequence alignment against millions of prokaryotic genomes using graph-based compressed indexing
Dragon aligns query sequences (genes, plasmids, long reads) against millions of prokaryotic genomes while using dramatically less disk and RAM than existing tools. It achieves this by exploiting the massive sequence redundancy among related genomes through three key innovations:
- Coloured compacted de Bruijn graph — shared sequence stored once across all genomes
- Run-length FM-index — compressed seed index proportional to BWT runs, not text length
- Graph-aware colinear chaining — seed chaining that respects genome graph structure
| Dragon | LexicMap | Minimap2 | BLASTn | |
|---|---|---|---|---|
| Disk (2M genomes) | ~100 GB | 5,460 GB | N/A | N/A |
| Query RAM | <4 GB | 4-25 GB | scales linearly | scales linearly |
| Batch queries | parallel | sequential | parallel | sequential |
# Install
git clone https://github.com/dragon-aligner/dragon.git
cd dragon
cargo build --release
# Index genomes
./target/release/dragon index -i /path/to/genomes/ -o my_index/ -k 31
# Search
./target/release/dragon search -i my_index/ -q query.fasta -o results.paf
# View index info
./target/release/dragon info -i my_index/Requires Rust 1.75 or later:
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
git clone https://github.com/dragon-aligner/dragon.git
cd dragon
cargo build --releaseThe binary will be at target/release/dragon.
For large-scale index construction, install GGCAT for optimised de Bruijn graph building:
# GGCAT will be used automatically if found in PATH
cargo install ggcatWithout GGCAT, Dragon uses a built-in graph builder suitable for datasets up to ~10,000 genomes.
dragon index \
--input /path/to/genome/fasta/directory/ \
--output /path/to/index/ \
--kmer-size 31 \
--threads 8Input: a directory of FASTA files (.fa, .fasta, .fna), one per genome.
dragon search \
--index /path/to/index/ \
--query query.fasta \
--output results.paf \
--format paf \
--threads 8 \
--max-ram 4.0Output formats:
paf— Pairwise Alignment Format (default)blast6— BLAST tabular (outfmt 6)
| Parameter | Default | Description |
|---|---|---|
--kmer-size |
31 | K-mer size for de Bruijn graph |
--min-seed-len |
15 | Minimum seed match length |
--max-seed-freq |
10,000 | Skip seeds occurring more than this |
--min-chain-score |
50 | Minimum chain score to report |
--max-target-seqs |
100 | Maximum hits per query |
--max-ram |
4.0 | RAM budget in GB |
--threads |
4 | Number of threads |
INDEX BUILD (offline):
FASTA genomes --> GGCAT (ccdBG) --> unitigs + colour bitmaps
|
v
FM-index + Genome Path Index
|
v
On-disk index (~100 GB for 2M genomes)
QUERY (online, <4 GB RAM):
Query --> FM-index backward search --> seed hits
--> Colour voting --> candidate genomes
--> Graph-aware chaining (Fenwick tree DP)
--> Banded wavefront alignment
--> PAF / BLAST output
# Run all tests (39 unit + integration)
cargo test
# Run benchmarks
cargo bench
# Run the benchmark pipeline (requires Python 3.9+)
python3 benchmark/setup_test_data.py
python3 benchmark/run_benchmark.pyDragon is benchmarked against LexicMap, Minimap2, BLASTn, and others using simulated queries at 0-15% divergence. See benchmark/ for the full Snakemake pipeline and manuscript/figures/ for generated plots.
Key result: Dragon maintains >98% sensitivity up to 5% sequence divergence while using ~50x less disk and ~5x less RAM than LexicMap.
For full documentation, see Read the Docs.
dragon/
+-- src/
| +-- main.rs CLI entry point
| +-- index/ Index construction (dbg, unitig, colour, fm, paths)
| +-- query/ Query pipeline (seed, candidate, chain, align)
| +-- io/ FASTA parser, PAF/BLAST output
| +-- ds/ Data structures (Fenwick, Elias-Fano, varint)
| +-- util/ DNA encoding, memory mapping, progress
+-- tests/ Integration tests
+-- benches/ Criterion benchmarks
+-- benchmark/ Snakemake benchmark pipeline
+-- manuscript/ Paper draft and figures
+-- docs/ ReadTheDocs documentation
If you use Dragon, please cite:
Cerdeira, L. (2026). Dragon: resource-efficient sequence alignment against millions of prokaryotic genomes using graph-based compressed indexing. In preparation.
MIT. See LICENSE for details.
