Add minimap2 aligner by odcambc · Pull Request #44 · odcambc/dumpling

odcambc · 2026-05-26T17:12:01Z

Testing shows minimap2 provides a pretty big performance boost over bbmap.sh.

Minor costs:

Certain mapping quality metrics require samtools to measure
Sequencing coverage requires expensive samtools coverage call - disabled currently. Typically is not uniquely informative.

Caveats:

MultiQC reports work but do not aggregate by sample as with bbmap.sh. Deferring to QC and display updates.
Schema is updated. Currently configs are backwards compatible but that may change.
minimap2 is not enabled by default to maintain expected behavior and environments.
Mapping results may not be identical - algorithm and settings are not precisely equivalent to bbmap. Mapping with multithreading is inherently nondeterministic however and testing score outputs showed >99.5% identify.
- Nondeterminism/perfect reproducibility should be considered TBD: setting R seeds is simple, but mapping would require single threads. Not ideal.

Track C from the performance audit: replace BBMap (the ~84% wall bottleneck) with minimap2 in short-read mode. Gated by a new config['aligner'] field defaulting to 'bbmap', so existing users see no change until they opt in with aligner: minimap2. How it's wired - common.smk validates config['aligner'] in {bbmap, minimap2}. - ref.smk branches on aligner: bbmap path keeps the existing prepare_bbmap_index rule; minimap2 path adds prepare_minimap2_index which writes ref/minimap2/{experiment}_{ref_digest}.mmi via `minimap2 -x sr -d`. - map.smk branches similarly: the bbmap rule is unchanged; the new map_to_reference_minimap2 rule pipes `minimap2 -ax sr --MD` through `samtools view -b`. The seven BBTools-format stats outputs (covstats / basecov / bincov / ehist / indelhist / mhist / idhist) are produced as empty placeholders so the multiqc_dir rule's load-bearing covstats dependency is satisfied and the baseline file list doesn't break. Real minimap2 QC artifacts (samtools coverage / flagstat) can be plumbed in once the aligner swap is proven. Empirical concordance on example.yaml (14 samples) vs the bbmap baseline (same fixture, same trim_clean_correct upstream): - Rosace cond_A: Pearson r = 0.99938, max |delta| = 0.335 score units, 0/1770 variants drift > 0.5 (gate: >0.99, the lilace-style gate from the audit was met cleanly) - Rosace cond_B: Pearson r = 0.99792 - Enrich2: Pearson r = 0.99312 - variantCounts obs drift: -0.5% to -1.4% across samples, within BBMap's own thread-non-determinism baseline (PR #43 measured ~15-25 ppm between bbmap-to-bbmap runs at different t= settings) Performance on example.yaml: - bbmap_map: 521s sum / 14 samples, 8575 MB peak RSS per sample - minimap2_map: 49s sum / 14 samples (10.7x faster), 451 MB peak RSS per sample (19x less memory) - End-to-end pipeline wall: ~25 min → ~5 min Realistic at prod scale: BBMap's fixed index-loading overhead amortizes on big inputs so the per-byte multiplier shrinks, but minimap2 is still ~3x faster on multi-GB FASTQs per the audit's literature review and remains substantially lower-memory. The 19x RSS reduction is the load-bearing change for cluster sizing — minimap2 fits per-rule budgets that BBMap couldn't. Not in scope for this commit: - Real minimap2 QC artifacts (samtools coverage to populate the placeholder stats files) - strobealign or bwa-mem2 spikes (audit Track C steps 2 and 3) - Default-flipping aligner: minimap2; we keep bbmap as the default until users opt in, so existing experiments are byte-for-byte unchanged

Standalone Python script that runs BBMap and minimap2 on the same paired-end FASTQ + reference and reports wall time, peak RSS (sum of descendants across the shell pipeline), BAM record count, and BAM size. Uses the same alignment flag set the dumpling snakemake rules use so the numbers are comparable to a real pipeline run. Built for the audit's prod-scale question: "the 10x minimap2 ratio we measured on example.yaml — does it hold on multi-GB inputs, or does BBMap's overhead amortize?" Run against `lmna.bb.r{1,2}.fastq.gz` (8 GB pair) or any other fixture and get a real answer in ~minutes. Index building (BBMap and minimap2) happens once per (reference, k-mer) combo and is excluded from the timing. Indices are cached under --workdir for reuse across runs. Peak RSS is measured by polling /proc/<pid>/status VmRSS every 100ms across the bash subprocess AND all of its descendants. Bash pipelines spawn the aligner and samtools as sibling children of bash, so summing descendants catches both. Verified on the example fixture: BBMap reports ~10 GB peak (8.5 GB BBMap + samtools), minimap2 reports ~325 MB. Smoke-tested on example/data/1_S1_L001_R{1,2}_001.fastq.gz against references/example_ref.fasta: BBMap 39s / 10 GB RSS, minimap2 3.7s / 323 MB RSS. Matches the snakemake benchmark numbers within sampling noise. CLI: python tools/bench_aligners.py --r1 R1.fq.gz --r2 R2.fq.gz --ref ref.fasta Flags: --threads --mem --workdir --skip-bbmap --skip-minimap2 --cleanup. Docstring includes a caveat that the wall/RSS ratios reported are reference-size dependent: BBMap's perfect-hash index allocates GB-scale memory regardless of reference length, so on tiny DMS references (~1-10 kb) the ratios are dramatic and they shrink on whole-genome references.

The previous minimap2 rule produced empty `touch`-ed placeholder files for the seven BBTools-format stats outputs (`_map.covstats`, `_map.basecov`, etc.) just to satisfy the multiqc_dir input dependency and the baseline file list. The DAG was happy; MultiQC silently rendered no mapping-stage QC under minimap2. That made the minimap2-vs-bbmap comparison unfair — we lost the QC story entirely when switching aligners. Replace placeholders with real samtools-based QC: the minimap2 rule now runs `samtools stats`, `samtools flagstat`, and `samtools coverage` on the BAM and writes their outputs to stats/{experiment}/{sample}_samtools_{stats,coverage,flagstat}.txt MultiQC's built-in samtools parsers autodiscover these by content (verified end-to-end: HTML contains samtools_stats, samtools_flagstat, and samtools_alignment_plot sections; baseline multiqc_data lists samtools_flagstat rows for all baseline samples). `samtools coverage` requires coordinate-sorted input, so the BAM is piped through `samtools sort` only for that one call. The on-disk BAM stays unsorted — GATK ASM doesn't need sort, and persisting a sorted copy is the work PR #37 deleted (sort_index_samtools rule). Adds ~seconds-to-minutes per-sample for the sort, scaling with BAM size; small price for proper QC. `multiqc_dir` (qc.smk) and `generate_baseline_file_list.py` both branch on `config["aligner"]`: - bbmap: trigger / list the seven `_map.*` BBTools histograms (unchanged). - minimap2: trigger / list the three `_samtools_*.txt` outputs. BBMap's per-position error histograms (`_map.ehist`, `_map.mhist`, `_map.idhist`) have no clean samtools equivalent and are intentionally not reproduced — they were debug detail, not biology-load-bearing DMS QC. Coverage breadth, mean depth, mapping rate, insert size, global error rate, indel size distribution are all still produced. Dry-runs of both aligners resolve cleanly (67 jobs each on example.yaml). Full end-to-end run on `aligner: minimap2` completes 81/81 jobs; all 14 samples produce all three samtools stat files with real content (samtools_stats ~180KB/sample, coverage 1 row/scaffold, flagstat the standard 17-line summary).

Previous commit (239ec37) added samtools coverage to the minimap2 rule to populate the MultiQC coverage section. That call does per-base pileup and costs ~14s per sample on the example fixture's ~60,000x depth — which is representative of real DMS prod data, not a fixture artifact. Per-variant depth × ~20 AAs per position aggregates to 10⁴–10⁵× per-position depth in realistic DMS runs. So the regression is real on prod too, not just on test fixtures. Result: minimap2's wall advantage vs bbmap dropped from 10.7× to 1.2×. Not acceptable. Trade-off taken: drop samtools coverage; keep samtools stats + flagstat (both cheap, ~1s/sample combined). MultiQC still renders the samtools_stats and samtools_flagstat sections (verified end-to-end: `samtools_alignment_plot`, `samtools_flagstat`, `samtools_stats` all present in the rendered HTML). multiqc_dir's per-aligner trigger becomes `_samtools_stats.txt` instead of `_samtools_coverage.txt`. Mean depth is derivable from samtools stats's `bases mapped (cigar)` ÷ reference length if a user needs it; breadth coverage is always ~100% for DMS so its loss is uninteresting. A follow-up task is logged in tasks/tasks.md: re-add per-position coverage via mosdepth (purpose-built for fast deep coverage, ~10× faster than samtools coverage, dedicated MultiQC parser). Until that lands, the bbmap path retains coverage stats (via `_map.covstats`) while the minimap2 path does not. Acceptable interim state. Measured on example.yaml (14 samples) after this change: - minimap2_map wall: 73s sum (5.2s mean per sample) - minimap2_map peak RSS: 455 MB - vs bbmap baseline (521s sum, 37s mean, 8575 MB): 7.2× wall, 18.9× RSS - vs placeholder-only minimap2 (49s sum): +24s overhead for the samtools stats+flagstat we get back 81/81 jobs complete; all 14 samples produce both samtools stat files with real content; MultiQC sections render correctly.

The aligner-comparison bench script is an ad-hoc tool for one-off measurements (e.g. validating prod-scale BBMap vs minimap2 ratios against a real fixture). It's not part of the pipeline contract, the test suite, or anything users run as part of a normal workflow. Tracking it in the repo invited the impression that it's a supported entry point. Remove from git tracking; keep the file in the working tree for current local use. Add the path to .gitignore so future edits don't accidentally get staged. multiqc_compare/ added to .gitignore too — that's another ad-hoc local dir produced by side-by-side MultiQC comparisons that shouldn't be in the repo.

The minimap2 PR added a runtime validation in common.smk that rejects values outside {bbmap, minimap2}, but the JSON schema was silently allowing anything because there's no `additionalProperties: false`. Add the schema entry so: - the canonical config documentation lists the new key with its enum and default - any future schema-aware tooling (autocompletion, docs gen) sees the same allowed values the runtime check enforces - typos like `aligner: bowtie2` get caught at validate() time too, not just at the common.smk validator Description points users at tasks/performance-audit.md Track C for the aligner-swap context. Belt-and-suspenders with the runtime ValueError; the runtime check fires first and gives a sharper error message. Verified: dry-run with aligner=bowtie2 fails cleanly with "Unsupported aligner 'bowtie2'. Expected one of: bbmap, minimap2." Python suite 150/150, R testthat passes, dry-runs of both supported aligners resolve.

MultiQC derives its sample identifier from the cleaned filename: it removes anything in extra_fn_clean_exts from the basename and uses what's left as the sample's s_name. The BBMap path has worked since day one because _trim, _map, _merge were all in extra_fn_clean_exts — so `sampleA_map.covstats` and `sampleA_trim.qhist` both collapse to s_name `sampleA`, producing one general-stats row per sample. The minimap2 path landed earlier in this branch but didn't add its new suffixes to extra_fn_clean_exts. Result: `sampleA_samtools_stats` and `sampleA_samtools_flagstat` were treated as separate samples in the MultiQC general-stats overview — one row per (sample, tool) instead of per sample. Confirmed on the baseline report: 3 baseline samples produced 9 rows (3 each of bare, _samtools_flagstat, _samtools_stats) instead of 3. Fix: add _samtools_stats and _samtools_flagstat to extra_fn_clean_exts. Re-rendered baseline report now shows the expected 3 rows. The bbmap path is unaffected — its filenames already match the existing suffixes. This is a minimap2-only fix.

Add `docs/` to .gitignore for local-only working documents like agent briefings, plugin-contract notes, and other artifacts that inform development but aren't part of the user-facing repo documentation. README, schemas, audit doc etc. still live at the repo root or under tasks/ as before. Sibling to the existing local-only entries (tools/bench_aligners.py, multiqc_compare/) — same intent: keep ephemeral / developer-only material out of git history.

Add tests/unit/test_aligner_config.py — same shape as the existing test_scoring_backend_config.py — pinning four invariants that protect old user configs: 1. The schema document defines `aligner` with the correct type, enum {bbmap, minimap2}, and default "bbmap". Catches accidental property rename or enum drift. 2. `aligner` is NOT in the schema's `required` list. Pre-minimap2 configs don't have the key, so requiring it would lock those users out at validate() time. 3. jsonschema validate accepts a config that omits `aligner`, accepts bbmap and minimap2 explicitly, and rejects unknown values like "bowtie2" with an error message that mentions the offending value or the enum constraint. 4. snakemake.utils.validate (the runtime path common.smk actually uses) injects the default "bbmap" into a config that doesn't mention `aligner`. The test docstring spells out the consequence of regression: an existing user's pipeline would silently switch aligners out from under them. Verified end-to-end on example.yaml: the file has no aligner key, dry- runs schedule the bbmap path (prepare_bbmap_index, map_to_reference_bbmap), explicit `aligner=bowtie2` is rejected with a clear ValueError from common.smk:207. 7/7 new tests pass, full suite is 157/157.

odcambc added 9 commits May 26, 2026 01:48

odcambc merged commit aba1d75 into performance May 26, 2026

odcambc deleted the quicken branch May 26, 2026 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add minimap2 aligner#44

Add minimap2 aligner#44
odcambc merged 9 commits into
performancefrom
quicken

odcambc commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

odcambc commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant