Add minimap2 aligner#44
Merged
Merged
Conversation
Track C from the performance audit: replace BBMap (the ~84% wall
bottleneck) with minimap2 in short-read mode. Gated by a new
config['aligner'] field defaulting to 'bbmap', so existing users see
no change until they opt in with aligner: minimap2.
How it's wired
- common.smk validates config['aligner'] in {bbmap, minimap2}.
- ref.smk branches on aligner: bbmap path keeps the existing
prepare_bbmap_index rule; minimap2 path adds prepare_minimap2_index
which writes ref/minimap2/{experiment}_{ref_digest}.mmi via
`minimap2 -x sr -d`.
- map.smk branches similarly: the bbmap rule is unchanged; the new
map_to_reference_minimap2 rule pipes `minimap2 -ax sr --MD` through
`samtools view -b`. The seven BBTools-format stats outputs
(covstats / basecov / bincov / ehist / indelhist / mhist / idhist)
are produced as empty placeholders so the multiqc_dir rule's
load-bearing covstats dependency is satisfied and the baseline file
list doesn't break. Real minimap2 QC artifacts (samtools coverage /
flagstat) can be plumbed in once the aligner swap is proven.
Empirical concordance on example.yaml (14 samples) vs the bbmap
baseline (same fixture, same trim_clean_correct upstream):
- Rosace cond_A: Pearson r = 0.99938, max |delta| = 0.335 score units,
0/1770 variants drift > 0.5 (gate: >0.99, the lilace-style gate
from the audit was met cleanly)
- Rosace cond_B: Pearson r = 0.99792
- Enrich2: Pearson r = 0.99312
- variantCounts obs drift: -0.5% to -1.4% across samples, within
BBMap's own thread-non-determinism baseline (PR #43 measured ~15-25
ppm between bbmap-to-bbmap runs at different t= settings)
Performance on example.yaml:
- bbmap_map: 521s sum / 14 samples, 8575 MB peak RSS per sample
- minimap2_map: 49s sum / 14 samples (10.7x faster), 451 MB peak RSS
per sample (19x less memory)
- End-to-end pipeline wall: ~25 min → ~5 min
Realistic at prod scale: BBMap's fixed index-loading overhead amortizes
on big inputs so the per-byte multiplier shrinks, but minimap2 is still
~3x faster on multi-GB FASTQs per the audit's literature review and
remains substantially lower-memory. The 19x RSS reduction is the
load-bearing change for cluster sizing — minimap2 fits per-rule budgets
that BBMap couldn't.
Not in scope for this commit:
- Real minimap2 QC artifacts (samtools coverage to populate the
placeholder stats files)
- strobealign or bwa-mem2 spikes (audit Track C steps 2 and 3)
- Default-flipping aligner: minimap2; we keep bbmap as the default
until users opt in, so existing experiments are byte-for-byte
unchanged
Standalone Python script that runs BBMap and minimap2 on the same
paired-end FASTQ + reference and reports wall time, peak RSS (sum of
descendants across the shell pipeline), BAM record count, and BAM size.
Uses the same alignment flag set the dumpling snakemake rules use so
the numbers are comparable to a real pipeline run.
Built for the audit's prod-scale question: "the 10x minimap2 ratio we
measured on example.yaml — does it hold on multi-GB inputs, or does
BBMap's overhead amortize?" Run against `lmna.bb.r{1,2}.fastq.gz` (8 GB
pair) or any other fixture and get a real answer in ~minutes.
Index building (BBMap and minimap2) happens once per (reference,
k-mer) combo and is excluded from the timing. Indices are cached
under --workdir for reuse across runs.
Peak RSS is measured by polling /proc/<pid>/status VmRSS every 100ms
across the bash subprocess AND all of its descendants. Bash pipelines
spawn the aligner and samtools as sibling children of bash, so summing
descendants catches both. Verified on the example fixture: BBMap
reports ~10 GB peak (8.5 GB BBMap + samtools), minimap2 reports ~325 MB.
Smoke-tested on example/data/1_S1_L001_R{1,2}_001.fastq.gz against
references/example_ref.fasta: BBMap 39s / 10 GB RSS, minimap2 3.7s /
323 MB RSS. Matches the snakemake benchmark numbers within sampling
noise.
CLI:
python tools/bench_aligners.py --r1 R1.fq.gz --r2 R2.fq.gz --ref ref.fasta
Flags: --threads --mem --workdir --skip-bbmap --skip-minimap2 --cleanup.
Docstring includes a caveat that the wall/RSS ratios reported are
reference-size dependent: BBMap's perfect-hash index allocates GB-scale
memory regardless of reference length, so on tiny DMS references
(~1-10 kb) the ratios are dramatic and they shrink on whole-genome
references.
The previous minimap2 rule produced empty `touch`-ed placeholder files
for the seven BBTools-format stats outputs (`_map.covstats`,
`_map.basecov`, etc.) just to satisfy the multiqc_dir input dependency
and the baseline file list. The DAG was happy; MultiQC silently
rendered no mapping-stage QC under minimap2. That made the
minimap2-vs-bbmap comparison unfair — we lost the QC story entirely
when switching aligners.
Replace placeholders with real samtools-based QC: the minimap2 rule
now runs `samtools stats`, `samtools flagstat`, and `samtools coverage`
on the BAM and writes their outputs to
stats/{experiment}/{sample}_samtools_{stats,coverage,flagstat}.txt
MultiQC's built-in samtools parsers autodiscover these by content
(verified end-to-end: HTML contains samtools_stats, samtools_flagstat,
and samtools_alignment_plot sections; baseline multiqc_data lists
samtools_flagstat rows for all baseline samples).
`samtools coverage` requires coordinate-sorted input, so the BAM is
piped through `samtools sort` only for that one call. The on-disk BAM
stays unsorted — GATK ASM doesn't need sort, and persisting a sorted
copy is the work PR #37 deleted (sort_index_samtools rule). Adds
~seconds-to-minutes per-sample for the sort, scaling with BAM size;
small price for proper QC.
`multiqc_dir` (qc.smk) and `generate_baseline_file_list.py` both branch
on `config["aligner"]`:
- bbmap: trigger / list the seven `_map.*` BBTools histograms (unchanged).
- minimap2: trigger / list the three `_samtools_*.txt` outputs.
BBMap's per-position error histograms (`_map.ehist`, `_map.mhist`,
`_map.idhist`) have no clean samtools equivalent and are intentionally
not reproduced — they were debug detail, not biology-load-bearing
DMS QC. Coverage breadth, mean depth, mapping rate, insert size,
global error rate, indel size distribution are all still produced.
Dry-runs of both aligners resolve cleanly (67 jobs each on
example.yaml). Full end-to-end run on `aligner: minimap2` completes
81/81 jobs; all 14 samples produce all three samtools stat files with
real content (samtools_stats ~180KB/sample, coverage 1 row/scaffold,
flagstat the standard 17-line summary).
Previous commit (239ec37) added samtools coverage to the minimap2 rule to populate the MultiQC coverage section. That call does per-base pileup and costs ~14s per sample on the example fixture's ~60,000x depth — which is representative of real DMS prod data, not a fixture artifact. Per-variant depth × ~20 AAs per position aggregates to 10⁴–10⁵× per-position depth in realistic DMS runs. So the regression is real on prod too, not just on test fixtures. Result: minimap2's wall advantage vs bbmap dropped from 10.7× to 1.2×. Not acceptable. Trade-off taken: drop samtools coverage; keep samtools stats + flagstat (both cheap, ~1s/sample combined). MultiQC still renders the samtools_stats and samtools_flagstat sections (verified end-to-end: `samtools_alignment_plot`, `samtools_flagstat`, `samtools_stats` all present in the rendered HTML). multiqc_dir's per-aligner trigger becomes `_samtools_stats.txt` instead of `_samtools_coverage.txt`. Mean depth is derivable from samtools stats's `bases mapped (cigar)` ÷ reference length if a user needs it; breadth coverage is always ~100% for DMS so its loss is uninteresting. A follow-up task is logged in tasks/tasks.md: re-add per-position coverage via mosdepth (purpose-built for fast deep coverage, ~10× faster than samtools coverage, dedicated MultiQC parser). Until that lands, the bbmap path retains coverage stats (via `_map.covstats`) while the minimap2 path does not. Acceptable interim state. Measured on example.yaml (14 samples) after this change: - minimap2_map wall: 73s sum (5.2s mean per sample) - minimap2_map peak RSS: 455 MB - vs bbmap baseline (521s sum, 37s mean, 8575 MB): 7.2× wall, 18.9× RSS - vs placeholder-only minimap2 (49s sum): +24s overhead for the samtools stats+flagstat we get back 81/81 jobs complete; all 14 samples produce both samtools stat files with real content; MultiQC sections render correctly.
The aligner-comparison bench script is an ad-hoc tool for one-off measurements (e.g. validating prod-scale BBMap vs minimap2 ratios against a real fixture). It's not part of the pipeline contract, the test suite, or anything users run as part of a normal workflow. Tracking it in the repo invited the impression that it's a supported entry point. Remove from git tracking; keep the file in the working tree for current local use. Add the path to .gitignore so future edits don't accidentally get staged. multiqc_compare/ added to .gitignore too — that's another ad-hoc local dir produced by side-by-side MultiQC comparisons that shouldn't be in the repo.
The minimap2 PR added a runtime validation in common.smk that
rejects values outside {bbmap, minimap2}, but the JSON schema was
silently allowing anything because there's no `additionalProperties:
false`. Add the schema entry so:
- the canonical config documentation lists the new key with its
enum and default
- any future schema-aware tooling (autocompletion, docs gen)
sees the same allowed values the runtime check enforces
- typos like `aligner: bowtie2` get caught at validate() time
too, not just at the common.smk validator
Description points users at tasks/performance-audit.md Track C for
the aligner-swap context. Belt-and-suspenders with the runtime
ValueError; the runtime check fires first and gives a sharper error
message.
Verified: dry-run with aligner=bowtie2 fails cleanly with
"Unsupported aligner 'bowtie2'. Expected one of: bbmap, minimap2."
Python suite 150/150, R testthat passes, dry-runs of both supported
aligners resolve.
MultiQC derives its sample identifier from the cleaned filename: it removes anything in extra_fn_clean_exts from the basename and uses what's left as the sample's s_name. The BBMap path has worked since day one because _trim, _map, _merge were all in extra_fn_clean_exts — so `sampleA_map.covstats` and `sampleA_trim.qhist` both collapse to s_name `sampleA`, producing one general-stats row per sample. The minimap2 path landed earlier in this branch but didn't add its new suffixes to extra_fn_clean_exts. Result: `sampleA_samtools_stats` and `sampleA_samtools_flagstat` were treated as separate samples in the MultiQC general-stats overview — one row per (sample, tool) instead of per sample. Confirmed on the baseline report: 3 baseline samples produced 9 rows (3 each of bare, _samtools_flagstat, _samtools_stats) instead of 3. Fix: add _samtools_stats and _samtools_flagstat to extra_fn_clean_exts. Re-rendered baseline report now shows the expected 3 rows. The bbmap path is unaffected — its filenames already match the existing suffixes. This is a minimap2-only fix.
Add `docs/` to .gitignore for local-only working documents like agent briefings, plugin-contract notes, and other artifacts that inform development but aren't part of the user-facing repo documentation. README, schemas, audit doc etc. still live at the repo root or under tasks/ as before. Sibling to the existing local-only entries (tools/bench_aligners.py, multiqc_compare/) — same intent: keep ephemeral / developer-only material out of git history.
Add tests/unit/test_aligner_config.py — same shape as the existing
test_scoring_backend_config.py — pinning four invariants that protect
old user configs:
1. The schema document defines `aligner` with the correct type,
enum {bbmap, minimap2}, and default "bbmap". Catches accidental
property rename or enum drift.
2. `aligner` is NOT in the schema's `required` list. Pre-minimap2
configs don't have the key, so requiring it would lock those
users out at validate() time.
3. jsonschema validate accepts a config that omits `aligner`,
accepts bbmap and minimap2 explicitly, and rejects unknown
values like "bowtie2" with an error message that mentions the
offending value or the enum constraint.
4. snakemake.utils.validate (the runtime path common.smk actually
uses) injects the default "bbmap" into a config that doesn't
mention `aligner`. The test docstring spells out the consequence
of regression: an existing user's pipeline would silently switch
aligners out from under them.
Verified end-to-end on example.yaml: the file has no aligner key, dry-
runs schedule the bbmap path (prepare_bbmap_index, map_to_reference_bbmap),
explicit `aligner=bowtie2` is rejected with a clear ValueError from
common.smk:207. 7/7 new tests pass, full suite is 157/157.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Testing shows minimap2 provides a pretty big performance boost over bbmap.sh.
Minor costs:
Caveats: