Skip to content

Add minimap2 aligner#44

Merged
odcambc merged 9 commits into
performancefrom
quicken
May 26, 2026
Merged

Add minimap2 aligner#44
odcambc merged 9 commits into
performancefrom
quicken

Conversation

@odcambc
Copy link
Copy Markdown
Owner

@odcambc odcambc commented May 26, 2026

Testing shows minimap2 provides a pretty big performance boost over bbmap.sh.

Minor costs:

  • Certain mapping quality metrics require samtools to measure
  • Sequencing coverage requires expensive samtools coverage call - disabled currently. Typically is not uniquely informative.

Caveats:

  • MultiQC reports work but do not aggregate by sample as with bbmap.sh. Deferring to QC and display updates.
  • Schema is updated. Currently configs are backwards compatible but that may change.
  • minimap2 is not enabled by default to maintain expected behavior and environments.
  • Mapping results may not be identical - algorithm and settings are not precisely equivalent to bbmap. Mapping with multithreading is inherently nondeterministic however and testing score outputs showed >99.5% identify.
    • Nondeterminism/perfect reproducibility should be considered TBD: setting R seeds is simple, but mapping would require single threads. Not ideal.

odcambc added 9 commits May 26, 2026 01:48
Track C from the performance audit: replace BBMap (the ~84% wall
bottleneck) with minimap2 in short-read mode. Gated by a new
config['aligner'] field defaulting to 'bbmap', so existing users see
no change until they opt in with aligner: minimap2.

How it's wired
- common.smk validates config['aligner'] in {bbmap, minimap2}.
- ref.smk branches on aligner: bbmap path keeps the existing
  prepare_bbmap_index rule; minimap2 path adds prepare_minimap2_index
  which writes ref/minimap2/{experiment}_{ref_digest}.mmi via
  `minimap2 -x sr -d`.
- map.smk branches similarly: the bbmap rule is unchanged; the new
  map_to_reference_minimap2 rule pipes `minimap2 -ax sr --MD` through
  `samtools view -b`. The seven BBTools-format stats outputs
  (covstats / basecov / bincov / ehist / indelhist / mhist / idhist)
  are produced as empty placeholders so the multiqc_dir rule's
  load-bearing covstats dependency is satisfied and the baseline file
  list doesn't break. Real minimap2 QC artifacts (samtools coverage /
  flagstat) can be plumbed in once the aligner swap is proven.

Empirical concordance on example.yaml (14 samples) vs the bbmap
baseline (same fixture, same trim_clean_correct upstream):
  - Rosace cond_A: Pearson r = 0.99938, max |delta| = 0.335 score units,
    0/1770 variants drift > 0.5 (gate: >0.99, the lilace-style gate
    from the audit was met cleanly)
  - Rosace cond_B: Pearson r = 0.99792
  - Enrich2:       Pearson r = 0.99312
  - variantCounts obs drift: -0.5% to -1.4% across samples, within
    BBMap's own thread-non-determinism baseline (PR #43 measured ~15-25
    ppm between bbmap-to-bbmap runs at different t= settings)

Performance on example.yaml:
  - bbmap_map: 521s sum / 14 samples, 8575 MB peak RSS per sample
  - minimap2_map: 49s sum / 14 samples (10.7x faster), 451 MB peak RSS
    per sample (19x less memory)
  - End-to-end pipeline wall: ~25 min → ~5 min

Realistic at prod scale: BBMap's fixed index-loading overhead amortizes
on big inputs so the per-byte multiplier shrinks, but minimap2 is still
~3x faster on multi-GB FASTQs per the audit's literature review and
remains substantially lower-memory. The 19x RSS reduction is the
load-bearing change for cluster sizing — minimap2 fits per-rule budgets
that BBMap couldn't.

Not in scope for this commit:
  - Real minimap2 QC artifacts (samtools coverage to populate the
    placeholder stats files)
  - strobealign or bwa-mem2 spikes (audit Track C steps 2 and 3)
  - Default-flipping aligner: minimap2; we keep bbmap as the default
    until users opt in, so existing experiments are byte-for-byte
    unchanged
Standalone Python script that runs BBMap and minimap2 on the same
paired-end FASTQ + reference and reports wall time, peak RSS (sum of
descendants across the shell pipeline), BAM record count, and BAM size.
Uses the same alignment flag set the dumpling snakemake rules use so
the numbers are comparable to a real pipeline run.

Built for the audit's prod-scale question: "the 10x minimap2 ratio we
measured on example.yaml — does it hold on multi-GB inputs, or does
BBMap's overhead amortize?" Run against `lmna.bb.r{1,2}.fastq.gz` (8 GB
pair) or any other fixture and get a real answer in ~minutes.

Index building (BBMap and minimap2) happens once per (reference,
k-mer) combo and is excluded from the timing. Indices are cached
under --workdir for reuse across runs.

Peak RSS is measured by polling /proc/<pid>/status VmRSS every 100ms
across the bash subprocess AND all of its descendants. Bash pipelines
spawn the aligner and samtools as sibling children of bash, so summing
descendants catches both. Verified on the example fixture: BBMap
reports ~10 GB peak (8.5 GB BBMap + samtools), minimap2 reports ~325 MB.

Smoke-tested on example/data/1_S1_L001_R{1,2}_001.fastq.gz against
references/example_ref.fasta: BBMap 39s / 10 GB RSS, minimap2 3.7s /
323 MB RSS. Matches the snakemake benchmark numbers within sampling
noise.

CLI:
  python tools/bench_aligners.py --r1 R1.fq.gz --r2 R2.fq.gz --ref ref.fasta

Flags: --threads --mem --workdir --skip-bbmap --skip-minimap2 --cleanup.

Docstring includes a caveat that the wall/RSS ratios reported are
reference-size dependent: BBMap's perfect-hash index allocates GB-scale
memory regardless of reference length, so on tiny DMS references
(~1-10 kb) the ratios are dramatic and they shrink on whole-genome
references.
The previous minimap2 rule produced empty `touch`-ed placeholder files
for the seven BBTools-format stats outputs (`_map.covstats`,
`_map.basecov`, etc.) just to satisfy the multiqc_dir input dependency
and the baseline file list. The DAG was happy; MultiQC silently
rendered no mapping-stage QC under minimap2. That made the
minimap2-vs-bbmap comparison unfair — we lost the QC story entirely
when switching aligners.

Replace placeholders with real samtools-based QC: the minimap2 rule
now runs `samtools stats`, `samtools flagstat`, and `samtools coverage`
on the BAM and writes their outputs to
  stats/{experiment}/{sample}_samtools_{stats,coverage,flagstat}.txt
MultiQC's built-in samtools parsers autodiscover these by content
(verified end-to-end: HTML contains samtools_stats, samtools_flagstat,
and samtools_alignment_plot sections; baseline multiqc_data lists
samtools_flagstat rows for all baseline samples).

`samtools coverage` requires coordinate-sorted input, so the BAM is
piped through `samtools sort` only for that one call. The on-disk BAM
stays unsorted — GATK ASM doesn't need sort, and persisting a sorted
copy is the work PR #37 deleted (sort_index_samtools rule). Adds
~seconds-to-minutes per-sample for the sort, scaling with BAM size;
small price for proper QC.

`multiqc_dir` (qc.smk) and `generate_baseline_file_list.py` both branch
on `config["aligner"]`:
  - bbmap: trigger / list the seven `_map.*` BBTools histograms (unchanged).
  - minimap2: trigger / list the three `_samtools_*.txt` outputs.

BBMap's per-position error histograms (`_map.ehist`, `_map.mhist`,
`_map.idhist`) have no clean samtools equivalent and are intentionally
not reproduced — they were debug detail, not biology-load-bearing
DMS QC. Coverage breadth, mean depth, mapping rate, insert size,
global error rate, indel size distribution are all still produced.

Dry-runs of both aligners resolve cleanly (67 jobs each on
example.yaml). Full end-to-end run on `aligner: minimap2` completes
81/81 jobs; all 14 samples produce all three samtools stat files with
real content (samtools_stats ~180KB/sample, coverage 1 row/scaffold,
flagstat the standard 17-line summary).
Previous commit (239ec37) added samtools coverage to the minimap2 rule
to populate the MultiQC coverage section. That call does per-base pileup
and costs ~14s per sample on the example fixture's ~60,000x depth —
which is representative of real DMS prod data, not a fixture artifact.
Per-variant depth × ~20 AAs per position aggregates to 10⁴–10⁵×
per-position depth in realistic DMS runs. So the regression is real on
prod too, not just on test fixtures.

Result: minimap2's wall advantage vs bbmap dropped from 10.7× to 1.2×.
Not acceptable.

Trade-off taken: drop samtools coverage; keep samtools stats + flagstat
(both cheap, ~1s/sample combined). MultiQC still renders the
samtools_stats and samtools_flagstat sections (verified end-to-end:
`samtools_alignment_plot`, `samtools_flagstat`, `samtools_stats` all
present in the rendered HTML). multiqc_dir's per-aligner trigger
becomes `_samtools_stats.txt` instead of `_samtools_coverage.txt`.

Mean depth is derivable from samtools stats's `bases mapped (cigar)` ÷
reference length if a user needs it; breadth coverage is always ~100%
for DMS so its loss is uninteresting.

A follow-up task is logged in tasks/tasks.md: re-add per-position
coverage via mosdepth (purpose-built for fast deep coverage, ~10×
faster than samtools coverage, dedicated MultiQC parser). Until that
lands, the bbmap path retains coverage stats (via `_map.covstats`)
while the minimap2 path does not. Acceptable interim state.

Measured on example.yaml (14 samples) after this change:
  - minimap2_map wall: 73s sum (5.2s mean per sample)
  - minimap2_map peak RSS: 455 MB
  - vs bbmap baseline (521s sum, 37s mean, 8575 MB): 7.2× wall, 18.9× RSS
  - vs placeholder-only minimap2 (49s sum): +24s overhead for the
    samtools stats+flagstat we get back

81/81 jobs complete; all 14 samples produce both samtools stat files
with real content; MultiQC sections render correctly.
The aligner-comparison bench script is an ad-hoc tool for one-off
measurements (e.g. validating prod-scale BBMap vs minimap2 ratios
against a real fixture). It's not part of the pipeline contract, the
test suite, or anything users run as part of a normal workflow. Tracking
it in the repo invited the impression that it's a supported entry point.

Remove from git tracking; keep the file in the working tree for current
local use. Add the path to .gitignore so future edits don't accidentally
get staged. multiqc_compare/ added to .gitignore too — that's another
ad-hoc local dir produced by side-by-side MultiQC comparisons that
shouldn't be in the repo.
The minimap2 PR added a runtime validation in common.smk that
rejects values outside {bbmap, minimap2}, but the JSON schema was
silently allowing anything because there's no `additionalProperties:
false`. Add the schema entry so:
  - the canonical config documentation lists the new key with its
    enum and default
  - any future schema-aware tooling (autocompletion, docs gen)
    sees the same allowed values the runtime check enforces
  - typos like `aligner: bowtie2` get caught at validate() time
    too, not just at the common.smk validator

Description points users at tasks/performance-audit.md Track C for
the aligner-swap context. Belt-and-suspenders with the runtime
ValueError; the runtime check fires first and gives a sharper error
message.

Verified: dry-run with aligner=bowtie2 fails cleanly with
"Unsupported aligner 'bowtie2'. Expected one of: bbmap, minimap2."
Python suite 150/150, R testthat passes, dry-runs of both supported
aligners resolve.
MultiQC derives its sample identifier from the cleaned filename: it
removes anything in extra_fn_clean_exts from the basename and uses
what's left as the sample's s_name. The BBMap path has worked since
day one because _trim, _map, _merge were all in extra_fn_clean_exts —
so `sampleA_map.covstats` and `sampleA_trim.qhist` both collapse to
s_name `sampleA`, producing one general-stats row per sample.

The minimap2 path landed earlier in this branch but didn't add its
new suffixes to extra_fn_clean_exts. Result: `sampleA_samtools_stats`
and `sampleA_samtools_flagstat` were treated as separate samples in
the MultiQC general-stats overview — one row per (sample, tool)
instead of per sample. Confirmed on the baseline report: 3 baseline
samples produced 9 rows (3 each of bare, _samtools_flagstat,
_samtools_stats) instead of 3.

Fix: add _samtools_stats and _samtools_flagstat to extra_fn_clean_exts.
Re-rendered baseline report now shows the expected 3 rows.

The bbmap path is unaffected — its filenames already match the
existing suffixes. This is a minimap2-only fix.
Add `docs/` to .gitignore for local-only working documents like
agent briefings, plugin-contract notes, and other artifacts that
inform development but aren't part of the user-facing repo
documentation. README, schemas, audit doc etc. still live at the
repo root or under tasks/ as before.

Sibling to the existing local-only entries (tools/bench_aligners.py,
multiqc_compare/) — same intent: keep ephemeral / developer-only
material out of git history.
Add tests/unit/test_aligner_config.py — same shape as the existing
test_scoring_backend_config.py — pinning four invariants that protect
old user configs:

  1. The schema document defines `aligner` with the correct type,
     enum {bbmap, minimap2}, and default "bbmap". Catches accidental
     property rename or enum drift.

  2. `aligner` is NOT in the schema's `required` list. Pre-minimap2
     configs don't have the key, so requiring it would lock those
     users out at validate() time.

  3. jsonschema validate accepts a config that omits `aligner`,
     accepts bbmap and minimap2 explicitly, and rejects unknown
     values like "bowtie2" with an error message that mentions the
     offending value or the enum constraint.

  4. snakemake.utils.validate (the runtime path common.smk actually
     uses) injects the default "bbmap" into a config that doesn't
     mention `aligner`. The test docstring spells out the consequence
     of regression: an existing user's pipeline would silently switch
     aligners out from under them.

Verified end-to-end on example.yaml: the file has no aligner key, dry-
runs schedule the bbmap path (prepare_bbmap_index, map_to_reference_bbmap),
explicit `aligner=bowtie2` is rejected with a clear ValueError from
common.smk:207. 7/7 new tests pass, full suite is 157/157.
@odcambc odcambc merged commit aba1d75 into performance May 26, 2026
@odcambc odcambc deleted the quicken branch May 26, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant