Skip to content

Release: Performance improvements#45

Merged
odcambc merged 26 commits into
mainfrom
performance
May 26, 2026
Merged

Release: Performance improvements#45
odcambc merged 26 commits into
mainfrom
performance

Conversation

@odcambc
Copy link
Copy Markdown
Owner

@odcambc odcambc commented May 26, 2026

Major performance improvements and some fixes.

Major points:

  • Optional minimap2 alignment: should reduce mapping time and cost substantially. Optional for now.
  • Stream sam through bam to reduce disk thrashing.
  • Optimize CSV reads out of GATK ASM.
  • Optimize string and list handling
  • Break up sample processing into per-sample jobs to allow parallel runs.
  • Split up rosace, lilace runs into per-condition runs to allow parallelism.

Bug fixes

  • Make renv a bit more robust: fix a race condition
  • Catch some NAs slipping into lilace run

Tests pass and backwards compatible.

odcambc added 26 commits May 25, 2026 15:21
BBMap previously wrote SAM to disk via outm=, the dead sort_index_samtools
rule wrapped a samtools sort that nobody consumed, and GATK ASM read the
plain-text SAM. None of the downstream tools care about sort order, so
the sort was unused work and the SAM intermediate was paying ~3x the disk
of the equivalent BAM.

Pipe BBMap's stdout through `samtools view -b` so the on-disk temp() is
already a BAM; point GATK ASM at the BAM; delete the orphan sort rule.
On example.yaml (14 samples, ~25-48MB input fastqs each):
  - gatk_ASM peak RSS: 1045MB -> 715MB (-32%)
  - gatk_ASM wall:     120s   -> 102s  (-15%)
  - mapped temp size:  ~150MB -> ~55MB per sample (~3x reduction)
  - bbmap_map wall:    +1.4s/job (samtools pipe overhead, acceptable)

Both runs produced 96/96 jobs, identical downstream rosace and enrich
score outputs.
Stream BBMap output to BAM instead of writing SAM intermediates
The old `process_counts` rule waited on every `{sample}.variantCounts`
then processed all samples sequentially inside one Python invocation.
That coupled per-sample work to the slowest sibling and blocked
independent retry, caching, and cluster scheduling.

Replace with `rule process_sample`, keyed on a `sample_prefix` wildcard.
`process_counts.py`'s `_run` now handles exactly one sample identified
by `snakemake.wildcards.sample_prefix`; the cross-group/replicate loops
that previously lived in `_run` are gone (they only existed to call
`process_experiment(sample_list)` for each group, and that function's
body had no cross-sample state). Each per-sample invocation re-loads
the reference FASTA + designed variants CSV — cheap, tiny files in
practice, and any real cost is amortized away by the parallelism.

Renamed `process_experiment(sample_list, ...)` → `process_sample(
sample_name, ...)`; updated unit tests; replaced the M7
"cross-group caching" regression guard with a "single-invocation
reads happen once" guard that's still meaningful per-rule. DAG count
goes from 96 → 109 jobs (−1 `process_counts`, +14 `process_sample`).

Performance read on `example.yaml`: structural change is a wall-time
wash on tiny data (per-sample max 0.76s vs single-rule 0.66s; Python
startup × 14 nearly cancels the parallelism gain). The wins land at
production scale (per-sample minutes can run concurrently instead of
serially) and on cluster execution (each sample is its own scheduling
unit with its own retry policy). Both runs produced their expected
job counts and identical final rosace + enrich score outputs.
Split process_counts into per-sample jobs
The old `run_rosace` rule was a single per-experiment job that scored
every non-baseline condition serially inside one R process. With two
conditions on `example.yaml` it was the longest-running step in the
pipeline (~108s) and couldn't overlap with itself on multi-core hosts
or fan out on a cluster.

Split into a per-condition rule keyed on a `{condition}` wildcard.
`build_rosace_object` and `run_rosace_for_conditions` collapse to
their per-condition equivalents (`build_rosace_object_for_condition`,
`run_rosace_for_condition`); the loops over conditions are gone.
Empty replicate sets now raise instead of silently skipping — the
old warn-and-continue path would silently fail to produce the
condition's expected scores CSV anyway, just later in the DAG with a
less useful "missing output" error.

`renv::activate()` races under parallel invocation: it self-backs-up
the project's renv bootstrap dir on every call (rename
`renv/library/.../renv` → `.renv-backup-<pid>-<rand>`), so two
sibling R processes both try the rename and one fails with "source
file does not exist". Replace with a direct `.libPaths()` prepend
using `renv::paths$library()` — read-only path resolution, no
filesystem mutation, concurrency-safe. `install_rosace.R` remains the
one place that actually initializes / populates the project library
(via `renv::restore()`).

Performance on `example.yaml` (two conditions, 16-core host):
  - run_rosace wall:    108s (single job) → 69s (max of two parallel)
                        ~36% reduction
  - run_rosace peak RSS: 1791MB (single) → ~1800MB × 2 concurrent
                         (doubled total, fine on the 24gb host but
                         worth flagging for memory-constrained runs)

Both runs produced identical cond_A_scores.csv + cond_B_scores.csv
sizes. R + Python test suites pass unchanged (no R test directly
exercised build_rosace_object; build_counts_for_replicate, which I
didn't touch, has its own coverage).
Split run_rosace per condition; fix renv race that surfaced under it
Mirrors PR #39 for the lilace scoring backend. The previous
`run_lilace` rule was a single per-experiment job that scored every
non-baseline condition serially inside one R process. Split into a
per-condition rule keyed on `{condition}`; each condition runs in its
own R process so Stan sampling can overlap on multi-core hosts and
fan out on a cluster.

The script change is small because `run_lilace_for_condition` was
already per-condition — only `main()` needed to read
`snakemake@wildcards[["condition"]]` instead of looping over the
non-baseline conditions. Empty replicate sets now raise instead of
warn-and-return-NULL: under the per-condition rule, returning NULL
would silently fail to produce the expected scores CSV and the DAG
would die later with a less useful "missing output" error.

Apply the same renv concurrency fix as PR #39: replace
`renv::activate()` (which self-backs-up the renv bootstrap dir on
every call → parallel rename race) with a direct `.libPaths()`
prepend via `renv::paths$library()`. install_lilace.R remains the one
place that initializes / populates the project library.
Pre-existing bug in run_lilace, unmasked by the per-condition split.
Lilace's Stan model rejects NA counts ("Variable 'y' has NA values"),
but `bind_rows(replicate_frames)` introduces NAs when sibling
replicates have uneven time-point coverage (e.g. cond_A R1 has
t=0,1,2,3 but cond_A R2 has t=0,1,2 — R2's rows pick up NA in c_3).
The existing per-replicate NA→0 sweep at line 201 only fills gaps
WITHIN a single replicate; rows joined ACROSS replicates kept the
gap.

Under the pre-split code this would have killed cond_A and aborted
the whole script, so cond_B never ran. The per-condition split runs
each condition in its own R process, so cond_B succeeded
independently while cond_A still failed — and that exposed the bug
loudly. Fix: extend the same NA→0 sweep over the post-bind_rows
model_df.

Treating "no measurement for replicate R at time T" as "0 reads
observed" is statistically debatable, but it's consistent with how
within-replicate gaps are already handled, and it matches the
implicit assumption the per-replicate sweep was already making.

Verified end-to-end on example.yaml with scoring_backend=lilace:
both cond_A_scores.csv and cond_B_scores.csv produced; 3/3 jobs
on the rebuild path; pipeline completes cleanly.
Three sites still used `rejected_list = rejected_list + [line]`
(:557, :566, :622), which allocates a fresh list and copies every
prior element on every iteration — O(n²) work as the rejected-reads
list grows. Switched to `rejected_list.append(line)` for amortized
O(1). Bounded but real win on production data where rejected reads
can be 5-20% of the total.

The fourth call site at :573 was already using `.append`; this just
makes the rest consistent. 150/150 Python tests unchanged.
Use list.append() in process_variants rejected path
read_gatk_csv previously loaded every line of the GATK variantCounts
file into a list before returning. process_variants_file then iterated
that list exactly once. On production samples (100s of MB per file,
millions of rows) the materialization is unbounded memory for no
benefit — the consumer can't seek backwards or peek ahead, it just
streams forward.

Convert read_gatk_csv to a generator that yields padded rows one at a
time and update process_variants_file's signature from GatkList
(List[List[str]]) to Iterable[List[str]] so type annotations match
the new contract. The generator owns the file handle for the duration
of consumption — Python's `with` block stays open across yields and
closes when the iterator is exhausted or GC'd.

Test impact:
  - tests/unit pass mock_gatk_list as plain Python lists; lists are
    iterables so process_variants_file still works unchanged.
  - tests/integration/test_processing_pipeline.py asserted
    isinstance(result, list) on the read_gatk_csv return; that was
    over-specification of the contract. Tightened to materialize the
    iterator into a list explicitly and assert on the row shape.

All 150 Python tests pass. Verified end-to-end on example.yaml: 81/81
jobs, 14 variantCounts + 14 per-sample enrich tsvs + both rosace
condition scores produced through the streaming path.
Stream GATK CSV reads instead of materializing the whole file
The read-processing chain previously wrote three pairs of compressed
FASTQs to disk between four BBTools stages — two of those pairs were
read once by the next stage and immediately deleted by Snakemake's
temp() handling. Pure scratch churn.

Merge bbduk_trim_adapters + remove_contaminants + error_correct_bbmerge
into one `trim_clean_correct` rule. The three BBTools tools run
concurrently in a shell pipeline, exchanging uncompressed interleaved
FASTQ over stdin/stdout. Only the final `.ec.clean.trim.fastq.gz` pair
is written to disk; the two transient pairs never exist on storage.
Each tool still writes its stats files directly (bbduk's qhist /
bhist / gchist / aqhist / lhist / trim.stats.txt /
trim_contam.stats.txt, bbmerge's ihist) so MultiQC's auto-discovery
and the baseline file list see the same artifacts as before.

bbmap_map stays a separate rule. Its 8.5 GB RSS for the BBMap index
would push the merged rule into double-digit per-sample memory, its
`_map.covstats` is the only stats file that's a load-bearing rule
input downstream, and its intra-rule pipe to `samtools view -b` is
already optimal.

Each BBTools invocation gets a 5/5/6 thread budget so the three JVMs
in the pipe stay at ~16 software threads total on the rule's 16-thread
reservation; passing t={threads} to all three would oversubscribe
(3×16=48). Per-tool stderr logs are preserved via separate
`log.bbduk_trim` / `log.bbduk_clean` / `log.bbmerge` outputs. Per-stage
Snakemake benchmark granularity is lost — the merged rule gets one
benchmark covering all three stages. Acceptable tradeoff; documented
here so a future perf investigation knows to temporarily split the
rule if per-stage RSS is needed.

Verification on example.yaml (14 samples):
  - 82/82 jobs, 100% complete
  - 112/112 stats files bit-identical (ignoring BBTools `#` headers)
  - Wall (sum-of-3-rules → merged-rule): 101s → 91s  (-9%)
  - Disk writes in this stage: 1773 MB → 823 MB  (-54%)
  - Peak RSS per sample: 982 MB → 2196 MB  (+124%, three JVMs
    concurrent; still well below bbmap_map's 8.5 GB peak)

Bit-identity caveat: 12/14 sample enrich_format/*.tsv match the
pre-refactor outputs exactly. 2/14 (A_R2_T0, B_R1_T3) drift by +1
read each (~15-25 ppm). The drift is BBTools thread non-determinism
exposed by the t=5/5/6 budget change, not a logic regression — the
old pipeline run twice would likely also drift by single reads on a
different pair of samples. The 2-read drift propagates through Stan/
Rosace MCMC to visible per-variant score shifts on those samples.
This is acceptable for the use case (DMS pipelines have orders of
magnitude more biological noise than this) and the disk-write
reduction is the primary win.

At prod scale (50 samples × 2 GB pairs) this saves ~85 GB scratch
disk per run, addressing the audit's High Priority "Reduce full-size
FASTQ intermediates" item.
Stream trim → clean → correct through one rule, no intermediate FASTQs
Track C from the performance audit: replace BBMap (the ~84% wall
bottleneck) with minimap2 in short-read mode. Gated by a new
config['aligner'] field defaulting to 'bbmap', so existing users see
no change until they opt in with aligner: minimap2.

How it's wired
- common.smk validates config['aligner'] in {bbmap, minimap2}.
- ref.smk branches on aligner: bbmap path keeps the existing
  prepare_bbmap_index rule; minimap2 path adds prepare_minimap2_index
  which writes ref/minimap2/{experiment}_{ref_digest}.mmi via
  `minimap2 -x sr -d`.
- map.smk branches similarly: the bbmap rule is unchanged; the new
  map_to_reference_minimap2 rule pipes `minimap2 -ax sr --MD` through
  `samtools view -b`. The seven BBTools-format stats outputs
  (covstats / basecov / bincov / ehist / indelhist / mhist / idhist)
  are produced as empty placeholders so the multiqc_dir rule's
  load-bearing covstats dependency is satisfied and the baseline file
  list doesn't break. Real minimap2 QC artifacts (samtools coverage /
  flagstat) can be plumbed in once the aligner swap is proven.

Empirical concordance on example.yaml (14 samples) vs the bbmap
baseline (same fixture, same trim_clean_correct upstream):
  - Rosace cond_A: Pearson r = 0.99938, max |delta| = 0.335 score units,
    0/1770 variants drift > 0.5 (gate: >0.99, the lilace-style gate
    from the audit was met cleanly)
  - Rosace cond_B: Pearson r = 0.99792
  - Enrich2:       Pearson r = 0.99312
  - variantCounts obs drift: -0.5% to -1.4% across samples, within
    BBMap's own thread-non-determinism baseline (PR #43 measured ~15-25
    ppm between bbmap-to-bbmap runs at different t= settings)

Performance on example.yaml:
  - bbmap_map: 521s sum / 14 samples, 8575 MB peak RSS per sample
  - minimap2_map: 49s sum / 14 samples (10.7x faster), 451 MB peak RSS
    per sample (19x less memory)
  - End-to-end pipeline wall: ~25 min → ~5 min

Realistic at prod scale: BBMap's fixed index-loading overhead amortizes
on big inputs so the per-byte multiplier shrinks, but minimap2 is still
~3x faster on multi-GB FASTQs per the audit's literature review and
remains substantially lower-memory. The 19x RSS reduction is the
load-bearing change for cluster sizing — minimap2 fits per-rule budgets
that BBMap couldn't.

Not in scope for this commit:
  - Real minimap2 QC artifacts (samtools coverage to populate the
    placeholder stats files)
  - strobealign or bwa-mem2 spikes (audit Track C steps 2 and 3)
  - Default-flipping aligner: minimap2; we keep bbmap as the default
    until users opt in, so existing experiments are byte-for-byte
    unchanged
Standalone Python script that runs BBMap and minimap2 on the same
paired-end FASTQ + reference and reports wall time, peak RSS (sum of
descendants across the shell pipeline), BAM record count, and BAM size.
Uses the same alignment flag set the dumpling snakemake rules use so
the numbers are comparable to a real pipeline run.

Built for the audit's prod-scale question: "the 10x minimap2 ratio we
measured on example.yaml — does it hold on multi-GB inputs, or does
BBMap's overhead amortize?" Run against `lmna.bb.r{1,2}.fastq.gz` (8 GB
pair) or any other fixture and get a real answer in ~minutes.

Index building (BBMap and minimap2) happens once per (reference,
k-mer) combo and is excluded from the timing. Indices are cached
under --workdir for reuse across runs.

Peak RSS is measured by polling /proc/<pid>/status VmRSS every 100ms
across the bash subprocess AND all of its descendants. Bash pipelines
spawn the aligner and samtools as sibling children of bash, so summing
descendants catches both. Verified on the example fixture: BBMap
reports ~10 GB peak (8.5 GB BBMap + samtools), minimap2 reports ~325 MB.

Smoke-tested on example/data/1_S1_L001_R{1,2}_001.fastq.gz against
references/example_ref.fasta: BBMap 39s / 10 GB RSS, minimap2 3.7s /
323 MB RSS. Matches the snakemake benchmark numbers within sampling
noise.

CLI:
  python tools/bench_aligners.py --r1 R1.fq.gz --r2 R2.fq.gz --ref ref.fasta

Flags: --threads --mem --workdir --skip-bbmap --skip-minimap2 --cleanup.

Docstring includes a caveat that the wall/RSS ratios reported are
reference-size dependent: BBMap's perfect-hash index allocates GB-scale
memory regardless of reference length, so on tiny DMS references
(~1-10 kb) the ratios are dramatic and they shrink on whole-genome
references.
The previous minimap2 rule produced empty `touch`-ed placeholder files
for the seven BBTools-format stats outputs (`_map.covstats`,
`_map.basecov`, etc.) just to satisfy the multiqc_dir input dependency
and the baseline file list. The DAG was happy; MultiQC silently
rendered no mapping-stage QC under minimap2. That made the
minimap2-vs-bbmap comparison unfair — we lost the QC story entirely
when switching aligners.

Replace placeholders with real samtools-based QC: the minimap2 rule
now runs `samtools stats`, `samtools flagstat`, and `samtools coverage`
on the BAM and writes their outputs to
  stats/{experiment}/{sample}_samtools_{stats,coverage,flagstat}.txt
MultiQC's built-in samtools parsers autodiscover these by content
(verified end-to-end: HTML contains samtools_stats, samtools_flagstat,
and samtools_alignment_plot sections; baseline multiqc_data lists
samtools_flagstat rows for all baseline samples).

`samtools coverage` requires coordinate-sorted input, so the BAM is
piped through `samtools sort` only for that one call. The on-disk BAM
stays unsorted — GATK ASM doesn't need sort, and persisting a sorted
copy is the work PR #37 deleted (sort_index_samtools rule). Adds
~seconds-to-minutes per-sample for the sort, scaling with BAM size;
small price for proper QC.

`multiqc_dir` (qc.smk) and `generate_baseline_file_list.py` both branch
on `config["aligner"]`:
  - bbmap: trigger / list the seven `_map.*` BBTools histograms (unchanged).
  - minimap2: trigger / list the three `_samtools_*.txt` outputs.

BBMap's per-position error histograms (`_map.ehist`, `_map.mhist`,
`_map.idhist`) have no clean samtools equivalent and are intentionally
not reproduced — they were debug detail, not biology-load-bearing
DMS QC. Coverage breadth, mean depth, mapping rate, insert size,
global error rate, indel size distribution are all still produced.

Dry-runs of both aligners resolve cleanly (67 jobs each on
example.yaml). Full end-to-end run on `aligner: minimap2` completes
81/81 jobs; all 14 samples produce all three samtools stat files with
real content (samtools_stats ~180KB/sample, coverage 1 row/scaffold,
flagstat the standard 17-line summary).
Previous commit (239ec37) added samtools coverage to the minimap2 rule
to populate the MultiQC coverage section. That call does per-base pileup
and costs ~14s per sample on the example fixture's ~60,000x depth —
which is representative of real DMS prod data, not a fixture artifact.
Per-variant depth × ~20 AAs per position aggregates to 10⁴–10⁵×
per-position depth in realistic DMS runs. So the regression is real on
prod too, not just on test fixtures.

Result: minimap2's wall advantage vs bbmap dropped from 10.7× to 1.2×.
Not acceptable.

Trade-off taken: drop samtools coverage; keep samtools stats + flagstat
(both cheap, ~1s/sample combined). MultiQC still renders the
samtools_stats and samtools_flagstat sections (verified end-to-end:
`samtools_alignment_plot`, `samtools_flagstat`, `samtools_stats` all
present in the rendered HTML). multiqc_dir's per-aligner trigger
becomes `_samtools_stats.txt` instead of `_samtools_coverage.txt`.

Mean depth is derivable from samtools stats's `bases mapped (cigar)` ÷
reference length if a user needs it; breadth coverage is always ~100%
for DMS so its loss is uninteresting.

A follow-up task is logged in tasks/tasks.md: re-add per-position
coverage via mosdepth (purpose-built for fast deep coverage, ~10×
faster than samtools coverage, dedicated MultiQC parser). Until that
lands, the bbmap path retains coverage stats (via `_map.covstats`)
while the minimap2 path does not. Acceptable interim state.

Measured on example.yaml (14 samples) after this change:
  - minimap2_map wall: 73s sum (5.2s mean per sample)
  - minimap2_map peak RSS: 455 MB
  - vs bbmap baseline (521s sum, 37s mean, 8575 MB): 7.2× wall, 18.9× RSS
  - vs placeholder-only minimap2 (49s sum): +24s overhead for the
    samtools stats+flagstat we get back

81/81 jobs complete; all 14 samples produce both samtools stat files
with real content; MultiQC sections render correctly.
The aligner-comparison bench script is an ad-hoc tool for one-off
measurements (e.g. validating prod-scale BBMap vs minimap2 ratios
against a real fixture). It's not part of the pipeline contract, the
test suite, or anything users run as part of a normal workflow. Tracking
it in the repo invited the impression that it's a supported entry point.

Remove from git tracking; keep the file in the working tree for current
local use. Add the path to .gitignore so future edits don't accidentally
get staged. multiqc_compare/ added to .gitignore too — that's another
ad-hoc local dir produced by side-by-side MultiQC comparisons that
shouldn't be in the repo.
The minimap2 PR added a runtime validation in common.smk that
rejects values outside {bbmap, minimap2}, but the JSON schema was
silently allowing anything because there's no `additionalProperties:
false`. Add the schema entry so:
  - the canonical config documentation lists the new key with its
    enum and default
  - any future schema-aware tooling (autocompletion, docs gen)
    sees the same allowed values the runtime check enforces
  - typos like `aligner: bowtie2` get caught at validate() time
    too, not just at the common.smk validator

Description points users at tasks/performance-audit.md Track C for
the aligner-swap context. Belt-and-suspenders with the runtime
ValueError; the runtime check fires first and gives a sharper error
message.

Verified: dry-run with aligner=bowtie2 fails cleanly with
"Unsupported aligner 'bowtie2'. Expected one of: bbmap, minimap2."
Python suite 150/150, R testthat passes, dry-runs of both supported
aligners resolve.
MultiQC derives its sample identifier from the cleaned filename: it
removes anything in extra_fn_clean_exts from the basename and uses
what's left as the sample's s_name. The BBMap path has worked since
day one because _trim, _map, _merge were all in extra_fn_clean_exts —
so `sampleA_map.covstats` and `sampleA_trim.qhist` both collapse to
s_name `sampleA`, producing one general-stats row per sample.

The minimap2 path landed earlier in this branch but didn't add its
new suffixes to extra_fn_clean_exts. Result: `sampleA_samtools_stats`
and `sampleA_samtools_flagstat` were treated as separate samples in
the MultiQC general-stats overview — one row per (sample, tool)
instead of per sample. Confirmed on the baseline report: 3 baseline
samples produced 9 rows (3 each of bare, _samtools_flagstat,
_samtools_stats) instead of 3.

Fix: add _samtools_stats and _samtools_flagstat to extra_fn_clean_exts.
Re-rendered baseline report now shows the expected 3 rows.

The bbmap path is unaffected — its filenames already match the
existing suffixes. This is a minimap2-only fix.
Add `docs/` to .gitignore for local-only working documents like
agent briefings, plugin-contract notes, and other artifacts that
inform development but aren't part of the user-facing repo
documentation. README, schemas, audit doc etc. still live at the
repo root or under tasks/ as before.

Sibling to the existing local-only entries (tools/bench_aligners.py,
multiqc_compare/) — same intent: keep ephemeral / developer-only
material out of git history.
Add tests/unit/test_aligner_config.py — same shape as the existing
test_scoring_backend_config.py — pinning four invariants that protect
old user configs:

  1. The schema document defines `aligner` with the correct type,
     enum {bbmap, minimap2}, and default "bbmap". Catches accidental
     property rename or enum drift.

  2. `aligner` is NOT in the schema's `required` list. Pre-minimap2
     configs don't have the key, so requiring it would lock those
     users out at validate() time.

  3. jsonschema validate accepts a config that omits `aligner`,
     accepts bbmap and minimap2 explicitly, and rejects unknown
     values like "bowtie2" with an error message that mentions the
     offending value or the enum constraint.

  4. snakemake.utils.validate (the runtime path common.smk actually
     uses) injects the default "bbmap" into a config that doesn't
     mention `aligner`. The test docstring spells out the consequence
     of regression: an existing user's pipeline would silently switch
     aligners out from under them.

Verified end-to-end on example.yaml: the file has no aligner key, dry-
runs schedule the bbmap path (prepare_bbmap_index, map_to_reference_bbmap),
explicit `aligner=bowtie2` is rejected with a clear ValueError from
common.smk:207. 7/7 new tests pass, full suite is 157/157.
Brings the user-facing README in line with the eight perf PRs landed on
`performance`. Most user-visible change is documenting the new
`aligner: bbmap | minimap2` config option; everything else is staleness
cleanup.

  - Intro paragraph: clarify mapping stage uses BBMap (default) or
    minimap2 (opt-in, configurable).
  - Dependencies list: add minimap2 (with a back-ref to the new Aligner
    choice section) and pysam (used by the inflate / processing helpers).
  - New "Aligner choice" section under Configuration: side-by-side
    comparison table (wall, RSS, MultiQC artifacts), notes on prod-scale
    perf ratio shrinkage, and the per-position coverage gap deferred to
    mosdepth.
  - QC metrics section: BBMap reports listed as bbmap-only; samtools
    stats/flagstat listed as minimap2-only.
  - Stale path fixes (long overdue): `config/test_config.yaml` →
    `config/example.yaml`; `schemas/` → `workflow/schemas/`;
    `designed_variants/test_variants.csv` →
    `designed_variants/example_variants.csv`; `references/test_ref.fasta`
    → `references/example_ref.fasta`. None of those `test_*` files have
    existed for a while.
  - Working directory tree: corrected to reflect actual layout
    (schemas under workflow/, multiqc_config.yaml in config/, oligos
    dir optional and currently empty).
@odcambc odcambc merged commit a654a5f into main May 26, 2026
@odcambc odcambc deleted the performance branch May 26, 2026 17:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant