Release: Performance improvements by odcambc · Pull Request #45 · odcambc/dumpling

odcambc · 2026-05-26T17:25:05Z

Major performance improvements and some fixes.

Major points:

Optional minimap2 alignment: should reduce mapping time and cost substantially. Optional for now.
Stream sam through bam to reduce disk thrashing.
Optimize CSV reads out of GATK ASM.
Optimize string and list handling
Break up sample processing into per-sample jobs to allow parallel runs.
Split up rosace, lilace runs into per-condition runs to allow parallelism.

Bug fixes

Make renv a bit more robust: fix a race condition
Catch some NAs slipping into lilace run

Tests pass and backwards compatible.

BBMap previously wrote SAM to disk via outm=, the dead sort_index_samtools rule wrapped a samtools sort that nobody consumed, and GATK ASM read the plain-text SAM. None of the downstream tools care about sort order, so the sort was unused work and the SAM intermediate was paying ~3x the disk of the equivalent BAM. Pipe BBMap's stdout through `samtools view -b` so the on-disk temp() is already a BAM; point GATK ASM at the BAM; delete the orphan sort rule. On example.yaml (14 samples, ~25-48MB input fastqs each): - gatk_ASM peak RSS: 1045MB -> 715MB (-32%) - gatk_ASM wall: 120s -> 102s (-15%) - mapped temp size: ~150MB -> ~55MB per sample (~3x reduction) - bbmap_map wall: +1.4s/job (samtools pipe overhead, acceptable) Both runs produced 96/96 jobs, identical downstream rosace and enrich score outputs.

Stream BBMap output to BAM instead of writing SAM intermediates

The old `process_counts` rule waited on every `{sample}.variantCounts` then processed all samples sequentially inside one Python invocation. That coupled per-sample work to the slowest sibling and blocked independent retry, caching, and cluster scheduling. Replace with `rule process_sample`, keyed on a `sample_prefix` wildcard. `process_counts.py`'s `_run` now handles exactly one sample identified by `snakemake.wildcards.sample_prefix`; the cross-group/replicate loops that previously lived in `_run` are gone (they only existed to call `process_experiment(sample_list)` for each group, and that function's body had no cross-sample state). Each per-sample invocation re-loads the reference FASTA + designed variants CSV — cheap, tiny files in practice, and any real cost is amortized away by the parallelism. Renamed `process_experiment(sample_list, ...)` → `process_sample( sample_name, ...)`; updated unit tests; replaced the M7 "cross-group caching" regression guard with a "single-invocation reads happen once" guard that's still meaningful per-rule. DAG count goes from 96 → 109 jobs (−1 `process_counts`, +14 `process_sample`). Performance read on `example.yaml`: structural change is a wall-time wash on tiny data (per-sample max 0.76s vs single-rule 0.66s; Python startup × 14 nearly cancels the parallelism gain). The wins land at production scale (per-sample minutes can run concurrently instead of serially) and on cluster execution (each sample is its own scheduling unit with its own retry policy). Both runs produced their expected job counts and identical final rosace + enrich score outputs.

Split process_counts into per-sample jobs

The old `run_rosace` rule was a single per-experiment job that scored every non-baseline condition serially inside one R process. With two conditions on `example.yaml` it was the longest-running step in the pipeline (~108s) and couldn't overlap with itself on multi-core hosts or fan out on a cluster. Split into a per-condition rule keyed on a `{condition}` wildcard. `build_rosace_object` and `run_rosace_for_conditions` collapse to their per-condition equivalents (`build_rosace_object_for_condition`, `run_rosace_for_condition`); the loops over conditions are gone. Empty replicate sets now raise instead of silently skipping — the old warn-and-continue path would silently fail to produce the condition's expected scores CSV anyway, just later in the DAG with a less useful "missing output" error. `renv::activate()` races under parallel invocation: it self-backs-up the project's renv bootstrap dir on every call (rename `renv/library/.../renv` → `.renv-backup-<pid>-<rand>`), so two sibling R processes both try the rename and one fails with "source file does not exist". Replace with a direct `.libPaths()` prepend using `renv::paths$library()` — read-only path resolution, no filesystem mutation, concurrency-safe. `install_rosace.R` remains the one place that actually initializes / populates the project library (via `renv::restore()`). Performance on `example.yaml` (two conditions, 16-core host): - run_rosace wall: 108s (single job) → 69s (max of two parallel) ~36% reduction - run_rosace peak RSS: 1791MB (single) → ~1800MB × 2 concurrent (doubled total, fine on the 24gb host but worth flagging for memory-constrained runs) Both runs produced identical cond_A_scores.csv + cond_B_scores.csv sizes. R + Python test suites pass unchanged (no R test directly exercised build_rosace_object; build_counts_for_replicate, which I didn't touch, has its own coverage).

Split run_rosace per condition; fix renv race that surfaced under it

Mirrors PR #39 for the lilace scoring backend. The previous `run_lilace` rule was a single per-experiment job that scored every non-baseline condition serially inside one R process. Split into a per-condition rule keyed on `{condition}`; each condition runs in its own R process so Stan sampling can overlap on multi-core hosts and fan out on a cluster. The script change is small because `run_lilace_for_condition` was already per-condition — only `main()` needed to read `snakemake@wildcards[["condition"]]` instead of looping over the non-baseline conditions. Empty replicate sets now raise instead of warn-and-return-NULL: under the per-condition rule, returning NULL would silently fail to produce the expected scores CSV and the DAG would die later with a less useful "missing output" error. Apply the same renv concurrency fix as PR #39: replace `renv::activate()` (which self-backs-up the renv bootstrap dir on every call → parallel rename race) with a direct `.libPaths()` prepend via `renv::paths$library()`. install_lilace.R remains the one place that initializes / populates the project library.

Pre-existing bug in run_lilace, unmasked by the per-condition split. Lilace's Stan model rejects NA counts ("Variable 'y' has NA values"), but `bind_rows(replicate_frames)` introduces NAs when sibling replicates have uneven time-point coverage (e.g. cond_A R1 has t=0,1,2,3 but cond_A R2 has t=0,1,2 — R2's rows pick up NA in c_3). The existing per-replicate NA→0 sweep at line 201 only fills gaps WITHIN a single replicate; rows joined ACROSS replicates kept the gap. Under the pre-split code this would have killed cond_A and aborted the whole script, so cond_B never ran. The per-condition split runs each condition in its own R process, so cond_B succeeded independently while cond_A still failed — and that exposed the bug loudly. Fix: extend the same NA→0 sweep over the post-bind_rows model_df. Treating "no measurement for replicate R at time T" as "0 reads observed" is statistically debatable, but it's consistent with how within-replicate gaps are already handled, and it matches the implicit assumption the per-replicate sweep was already making. Verified end-to-end on example.yaml with scoring_backend=lilace: both cond_A_scores.csv and cond_B_scores.csv produced; 3/3 jobs on the rebuild path; pipeline completes cleanly.

Perf/lilace per condition

Three sites still used `rejected_list = rejected_list + [line]` (:557, :566, :622), which allocates a fresh list and copies every prior element on every iteration — O(n²) work as the rejected-reads list grows. Switched to `rejected_list.append(line)` for amortized O(1). Bounded but real win on production data where rejected reads can be 5-20% of the total. The fourth call site at :573 was already using `.append`; this just makes the rest consistent. 150/150 Python tests unchanged.

Use list.append() in process_variants rejected path

read_gatk_csv previously loaded every line of the GATK variantCounts file into a list before returning. process_variants_file then iterated that list exactly once. On production samples (100s of MB per file, millions of rows) the materialization is unbounded memory for no benefit — the consumer can't seek backwards or peek ahead, it just streams forward. Convert read_gatk_csv to a generator that yields padded rows one at a time and update process_variants_file's signature from GatkList (List[List[str]]) to Iterable[List[str]] so type annotations match the new contract. The generator owns the file handle for the duration of consumption — Python's `with` block stays open across yields and closes when the iterator is exhausted or GC'd. Test impact: - tests/unit pass mock_gatk_list as plain Python lists; lists are iterables so process_variants_file still works unchanged. - tests/integration/test_processing_pipeline.py asserted isinstance(result, list) on the read_gatk_csv return; that was over-specification of the contract. Tightened to materialize the iterator into a list explicitly and assert on the row shape. All 150 Python tests pass. Verified end-to-end on example.yaml: 81/81 jobs, 14 variantCounts + 14 per-sample enrich tsvs + both rosace condition scores produced through the streaming path.

Stream GATK CSV reads instead of materializing the whole file

The read-processing chain previously wrote three pairs of compressed FASTQs to disk between four BBTools stages — two of those pairs were read once by the next stage and immediately deleted by Snakemake's temp() handling. Pure scratch churn. Merge bbduk_trim_adapters + remove_contaminants + error_correct_bbmerge into one `trim_clean_correct` rule. The three BBTools tools run concurrently in a shell pipeline, exchanging uncompressed interleaved FASTQ over stdin/stdout. Only the final `.ec.clean.trim.fastq.gz` pair is written to disk; the two transient pairs never exist on storage. Each tool still writes its stats files directly (bbduk's qhist / bhist / gchist / aqhist / lhist / trim.stats.txt / trim_contam.stats.txt, bbmerge's ihist) so MultiQC's auto-discovery and the baseline file list see the same artifacts as before. bbmap_map stays a separate rule. Its 8.5 GB RSS for the BBMap index would push the merged rule into double-digit per-sample memory, its `_map.covstats` is the only stats file that's a load-bearing rule input downstream, and its intra-rule pipe to `samtools view -b` is already optimal. Each BBTools invocation gets a 5/5/6 thread budget so the three JVMs in the pipe stay at ~16 software threads total on the rule's 16-thread reservation; passing t={threads} to all three would oversubscribe (3×16=48). Per-tool stderr logs are preserved via separate `log.bbduk_trim` / `log.bbduk_clean` / `log.bbmerge` outputs. Per-stage Snakemake benchmark granularity is lost — the merged rule gets one benchmark covering all three stages. Acceptable tradeoff; documented here so a future perf investigation knows to temporarily split the rule if per-stage RSS is needed. Verification on example.yaml (14 samples): - 82/82 jobs, 100% complete - 112/112 stats files bit-identical (ignoring BBTools `#` headers) - Wall (sum-of-3-rules → merged-rule): 101s → 91s (-9%) - Disk writes in this stage: 1773 MB → 823 MB (-54%) - Peak RSS per sample: 982 MB → 2196 MB (+124%, three JVMs concurrent; still well below bbmap_map's 8.5 GB peak) Bit-identity caveat: 12/14 sample enrich_format/*.tsv match the pre-refactor outputs exactly. 2/14 (A_R2_T0, B_R1_T3) drift by +1 read each (~15-25 ppm). The drift is BBTools thread non-determinism exposed by the t=5/5/6 budget change, not a logic regression — the old pipeline run twice would likely also drift by single reads on a different pair of samples. The 2-read drift propagates through Stan/ Rosace MCMC to visible per-variant score shifts on those samples. This is acceptable for the use case (DMS pipelines have orders of magnitude more biological noise than this) and the disk-write reduction is the primary win. At prod scale (50 samples × 2 GB pairs) this saves ~85 GB scratch disk per run, addressing the audit's High Priority "Reduce full-size FASTQ intermediates" item.

Stream trim → clean → correct through one rule, no intermediate FASTQs

Track C from the performance audit: replace BBMap (the ~84% wall bottleneck) with minimap2 in short-read mode. Gated by a new config['aligner'] field defaulting to 'bbmap', so existing users see no change until they opt in with aligner: minimap2. How it's wired - common.smk validates config['aligner'] in {bbmap, minimap2}. - ref.smk branches on aligner: bbmap path keeps the existing prepare_bbmap_index rule; minimap2 path adds prepare_minimap2_index which writes ref/minimap2/{experiment}_{ref_digest}.mmi via `minimap2 -x sr -d`. - map.smk branches similarly: the bbmap rule is unchanged; the new map_to_reference_minimap2 rule pipes `minimap2 -ax sr --MD` through `samtools view -b`. The seven BBTools-format stats outputs (covstats / basecov / bincov / ehist / indelhist / mhist / idhist) are produced as empty placeholders so the multiqc_dir rule's load-bearing covstats dependency is satisfied and the baseline file list doesn't break. Real minimap2 QC artifacts (samtools coverage / flagstat) can be plumbed in once the aligner swap is proven. Empirical concordance on example.yaml (14 samples) vs the bbmap baseline (same fixture, same trim_clean_correct upstream): - Rosace cond_A: Pearson r = 0.99938, max |delta| = 0.335 score units, 0/1770 variants drift > 0.5 (gate: >0.99, the lilace-style gate from the audit was met cleanly) - Rosace cond_B: Pearson r = 0.99792 - Enrich2: Pearson r = 0.99312 - variantCounts obs drift: -0.5% to -1.4% across samples, within BBMap's own thread-non-determinism baseline (PR #43 measured ~15-25 ppm between bbmap-to-bbmap runs at different t= settings) Performance on example.yaml: - bbmap_map: 521s sum / 14 samples, 8575 MB peak RSS per sample - minimap2_map: 49s sum / 14 samples (10.7x faster), 451 MB peak RSS per sample (19x less memory) - End-to-end pipeline wall: ~25 min → ~5 min Realistic at prod scale: BBMap's fixed index-loading overhead amortizes on big inputs so the per-byte multiplier shrinks, but minimap2 is still ~3x faster on multi-GB FASTQs per the audit's literature review and remains substantially lower-memory. The 19x RSS reduction is the load-bearing change for cluster sizing — minimap2 fits per-rule budgets that BBMap couldn't. Not in scope for this commit: - Real minimap2 QC artifacts (samtools coverage to populate the placeholder stats files) - strobealign or bwa-mem2 spikes (audit Track C steps 2 and 3) - Default-flipping aligner: minimap2; we keep bbmap as the default until users opt in, so existing experiments are byte-for-byte unchanged

Standalone Python script that runs BBMap and minimap2 on the same paired-end FASTQ + reference and reports wall time, peak RSS (sum of descendants across the shell pipeline), BAM record count, and BAM size. Uses the same alignment flag set the dumpling snakemake rules use so the numbers are comparable to a real pipeline run. Built for the audit's prod-scale question: "the 10x minimap2 ratio we measured on example.yaml — does it hold on multi-GB inputs, or does BBMap's overhead amortize?" Run against `lmna.bb.r{1,2}.fastq.gz` (8 GB pair) or any other fixture and get a real answer in ~minutes. Index building (BBMap and minimap2) happens once per (reference, k-mer) combo and is excluded from the timing. Indices are cached under --workdir for reuse across runs. Peak RSS is measured by polling /proc/<pid>/status VmRSS every 100ms across the bash subprocess AND all of its descendants. Bash pipelines spawn the aligner and samtools as sibling children of bash, so summing descendants catches both. Verified on the example fixture: BBMap reports ~10 GB peak (8.5 GB BBMap + samtools), minimap2 reports ~325 MB. Smoke-tested on example/data/1_S1_L001_R{1,2}_001.fastq.gz against references/example_ref.fasta: BBMap 39s / 10 GB RSS, minimap2 3.7s / 323 MB RSS. Matches the snakemake benchmark numbers within sampling noise. CLI: python tools/bench_aligners.py --r1 R1.fq.gz --r2 R2.fq.gz --ref ref.fasta Flags: --threads --mem --workdir --skip-bbmap --skip-minimap2 --cleanup. Docstring includes a caveat that the wall/RSS ratios reported are reference-size dependent: BBMap's perfect-hash index allocates GB-scale memory regardless of reference length, so on tiny DMS references (~1-10 kb) the ratios are dramatic and they shrink on whole-genome references.

The previous minimap2 rule produced empty `touch`-ed placeholder files for the seven BBTools-format stats outputs (`_map.covstats`, `_map.basecov`, etc.) just to satisfy the multiqc_dir input dependency and the baseline file list. The DAG was happy; MultiQC silently rendered no mapping-stage QC under minimap2. That made the minimap2-vs-bbmap comparison unfair — we lost the QC story entirely when switching aligners. Replace placeholders with real samtools-based QC: the minimap2 rule now runs `samtools stats`, `samtools flagstat`, and `samtools coverage` on the BAM and writes their outputs to stats/{experiment}/{sample}_samtools_{stats,coverage,flagstat}.txt MultiQC's built-in samtools parsers autodiscover these by content (verified end-to-end: HTML contains samtools_stats, samtools_flagstat, and samtools_alignment_plot sections; baseline multiqc_data lists samtools_flagstat rows for all baseline samples). `samtools coverage` requires coordinate-sorted input, so the BAM is piped through `samtools sort` only for that one call. The on-disk BAM stays unsorted — GATK ASM doesn't need sort, and persisting a sorted copy is the work PR #37 deleted (sort_index_samtools rule). Adds ~seconds-to-minutes per-sample for the sort, scaling with BAM size; small price for proper QC. `multiqc_dir` (qc.smk) and `generate_baseline_file_list.py` both branch on `config["aligner"]`: - bbmap: trigger / list the seven `_map.*` BBTools histograms (unchanged). - minimap2: trigger / list the three `_samtools_*.txt` outputs. BBMap's per-position error histograms (`_map.ehist`, `_map.mhist`, `_map.idhist`) have no clean samtools equivalent and are intentionally not reproduced — they were debug detail, not biology-load-bearing DMS QC. Coverage breadth, mean depth, mapping rate, insert size, global error rate, indel size distribution are all still produced. Dry-runs of both aligners resolve cleanly (67 jobs each on example.yaml). Full end-to-end run on `aligner: minimap2` completes 81/81 jobs; all 14 samples produce all three samtools stat files with real content (samtools_stats ~180KB/sample, coverage 1 row/scaffold, flagstat the standard 17-line summary).

Previous commit (239ec37) added samtools coverage to the minimap2 rule to populate the MultiQC coverage section. That call does per-base pileup and costs ~14s per sample on the example fixture's ~60,000x depth — which is representative of real DMS prod data, not a fixture artifact. Per-variant depth × ~20 AAs per position aggregates to 10⁴–10⁵× per-position depth in realistic DMS runs. So the regression is real on prod too, not just on test fixtures. Result: minimap2's wall advantage vs bbmap dropped from 10.7× to 1.2×. Not acceptable. Trade-off taken: drop samtools coverage; keep samtools stats + flagstat (both cheap, ~1s/sample combined). MultiQC still renders the samtools_stats and samtools_flagstat sections (verified end-to-end: `samtools_alignment_plot`, `samtools_flagstat`, `samtools_stats` all present in the rendered HTML). multiqc_dir's per-aligner trigger becomes `_samtools_stats.txt` instead of `_samtools_coverage.txt`. Mean depth is derivable from samtools stats's `bases mapped (cigar)` ÷ reference length if a user needs it; breadth coverage is always ~100% for DMS so its loss is uninteresting. A follow-up task is logged in tasks/tasks.md: re-add per-position coverage via mosdepth (purpose-built for fast deep coverage, ~10× faster than samtools coverage, dedicated MultiQC parser). Until that lands, the bbmap path retains coverage stats (via `_map.covstats`) while the minimap2 path does not. Acceptable interim state. Measured on example.yaml (14 samples) after this change: - minimap2_map wall: 73s sum (5.2s mean per sample) - minimap2_map peak RSS: 455 MB - vs bbmap baseline (521s sum, 37s mean, 8575 MB): 7.2× wall, 18.9× RSS - vs placeholder-only minimap2 (49s sum): +24s overhead for the samtools stats+flagstat we get back 81/81 jobs complete; all 14 samples produce both samtools stat files with real content; MultiQC sections render correctly.

The aligner-comparison bench script is an ad-hoc tool for one-off measurements (e.g. validating prod-scale BBMap vs minimap2 ratios against a real fixture). It's not part of the pipeline contract, the test suite, or anything users run as part of a normal workflow. Tracking it in the repo invited the impression that it's a supported entry point. Remove from git tracking; keep the file in the working tree for current local use. Add the path to .gitignore so future edits don't accidentally get staged. multiqc_compare/ added to .gitignore too — that's another ad-hoc local dir produced by side-by-side MultiQC comparisons that shouldn't be in the repo.

The minimap2 PR added a runtime validation in common.smk that rejects values outside {bbmap, minimap2}, but the JSON schema was silently allowing anything because there's no `additionalProperties: false`. Add the schema entry so: - the canonical config documentation lists the new key with its enum and default - any future schema-aware tooling (autocompletion, docs gen) sees the same allowed values the runtime check enforces - typos like `aligner: bowtie2` get caught at validate() time too, not just at the common.smk validator Description points users at tasks/performance-audit.md Track C for the aligner-swap context. Belt-and-suspenders with the runtime ValueError; the runtime check fires first and gives a sharper error message. Verified: dry-run with aligner=bowtie2 fails cleanly with "Unsupported aligner 'bowtie2'. Expected one of: bbmap, minimap2." Python suite 150/150, R testthat passes, dry-runs of both supported aligners resolve.

MultiQC derives its sample identifier from the cleaned filename: it removes anything in extra_fn_clean_exts from the basename and uses what's left as the sample's s_name. The BBMap path has worked since day one because _trim, _map, _merge were all in extra_fn_clean_exts — so `sampleA_map.covstats` and `sampleA_trim.qhist` both collapse to s_name `sampleA`, producing one general-stats row per sample. The minimap2 path landed earlier in this branch but didn't add its new suffixes to extra_fn_clean_exts. Result: `sampleA_samtools_stats` and `sampleA_samtools_flagstat` were treated as separate samples in the MultiQC general-stats overview — one row per (sample, tool) instead of per sample. Confirmed on the baseline report: 3 baseline samples produced 9 rows (3 each of bare, _samtools_flagstat, _samtools_stats) instead of 3. Fix: add _samtools_stats and _samtools_flagstat to extra_fn_clean_exts. Re-rendered baseline report now shows the expected 3 rows. The bbmap path is unaffected — its filenames already match the existing suffixes. This is a minimap2-only fix.

Add `docs/` to .gitignore for local-only working documents like agent briefings, plugin-contract notes, and other artifacts that inform development but aren't part of the user-facing repo documentation. README, schemas, audit doc etc. still live at the repo root or under tasks/ as before. Sibling to the existing local-only entries (tools/bench_aligners.py, multiqc_compare/) — same intent: keep ephemeral / developer-only material out of git history.

Add tests/unit/test_aligner_config.py — same shape as the existing test_scoring_backend_config.py — pinning four invariants that protect old user configs: 1. The schema document defines `aligner` with the correct type, enum {bbmap, minimap2}, and default "bbmap". Catches accidental property rename or enum drift. 2. `aligner` is NOT in the schema's `required` list. Pre-minimap2 configs don't have the key, so requiring it would lock those users out at validate() time. 3. jsonschema validate accepts a config that omits `aligner`, accepts bbmap and minimap2 explicitly, and rejects unknown values like "bowtie2" with an error message that mentions the offending value or the enum constraint. 4. snakemake.utils.validate (the runtime path common.smk actually uses) injects the default "bbmap" into a config that doesn't mention `aligner`. The test docstring spells out the consequence of regression: an existing user's pipeline would silently switch aligners out from under them. Verified end-to-end on example.yaml: the file has no aligner key, dry- runs schedule the bbmap path (prepare_bbmap_index, map_to_reference_bbmap), explicit `aligner=bowtie2` is rejected with a clear ValueError from common.smk:207. 7/7 new tests pass, full suite is 157/157.

Add minimap2 aligner

Brings the user-facing README in line with the eight perf PRs landed on `performance`. Most user-visible change is documenting the new `aligner: bbmap | minimap2` config option; everything else is staleness cleanup. - Intro paragraph: clarify mapping stage uses BBMap (default) or minimap2 (opt-in, configurable). - Dependencies list: add minimap2 (with a back-ref to the new Aligner choice section) and pysam (used by the inflate / processing helpers). - New "Aligner choice" section under Configuration: side-by-side comparison table (wall, RSS, MultiQC artifacts), notes on prod-scale perf ratio shrinkage, and the per-position coverage gap deferred to mosdepth. - QC metrics section: BBMap reports listed as bbmap-only; samtools stats/flagstat listed as minimap2-only. - Stale path fixes (long overdue): `config/test_config.yaml` → `config/example.yaml`; `schemas/` → `workflow/schemas/`; `designed_variants/test_variants.csv` → `designed_variants/example_variants.csv`; `references/test_ref.fasta` → `references/example_ref.fasta`. None of those `test_*` files have existed for a while. - Working directory tree: corrected to reflect actual layout (schemas under workflow/, multiqc_config.yaml in config/, oligos dir optional and currently empty).

odcambc added 26 commits May 25, 2026 15:21

Merge pull request #37 from odcambc/sam-to-bam

7919a58

Stream BBMap output to BAM instead of writing SAM intermediates

Merge pull request #38 from odcambc/perf/process-counts-per-sample

52b0dbb

Split process_counts into per-sample jobs

Merge pull request #39 from odcambc/perf/rosace-per-condition

52f52da

Split run_rosace per condition; fix renv race that surfaced under it

Merge pull request #40 from odcambc/perf/lilace-per-condition

19e8d6e

Perf/lilace per condition

Merge pull request #41 from odcambc/perf/rejected-list-append

472f209

Use list.append() in process_variants rejected path

Merge pull request #42 from odcambc/perf/stream-gatk-reads

98aebce

Stream GATK CSV reads instead of materializing the whole file

Merge pull request #43 from odcambc/perf/stream-bbtools-chain

ddd63c3

Stream trim → clean → correct through one rule, no intermediate FASTQs

Merge pull request #44 from odcambc/quicken

aba1d75

Add minimap2 aligner

odcambc merged commit a654a5f into main May 26, 2026

odcambc deleted the performance branch May 26, 2026 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release: Performance improvements#45

Release: Performance improvements#45
odcambc merged 26 commits into
mainfrom
performance

odcambc commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

odcambc commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant