Split process_counts into per-sample jobs by odcambc · Pull Request #38 · odcambc/dumpling

odcambc · 2026-05-25T17:12:55Z

The old process_counts rule waited on every {sample}.variantCounts then processed all samples sequentially inside one Python invocation. That coupled per-sample work to the slowest sibling and blocked independent retry, caching, and cluster scheduling.

Replace with rule process_sample, keyed on a sample_prefix wildcard. process_counts.py's _run now handles exactly one sample identified by snakemake.wildcards.sample_prefix; the cross-group/replicate loops that previously lived in _run are gone (they only existed to call process_experiment(sample_list) for each group, and that function's body had no cross-sample state). Each per-sample invocation re-loads the reference FASTA + designed variants CSV — cheap, tiny files in practice, and any real cost is amortized away by the parallelism.

Renamed process_experiment(sample_list, ...) → process_sample( sample_name, ...); updated unit tests; replaced the M7 "cross-group caching" regression guard with a "single-invocation reads happen once" guard that's still meaningful per-rule. DAG count goes from 96 → 109 jobs (−1 process_counts, +14 process_sample).

Performance read on example.yaml: structural change is a wall-time wash on tiny data (per-sample max 0.76s vs single-rule 0.66s; Python startup × 14 nearly cancels the parallelism gain). The wins land at production scale (per-sample minutes can run concurrently instead of serially) and on cluster execution (each sample is its own scheduling unit with its own retry policy). Both runs produced their expected job counts and identical final rosace + enrich score outputs.

The old `process_counts` rule waited on every `{sample}.variantCounts` then processed all samples sequentially inside one Python invocation. That coupled per-sample work to the slowest sibling and blocked independent retry, caching, and cluster scheduling. Replace with `rule process_sample`, keyed on a `sample_prefix` wildcard. `process_counts.py`'s `_run` now handles exactly one sample identified by `snakemake.wildcards.sample_prefix`; the cross-group/replicate loops that previously lived in `_run` are gone (they only existed to call `process_experiment(sample_list)` for each group, and that function's body had no cross-sample state). Each per-sample invocation re-loads the reference FASTA + designed variants CSV — cheap, tiny files in practice, and any real cost is amortized away by the parallelism. Renamed `process_experiment(sample_list, ...)` → `process_sample( sample_name, ...)`; updated unit tests; replaced the M7 "cross-group caching" regression guard with a "single-invocation reads happen once" guard that's still meaningful per-rule. DAG count goes from 96 → 109 jobs (−1 `process_counts`, +14 `process_sample`). Performance read on `example.yaml`: structural change is a wall-time wash on tiny data (per-sample max 0.76s vs single-rule 0.66s; Python startup × 14 nearly cancels the parallelism gain). The wins land at production scale (per-sample minutes can run concurrently instead of serially) and on cluster execution (each sample is its own scheduling unit with its own retry policy). Both runs produced their expected job counts and identical final rosace + enrich score outputs.

odcambc merged commit 52b0dbb into performance May 25, 2026

odcambc deleted the perf/process-counts-per-sample branch May 25, 2026 17:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split process_counts into per-sample jobs#38

Split process_counts into per-sample jobs#38
odcambc merged 1 commit into
performancefrom
perf/process-counts-per-sample

odcambc commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

odcambc commented May 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant