Skip to content

Split process_counts into per-sample jobs#38

Merged
odcambc merged 1 commit into
performancefrom
perf/process-counts-per-sample
May 25, 2026
Merged

Split process_counts into per-sample jobs#38
odcambc merged 1 commit into
performancefrom
perf/process-counts-per-sample

Conversation

@odcambc
Copy link
Copy Markdown
Owner

@odcambc odcambc commented May 25, 2026

The old process_counts rule waited on every {sample}.variantCounts then processed all samples sequentially inside one Python invocation. That coupled per-sample work to the slowest sibling and blocked independent retry, caching, and cluster scheduling.

Replace with rule process_sample, keyed on a sample_prefix wildcard. process_counts.py's _run now handles exactly one sample identified by snakemake.wildcards.sample_prefix; the cross-group/replicate loops that previously lived in _run are gone (they only existed to call process_experiment(sample_list) for each group, and that function's body had no cross-sample state). Each per-sample invocation re-loads the reference FASTA + designed variants CSV — cheap, tiny files in practice, and any real cost is amortized away by the parallelism.

Renamed process_experiment(sample_list, ...)process_sample( sample_name, ...); updated unit tests; replaced the M7 "cross-group caching" regression guard with a "single-invocation reads happen once" guard that's still meaningful per-rule. DAG count goes from 96 → 109 jobs (−1 process_counts, +14 process_sample).

Performance read on example.yaml: structural change is a wall-time wash on tiny data (per-sample max 0.76s vs single-rule 0.66s; Python startup × 14 nearly cancels the parallelism gain). The wins land at production scale (per-sample minutes can run concurrently instead of serially) and on cluster execution (each sample is its own scheduling unit with its own retry policy). Both runs produced their expected job counts and identical final rosace + enrich score outputs.

The old `process_counts` rule waited on every `{sample}.variantCounts`
then processed all samples sequentially inside one Python invocation.
That coupled per-sample work to the slowest sibling and blocked
independent retry, caching, and cluster scheduling.

Replace with `rule process_sample`, keyed on a `sample_prefix` wildcard.
`process_counts.py`'s `_run` now handles exactly one sample identified
by `snakemake.wildcards.sample_prefix`; the cross-group/replicate loops
that previously lived in `_run` are gone (they only existed to call
`process_experiment(sample_list)` for each group, and that function's
body had no cross-sample state). Each per-sample invocation re-loads
the reference FASTA + designed variants CSV — cheap, tiny files in
practice, and any real cost is amortized away by the parallelism.

Renamed `process_experiment(sample_list, ...)` → `process_sample(
sample_name, ...)`; updated unit tests; replaced the M7
"cross-group caching" regression guard with a "single-invocation
reads happen once" guard that's still meaningful per-rule. DAG count
goes from 96 → 109 jobs (−1 `process_counts`, +14 `process_sample`).

Performance read on `example.yaml`: structural change is a wall-time
wash on tiny data (per-sample max 0.76s vs single-rule 0.66s; Python
startup × 14 nearly cancels the parallelism gain). The wins land at
production scale (per-sample minutes can run concurrently instead of
serially) and on cluster execution (each sample is its own scheduling
unit with its own retry policy). Both runs produced their expected
job counts and identical final rosace + enrich score outputs.
@odcambc odcambc merged commit 52b0dbb into performance May 25, 2026
@odcambc odcambc deleted the perf/process-counts-per-sample branch May 25, 2026 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant