Split process_counts into per-sample jobs#38
Merged
Conversation
The old `process_counts` rule waited on every `{sample}.variantCounts`
then processed all samples sequentially inside one Python invocation.
That coupled per-sample work to the slowest sibling and blocked
independent retry, caching, and cluster scheduling.
Replace with `rule process_sample`, keyed on a `sample_prefix` wildcard.
`process_counts.py`'s `_run` now handles exactly one sample identified
by `snakemake.wildcards.sample_prefix`; the cross-group/replicate loops
that previously lived in `_run` are gone (they only existed to call
`process_experiment(sample_list)` for each group, and that function's
body had no cross-sample state). Each per-sample invocation re-loads
the reference FASTA + designed variants CSV — cheap, tiny files in
practice, and any real cost is amortized away by the parallelism.
Renamed `process_experiment(sample_list, ...)` → `process_sample(
sample_name, ...)`; updated unit tests; replaced the M7
"cross-group caching" regression guard with a "single-invocation
reads happen once" guard that's still meaningful per-rule. DAG count
goes from 96 → 109 jobs (−1 `process_counts`, +14 `process_sample`).
Performance read on `example.yaml`: structural change is a wall-time
wash on tiny data (per-sample max 0.76s vs single-rule 0.66s; Python
startup × 14 nearly cancels the parallelism gain). The wins land at
production scale (per-sample minutes can run concurrently instead of
serially) and on cluster execution (each sample is its own scheduling
unit with its own retry policy). Both runs produced their expected
job counts and identical final rosace + enrich score outputs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The old
process_countsrule waited on every{sample}.variantCountsthen processed all samples sequentially inside one Python invocation. That coupled per-sample work to the slowest sibling and blocked independent retry, caching, and cluster scheduling.Replace with
rule process_sample, keyed on asample_prefixwildcard.process_counts.py's_runnow handles exactly one sample identified bysnakemake.wildcards.sample_prefix; the cross-group/replicate loops that previously lived in_runare gone (they only existed to callprocess_experiment(sample_list)for each group, and that function's body had no cross-sample state). Each per-sample invocation re-loads the reference FASTA + designed variants CSV — cheap, tiny files in practice, and any real cost is amortized away by the parallelism.Renamed
process_experiment(sample_list, ...)→process_sample( sample_name, ...); updated unit tests; replaced the M7 "cross-group caching" regression guard with a "single-invocation reads happen once" guard that's still meaningful per-rule. DAG count goes from 96 → 109 jobs (−1process_counts, +14process_sample).Performance read on
example.yaml: structural change is a wall-time wash on tiny data (per-sample max 0.76s vs single-rule 0.66s; Python startup × 14 nearly cancels the parallelism gain). The wins land at production scale (per-sample minutes can run concurrently instead of serially) and on cluster execution (each sample is its own scheduling unit with its own retry policy). Both runs produced their expected job counts and identical final rosace + enrich score outputs.