Hoist Rosace preprocessing out of the per-replicate loop#35
Merged
Conversation
build_rosace_object called FilterData / ImputeData / NormalizeData / IntegrateData inside the inner replicate loop. For a condition with N replicates, each function ran N times — once per replicate addition — against progressively more populated state. Verified against pimentellab/rosace@main (R/preprocessing.R): - FilterData.AssayGrowth writes back to @CountS in-place (no separate filtered slot). Idempotent on a second back-to-back call: the surviving rows already satisfy the predicates, so re-filtering is a no-op. Also clears @norm.counts as a side effect. - ImputeData.AssayGrowth writes imputed values back to @CountS. impute::impute.knn returns the input unchanged when there are no NAs, so re-imputation on already-imputed data is numerically a no-op. - NormalizeData.AssayGrowth writes to a separate @norm.counts slot and recomputes from raw @CountS on each call. Idempotent. The WT baseline is intra-assay (per-replicate), not pooled across replicates — so once-per-rep vs once-per-key produces identical per-assay norm.counts. - IntegrateData.Rosace rebuilds the AssaySet from scratch on each call (R/preprocessing.R:217-238, `assayset <- AssaySet()`), and AddAssaySetData overwrites by name. Repeated calls just rebuild and replace. So the per-replicate loop produced the same final scores as the per-condition pattern, with O(N²) wasted preprocessing work. The intermediate AssaySets between loop iterations were also visible to any downstream consumer reading @assay.sets, even though the loop overwrote them seconds later. Restructure as two passes per condition: Pass 1 — add every replicate's assay via AddAssayData. Pass 2 — run Filter / Impute / Normalize / Integrate once on the consolidated key, matching the vignette pattern in vignettes/intro_rosace.Rmd lines 65-115. Also guard against the case where every replicate for a condition has empty counts. The old loop silently dodged this because the per-replicate F/I/N/I never ran. With F/I/N/I now hoisted, the guard is explicit: skip with a warning rather than calling IntegrateData on a key with no assays. No behavior change: same scores, less CPU. The "different / more correct scores" framing from the original commit message was wrong — re-scoring historical datasets is not required. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
build_rosace_object called FilterData / ImputeData / NormalizeData / IntegrateData inside the inner replicate loop. For a condition with N replicates, each function ran N times — once per replicate addition — against progressively more populated state.
Verified against pimentellab/rosace@main (R/preprocessing.R):
FilterData.AssayGrowth writes back to @CountS in-place (no separate filtered slot). Idempotent on a second back-to-back call: the surviving rows already satisfy the predicates, so re-filtering is a no-op. Also clears @norm.counts as a side effect.
ImputeData.AssayGrowth writes imputed values back to @CountS. impute::impute.knn returns the input unchanged when there are no NAs, so re-imputation on already-imputed data is numerically a no-op.
NormalizeData.AssayGrowth writes to a separate @norm.counts slot and recomputes from raw @CountS on each call. Idempotent. The WT baseline is intra-assay (per-replicate), not pooled across replicates — so once-per-rep vs once-per-key produces identical per-assay norm.counts.
IntegrateData.Rosace rebuilds the AssaySet from scratch on each call (R/preprocessing.R:217-238,
assayset <- AssaySet()), and AddAssaySetData overwrites by name. Repeated calls just rebuild and replace.So the per-replicate loop produced the same final scores as the per-condition pattern, with O(N²) wasted preprocessing work. The intermediate AssaySets between loop iterations were also visible to any downstream consumer reading @assay.sets, even though the loop overwrote them seconds later.
Restructure as two passes per condition:
Pass 1 — add every replicate's assay via AddAssayData.
Pass 2 — run Filter / Impute / Normalize / Integrate once on the
consolidated key, matching the vignette pattern in
vignettes/intro_rosace.Rmd lines 65-115.
Also guard against the case where every replicate for a condition has empty counts. The old loop silently dodged this because the per-replicate F/I/N/I never ran. With F/I/N/I now hoisted, the guard is explicit: skip with a warning rather than calling IntegrateData on a key with no assays.