Skip to content

Hoist Rosace preprocessing out of the per-replicate loop#35

Merged
odcambc merged 1 commit into
mainfrom
rosace-integration
May 25, 2026
Merged

Hoist Rosace preprocessing out of the per-replicate loop#35
odcambc merged 1 commit into
mainfrom
rosace-integration

Conversation

@odcambc
Copy link
Copy Markdown
Owner

@odcambc odcambc commented May 25, 2026

build_rosace_object called FilterData / ImputeData / NormalizeData / IntegrateData inside the inner replicate loop. For a condition with N replicates, each function ran N times — once per replicate addition — against progressively more populated state.

Verified against pimentellab/rosace@main (R/preprocessing.R):

  • FilterData.AssayGrowth writes back to @CountS in-place (no separate filtered slot). Idempotent on a second back-to-back call: the surviving rows already satisfy the predicates, so re-filtering is a no-op. Also clears @norm.counts as a side effect.

  • ImputeData.AssayGrowth writes imputed values back to @CountS. impute::impute.knn returns the input unchanged when there are no NAs, so re-imputation on already-imputed data is numerically a no-op.

  • NormalizeData.AssayGrowth writes to a separate @norm.counts slot and recomputes from raw @CountS on each call. Idempotent. The WT baseline is intra-assay (per-replicate), not pooled across replicates — so once-per-rep vs once-per-key produces identical per-assay norm.counts.

  • IntegrateData.Rosace rebuilds the AssaySet from scratch on each call (R/preprocessing.R:217-238, assayset <- AssaySet()), and AddAssaySetData overwrites by name. Repeated calls just rebuild and replace.

So the per-replicate loop produced the same final scores as the per-condition pattern, with O(N²) wasted preprocessing work. The intermediate AssaySets between loop iterations were also visible to any downstream consumer reading @assay.sets, even though the loop overwrote them seconds later.

Restructure as two passes per condition:
Pass 1 — add every replicate's assay via AddAssayData.
Pass 2 — run Filter / Impute / Normalize / Integrate once on the
consolidated key, matching the vignette pattern in
vignettes/intro_rosace.Rmd lines 65-115.

Also guard against the case where every replicate for a condition has empty counts. The old loop silently dodged this because the per-replicate F/I/N/I never ran. With F/I/N/I now hoisted, the guard is explicit: skip with a warning rather than calling IntegrateData on a key with no assays.

build_rosace_object called FilterData / ImputeData / NormalizeData /
IntegrateData inside the inner replicate loop. For a condition with N
replicates, each function ran N times — once per replicate addition —
against progressively more populated state.

Verified against pimentellab/rosace@main (R/preprocessing.R):

  - FilterData.AssayGrowth writes back to @CountS in-place (no separate
    filtered slot). Idempotent on a second back-to-back call: the
    surviving rows already satisfy the predicates, so re-filtering is a
    no-op. Also clears @norm.counts as a side effect.

  - ImputeData.AssayGrowth writes imputed values back to @CountS.
    impute::impute.knn returns the input unchanged when there are no
    NAs, so re-imputation on already-imputed data is numerically a
    no-op.

  - NormalizeData.AssayGrowth writes to a separate @norm.counts slot
    and recomputes from raw @CountS on each call. Idempotent. The
    WT baseline is intra-assay (per-replicate), not pooled across
    replicates — so once-per-rep vs once-per-key produces identical
    per-assay norm.counts.

  - IntegrateData.Rosace rebuilds the AssaySet from scratch on each
    call (R/preprocessing.R:217-238, `assayset <- AssaySet()`), and
    AddAssaySetData overwrites by name. Repeated calls just rebuild
    and replace.

So the per-replicate loop produced the same final scores as the
per-condition pattern, with O(N²) wasted preprocessing work. The
intermediate AssaySets between loop iterations were also visible to
any downstream consumer reading @assay.sets, even though the loop
overwrote them seconds later.

Restructure as two passes per condition:
  Pass 1 — add every replicate's assay via AddAssayData.
  Pass 2 — run Filter / Impute / Normalize / Integrate once on the
           consolidated key, matching the vignette pattern in
           vignettes/intro_rosace.Rmd lines 65-115.

Also guard against the case where every replicate for a condition
has empty counts. The old loop silently dodged this because the
per-replicate F/I/N/I never ran. With F/I/N/I now hoisted, the
guard is explicit: skip with a warning rather than calling
IntegrateData on a key with no assays.

No behavior change: same scores, less CPU. The "different / more
correct scores" framing from the original commit message was wrong —
re-scoring historical datasets is not required.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@odcambc odcambc merged commit e83d00d into main May 25, 2026
@odcambc odcambc deleted the rosace-integration branch May 25, 2026 03:45
@odcambc odcambc restored the rosace-integration branch May 25, 2026 03:49
@odcambc odcambc deleted the rosace-integration branch May 25, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant