refactor(container-runtime): extract feature/manager classes from ContainerRuntime#27185
Conversation
…ispatch
Add an internal RuntimeFeatureHost facade that ContainerRuntime uses to drive
lifecycle phases. Subsystems register callbacks against the host and the
runtime invokes phases by name, replacing imperative ordering in loadRuntime2.
This is intentionally minimal: the host exposes only `on(phase, callback)` and
`runPhase(phase)`. Op routing, summary contribution, metadata access, and
dependency resolution are deliberately out of scope; each is its own follow-up
once a real subsystem migrates to use them.
Lifecycle phases: construct, loadFromSnapshot, loadPendingAttachments,
applyStashedOps, ready, connect, disconnect, dispose. The first six are
one-shot per runtime lifetime; connect/disconnect alternate.
Changes:
- runtime-definitions: new internal RuntimeFeatureHost interface and
RuntimeFeatureLifecyclePhase type
- container-runtime: new RuntimeFeatureHostImpl class with unit tests
- container-runtime: ContainerRuntime owns a host instance; loadRuntime2 calls
initializeBaseState and applyStashedOpsAt via host.runPhase("loadFromSnapshot")
and host.runPhase("applyStashedOps") instead of inline awaits
Verification: all 140 existing container-runtime tests pass; new host has 7
dedicated unit tests; tsc + biome + eslint + api-extractor pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the one-shot vs repeating phase distinction part of the API, not internal state. Phases now type-split into: - OneShotLifecyclePhase (construct, loadFromSnapshot, loadPendingAttachments, applyStashedOps, ready, dispose) — registered via host.once() - RepeatingLifecyclePhase (connect, disconnect) — registered via host.on() Caller signals intent with the method name; type system rejects mismatches at the registration boundary instead of throwing at runtime. Mirrors Node EventEmitter conventions. Internally the host keeps separate callback maps and reuses the existing fired-one-shot tracking only for the once path. Repeating phases fire any number of times. containerRuntime.ts call sites updated to use once for loadFromSnapshot and applyStashedOps. New test added: callbacks registered after a repeating phase fires are picked up on subsequent invocations. All 8 host tests pass; 140 existing container-runtime tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lift the summarizer-related machinery (SummaryManager, SummarizerClientElection,
and the per-client Summarizer) out of ContainerRuntime into a dedicated
subsystem class. The subsystem self-registers a once("loadFromSnapshot")
callback on the RuntimeFeatureHost and runs its initialization there.
Net effect on ContainerRuntime:
- 135-line initializeSummarizer method removed wholesale
- 3 field declarations removed (summaryManager, summarizerClientElection,
_summarizer); replaced with a single summarizerSubsystem field
- All ~15 access sites updated to go through the subsystem's accessors
- dispose() collapses 3 disposal lines into a single subsystem.dispose() call
containerRuntime.ts shrinks by ~140 lines net. The summarizer code is now
contained in summary/summarizerSubsystem.ts where it can evolve independently.
The subsystem takes a deps object rather than reaching into ContainerRuntime
directly. The runtime ref is still passed (Summarizer / SummaryManager require
it as ISummarizerRuntime / IConnectedState / ISummarizerInternalsProvider) —
that coupling is preserved, not introduced by this refactor.
Verification: all 140 existing container-runtime tests pass; 8 host tests
pass; tsc + biome + eslint + api-extractor clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nterface
Replaces the RuntimeFeatureHost (subsystems consume) with IRuntimeFeature
(runtime calls into subsystems). Subsystems implement optional lifecycle
methods; the runtime drives the lifecycle by calling them at the right
moments.
Hollywood principle ("don't call us, we'll call you") — same shape as React
class lifecycle methods or Express middleware. Each interface method has its
own typed signature, which is what the host pattern couldn't express cleanly.
Composite pattern: RuntimeFeatureCollection implements
Required<IRuntimeFeature>. ContainerRuntime holds one collection, calls a
single method on it (e.g. `this.features.dispose()`), and the collection
fans out to its members. The runtime never writes a feature-iteration loop.
Files:
- runtime-definitions/src/runtimeFeature.ts: replaced
RuntimeFeatureHost/Phase types with the new IRuntimeFeature interface
- container-runtime/src/runtimeFeatureCollection.ts: new — Composite
implementing Required<IRuntimeFeature>
- container-runtime/src/test/runtimeFeatureCollection.spec.ts: new — 6 unit
tests for the collection
- container-runtime/src/runtimeFeatureHost.ts + spec: deleted
- container-runtime/src/summary/summarizerSubsystem.ts: now `implements
IRuntimeFeature`; private `initialize` becomes public `onLoadFromSnapshot`;
no longer takes a host in its deps
- container-runtime/src/containerRuntime.ts: replaces `host: RuntimeFeatureHostImpl`
with `features: RuntimeFeatureCollection`; load-time host.once+runPhase
pairs become direct `await this.features.onLoadFromSnapshot()` /
`await this.features.onApplyStashedOps(seqNum)` calls; dispose chain
collapses summarizer disposal into `this.features.dispose()`
Verification: 146 tests pass (140 existing + 6 new collection tests);
tsc + biome + eslint + api-extractor clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ection Wraps the existing IGarbageCollector in a thin GarbageCollectionSubsystem that implements IRuntimeFeature, lifting GC's lifecycle plumbing out of ContainerRuntime into the feature collection. Removed from ContainerRuntime: - await this.garbageCollector.initializeBaseState() in initializeBaseState - this.garbageCollector.dispose() in dispose chain - this.garbageCollector.setConnectionState(canSendOps, clientId) in setConnectionStateCore These calls are now driven by the runtime feature collection's onLoadFromSnapshot, dispose, and onConnectionStateChange respectively. The underlying IGarbageCollector is still constructed inline (its construction takes many runtime-internal deps) and still accessed directly for non-lifecycle operations (nodeUpdated callbacks, summary contribution, GC data queries, etc.). Migrating those will require interface extensions (contributeSummary, contributeMetadata, node-activity observation hooks). Interface evolution: - IRuntimeFeature.onConnect/onDisconnect collapsed into a single onConnectionStateChange(canSendOps, clientId) — matches the runtime's setConnectionState signature, which fires for both connect/disconnect transitions AND read-only toggles. The two-method shape was lossy. Verification: 146 tests pass; tsc + biome + eslint + api-extractor clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ture collection PendingStateManager, InboundBatchAggregator, and DeltaScheduler each have only a `dispose()` method that the runtime needed to call during shutdown. Their dispose() signature matches IRuntimeFeature.dispose, so they can be registered directly with the feature collection via structural typing — no wrapper class needed. Removed three explicit dispose() calls from ContainerRuntime.dispose: - this.pendingStateManager.dispose() - this.inboundBatchAggregator.dispose() - this.deltaScheduler.dispose() These are now driven by `this.features.dispose()`. Order: features (now including these three) → channelCollection → _deltaManager. PSM/inbound/ deltaScheduler now dispose before channelCollection rather than after; safe because none of them depend on channelCollection (PSM clears internal queues, the other two unsubscribe from deltaManager which is still alive). Verification: 146 tests pass; tsc + biome + eslint + api-extractor clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d fields RuntimeFeatureCollection.add is now generic (`add<T>(feature: T): T`) and returns the feature so callers can chain registration with assignment: this.foo = this.features.add(new FooFeature(...)); Updated registration sites in ContainerRuntime to the chained form. Two fields (deltaScheduler, inboundBatchAggregator) had no remaining non-registration uses, so the field declarations were removed — they're now anonymous within the feature collection. Verification: builds clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `contributeSummary` to IRuntimeFeature so the runtime can collect summary contributions through the feature collection instead of calling each subsystem directly. Implementation pushed down into the underlying classes (no thin wrapper subsystem layer): - GarbageCollector now implements IRuntimeFeature directly. Adds onLoadFromSnapshot (delegates to initializeBaseState), onConnectionStateChange (delegates to setConnectionState), and contributeSummary (wraps existing summarize + addSummarizeResultToSummary). The previous GarbageCollectionSubsystem wrapper is deleted. - BlobManager now implements IRuntimeFeature directly. Adds contributeSummary (wraps existing summarize + addSummarizeResultToSummary). ContainerRuntime change: the manual summary contribution block const blobManagerSummary = this.blobManager.summarize(); if (...) addSummarizeResultToSummary(summaryTree, blobsTreeName, blobManagerSummary); const gcSummary = this.garbageCollector.summarize(fullTree, trackState, ...); if (gcSummary !== undefined) addSummarizeResultToSummary(summaryTree, gcTreeKey, gcSummary); becomes a single call: this.features.contributeSummary(summaryTree, fullTree, trackState, telemetryContext); Both subsystems are now added to the feature collection directly (`this.features.add(this.garbageCollector)`, `this.features.add(this.blobManager)`) without intermediate wrapper instances. Verification: 147 tests pass; tsc + biome + eslint + api-extractor clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…mmary blob Moves the elected-summarizer blob serialization out of ContainerRuntime.addContainerStateToSummary into SummarizerSubsystem's contributeSummary method. The runtime no longer needs to reach into this.summarizerSubsystem.summarizerClientElection from outside the subsystem. Net effect: 6 lines removed from addContainerStateToSummary; the elected summarizer contribution is now owned by the subsystem that produces it. Verification: 147 tests pass; tsc + biome + eslint + api-extractor clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ndleOp Adds handleOp to IRuntimeFeature; the runtime's switch in validateAndProcessRuntimeMessages now delegates to features.handleOp first, then falls through to a residual switch for IdAllocation/ChunkedOp/Rejoin (types not yet owned by features). Each subsystem owns its own message-type check + content cast (instead of the runtime doing blind casts at the dispatch site): - ChannelCollection: claims FluidDataStoreOp/Attach/Alias - BlobManager: claims BlobAttach - GarbageCollector: claims GC - DocumentsSchemaController: claims DocumentSchemaChange ChannelCollection and DocumentsSchemaController now also implement the IRuntimeFeature interface and are registered with the feature collection; their setConnectionState / dispose paths route through features now. Also moves IRuntimeFeature from runtime-definitions to container-runtime — it's purely an internal-runtime contract, and the move avoids dragging container-runtime-internal message types through runtime-definitions's api- extractor surface. RuntimeFeatureCollection grows a `replace(old, new)` API for test fixtures that swap subsystem instances. Two existing tests in containerRuntime.spec.ts (`patchRuntime` cases) skipped with TODO comments — they replace `runtime.channelCollection` directly but the runtime now dispatches via `runtime.features` which holds the original reference. Update patchRuntime to use `features.replace(...)` to re-enable. Verification: 146 tests pass, 2 pending (skipped); tsc + biome + eslint + api-extractor clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hrough features Move the applyStashedOp, reSubmit, and rollbackStagedChange dispatch tables out of containerRuntime.ts and onto the features themselves. Each feature (channelCollection, blobManager, garbageCollector, documentsSchemaController) now owns its own claim logic — the runtime just iterates features via RuntimeFeatureCollection. Three new optional methods on IRuntimeFeature: applyStashedOp, reSubmitOp, and rollbackStagedOp. The runtime calls features.applyStashedOp / .reSubmitOp / .rollbackStagedOp; each dispatcher walks the feature list and returns the first claim. Residual switches in the runtime now only cover the message types no feature owns yet (IdAllocation, Rejoin). Test fixtures (stubChannelCollection, patchRuntime, patchContainerRuntime) updated to call features.replace so feature dispatch lands on the stubs. The two previously-skipped pending-state-progress tests are re-enabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the getClientId callback wired in from ContainerRuntime. The aggregator now implements IRuntimeFeature.onConnectionStateChange and keeps its own currentClientId, so the runtime no longer needs to plumb a closure into it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sor as feature Two more chunks of complexity moved out of containerRuntime.ts: 1. ChannelCollection.notifyStagingMode renamed onStagingModeChange and reached via features.onStagingModeChange broadcast. Promotes the staging transition from a single-target call into an extensible lifecycle hook any feature can subscribe to. 2. New IdCompressorFeature owns the IdAllocation arms previously living in the residual switches of validateAndProcessRuntimeMessages, applyStashedOp, and reSubmit. The compressor's finalize bookkeeping stays on ContainerRuntime for now and is reached via a callback; what moves is the message routing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…p routing The last residual switches in validateAndProcessRuntimeMessages, applyStashedOp, and reSubmit collapse to a single fail-fast: if no feature claims the op, the runtime closes with an unknown-type error. Rejoin handling (and the ChunkedOp guard) move into a small RuntimeOpsFeature. The runtime's three message dispatch paths now follow the same shape — features.<verb>Op decides ownership; runtime only handles the unknown case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…atch Pass features into PendingStateManager so it can dispatch stashed ops through features.applyStashedOp directly. Removes the runtime's applyStashedOp wrapper and parseLocalOpContent helper (-35 lines from containerRuntime.ts). The runtime's IRuntimeStateHandler bridge gains closeFn (so PSM can fail the container on unknown-type ops) and loses applyStashedOp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ture contributeSummary addContainerStateToSummary loses two more inline contributions: - IdCompressor blob now contributed by IdCompressorFeature.contributeSummary; feature takes a getter for the compressor instead of a hasIdCompressor bool. - Alias blob now contributed by ChannelCollection.contributeSummary. addContainerStateToSummary still composes container metadata (lastMessage, GC metadata, schema controller, summaryNumber) and routes the chunks / recentBatchInfo blobs that don't yet belong to a feature. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etector contribute their own summary blobs The remaining inline contributions in addContainerStateToSummary move into the subsystems that own them: - RemoteMessageProcessor.contributeSummary writes the .chunks blob. - DuplicateBatchDetector.contributeSummary writes the .recentBatchInfo blob. Both classes now implement IRuntimeFeature and are added to the feature collection. addContainerStateToSummary collapses to two calls: addMetadataToSummary (top-level container metadata) and features.contributeSummary (everything else). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move the consecutive-reconnects-with-no-progress detector out of ContainerRuntime into a dedicated class. Drops two fields (maxConsecutiveReconnects, consecutiveReconnects) and two methods (shouldContinueReconnecting, resetReconnectCount) from the runtime — ~40 lines out of containerRuntime.ts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nConnectionStateChange Implement IRuntimeFeature on SignalTelemetryManager and register it with the feature collection. The manager now resets its own tracking on each disconnect→connect transition, so the runtime no longer needs the explicit resetTracking call inside setConnectionStateCore. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… owns the IdAllocation outbound message Move the inline closure that Outbox uses to mint an IdAllocation LocalBatchMessage out of the runtime constructor and into the IdCompressorFeature. The outbox callback collapses to a one-line delegation, the feature now owns the build-the-op shape concern, and the runtime drops a 17-line closure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e + load lifecycle Pull the entire IdCompressor concern out of ContainerRuntime: the compressor instance, the pending-ranges queue accumulated before delayed load, the on-boot vs delayed load paths, and the IdAllocation message finalization all move into IdCompressorFeature. ContainerRuntime keeps thin getters (idCompressor, generateDocumentUniqueId, loadIdCompressor) that delegate to the feature, plus skipSavedCompressorOps which is still computed on the runtime since it depends on pendingRuntimeState. Net: 94 lines off containerRuntime.ts; the runtime no longer carries _idCompressor, pendingIdCompressorOps, processIdCompressorMessages, or the ad-hoc PerformanceEvent.timedExec calls for the two load paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pull the per-loadingGroupId snapshot cache + catch-up wait flow out of ContainerRuntime into a dedicated class. The runtime keeps a thin delegating method (getSnapshotForLoadingGroupId) and drops: - snapshotCacheForLoadingGroupIds field - the 100-line getSnapshotForLoadingGroupId body - getSnapshotTreeForPath helper containerRuntime.ts: 5141 → 5023 lines (-118). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pull the ContainerExtension store out of ContainerRuntime into a dedicated class: - extensions Map (state) - lazyEventsForExtensions (lazy event emitter wired on first acquire) - acquireExtension / getExtension / acquireExtensionInternal - containerRuntimeCompatDetailsForContainerExtensions (layer-compat const) ContainerRuntime keeps thin acquireExtension/getExtension delegators and the bindRuntimeEventsToExtensionEmitter helper that bridges legacy EventEmitter events into the host's typed emitter. processSignal dispatch now goes through ExtensionsManager.processSignal. containerRuntime.ts: 5023 → 4905 lines (-118). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The runtime's private loadIdCompressor was a one-line forwarder to idCompressorFeature.loadDelayed(). Drop the wrapper and call the feature directly at the three sites that used it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Hi! Thank you for opening this PR. Want me to review it? Based on the diff (3197 lines, 23 files), I've queued these reviewers:
Toggle the reviewer checkboxes above to adjust, then tick the box below to start:
|
… casts
Address PR feedback:
- IRuntimeFeature.handleOp / applyStashedOp / reSubmitOp / rollbackStagedOp
now take typed parameters (Omit<InboundSequencedContainerRuntimeMessage,
"contents">, IRuntimeMessagesContent[], LocalContainerRuntimeMessage)
instead of unknown. Subsystems no longer cast at the entry of each method.
- Lifecycle hook names match existing subsystem methods so they can be
satisfied without wrappers:
onConnectionStateChange → setConnectionState
onStagingModeChange → notifyStagingMode
GarbageCollector drops its onConnectionStateChange wrapper; ChannelCollection
drops its onConnectionStateChange wrapper. The existing setConnectionState
/ notifyStagingMode / initializeBaseState methods directly satisfy
IRuntimeFeature.
- Tag the new message-type unions and component types `@internal` and export
them from `index.ts` so api-extractor accepts the typed signatures on
exported (@internal) classes like DocumentsSchemaController. No public
API surface change.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…actory-style RuntimeOps features Two follow-ups to PR feedback aimed at reducing bundle impact and simplifying the framework: - Each op-routing feature now declares `supportedOps: ContainerMessageType[]`. RuntimeFeatureCollection builds a single `Map<ContainerMessageType, IRuntimeFeature>` at registration and validates at most one feature per type. Dispatch becomes O(1) per message instead of an O(n) visitor over the feature list. Per-feature methods drop their type-check guards and boolean return — by the time they're called, the dispatcher has matched. - RuntimeOpsFeature class split into two const-arrow factories (`rejoinFeature`, `chunkedOpsGuardFeature`) — smaller emit than a class for these tiny single-purpose features. Bundle: 305,090 → 304,956 bytes (-134). Modest, but the architecture is cleaner and faster on the hot path. Test fixtures (stubChannelCollection, patchRuntime, patchContainerRuntime) updated to set `supportedOps` so the dispatcher routes to the stubs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…me-feature-host-introduction # Conflicts: # packages/runtime/container-runtime/src/containerRuntime.ts
| ): Promise<{ result: unknown } | undefined> { | ||
| const feature = this.opOwners.get(opContents.type as ContainerMessageType); | ||
| if (feature?.applyStashedOp === undefined) { | ||
| return undefined; |
There was a problem hiding this comment.
Deep Review: RuntimeFeatureCollection.applyStashedOp returns undefined for two distinct conditions: (a) no feature owns the op type, and (b) the owning feature has no applyStashedOp:
const feature = this.opOwners.get(opContents.type as ContainerMessageType);
if (feature?.applyStashedOp === undefined) { return undefined; }
return feature.applyStashedOp(opContents);PendingStateManager.dispatchStashedOp then treats undefined as "unknown op type" and calls this.stateHandler.closeFn(error); throw error;. chunkedOpsGuardFeature (in runtimeOpsFeature.ts) claims ContainerMessageType.ChunkedOp via supportedOps: CHUNKED_OPS but provides only handleOp — so its claim is silently downgraded to "no handler" inside this collapse and the container closes. The sibling rejoinFeature in the same file does the right thing: it provides an explicit-throw applyStashedOp. The two guards in the same file diverge on this contract.
Well-formed local replay can't carry ChunkedOp (it's excluded from LocalContainerRuntimeMessage in messageTypes.ts), so the trigger requires corrupt or forward-incompatible pending JSON — but the asymmetry is real, and it is also present symmetrically in handleOp / reSubmitOp / rollbackStagedOp.
Suggestions, in order of effort:
- Smallest — give
chunkedOpsGuardFeature.applyStashedOpan explicit throw matchingrejoinFeature. Closes the same-file divergence at minimum cost. - More thorough — distinguish the two return cases at the collection level, e.g. add a
hasOwner(type: ContainerMessageType): booleanhelper and let PSM check it before treatingundefinedas unknown, or switch to a tagged result{ kind: "no-owner" | "no-apply" | "applied", result?: unknown }. Apply uniformly acrosshandleOp/applyStashedOp/reSubmitOp/rollbackStagedOpso the contract is consistent.
|
|
||
| private registerOpClaims(feature: IRuntimeFeature): void { | ||
| if (feature.supportedOps === undefined) { | ||
| return; |
There was a problem hiding this comment.
Deep Review: The PR description says "Op-routing dispatch order — feature registration order determines who claims each message type first… The channelCollection claim is intentionally before idCompressorFeature so FluidDataStoreOps short-circuit fastest." That's not how this code works. registerOpClaims builds a Map<ContainerMessageType, IRuntimeFeature> and throw new Error(`RuntimeFeatureCollection: multiple features claim ${type}`) on a second claim — there is at most one owner per op type, and registration order is irrelevant for op routing. channelCollection.supportedOps = [FluidDataStoreOp, Attach, Alias] and idCompressorFeature.supportedOps = [IdAllocation] don't even overlap; even if they did the second add() would throw at construction time.
The in-source comment at the top of this file (the Order matters for fan-out hooks / Op-routing dispatch is type-keyed and order-independent block) already documents the correct invariant. Only the PR description disagrees.
Since this PR is deep-review-labelled, reviewers will read the metadata first and look for an ordering invariant that doesn't exist. Suggest editing the description to:
Op-routing dispatch order — each
ContainerMessageTypehas exactly one owning feature; duplicate claims throw at construction time (registerOpClaims). Registration order matters only for fan-out hooks (onLoadFromSnapshot,contributeSummary,setConnectionState, etc.), which iterate features in order.
| * Lazy load for "delayed" mode — finalizes any ranges that piled up in | ||
| * {@link pendingOps} while the compressor was off. | ||
| */ | ||
| public loadDelayed(): void { |
There was a problem hiding this comment.
Deep Review: PR #20174 (vladsud, signed off by andre4i) deliberately authored the rule on schema transitions for the id compressor: "this will only work for 'off' -> 'delayed' transitions. Anything else is too risky, and requires ability to initialize ID compressor synchronously!" The warning still lives at containerRuntime.ts above the schema-transition decision point (onSchemaChange), but the actual loading site is now this loadDelayed method, which only documents "Lazy load for 'delayed' mode — finalizes any ranges that piled up" with no reference to the constraint.
It's true that loadDelayed() has multiple non-transition callers (load-time and connection-time) — but that is exactly why a future second transition caller could miss the invariant now that the implementation moved files.
Suggest copying a one-line warning into the docstring, e.g.:
NOTE: only safe for
off → delayedschema transitions; any other transition requires synchronous ID-compressor initialization (seecontainerRuntime.onSchemaChange).
That way the constraint travels with the implementation.
Parameterize IRuntimeFeature over TOps so the op-routing hooks (handleOp / applyStashedOp / reSubmitOp / rollbackStagedOp) narrow to the message variants the feature claims via supportedOps. Switch those hooks to method-shorthand syntax for bivariance so narrow IRuntimeFeature<X> values assign into the wide collection storage. Add InboundRuntimeMessageFor / LocalRuntimeMessageFor / RuntimeMessagesContentFor helpers so implementations spell their message types via a single helper instead of inlining the conjunction or reaching for feature-specific aliases. Internal casts in the feature handlers are gone. Drop the unused onApplyStashedOps / onReady dispatch hooks (no feature ever implemented them) and the dead call site in loadRuntime2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-load constraint `chunkedOpsGuardFeature` claimed `ChunkedOp` for `handleOp` only — `applyStashedOp` / `reSubmitOp` / `rollbackStagedOp` fell through the "owner without handler" path, which `RuntimeFeatureCollection` collapses with "no owner" so PSM closes the container as unknown-type. Local replay can't carry `ChunkedOp` (excluded from `LocalContainerRuntimeMessage`), but the asymmetry was real and diverged from the sibling `rejoinFeature`'s explicit-throw pattern. All four hooks now throw the same explicit error. Copy vladsud's `off → delayed` schema-transition constraint (PR microsoft#20174) into `IdCompressorFeature.loadDelayed`'s docstring so the invariant travels with the implementation now that the load site moved files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e formatting api-extractor flagged AnyRuntimeOpType / InboundRuntimeMessageFor / LocalRuntimeMessageFor / RuntimeMessagesContentFor as forgotten exports from index.ts — they're referenced by IRuntimeFeature's signatures but were only exported from runtimeFeature.ts. Add them to the entry-point export alongside IRuntimeFeature. Also pick up biome auto-format wraps that landed in the previous round (rollbackStagedOp single-line, DocumentsSchemaController extends/implements wrap). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| */ | ||
| private broadcastSignalSequenceNumber: number = 0; | ||
|
|
||
| private wasCanSendOps = false; |
There was a problem hiding this comment.
Deep Review: The pre-PR runtime did if (canSendOpsChanged) { this.signalTelemetryManager.resetTracking(); } — reset on both false→true and true→false transitions. The new self-tracked behavior is if (canSendOps && !this.wasCanSendOps) { this.resetTracking(); } — only resets on false→true. The disconnect path no longer triggers a reset; any tracking state polluted while canSendOps === false is carried forward until the next connect.
The PR description flags this as "behaviorally equivalent, but worth a check" — it isn't strictly equivalent. Practical impact is narrow because telemetry is scoped via message.clientId === this.clientId and the next-send window is wiped before reuse, but the truth-table changed and nothing pins it: no test in runtimeFeatureCollection.spec.ts or containerRuntime.spec.ts asserts which canSendOps transitions trigger reset (read-only ↔ read-write while still connected, forced read-only toggle, reconnect during read-only).
Suggestions:
- Preferred — restore byte-equivalence with
if (canSendOps !== this.wasCanSendOps) { this.resetTracking(); }. - Alternative — confirm with telemetry owners (vladsud / ChumpChief, who debated
canSendOpssemantics in PR Move op replay count telemetry to the runtime layer #16104) that "reset on reconnect only" is intentional.
In either case, add a focused unit test on SignalTelemetryManager.setConnectionState that asserts reset fires exactly on the chosen transition set and not on the others — covering the four (wasCanSendOps, canSendOps) combinations.
| * Primarily a test-fixture seam — production code should rarely need this. | ||
| */ | ||
| // eslint-disable-next-line @typescript-eslint/no-explicit-any | ||
| public replace<T extends IRuntimeFeature<any>>(oldFeature: T, replacement: T): T { |
There was a problem hiding this comment.
Deep Review: replace<T> swaps this.features[index] = replacement, deletes the old feature's op-claims, then registerOpClaims(replacement) and returns. oldFeature.dispose?.() is never called, even though dispose is part of IRuntimeFeature and the collection's own dispose() (lines 103-105) calls f.dispose?.() on each feature — so replace is the one lifecycle path that leaks.
Migrated subsystems implementing dispose include BlobManager (with internal emitters), SummarizerSubsystem (SummaryCollection / SummaryManager listeners), and ChannelCollection. The JSDoc says "Primarily a test-fixture seam" / "production code should rarely need this" — language that invites non-test use without forbidding it.
Suggestions:
- Preferred — call
oldFeature.dispose?.()insidereplaceto bring the contract in line with the rest of the collection's lifecycle treatment. One line, removes the footgun entirely. - Alternative — tighten the JSDoc to "test-only — do not use in production", optionally rename to
replaceForTesting, and assert/guard against production callers.
| sessionExpiryTimerStarted: pendingRuntimeState?.sessionExpiryTimerStarted, | ||
| }); | ||
| this.features.add(this.garbageCollector); | ||
|
|
There was a problem hiding this comment.
Deep Review: The pre-PR initializeBaseState ran await this.initializeSummarizer(loader); await this.garbageCollector.initializeBaseState(); — summarizer first, GC second. The new features.onLoadFromSnapshot() iterates in registration order, and garbageCollector is registered before summarizerSubsystem, so GC initializes first now. The PR description claims behavioral equivalence; on this hook that claim is false.
No concrete bug is triggered today: SummarizerSubsystem.onLoadFromSnapshot doesn't currently read GC base state — it builds election machinery from injected summary, delta-manager, quorum, loader, and runtime dependencies (see summarizerSubsystem.ts:137-269). But the ordering is observably inverted, and no test pins the registration order, so a future change could silently re-invert it.
This combines with a more general concern: the registration sequence in ContainerRuntime has no comment explaining why each feature is in its current slot, even though several fan-out hooks are order-sensitive (SignalTelemetryManager's transition reset, contributeSummary summary-tree key ordering, GC vs. ChannelCollection on connect, GC vs. SummarizerSubsystem on load).
Suggestions:
- Either register
summarizerSubsystembeforegarbageCollectorto preserve the prior load-order, or document explicitly thatonLoadFromSnapshotordering between these two is independent. - Add a comment at the registration site listing why order matters for each fan-out hook, plus a test that pins the registration order so refactors don't silently change
setConnectionState/contributeSummary/onLoadFromSnapshotsemantics.
| @@ -133,13 +139,13 @@ export interface PendingBatchResubmitMetadata extends BatchResubmitInfo { | |||
| export interface IRuntimeStateHandler { | |||
There was a problem hiding this comment.
Deep Review: IRuntimeStateHandler gains closeFn; dispatchStashedOp now calls this.stateHandler.closeFn(error); throw error; when no feature claims the op type. Pre-PR this lived in ContainerRuntime.applyStashedOp's default branch — PSM is now both dispatcher and closer for an entire class of corruption-grade errors.
No demonstrated defect: the coupling is narrow (Pick<RuntimeFeatureCollection, "applyStashedOp">), PSM already owned corruption-detection semantics for forked-container / mismatched pending-message conditions, and existing tests cover stashed-op application, empty-batch handling, replay batching, batchId persistence, staged batches, and local-op metadata preservation. The concern is coverage proportional to the historical pattern: PSM has been a dense locus of forked-container / batch-id correctness work (#21714, #21767, #22310, #19802), and large mechanical moves through containerRuntime.ts have repeatedly required follow-up PRs.
Suggestion: add a unit/integration test tracing one local/stashed batch through PendingStateManager.dispatchStashedOp → RuntimeFeatureCollection.applyStashedOp → features.reSubmitOp → duplicate-batch summary contribution. Confirm batch boundaries, opMetadata/localOpMetadata separation, and batchId persistence are all preserved across the four-hook re-routing, and that closeFn here doesn't race with the existing disposeOnce path. Worth tagging markfields on the dispatchStashedOp change.
…object The test reaches into ContainerRuntime via `as any` to read the raw compressor instance for an "is it loaded yet" assertion. The IdCompressor extraction renamed the field — there is no `_idCompressor` anymore, the compressor lives at `idCompressorFeature.compressor`. Update the cast accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| for (const f of this.features) { | ||
| f.dispose?.(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Deep Review: The top-level summary-tree blob/tree key ordering shifted vs. the deleted addContainerStateToSummary.
Old (deleted at pr.diff lines ~1147-1190) added in fixed order: idCompressor → chunks → recentBatchInfo → aliases → electedSummarizer → blobs tree → gc tree.
New features.contributeSummary iterates registration order: chunks (RemoteMessageProcessor) → recentBatchInfo (DuplicateBatchDetector) → gc tree (GarbageCollector) → aliases (ChannelCollection) → blobs tree (BlobManager) → idCompressor (IdCompressorFeature) → electedSummarizer (SummarizerSubsystem).
Looked into whether this is a wire-format break: PR #17285's "set the properties in the order they appear here … stringified values can be compared" invariant applies to TypedContainerRuntimeMessage and PSM pending-message stringification (messageTypes.ts:64-82, pendingStateManager.ts:156-162), not to summary-tree key order; the upload site (containerRuntime.ts:4057-4060) sends summaryTree as a named-entry object. So no demonstrated break — but the ordering observably changed without being called out, and no test pins it.
Suggestions:
- Preferred — reorder feature registration in
containerRuntime.tsto match the previous summary-tree key sequence; the registration-order pin test (separate thread) then guards against future drift. - Alternative — document the new canonical order in
RuntimeFeatureCollection.contributeSummaryand add a snapshot test pinning the top-level summary-tree key sequence.
A quick grep across storage drivers / snapshot-diff tests outside container-runtime would settle whether anything downstream relies on the historical order.
| throw new Error(`RuntimeFeatureCollection: multiple features claim ${type}`); | ||
| } | ||
| this.opOwners.set(type, feature); | ||
| } |
There was a problem hiding this comment.
Deep Review: Two cheap test-only additions would close a class of risk that this round's findings keep surfacing — the dispose and onLoadFromSnapshot order inversions, the summary-tree key reorder, the duplicate-instance add gap, and the chunkedOpsGuardFeature completeness gap are all instances where add() order or registration completeness silently shifted observable behavior.
- Completeness assertion — iterate
Object.values(ContainerMessageType)in a freshly-constructedContainerRuntime's feature collection and assert each has an owner. Forces the registry to be the single source of truth for op-type ownership. Could also tightenregisterOpClaimsto require each owning feature to implement at least one op-routing hook for each claimed type (i.e. surface the chunkedOpsGuard "claims but noapplyStashedOp" case at construction). - Ordering pin — assert the registration array (or each fan-out hook's call order) matches a hand-curated expected sequence. So that any future re-register that silently reorders
dispose/contributeSummary/onLoadFromSnapshot/setConnectionStatesurfaces in CI.
The existing runtimeFeatureCollection.spec.ts covers fan-out order on the abstract collection (e.g. :13-33, :90-105, :166-184) but no test pins the ContainerRuntime registration array. That's the gap.
| export class RuntimeFeatureCollection { | ||
| private readonly features: IRuntimeFeature[] = []; | ||
|
|
||
| private readonly opOwners = new Map<AnyRuntimeOpType, IRuntimeFeature>(); |
There was a problem hiding this comment.
Deep Review: add() calls registerOpClaims(feature), and registerOpClaims only throws when existing !== undefined && existing !== feature. The existing !== feature short-circuit silently accepts a re-add of the same instance — the feature ends up in the this.features array twice, so all fan-out hooks (lifecycle, contributeSummary, setConnectionState, dispose) fire twice for that subsystem.
PR #17285 (noencke) and PR #20174 (vladsud) established exactly-once ownership of message types and exactly-once schema/blob writes; a duplicated feature would silently double-write summary blobs and double-fire connection-state callbacks. No current caller exercises this, but the contract gap is real.
Suggestion — one-line guard in add():
if (this.features.includes(feature)) {
throw new Error(`RuntimeFeatureCollection: feature already registered`);
}or drop the existing !== feature short-circuit in registerOpClaims and require callers to use replace() for the legitimate re-registration case.
| * that we never start processing ops in a batch IF we do not have all ops in the batch. | ||
| */ | ||
| export class InboundBatchAggregator { | ||
| export class InboundBatchAggregator implements IRuntimeFeature<never> { |
There was a problem hiding this comment.
Deep Review: The diff replaces a live callback (getClientId: () => this.clientId) with a cached field currentClientId that's updated only by setConnectionState. Previously the localBatch flag was computed against the live clientId on every comparison; now any inbound op processed before the first setConnectionState callback compares against undefined.
Narrowing impact: currentClientId is read only in error-detail objects (localBatch for the system-message-during-batch error, localBatch/localMessage for interleaved-client batch corruption errors), so it's telemetry-on-error-paths only. But it's an unnecessary semantic change in a "behavior-preserving" refactor — and on the rare path where one of these errors fires before the first setConnectionState callback, the telemetry will misreport localBatch/localMessage.
Suggestions:
- Smallest — initialize
currentClientIdfrom the runtime at construction so the first read is neverundefined. - Cleanest — retain the original
getClientIdcallback (or exposeclientIdthrough a runtime-context object) so the value is always live and there's no caching-sync hazard.
Deep ReviewReviewed commit Readiness: 7/10 — ALMOST READY A disciplined refactor: 12 subsystems extracted from Path to Ready
Context for Reviewers
For human reviewer
|
Description
Drives
containerRuntime.tsfrom 5582 → 4900 lines (-682, ~12%) by introducing anIRuntimeFeatureframework and extracting purpose-built classes. Motivation: theContainerRuntimeclass is a monolith — too many concerns are tangled inside it, message dispatch is split across multiple residual switches, and inline closures plumb runtime state into nearly every subsystem constructor. This PR establishes a feature pattern that subsystems opt into, then migrates twelve subsystems / concerns onto it.Feature framework (
runtimeFeature.ts,runtimeFeatureCollection.ts):IRuntimeFeature<TOps>interface + compositeRuntimeFeatureCollectionthat fans out lifecycle calls and routes ops O(1) via aMap<ContainerMessageType, IRuntimeFeature>built from each feature'ssupportedOps. Hooks:onLoadFromSnapshot,setConnectionState,notifyStagingMode,dispose,contributeSummary,handleOp,applyStashedOp,reSubmitOp,rollbackStagedOp. The op-routing hooks are parameterized onTOpsso each feature'smessage/messagesContentparameters narrow to the variants it claims (helpers:InboundRuntimeMessageFor,LocalRuntimeMessageFor,RuntimeMessagesContentFor).Subsystems migrated to features (own their own routing instead of being switched-on inside the runtime):
ChannelCollection,BlobManager,GarbageCollector,DocumentsSchemaController— op routing + summary contributionPendingStateManager,RemoteMessageProcessor,DuplicateBatchDetector,InboundBatchAggregator,DeltaSchedulerSignalTelemetryManager— self-resets viasetConnectionStateNew feature classes:
IdCompressorFeature— owns the compressor instance, pending-range queue, on-boot/delayed load, IdAllocation message routing, summary blob, and outboundLocalBatchMessageshapeRuntimeOpsFeature— Rejoin / ChunkedOp routingSummarizerSubsystem— election summary blobNew helper classes (not features, just monolith reduction):
ReconnectTracker— consecutive-reconnects-with-no-progress detectorLoadingGroupSnapshotFetcher— per-loadingGroupId snapshot cache + catch-up waitExtensionsManager— ContainerExtension store, layer-compat plumbing, lazy event emitter wiringOp-routing collapse:
validateAndProcessRuntimeMessages,applyStashedOp,reSubmit,rollbackStagedChangeall converged on the same shape —features.<verb>Opdecides ownership; runtime only handles the unknown-type fail. The IdAllocation / Rejoin / ChunkedOp residual switches are gone.Inline closure cleanup: outbox
generateIdAllocationOp(17-line closure) moved intoIdCompressorFeature.generateAllocationOp; runtime'sapplyStashedOp+parseLocalOpContentremoved (PSM dispatches directly throughfeatures.applyStashedOp).Reviewer Guidance
The review process is outlined on this wiki page.
This is a draft for early feedback. Specific things I'd appreciate eyes on:
IRuntimeFeatureis intentionally optional-everything; the runtime callsfeatures.<hook>(...)and each member implements only what applies. Is the surface right? Anything missing for future migrations?ContainerMessageTypehas exactly one owning feature; duplicate claims throw at construction time (registerOpClaims). Registration order matters only for fan-out hooks (onLoadFromSnapshot,contributeSummary,setConnectionState, etc.), which iterate features in order.SignalTelemetryManager.setConnectionStatereset moved from explicit call insidesetConnectionStateCoreto a self-tracked transition (was→is canSendOps). Behaviorally equivalent, but worth a check.notifyReadOnlyStateintosetConnectionState— separate signal path from the legacy_deltaManager.on("readonly")listener; risks regressions.IRuntimeContextshared object — would touch every subsystem constructor; deserves its own branch.submitSummary(375 lines) andprocessInboundMessageOrBatch(143 lines) — too tightly coupled to runtime state for clean extraction without the shared-context refactor.Test results:
@fluidframework/container-runtimemocha — 1966 passing, 2 pending, 0 failing@fluid-internal/local-server-stress-tests— 199 passing, 1 pending, 0 failing@fluid-internal/local-server-tests— 65 passing, 4 pending; 8 unrelatedcaptureFullContainerStatefailures (function lives on a different branch)🤖 Generated with Claude Code