-
Notifications
You must be signed in to change notification settings - Fork 562
fix(container-runtime): Simplify some Op Reentrancy code #24099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR simplifies the op reentrancy logic in ContainerRuntime by flushing the Outbox at the very beginning of op processing and by unifying sequence number coherency handling under a single check that can be configured.
- Replaces the old partial batch flushing mechanism with an assertion-based check on sequence numbers
- Introduces a configuration flag (disableSequenceNumberCoherencyAssert) to retain the legacy flush behavior if necessary
- Updates tests and telemetry code to accommodate the new behavior
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| packages/runtime/container-runtime/src/opLifecycle/outbox.ts | Updates sequence number coherency check and telemetry logging, and renames the method to assertSequenceNumberCoherency |
| packages/runtime/container-runtime/src/containerRuntime.ts | Adjusts process flow to flush the outbox before invoking process logic and passes through the new config flag |
| packages/runtime/container-runtime/src/test/* | Updates tests to include the new configuration flag and to reflect changes in expected behavior |
| packages/runtime/container-runtime/src/opLifecycle/batchManager.ts | Minor comment adjustments regarding reentrant ops and reference sequence numbers |
| packages/test/test-end-to-end-tests/src/test/fewerBatches.spec.ts | Updates test expectations for op reentry and batch flushing behavior |
Comments suppressed due to low confidence (2)
packages/runtime/container-runtime/src/opLifecycle/outbox.ts:225
- [nitpick] Consider aligning telemetry property naming by using camelCase (e.g. 'dataDetails' instead of 'Data_details') for consistency across events.
this.logger.sendErrorEvent({
packages/runtime/container-runtime/src/opLifecycle/outbox.ts:47
- [nitpick] Consider renaming 'disableSequenceNumberCoherencyAssert' to a name that more clearly expresses its intent (for example, 'allowIncoherentSequenceNumbers') to improve clarity.
readonly disableSequenceNumberCoherencyAssert: boolean;
|
Marking as Release Driver - you may give the final Approve from a security standpoint (that's the only reason the existing approvals don't count, since I made some final changes taking PR feedback) |
|
Investigated the test failures -- Turns out Rollback had a bug that was hiding under this codepath, and the change exposed it. We weren't resetting the baseline clientSequenceNumber when rolling back. Fixed. |
Description
tl;dr
Flush the Outbox before processing income ops to avoid adding ops to the batch which have a different base (reference sequence number) than the existing batched ops.
There's a kill-bit switch "Fluid.ContainerRuntime.DisableFlushBeforeProcess" that restores the previous behavior, in case of unforeseen side effects of flushing before process.
the full story
There are two mechanisms for detecting / dealing with interleaving submitting and processing ops in invalid ways:
ContainerRuntime.ensureNoDataModelChanges- This wraps the core code inContainerRuntime.process, and marks any ops submitted during this time as "reentrant". When a batch is flushed, if it has any reentrant ops it rebases them (via resubmit).Outbox.maybeFlushPartialBatch- This ensures that every batch has a singularreferenceSequenceNumberacross all its ops. This is needed because an incomingprocess(from DeltaManager processing the inbound queue) can jump in line before a scheduled flush.a. This will also happen any time case (1) happens with multiple reentrancies in a single incoming batch (since we process in between those reentrant submits)
This PR gets rid of the need for the 2nd one by always flushing at the beginning of
ContainerRuntime.process(and for safety/clarity put the wholeprocessfunction insideensureNoDataModelChanges). This way we know that every batch gets a fresh and stablereferenceSequenceNumber.More on point
2.aThis will still result in batched ops having different bases, but those ops will always be marked as reentrant, so we will rebase/resubmit the whole thing.
Also, due to op bunching, this case isn't even detected at times because the way we detect sequence numbers advancing during process ends up out of sync with the actual processing (where reentrant ops come up).