Skip to content

Conversation

@markfields
Copy link
Member

@markfields markfields commented Mar 19, 2025

Description

tl;dr

Flush the Outbox before processing income ops to avoid adding ops to the batch which have a different base (reference sequence number) than the existing batched ops.

There's a kill-bit switch "Fluid.ContainerRuntime.DisableFlushBeforeProcess" that restores the previous behavior, in case of unforeseen side effects of flushing before process.

the full story

There are two mechanisms for detecting / dealing with interleaving submitting and processing ops in invalid ways:

  1. ContainerRuntime.ensureNoDataModelChanges - This wraps the core code in ContainerRuntime.process, and marks any ops submitted during this time as "reentrant". When a batch is flushed, if it has any reentrant ops it rebases them (via resubmit).
  2. Outbox.maybeFlushPartialBatch - This ensures that every batch has a singular referenceSequenceNumber across all its ops. This is needed because an incoming process (from DeltaManager processing the inbound queue) can jump in line before a scheduled flush.
    a. This will also happen any time case (1) happens with multiple reentrancies in a single incoming batch (since we process in between those reentrant submits)

This PR gets rid of the need for the 2nd one by always flushing at the beginning of ContainerRuntime.process (and for safety/clarity put the whole process function inside ensureNoDataModelChanges). This way we know that every batch gets a fresh and stable referenceSequenceNumber.

More on point 2.a

This will still result in batched ops having different bases, but those ops will always be marked as reentrant, so we will rebase/resubmit the whole thing.

Also, due to op bunching, this case isn't even detected at times because the way we detect sequence numbers advancing during process ends up out of sync with the actual processing (where reentrant ops come up).

@github-actions github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch labels Mar 19, 2025
@markfields markfields changed the title fix(container-runtime): Simplify Op Reentrancy code fix(container-runtime): Simplify some Op Reentrancy code Mar 21, 2025
@markfields markfields marked this pull request as ready for review March 21, 2025 20:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR simplifies the op reentrancy logic in ContainerRuntime by flushing the Outbox at the very beginning of op processing and by unifying sequence number coherency handling under a single check that can be configured.

  • Replaces the old partial batch flushing mechanism with an assertion-based check on sequence numbers
  • Introduces a configuration flag (disableSequenceNumberCoherencyAssert) to retain the legacy flush behavior if necessary
  • Updates tests and telemetry code to accommodate the new behavior

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file
File Description
packages/runtime/container-runtime/src/opLifecycle/outbox.ts Updates sequence number coherency check and telemetry logging, and renames the method to assertSequenceNumberCoherency
packages/runtime/container-runtime/src/containerRuntime.ts Adjusts process flow to flush the outbox before invoking process logic and passes through the new config flag
packages/runtime/container-runtime/src/test/* Updates tests to include the new configuration flag and to reflect changes in expected behavior
packages/runtime/container-runtime/src/opLifecycle/batchManager.ts Minor comment adjustments regarding reentrant ops and reference sequence numbers
packages/test/test-end-to-end-tests/src/test/fewerBatches.spec.ts Updates test expectations for op reentry and batch flushing behavior
Comments suppressed due to low confidence (2)

packages/runtime/container-runtime/src/opLifecycle/outbox.ts:225

  • [nitpick] Consider aligning telemetry property naming by using camelCase (e.g. 'dataDetails' instead of 'Data_details') for consistency across events.
this.logger.sendErrorEvent({

packages/runtime/container-runtime/src/opLifecycle/outbox.ts:47

  • [nitpick] Consider renaming 'disableSequenceNumberCoherencyAssert' to a name that more clearly expresses its intent (for example, 'allowIncoherentSequenceNumbers') to improve clarity.
readonly disableSequenceNumberCoherencyAssert: boolean;

@markfields markfields requested a review from vladsud March 21, 2025 20:34
@markfields markfields added the release-blocking Must be addressed before we cut and publish the next release label Mar 31, 2025
@markfields
Copy link
Member Author

markfields commented Mar 31, 2025

Marking as release-blocking because I am about to make more changes in this area, and would like this one to release first to be able to observe the impact of the logic simplification independently. I am just waiting for checks and re-review after final commit.

Release Driver - you may give the final Approve from a security standpoint (that's the only reason the existing approvals don't count, since I made some final changes taking PR feedback)

@markfields markfields enabled auto-merge (squash) March 31, 2025 13:56
@markfields
Copy link
Member Author

markfields commented Mar 31, 2025

Investigated the test failures -- Turns out Rollback had a bug that was hiding under this codepath, and the change exposed it. We weren't resetting the baseline clientSequenceNumber when rolling back. Fixed.

@markfields markfields merged commit fa03fe8 into microsoft:main Mar 31, 2025
32 checks passed
@markfields markfields deleted the cr/rebase-reentrant branch March 31, 2025 17:46
markfields added a commit that referenced this pull request Apr 11, 2025
…ushing partial batches (#24303)

Test for #24276. The test makes sure that if a system op is processed
before a scheduled flush happens, we properly cut a new batch.

See #24099 for the full story on this scenario and recent changes to it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch Feature_StagingMode release-blocking Must be addressed before we cut and publish the next release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants