fix(container-runtime): Simplify some Op Reentrancy code #24099

markfields · 2025-03-19T20:56:31Z

Description

tl;dr

Flush the Outbox before processing income ops to avoid adding ops to the batch which have a different base (reference sequence number) than the existing batched ops.

There's a kill-bit switch "Fluid.ContainerRuntime.DisableFlushBeforeProcess" that restores the previous behavior, in case of unforeseen side effects of flushing before process.

the full story

There are two mechanisms for detecting / dealing with interleaving submitting and processing ops in invalid ways:

ContainerRuntime.ensureNoDataModelChanges - This wraps the core code in ContainerRuntime.process, and marks any ops submitted during this time as "reentrant". When a batch is flushed, if it has any reentrant ops it rebases them (via resubmit).
Outbox.maybeFlushPartialBatch - This ensures that every batch has a singular referenceSequenceNumber across all its ops. This is needed because an incoming process (from DeltaManager processing the inbound queue) can jump in line before a scheduled flush.
a. This will also happen any time case (1) happens with multiple reentrancies in a single incoming batch (since we process in between those reentrant submits)

This PR gets rid of the need for the 2nd one by always flushing at the beginning of ContainerRuntime.process (and for safety/clarity put the whole process function inside ensureNoDataModelChanges). This way we know that every batch gets a fresh and stable referenceSequenceNumber.

More on point `2.a`

This will still result in batched ops having different bases, but those ops will always be marked as reentrant, so we will rebase/resubmit the whole thing.

Also, due to op bunching, this case isn't even detected at times because the way we detect sequence numbers advancing during process ends up out of sync with the actual processing (where reentrant ops come up).

Copilot

Pull Request Overview

This PR simplifies the op reentrancy logic in ContainerRuntime by flushing the Outbox at the very beginning of op processing and by unifying sequence number coherency handling under a single check that can be configured.

Replaces the old partial batch flushing mechanism with an assertion-based check on sequence numbers
Introduces a configuration flag (disableSequenceNumberCoherencyAssert) to retain the legacy flush behavior if necessary
Updates tests and telemetry code to accommodate the new behavior

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
packages/runtime/container-runtime/src/opLifecycle/outbox.ts	Updates sequence number coherency check and telemetry logging, and renames the method to assertSequenceNumberCoherency
packages/runtime/container-runtime/src/containerRuntime.ts	Adjusts process flow to flush the outbox before invoking process logic and passes through the new config flag
packages/runtime/container-runtime/src/test/*	Updates tests to include the new configuration flag and to reflect changes in expected behavior
packages/runtime/container-runtime/src/opLifecycle/batchManager.ts	Minor comment adjustments regarding reentrant ops and reference sequence numbers
packages/test/test-end-to-end-tests/src/test/fewerBatches.spec.ts	Updates test expectations for op reentry and batch flushing behavior

Comments suppressed due to low confidence (2)

packages/runtime/container-runtime/src/opLifecycle/outbox.ts:225

[nitpick] Consider aligning telemetry property naming by using camelCase (e.g. 'dataDetails' instead of 'Data_details') for consistency across events.

this.logger.sendErrorEvent({

packages/runtime/container-runtime/src/opLifecycle/outbox.ts:47

[nitpick] Consider renaming 'disableSequenceNumberCoherencyAssert' to a name that more clearly expresses its intent (for example, 'allowIncoherentSequenceNumbers') to improve clarity.

readonly disableSequenceNumberCoherencyAssert: boolean;

packages/runtime/container-runtime/src/opLifecycle/outbox.ts

packages/runtime/container-runtime/src/containerRuntime.ts

packages/runtime/container-runtime/src/opLifecycle/outbox.ts

packages/runtime/container-runtime/src/containerRuntime.ts

markfields · 2025-03-31T13:53:45Z

Marking as release-blocking because I am about to make more changes in this area, and would like this one to release first to be able to observe the impact of the logic simplification independently. I am just waiting for checks and re-review after final commit.

Release Driver - you may give the final Approve from a security standpoint (that's the only reason the existing approvals don't count, since I made some final changes taking PR feedback)

markfields · 2025-03-31T15:10:31Z

Investigated the test failures -- Turns out Rollback had a bug that was hiding under this codepath, and the change exposed it. We weren't resetting the baseline clientSequenceNumber when rolling back. Fixed.

…ushing partial batches (#24303) Test for #24276. The test makes sure that if a system op is processed before a scheduled flush happens, we properly cut a new batch. See #24099 for the full story on this scenario and recent changes to it.

markfields added 4 commits March 19, 2025 20:32

reentrancy test debugging

ac43978

Reentrancy tests - why does e2e behave differently from lst?

8c521db

More test stuff

51eb7ea

Flush before process so maybeFlushPartialBatch can simply assert

783a1d7

github-actions bot added area: runtime Runtime related issues area: tests Tests to add, test infrastructure improvements, etc base: main PRs targeted against main branch labels Mar 19, 2025

markfields added 4 commits March 19, 2025 23:22

Fix failing PR checks

a25e6bf

Fix issue with multiple reentrant ops thwarting seqNum coherency check

c9a72fc

Merge remote-tracking branch 'origin/main' into cr/rebase-reentrant

f261ba8

Revert tests about rebase

be99c57

markfields changed the title ~~fix(container-runtime): Simplify Op Reentrancy code~~ fix(container-runtime): Simplify some Op Reentrancy code Mar 21, 2025

Address todo comments and evaluate test coverage

994c700

markfields marked this pull request as ready for review March 21, 2025 20:22

markfields requested review from a team, MarioJGMsoft, WillieHabi, agarwal-navin, Copilot, jason-ha, jatgarg, kian-thompson and pragya91 March 21, 2025 20:22

Copilot AI reviewed Mar 21, 2025

View reviewed changes

Tweaks and tests

7943480

markfields requested a review from vladsud March 21, 2025 20:34

kian-thompson approved these changes Mar 21, 2025

View reviewed changes

packages/runtime/container-runtime/src/opLifecycle/outbox.ts Outdated Show resolved Hide resolved

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

vladsud reviewed Mar 21, 2025

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

vladsud reviewed Mar 21, 2025

View reviewed changes

packages/runtime/container-runtime/src/opLifecycle/outbox.ts Outdated Show resolved Hide resolved

vladsud reviewed Mar 21, 2025

View reviewed changes

packages/runtime/container-runtime/src/opLifecycle/outbox.ts Outdated Show resolved Hide resolved

vladsud reviewed Mar 21, 2025

View reviewed changes

packages/runtime/container-runtime/src/containerRuntime.ts Outdated Show resolved Hide resolved

vladsud approved these changes Mar 21, 2025

View reviewed changes

markfields added 2 commits March 31, 2025 12:25

Merge remote-tracking branch 'origin/main' into cr/rebase-reentrant

956942b

lint

1bed687

markfields mentioned this pull request Mar 31, 2025

Op bunching 1: Bunch contiguous ops for data store in a batch - Runtime part #22839

Merged

markfields added 2 commits March 31, 2025 13:10

PR feedback

66020cf

Refactor kill-bit switch and cover the whole change

5c40d91

markfields added the release-blocking Must be addressed before we cut and publish the next release label Mar 31, 2025

markfields enabled auto-merge (squash) March 31, 2025 13:56

Fix rollback bug

c99e280

jatgarg approved these changes Mar 31, 2025

View reviewed changes

markfields merged commit fa03fe8 into microsoft:main Mar 31, 2025
32 checks passed

markfields deleted the cr/rebase-reentrant branch March 31, 2025 17:46

This was referenced Apr 11, 2025

feat(ContainerRuntime): Flush on DeltaManager's "op" event #24276

Merged

test(container-runtime) Add test for bug fix around old loader and flushing partial batches #24303

Merged

markfields added the Feature_StagingMode label Jun 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(container-runtime): Simplify some Op Reentrancy code #24099

fix(container-runtime): Simplify some Op Reentrancy code #24099

Uh oh!

markfields commented Mar 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markfields commented Mar 31, 2025 •

edited

Loading

Uh oh!

markfields commented Mar 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

fix(container-runtime): Simplify some Op Reentrancy code #24099

fix(container-runtime): Simplify some Op Reentrancy code #24099

Uh oh!

Conversation

markfields commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

tl;dr

the full story

More on point 2.a

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markfields commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

markfields commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

markfields commented Mar 19, 2025 •

edited

Loading

More on point `2.a`

markfields commented Mar 31, 2025 •

edited

Loading

markfields commented Mar 31, 2025 •

edited

Loading