Skip to content

feat(container-runtime): enable batchId tracking by default with kill-switch and perf telemetry#27216

Merged
dannimad merged 8 commits intomicrosoft:mainfrom
dannimad:dbd-default
May 6, 2026
Merged

feat(container-runtime): enable batchId tracking by default with kill-switch and perf telemetry#27216
dannimad merged 8 commits intomicrosoft:mainfrom
dannimad:dbd-default

Conversation

@dannimad
Copy link
Copy Markdown
Contributor

@dannimad dannimad commented May 1, 2026

Enables batchId tracking and the DuplicateBatchDetector by default in FlushMode.TurnBased so that "parallel fork" duplicate batches (the same local state sequenced twice from two containers rehydrated from the same serialized pending state) are caught for all consumers, not only those who opted into Offline Load.

Changes in container-runtime:

  • Default-on in TurnBased. batchIdTrackingEnabled is now true whenever the runtime is in FlushMode.TurnBased and the kill-switch is not set. In FlushMode.Immediate the detector is silently skipped (it has nothing meaningful to do without batches).
  • Kill-switch. New config flag Fluid.ContainerRuntime.DisableBatchIdTracking disables the feature without a code change if a regression is observed in the field.
  • Removed the legacy Fluid.ContainerRuntime.enableBatchIdTracking opt-in flag. It is no longer needed — the feature is unconditionally on. Fluid.Container.enableOfflineFull remains, since Offline Load still requires the same TurnBased-only constraint and continues to throw UsageError if combined with FlushMode.Immediate (back-compat preserved).

Copilot AI review requested due to automatic review settings May 1, 2026 15:57
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 1, 2026

Hi! Thank you for opening this PR. Want me to review it?

Based on the diff (433 lines, 11 files), I've queued these reviewers:

  • Correctness — logic errors, race conditions, lifecycle issues
  • Security — vulnerabilities, secret exposure, injection
  • API Compatibility — breaking changes, release tags, type design
  • Performance — algorithmic regressions, memory leaks
  • Testing — coverage gaps, hollow tests

How this works

  • Adjust the reviewer set by ticking/unticking boxes above. Reviewer toggles alone don't trigger anything.

  • Tick Start review below to dispatch the review fleet.

  • After review finishes, tick Start review again to request another run — it auto-resets after each dispatch.

  • This comment updates as new commits land; your reviewer selections are preserved.

  • Start review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR makes batchId tracking (and fork/duplicate-batch detection) default-on for ContainerRuntime in FlushMode.TurnBased, adds a kill-switch to disable it in the field, and extends summary-time perf telemetry emitted by DuplicateBatchDetector.

Changes:

  • Enable batchId tracking by default in TurnBased mode, skip it in Immediate mode, and add kill-switch Fluid.ContainerRuntime.DisableBatchIdTracking plus a one-time enablement telemetry event.
  • Add per-summary-window perf counters (peakRecentBatchCount, processedBatchCount) to DuplicateBatchDetector telemetry and reset them each summary window.
  • Update/unit-test expectations and configurations to reflect default-on behavior and kill-switch behavior.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
packages/test/test-end-to-end-tests/src/test/batching.spec.ts Updates test config to preserve “Offline Load + Immediate throws” behavior now that batchId tracking no longer throws in Immediate mode.
packages/runtime/container-runtime/src/test/opLifecycle/duplicateBatchDetector.spec.ts Extends tests to validate new per-summary perf telemetry and counter resets.
packages/runtime/container-runtime/src/test/containerRuntime.spec.ts Updates runtime tests for default-on stamping/detection, adds Immediate-mode skip test, and keeps Offline Load Immediate-mode back-compat error test.
packages/runtime/container-runtime/src/opLifecycle/duplicateBatchDetector.ts Adds perf counters, emits new telemetry fields, and resets per-summary window counters.
packages/runtime/container-runtime/src/containerRuntime.ts Implements default enablement in TurnBased, kill-switch, and enablement telemetry event; allocates detector only when enabled.

this.closeFn(error);
throw error;
}

Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Fluid.Container.enableOfflineFull is true while the kill-switch Fluid.ContainerRuntime.DisableBatchIdTracking is also true, the runtime will allow Offline Load in TurnBased mode but will disable batchIdTrackingEnabled (and thus disable batchId stamping + DuplicateBatchDetector). That seems to violate the comment that tracking is a prerequisite for Offline Load and could re-enable unsafe fork scenarios silently. Consider either (a) ignoring the kill-switch when Offline Load is explicitly requested, or (b) throwing a UsageError when both flags are set so the configuration can't be enabled in an unsupported state.

Suggested change
if (offlineLoadRequested && disableBatchIdTracking) {
const error = new UsageError(
"Offline mode requires batchId tracking and cannot be used when Fluid.ContainerRuntime.DisableBatchIdTracking is enabled.",
);
this.closeFn(error);
throw error;
}

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offline load flag will also by removed in a follow up

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deep Review: Following up on this thread now that batchId tracking is default-on. The asymmetry the original comment flagged is still live in the current diff: the offlineLoadRequested UsageError validates only flush mode, not the kill-switch, and the constructor comment immediately above still describes batchId tracking as the prerequisite for Offline Load. Pre-PR, enableOfflineFull=true unconditionally enabled tracking; post-PR, setting enableOfflineFull=true together with Fluid.ContainerRuntime.DisableBatchIdTracking=true in TurnBased silently strips the prerequisite — Offline Load proceeds with fork detection disabled, no error, no telemetry. The PR description says Offline Load is "back-compat preserved" but only the FlushMode portion currently is.

@dannimad you mentioned enableOfflineFull will be removed in a follow-up — that resolves this cleanly if it's imminent. A couple of cheap options either way:

  • (a) Link the follow-up issue here and add a one-line comment in the constructor noting the asymmetry is intentionally transient, or
  • (b) Add the symmetric guard now — extend the existing UsageError to fire when offlineLoadRequested && disableBatchIdTracking, or emit a BatchIdTrackingDisabledForOfflineLoad warning telemetry so the misconfig is at least observable while the follow-up lands.

Either closes the loop on this thread.

Comment on lines 105 to 110
registry,
loaderProps: {
configProvider: configProvider({
"Fluid.ContainerRuntime.enableBatchIdTracking": true,
"Fluid.Container.enableOfflineFull": true,
}),
},
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suite now enables Fluid.Container.enableOfflineFull for all tests. That changes behavior beyond just batchId tracking (and forces callers to remember to pass disableOfflineLoad=true for Immediate mode). If the goal is only to keep the one Offline Load validation test, consider scoping enableOfflineFull to that specific test/setup instead of setting it globally for the whole describeCompat, to avoid running unrelated batching tests under the Offline Load configuration.

Copilot uses AI. Check for mistakes.
Comment thread packages/runtime/container-runtime/src/test/containerRuntime.spec.ts Outdated
dannimad and others added 3 commits May 1, 2026 10:34
…pec.ts

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts Outdated
Comment thread packages/test/snapshots/src/validateSnapshots.ts Outdated
Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
danyymad and others added 2 commits May 5, 2026 10:13
Co-authored-by: Copilot <copilot@github.com>
@anthony-murphy
Copy link
Copy Markdown
Contributor

Deep Review

Reviewed commit 90edb00 on 2026-05-05.

Readiness: 6/10 — 🔨 MAKING PROGRESS

Design continues to converge — independent solution-space proposals matched the PR's shape (default-on DuplicateBatchDetector in TurnBased, dedicated DisableBatchIdTracking kill-switch, per-summary peak/processed counters). One Tier 2 issue survives this pass: the PR description still promises a BatchIdTrackingEnablement rollout-attribution event that is not emitted in the diff. The Offline-Load + kill-switch asymmetry from the prior review remains tracked on its existing inline thread (3173973950, still unresolved). Both are bounded one-touch fixes; both have author engagement. Holding at 6/10 — no Tier 1, one Tier 2, three Tier 3 polish items held until Tier 2 closes.

Path to Ready

  • Resolve inline threads

Context for Reviewers

For human reviewer
  • Needs human judgment:
    • anthony-murphy's open log-volume question on the enablement event ("do we really need this? we have generalized logs about config values") is yours to call. If "no", the Tier 2 collapses to a description amendment.
    • vladsud — set the silent-fallback-over-throw precedent in Don't allow FlushModeExperimental.Async if the loader does not support reference sequence numbers #14239; right person to validate the deliberate Offline-Load asymmetry (silent for the new default-on path, loud UsageError retained for the legacy explicit enableOfflineFull opt-in).
    • andre4i / markfields — confirm no production runtime path (resubmit during reconnect, applyStashedOp replay, transient ordering) can produce a (sequenceNumber, batchId) shape that trips the duplicate-batch invariant now that detection is default-on. The fewerBatches.spec.ts comment ("trip its invariants and short-circuit") indicates the invariant is an assertion, not graceful handling.
  • Cannot be assessed by the pipeline:
    • Runtime/perf cost of the new peakRecentBatchCount / processedBatchCount counters and per-summary reset on real workloads.
    • Whether the follow-up that removes enableOfflineFull is close enough that the asymmetry can land as-is.
    • Whether any external host outside this repo sets the retired Fluid.ContainerRuntime.enableBatchIdTracking key (in-tree confirmed clear on thread 3184554880).
Review history (2 prior reviews)
  • 031056a 2026-05-05 · 6/10 — enablement event missing; offline-load asymmetry re-pinged
  • b29d0ff 2026-05-04 · 8/10 — three minor polish items; enablement event referenced as if it existed (later confirmed missing)

Comment thread packages/runtime/container-runtime/src/containerRuntime.ts
@dannimad dannimad merged commit 085f51b into microsoft:main May 6, 2026
33 checks passed
@dannimad dannimad deleted the dbd-default branch May 6, 2026 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants