Skip to content

Return replay retry state from orchestration recovery#1728

Merged
juliusmarminge merged 3 commits intomainfrom
t3code/project-open-flag
Apr 4, 2026
Merged

Return replay retry state from orchestration recovery#1728
juliusmarminge merged 3 commits intomainfrom
t3code/project-open-flag

Conversation

@juliusmarminge
Copy link
Copy Markdown
Member

@juliusmarminge juliusmarminge commented Apr 4, 2026

Summary

  • Change completeReplayRecovery() to return structured replay outcome data instead of a bare boolean.
  • Preserve the replay progress signal while separately surfacing whether another replay should be attempted.
  • Add coverage for the new retry behavior when replay makes no progress with and without newer observed sequences.

Testing

  • Updated apps/web/src/orchestrationRecovery.test.ts to assert the new completion shape and retry cases.
  • Not run: bun fmt
  • Not run: bun lint
  • Not run: bun typecheck
  • Not run: bun run test

Note

Medium Risk
Changes orchestration replay recovery control flow and retry behavior in the app root event router, which can impact client/server state synchronization during sequence gaps. Risk is mitigated by added unit tests but could still affect recovery edge cases (e.g., no-progress replays, disposal timing).

Overview
Replay recovery now returns structured completion data instead of a boolean. completeReplayRecovery() returns { replayMadeProgress, shouldReplay }, separating “did the replay advance the sequence” from “do we still need another replay.”

Adds bounded, backoff-based retries for no-progress replays. New deriveReplayRetryDecision tracks consecutive no-progress attempts for the same replay frontier, retries with exponential delays (100ms base) up to a max (3), resets the budget when the frontier changes, and logs a warning when stopping early.

Updates callers and tests. EventRouter uses the new completion shape and retry decision (including clearing the tracker on replay failure), and orchestrationRecovery.test.ts is expanded to assert the new return type and retry/stop behavior.

Reviewed by Cursor Bugbot for commit cb1880f. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Return structured replay retry state with exponential backoff from orchestration recovery

  • completeReplayRecovery now returns a ReplayRecoveryCompletion object ({ replayMadeProgress, shouldReplay }) instead of a boolean, and no longer unconditionally clears pendingReplay on no-progress completion.
  • Adds deriveReplayRetryDecision in orchestrationRecovery.ts to compute whether to retry a replay and with what delay: immediate retry on progress, capped exponential backoff (base 100ms, up to 3 attempts) when there is no progress on the same frontier, and budget reset when the frontier changes.
  • routes/__root.tsx uses the new return value to schedule retries, logging a warning when no-progress retries are exhausted.
  • Behavioral Change: previously a truthy return from completeReplayRecovery triggered an immediate retry unconditionally; now retries are budgeted and backoff-gated when no progress is observed.

Macroscope summarized cb1880f.

- Distinguish replay progress from retry eligibility
- Cover retry and no-op replay cases in tests
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 4, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 7d9bd605-25a8-4d21-a771-4341fe803f78

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch t3code/project-open-flag

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size:S 10-29 changed lines (additions + deletions). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list. labels Apr 4, 2026
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Autofix Details

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Caller treats object return value as boolean
    • Changed the truthiness check on completeReplayRecovery() to access .shouldReplay on the returned ReplayRecoveryCompletion object, so replay recovery is only triggered when shouldReplay is true.

Create PR

Or push these changes by commenting:

@cursor push 4d69809910
Preview (4d69809910)
diff --git a/apps/web/src/routes/__root.tsx b/apps/web/src/routes/__root.tsx
--- a/apps/web/src/routes/__root.tsx
+++ b/apps/web/src/routes/__root.tsx
@@ -440,7 +440,7 @@
         return;
       }
 
-      if (!disposed && recovery.completeReplayRecovery()) {
+      if (!disposed && recovery.completeReplayRecovery().shouldReplay) {
         void recoverFromSequenceGap();
       }
     };

You can send follow-ups to the cloud agent here.

@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp bot commented Apr 4, 2026

Approvability

Verdict: Needs human review

This PR introduces new runtime behavior for orchestration recovery: exponential backoff delays and capped retry attempts. While well-tested and authored by the module's owner, the changes affect when and how often the app retries during sequence gap recovery, warranting human review.

You can customize Macroscope's approvability policy. Learn more.

- wait briefly when replay recovery makes no progress
- retry sequence-gap recovery after replay completion
@github-actions github-actions bot added size:M 30-99 changed lines (additions + deletions). and removed size:S 10-29 changed lines (additions + deletions). labels Apr 4, 2026
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Unbounded retry loop when replay consistently makes no progress
    • Reset highestObservedSequence to latestSequence when replay makes no progress, so stale observations no longer permanently satisfy the observedAhead condition and cause infinite retries.

Create PR

Or push these changes by commenting:

@cursor push 89821febb3
Preview (89821febb3)
diff --git a/apps/web/src/orchestrationRecovery.test.ts b/apps/web/src/orchestrationRecovery.test.ts
--- a/apps/web/src/orchestrationRecovery.test.ts
+++ b/apps/web/src/orchestrationRecovery.test.ts
@@ -65,7 +65,7 @@
     });
   });
 
-  it("retries replay when no progress was made but higher live sequences were observed", () => {
+  it("does not retry replay when no progress was made even if higher live sequences were previously observed", () => {
     const coordinator = createOrchestrationRecoveryCoordinator();
 
     coordinator.beginSnapshotRecovery("bootstrap");
@@ -75,11 +75,11 @@
 
     expect(coordinator.completeReplayRecovery()).toEqual({
       replayMadeProgress: false,
-      shouldReplay: true,
+      shouldReplay: false,
     });
     expect(coordinator.getState()).toMatchObject({
       latestSequence: 3,
-      highestObservedSequence: 5,
+      highestObservedSequence: 3,
       pendingReplay: false,
       inFlight: null,
     });

diff --git a/apps/web/src/orchestrationRecovery.ts b/apps/web/src/orchestrationRecovery.ts
--- a/apps/web/src/orchestrationRecovery.ts
+++ b/apps/web/src/orchestrationRecovery.ts
@@ -130,6 +130,9 @@
         replayStartSequence !== null && state.latestSequence > replayStartSequence;
       replayStartSequence = null;
       state.inFlight = null;
+      if (!replayMadeProgress) {
+        state.highestObservedSequence = state.latestSequence;
+      }
       const replayResolution = resolveReplayNeedAfterRecovery();
       return {
         replayMadeProgress,

You can send follow-ups to the cloud agent here.

Reviewed by Cursor Bugbot for commit 5c18bad. Configure here.

- Add retry tracking for stalled orchestration replay
- Reset retry budget when the replay frontier changes
- Log when replay recovery stops after exhausting retries
- Co-authored-by: codex <codex@users.noreply.github.com>
@github-actions github-actions bot added size:L 100-499 changed lines (additions + deletions). and removed size:M 30-99 changed lines (additions + deletions). labels Apr 4, 2026
@juliusmarminge juliusmarminge merged commit 6de4b47 into main Apr 4, 2026
12 checks passed
@juliusmarminge juliusmarminge deleted the t3code/project-open-flag branch April 4, 2026 05:05
aaditagrawal pushed a commit to aaditagrawal/t3code that referenced this pull request Apr 5, 2026
aaditagrawal added a commit to aaditagrawal/t3code that referenced this pull request Apr 5, 2026
…-retry-state

Merge upstream: Return replay retry state from orchestration recovery (pingdotgg#1728)
gigq pushed a commit to gigq/t3code that referenced this pull request Apr 6, 2026
Chrono-byte pushed a commit to Chrono-byte/t3code that referenced this pull request Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100-499 changed lines (additions + deletions). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant