Skip to content

[codex] Harden WebSocket reconnect recovery#1864

Merged
juliusmarminge merged 3 commits intopingdotgg:mainfrom
juliusmarminge:t3code/reconnect-recovery-20260409
Apr 10, 2026
Merged

[codex] Harden WebSocket reconnect recovery#1864
juliusmarminge merged 3 commits intopingdotgg:mainfrom
juliusmarminge:t3code/reconnect-recovery-20260409

Conversation

@juliusmarminge
Copy link
Copy Markdown
Member

@juliusmarminge juliusmarminge commented Apr 9, 2026

What changed

  • restart stalled reconnect attempts when the expected retry window expires
  • retry replay and snapshot recovery through transient transport failures during reconnect
  • preserve default websocket lifecycle tracking when custom protocol handlers are provided

Why

Reconnect recovery could stall or lose lifecycle bookkeeping after transient websocket transport failures. This hardens the client-side recovery path so reconnect state resumes predictably instead of getting stuck.

Impact

WebSocket reconnect recovery in the web client is more reliable under disconnects, retries, and partial recovery failures.

Validation

  • bun fmt
  • bun lint
  • bun typecheck
  • cd apps/web && bun run test src/components/WebSocketConnectionSurface.logic.test.ts src/environments/runtime/connection.test.ts src/rpc/wsTransport.test.ts

Note

Harden WebSocket reconnect recovery by retrying stalled reconnects and transport errors

  • Replaces exhaustWsReconnectIfStillWaiting with shouldRestartStalledReconnect in WebSocketConnectionSurface.tsx: when a scheduled retry window elapses while still in waiting phase, the coordinator now triggers an actual reconnect attempt instead of exhausting the retry window.
  • Wraps replayEvents and getSnapshot calls in connection.ts with retryTransportRecoveryOperation, which retries up to 20 times with a 250ms delay on transport connection errors.
  • Fixes composeLifecycleHandlers in protocol.ts so default WebSocket lifecycle tracking always runs alongside any custom handlers provided to createWsRpcProtocolLayer.
  • Risk: recovery operations now retry up to 20× on transport errors before giving up, increasing the time before a failed recovery surfaces to the user.

Macroscope summarized db8e216.


Note

Medium Risk
Modifies WebSocket reconnect coordination and orchestration recovery retry behavior; mistakes could cause reconnect loops, delayed recovery, or missed state updates under flaky networks.

Overview
Hardens client WebSocket reconnect/recovery paths to avoid stalled or brittle reconnect behavior.

The reconnect coordinator now detects a stalled scheduled retry window and proactively calls reconnect() (via shouldRestartStalledReconnect) instead of forcing the connection into an exhausted state, and the old exhaustWsReconnectIfStillWaiting path is removed.

Orchestration snapshot/replay recovery is wrapped in a transport-error retry loop (up to 20 attempts with a short delay) so transient disconnects during resubscribe/bootstrap don’t immediately fail recovery. The protocol layer also now composes custom lifecycle handlers with default connection-state tracking so user-provided handlers can’t accidentally bypass reconnection bookkeeping, with tests added/updated to cover these behaviors.

Reviewed by Cursor Bugbot for commit db8e216. Bugbot is set up for automated code reviews on this repo. Configure here.

- restart stalled reconnect timers when the retry window expires
- retry replay and snapshot recovery through transient transport errors
- preserve default websocket lifecycle tracking when adding custom handlers
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 9, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 417e4f22-684e-4679-90e9-3f54781cae5d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added size:M 30-99 changed lines (additions + deletions). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list. labels Apr 9, 2026
@juliusmarminge juliusmarminge marked this pull request as ready for review April 9, 2026 23:12
Copy link
Copy Markdown
Contributor

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is ON, but it could not run because the branch was deleted or merged before autofix could start.

Reviewed by Cursor Bugbot for commit 6fa364b. Configure here.

Co-authored-by: codex <codex@users.noreply.github.com>
@github-actions github-actions bot added size:L 100-499 changed lines (additions + deletions). and removed size:M 30-99 changed lines (additions + deletions). labels Apr 9, 2026
Co-authored-by: codex <codex@users.noreply.github.com>
@macroscopeapp
Copy link
Copy Markdown
Contributor

macroscopeapp bot commented Apr 9, 2026

Approvability

Verdict: Approved

This PR hardens WebSocket reconnection recovery by adding retry logic for transport errors during recovery operations and fixing lifecycle handler composition. The changes are well-tested bug fixes limited to reconnection edge cases, authored by the primary maintainer of these files.

You can customize Macroscope's approvability policy. Learn more.

@juliusmarminge juliusmarminge merged commit 528bb2a into pingdotgg:main Apr 10, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L 100-499 changed lines (additions + deletions). vouch:trusted PR author is trusted by repo permissions or the VOUCHED list.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant