Add exec-server websocket reconnect foundation by starr-openai · Pull Request #23395 · openai/codex

starr-openai · 2026-05-19T02:12:36Z

Summary

split the old remote exec-server client into a durable logical client/session layer and replaceable live connection bindings
add same-host websocket reconnect for remote exec-server environments, shared across process, filesystem, and HTTP capabilities
define the first reconnect/idempotency slice explicitly: resume sessions and recover cursor reads, but do not replay ambiguous side-effecting RPCs
keep rendezvous harness URLs on the same reconnect path while leaving full relay-frame reliable-message resume/replay to the later endpoint-owned protocol slice

Why

Remote exec-server environments currently treat a websocket drop as terminal even when the same exec-server process is still alive. That is too fragile for the same-host case we want first: laptop sleep, tunnel churn, websocket close races, or rendezvous route replacement should not force the rest of Codex to know that the transport changed underneath it.

This PR establishes the reconnect boundary at the environment-owned remote exec-server client. The rest of Codex should keep holding one logical remote environment capability while that client swaps the underlying live ExecServerConnection as needed.

The important constraint is that reconnect is not the same as replaying every request. A transport can die after the server accepted an RPC but before the client saw the response. For this slice, the protocol only has one operation with a built-in replay cursor: process/read(afterSeq). Everything else must reconnect before later calls, but an ambiguous in-flight call returns an error instead of risking duplicate side effects.

What Changed

Client/session restructuring

The old lazy remote client held onto the first successful connection forever. This PR replaces that with a logical RemoteExecServerClient that owns durable RemoteExecServerSession state:

current live ExecServerConnection, if any
resumable logical exec-server session_id
one shared in-flight reconnect attempt for the whole logical client
weakly tracked live ProcessSessions that may need rebinding while their process handles still exist
terminal resume failure state when the prior session is definitively gone

ExecServerConnection is now the name for one live JSON-RPC transport binding. It owns connection-local machinery such as the RpcClient, reader task, disconnect latch, process notification routes, and streamed HTTP body routes. The public ExecServerClient name remains as a type alias so existing callers do not need an API migration in this PR.

RemoteProcess, RemoteFileSystem, and the remote HttpClient implementation are now thin adapters over the shared logical RemoteExecServerClient; they do not own separate reconnect loops.

Reconnect semantics

When a websocket-backed remote client notices a dead transport, the next remote API call asks the shared logical client for a connection. The client either reuses the live binding, waits for the one reconnect attempt already in progress, or creates one replacement connection and resumes the prior exec-server session_id.

After resume, tracked process session routes are rebound onto the replacement ExecServerConnection so existing RemoteExecProcess handles continue through the same logical client state. The logical client does not own those sessions forever: RemoteExecProcess handles own the durable ProcessSession state, and the reconnect session keeps only weak references needed to rebind still-live handles.

Preserved remote process sessions now emit a local ResyncRequired event and wake when their transport disappears. Push-based consumers can use that signal to recover through process/read(afterSeq) instead of waiting forever for another pushed event. Direct one-shot/test sessions still fail terminally on disconnect.

Resume error handling is intentionally narrow:

unknown session id ... is treated as terminal and cached for later callers.
session ... is already attached to another connection is treated as a transient resume race and retried briefly.

Idempotency / replay rules

This PR does not add general request idempotency keys. Instead it makes the first replay boundary explicit at the API layer:

process/read(afterSeq) may reconnect and retry once after a transport-close race because the cursor makes the read recoverable and duplicate-safe.
process/start, process/write, process/terminate, filesystem RPCs, and http/request reconnect before later calls but are not replayed after an ambiguous mid-request disconnect.
streamed HTTP response bodies remain connection-local; a later request can reconnect, but an already-open body-delta stream is not resumed in this slice.

That keeps this foundation honest: we recover the operation that already has a read cursor, and we surface ambiguity for operations that would need explicit idempotency keys or stronger protocol semantics before automatic replay is safe.

Rendezvous / relay behavior

Rendezvous harness URLs are still represented as ExecServerTransportParams::WebSocketUrl, so they take the same logical reconnect path as direct websocket URLs. client_transport continues to detect ?role=harness and wrap the websocket in the relay transport before exec-server initialize/resume runs over it.

For this slice, reconnecting a rendezvous-backed client means establishing a fresh relay websocket/stream, then resuming the exec-server logical session above that relay binding and recovering process output through process/read(afterSeq).

This is intentionally not full reliable-message relay replay. The Reliable Messages design has harness and executor retain seq/ack/unacked state and resend remaining relay segments end-to-end after same-stream_id resume. That endpoint-owned relay resume/retry protocol is a later slice; this PR only makes the exec-server client/session layer resilient to replacement of the underlying websocket or relay binding.

Validation

added focused reconnect coverage for shared reconnect attempts, reconnect-before-dispatch across remote APIs, cursor-read replay, and no replay for ambiguous non-read APIs
added regressions for idle process resync notification, transient resume-conflict retry, terminal unknown-session caching, process-session cleanup after start/connect failure, and process-id reuse after a dropped remote handle
ran focused remote Bazel tests for //codex-rs/exec-server:exec-server-unit-tests and //codex-rs/rmcp-client:rmcp-client-unit-tests
generated an exec-server coverage report with cargo llvm-cov on the devbox while checking the new reconnect paths

starr-openai added 7 commits May 18, 2026 10:04

Refactor exec-server websocket pump

8a9300e

Add exec-server websocket pump tests

3673b69

Preserve exec-server websocket keepalive ownership

d94300d

Add websocket backpressure regression tests

9215e15

Fix websocket backpressure regression tests

6d6cdeb

Restore server-owned websocket keepalive

6f96405

Remove manual websocket pong handling

90804bb

starr-openai force-pushed the starr/exec-server-reconnect-candidate-b-20260519 branch 2 times, most recently from 3b15152 to db2e2bc Compare May 19, 2026 02:56

Add exec-server websocket reconnect foundation

ef8267f

starr-openai force-pushed the starr/exec-server-reconnect-candidate-b-20260519 branch from db2e2bc to ef8267f Compare May 19, 2026 03:05

starr-openai added 4 commits May 18, 2026 20:53

Document and test exec-server reconnect invariants

b25c4b5

Document exec-server reconnect hierarchy

d82bb0d

Harden exec-server reconnect recovery

edd0d6a

Fix reconnect test lint

ee0db32

Base automatically changed from starr/exec-server-ws-keepalive-20260517 to main May 19, 2026 20:32

starr-openai closed this May 28, 2026

starr-openai deleted the starr/exec-server-reconnect-candidate-b-20260519 branch May 29, 2026 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add exec-server websocket reconnect foundation#23395

Add exec-server websocket reconnect foundation#23395
starr-openai wants to merge 12 commits into
mainfrom
starr/exec-server-reconnect-candidate-b-20260519

starr-openai commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

starr-openai commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

What Changed

Client/session restructuring

Reconnect semantics

Idempotency / replay rules

Rendezvous / relay behavior

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

starr-openai commented May 19, 2026 •

edited

Loading