Skip to content

Add exec-server websocket reconnect foundation#23395

Closed
starr-openai wants to merge 12 commits into
mainfrom
starr/exec-server-reconnect-candidate-b-20260519
Closed

Add exec-server websocket reconnect foundation#23395
starr-openai wants to merge 12 commits into
mainfrom
starr/exec-server-reconnect-candidate-b-20260519

Conversation

@starr-openai
Copy link
Copy Markdown
Contributor

@starr-openai starr-openai commented May 19, 2026

Summary

  • split the old remote exec-server client into a durable logical client/session layer and replaceable live connection bindings
  • add same-host websocket reconnect for remote exec-server environments, shared across process, filesystem, and HTTP capabilities
  • define the first reconnect/idempotency slice explicitly: resume sessions and recover cursor reads, but do not replay ambiguous side-effecting RPCs
  • keep rendezvous harness URLs on the same reconnect path while leaving full relay-frame reliable-message resume/replay to the later endpoint-owned protocol slice

Why

Remote exec-server environments currently treat a websocket drop as terminal even when the same exec-server process is still alive. That is too fragile for the same-host case we want first: laptop sleep, tunnel churn, websocket close races, or rendezvous route replacement should not force the rest of Codex to know that the transport changed underneath it.

This PR establishes the reconnect boundary at the environment-owned remote exec-server client. The rest of Codex should keep holding one logical remote environment capability while that client swaps the underlying live ExecServerConnection as needed.

The important constraint is that reconnect is not the same as replaying every request. A transport can die after the server accepted an RPC but before the client saw the response. For this slice, the protocol only has one operation with a built-in replay cursor: process/read(afterSeq). Everything else must reconnect before later calls, but an ambiguous in-flight call returns an error instead of risking duplicate side effects.

What Changed

Client/session restructuring

The old lazy remote client held onto the first successful connection forever. This PR replaces that with a logical RemoteExecServerClient that owns durable RemoteExecServerSession state:

  • current live ExecServerConnection, if any
  • resumable logical exec-server session_id
  • one shared in-flight reconnect attempt for the whole logical client
  • weakly tracked live ProcessSessions that may need rebinding while their process handles still exist
  • terminal resume failure state when the prior session is definitively gone

ExecServerConnection is now the name for one live JSON-RPC transport binding. It owns connection-local machinery such as the RpcClient, reader task, disconnect latch, process notification routes, and streamed HTTP body routes. The public ExecServerClient name remains as a type alias so existing callers do not need an API migration in this PR.

RemoteProcess, RemoteFileSystem, and the remote HttpClient implementation are now thin adapters over the shared logical RemoteExecServerClient; they do not own separate reconnect loops.

Reconnect semantics

When a websocket-backed remote client notices a dead transport, the next remote API call asks the shared logical client for a connection. The client either reuses the live binding, waits for the one reconnect attempt already in progress, or creates one replacement connection and resumes the prior exec-server session_id.

After resume, tracked process session routes are rebound onto the replacement ExecServerConnection so existing RemoteExecProcess handles continue through the same logical client state. The logical client does not own those sessions forever: RemoteExecProcess handles own the durable ProcessSession state, and the reconnect session keeps only weak references needed to rebind still-live handles.

Preserved remote process sessions now emit a local ResyncRequired event and wake when their transport disappears. Push-based consumers can use that signal to recover through process/read(afterSeq) instead of waiting forever for another pushed event. Direct one-shot/test sessions still fail terminally on disconnect.

Resume error handling is intentionally narrow:

  • unknown session id ... is treated as terminal and cached for later callers.
  • session ... is already attached to another connection is treated as a transient resume race and retried briefly.

Idempotency / replay rules

This PR does not add general request idempotency keys. Instead it makes the first replay boundary explicit at the API layer:

  • process/read(afterSeq) may reconnect and retry once after a transport-close race because the cursor makes the read recoverable and duplicate-safe.
  • process/start, process/write, process/terminate, filesystem RPCs, and http/request reconnect before later calls but are not replayed after an ambiguous mid-request disconnect.
  • streamed HTTP response bodies remain connection-local; a later request can reconnect, but an already-open body-delta stream is not resumed in this slice.

That keeps this foundation honest: we recover the operation that already has a read cursor, and we surface ambiguity for operations that would need explicit idempotency keys or stronger protocol semantics before automatic replay is safe.

Rendezvous / relay behavior

Rendezvous harness URLs are still represented as ExecServerTransportParams::WebSocketUrl, so they take the same logical reconnect path as direct websocket URLs. client_transport continues to detect ?role=harness and wrap the websocket in the relay transport before exec-server initialize/resume runs over it.

For this slice, reconnecting a rendezvous-backed client means establishing a fresh relay websocket/stream, then resuming the exec-server logical session above that relay binding and recovering process output through process/read(afterSeq).

This is intentionally not full reliable-message relay replay. The Reliable Messages design has harness and executor retain seq/ack/unacked state and resend remaining relay segments end-to-end after same-stream_id resume. That endpoint-owned relay resume/retry protocol is a later slice; this PR only makes the exec-server client/session layer resilient to replacement of the underlying websocket or relay binding.

Validation

  • added focused reconnect coverage for shared reconnect attempts, reconnect-before-dispatch across remote APIs, cursor-read replay, and no replay for ambiguous non-read APIs
  • added regressions for idle process resync notification, transient resume-conflict retry, terminal unknown-session caching, process-session cleanup after start/connect failure, and process-id reuse after a dropped remote handle
  • ran focused remote Bazel tests for //codex-rs/exec-server:exec-server-unit-tests and //codex-rs/rmcp-client:rmcp-client-unit-tests
  • generated an exec-server coverage report with cargo llvm-cov on the devbox while checking the new reconnect paths

@starr-openai starr-openai force-pushed the starr/exec-server-reconnect-candidate-b-20260519 branch 2 times, most recently from 3b15152 to db2e2bc Compare May 19, 2026 02:56
@starr-openai starr-openai force-pushed the starr/exec-server-reconnect-candidate-b-20260519 branch from db2e2bc to ef8267f Compare May 19, 2026 03:05
Base automatically changed from starr/exec-server-ws-keepalive-20260517 to main May 19, 2026 20:32
@starr-openai starr-openai deleted the starr/exec-server-reconnect-candidate-b-20260519 branch May 29, 2026 22:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant