Add exec-server websocket reconnect foundation#23395
Closed
starr-openai wants to merge 12 commits into
Closed
Conversation
3b15152 to
db2e2bc
Compare
db2e2bc to
ef8267f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Why
Remote exec-server environments currently treat a websocket drop as terminal even when the same exec-server process is still alive. That is too fragile for the same-host case we want first: laptop sleep, tunnel churn, websocket close races, or rendezvous route replacement should not force the rest of Codex to know that the transport changed underneath it.
This PR establishes the reconnect boundary at the environment-owned remote exec-server client. The rest of Codex should keep holding one logical remote environment capability while that client swaps the underlying live
ExecServerConnectionas needed.The important constraint is that reconnect is not the same as replaying every request. A transport can die after the server accepted an RPC but before the client saw the response. For this slice, the protocol only has one operation with a built-in replay cursor:
process/read(afterSeq). Everything else must reconnect before later calls, but an ambiguous in-flight call returns an error instead of risking duplicate side effects.What Changed
Client/session restructuring
The old lazy remote client held onto the first successful connection forever. This PR replaces that with a logical
RemoteExecServerClientthat owns durableRemoteExecServerSessionstate:ExecServerConnection, if anysession_idProcessSessions that may need rebinding while their process handles still existExecServerConnectionis now the name for one live JSON-RPC transport binding. It owns connection-local machinery such as theRpcClient, reader task, disconnect latch, process notification routes, and streamed HTTP body routes. The publicExecServerClientname remains as a type alias so existing callers do not need an API migration in this PR.RemoteProcess,RemoteFileSystem, and the remoteHttpClientimplementation are now thin adapters over the shared logicalRemoteExecServerClient; they do not own separate reconnect loops.Reconnect semantics
When a websocket-backed remote client notices a dead transport, the next remote API call asks the shared logical client for a connection. The client either reuses the live binding, waits for the one reconnect attempt already in progress, or creates one replacement connection and resumes the prior exec-server
session_id.After resume, tracked process session routes are rebound onto the replacement
ExecServerConnectionso existingRemoteExecProcesshandles continue through the same logical client state. The logical client does not own those sessions forever:RemoteExecProcesshandles own the durableProcessSessionstate, and the reconnect session keeps only weak references needed to rebind still-live handles.Preserved remote process sessions now emit a local
ResyncRequiredevent and wake when their transport disappears. Push-based consumers can use that signal to recover throughprocess/read(afterSeq)instead of waiting forever for another pushed event. Direct one-shot/test sessions still fail terminally on disconnect.Resume error handling is intentionally narrow:
unknown session id ...is treated as terminal and cached for later callers.session ... is already attached to another connectionis treated as a transient resume race and retried briefly.Idempotency / replay rules
This PR does not add general request idempotency keys. Instead it makes the first replay boundary explicit at the API layer:
process/read(afterSeq)may reconnect and retry once after a transport-close race because the cursor makes the read recoverable and duplicate-safe.process/start,process/write,process/terminate, filesystem RPCs, andhttp/requestreconnect before later calls but are not replayed after an ambiguous mid-request disconnect.That keeps this foundation honest: we recover the operation that already has a read cursor, and we surface ambiguity for operations that would need explicit idempotency keys or stronger protocol semantics before automatic replay is safe.
Rendezvous / relay behavior
Rendezvous harness URLs are still represented as
ExecServerTransportParams::WebSocketUrl, so they take the same logical reconnect path as direct websocket URLs.client_transportcontinues to detect?role=harnessand wrap the websocket in the relay transport before exec-server initialize/resume runs over it.For this slice, reconnecting a rendezvous-backed client means establishing a fresh relay websocket/stream, then resuming the exec-server logical session above that relay binding and recovering process output through
process/read(afterSeq).This is intentionally not full reliable-message relay replay. The Reliable Messages design has harness and executor retain seq/ack/unacked state and resend remaining relay segments end-to-end after same-
stream_idresume. That endpoint-owned relay resume/retry protocol is a later slice; this PR only makes the exec-server client/session layer resilient to replacement of the underlying websocket or relay binding.Validation
//codex-rs/exec-server:exec-server-unit-testsand//codex-rs/rmcp-client:rmcp-client-unit-testsexec-servercoverage report withcargo llvm-covon the devbox while checking the new reconnect paths