Reconnect disconnected exec-server websocket clients with fresh sessions#23562
Reconnect disconnected exec-server websocket clients with fresh sessions#23562starr-openai wants to merge 3 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f663cdab70
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| Err(err) if is_transport_closed_error(&err) => { | ||
| client.mark_disconnected(&err); | ||
| debug!("retrying exec-server request after websocket disconnect"); | ||
| let client = self.get().await?; | ||
| request(client).await |
There was a problem hiding this comment.
Avoid replaying mutating RPCs after a lost response
When the websocket drops after the server has already handled a mutating call but before its response is delivered, this branch treats the pending call as a transport error and immediately invokes the same closure on the reconnected client. Several callers now wrapped in request are not safely replayable (fs_remove with force: false, non-recursive create_directory, process/start), so a successful delete can be reported as NotFound on retry or a started process can be left running while the retry returns a duplicate-process error. Please only retry when the request is known not to have been sent, or restrict automatic retry to idempotent operations.
Useful? React with 👍 / 👎.
| let session = client.register_session(&process_id).await?; | ||
| if let Err(err) = client.exec(params).await { | ||
| session.unregister().await; | ||
| return Err(err); | ||
| } | ||
| Ok(session) |
There was a problem hiding this comment.
Rebind running process handles after reconnect
A process started here returns a Session that is permanently tied to the ExecServerClient instance used for process/start. If that websocket later disconnects, the server session can be resumed by the shared RemoteExecServerClient, but this RemoteExecProcess still services read, write, and terminate through the old session/client, so a running remote command becomes unreadable/unwritable even after another operation has successfully reconnected. Please route process operations through the reconnecting client or re-register/rebind active process sessions after reconnect.
Useful? React with 👍 / 👎.
| file_system | ||
| .write_file(&file_path_abs, payload.clone(), /*sandbox*/ None) | ||
| .await?; | ||
| remote_exec("ss -K state established '( sport = :31987 )'")?; |
There was a problem hiding this comment.
Isolate the reconnect test from parallel remote tests
When the remote-env test binary runs with the default parallel test harness, this command kills every established connection to the shared exec-server port in CODEX_TEST_REMOTE_ENV; #[serial] only serializes this test with other serial tests, not the unannotated remote_env tests that use the same container. If another remote filesystem/process test is running at the same time, its websocket can be dropped unexpectedly and the suite becomes flaky. Please isolate this test's exec-server instance or serialize all tests sharing that remote server.
Useful? React with 👍 / 👎.
5a344e4 to
055c9f8
Compare
055c9f8 to
70cd4a7
Compare
3e3ea88 to
28bb7bf
Compare
Summary
get()Why
The prior PR direction was trying to grow into session restore: resume the old
session_id, preserve existing process handles, and add reconnect retry policy. That is more machinery than we want for this slice.For now, the useful minimum is simpler: later fresh remote operations should not be stuck behind a dead cached websocket client, but anything already attached to the dead connection should fail loudly through the existing disconnect path. The server already has detached-session cleanup via its existing TTL, so this PR does not need to add client-side session preservation.
What Changed
LazyRemoteExecServerClient::get()now keeps the current concrete client in a small mutex-protected cache plus one async connect lock.get()returns it.get()creates a brand-new websocket client with a brand-new exec-server session and replaces the cache.resume_session_id, backoff, request replay, or existingRemoteExecProcessrebinding is added here.get()calls after disconnect share one fresh replacement client/session.Validation
just fmt.