Skip to content

Reconnect disconnected exec-server websocket clients with fresh sessions#23562

Closed
starr-openai wants to merge 3 commits into
mainfrom
starr/exec-reconnect-min-cand-a-full-ci
Closed

Reconnect disconnected exec-server websocket clients with fresh sessions#23562
starr-openai wants to merge 3 commits into
mainfrom
starr/exec-reconnect-min-cand-a-full-ci

Conversation

@starr-openai
Copy link
Copy Markdown
Contributor

@starr-openai starr-openai commented May 19, 2026

Summary

  • replace the one-shot lazy remote exec-server cache with a lock-protected current client
  • when the cached websocket client is already disconnected, create one fresh websocket client/session on the next get()
  • keep existing disconnect failure behavior for old process sessions and HTTP body streams; do not add session resume or request retry

Why

The prior PR direction was trying to grow into session restore: resume the old session_id, preserve existing process handles, and add reconnect retry policy. That is more machinery than we want for this slice.

For now, the useful minimum is simpler: later fresh remote operations should not be stuck behind a dead cached websocket client, but anything already attached to the dead connection should fail loudly through the existing disconnect path. The server already has detached-session cleanup via its existing TTL, so this PR does not need to add client-side session preservation.

What Changed

  • LazyRemoteExecServerClient::get() now keeps the current concrete client in a small mutex-protected cache plus one async connect lock.
  • If that cached client is still connected, get() returns it.
  • If that cached client is a disconnected websocket client, get() creates a brand-new websocket client with a brand-new exec-server session and replaces the cache.
  • If that cached client is stdio-backed, behavior stays one-shot: the dead client is returned and later work surfaces the existing disconnect error.
  • No resume_session_id, backoff, request replay, or existing RemoteExecProcess rebinding is added here.
  • Added focused websocket coverage that proves two concurrent get() calls after disconnect share one fresh replacement client/session.

Validation

  • Ran just fmt.
  • Did not run tests in this pass.

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f663cdab70

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/exec-server/src/client.rs Outdated
Comment on lines +315 to +319
Err(err) if is_transport_closed_error(&err) => {
client.mark_disconnected(&err);
debug!("retrying exec-server request after websocket disconnect");
let client = self.get().await?;
request(client).await
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid replaying mutating RPCs after a lost response

When the websocket drops after the server has already handled a mutating call but before its response is delivered, this branch treats the pending call as a transport error and immediately invokes the same closure on the reconnected client. Several callers now wrapped in request are not safely replayable (fs_remove with force: false, non-recursive create_directory, process/start), so a successful delete can be reported as NotFound on retry or a started process can be left running while the retry returns a duplicate-process error. Please only retry when the request is known not to have been sent, or restrict automatic retry to idempotent operations.

Useful? React with 👍 / 👎.

Comment on lines +44 to +49
let session = client.register_session(&process_id).await?;
if let Err(err) = client.exec(params).await {
session.unregister().await;
return Err(err);
}
Ok(session)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Rebind running process handles after reconnect

A process started here returns a Session that is permanently tied to the ExecServerClient instance used for process/start. If that websocket later disconnects, the server session can be resumed by the shared RemoteExecServerClient, but this RemoteExecProcess still services read, write, and terminate through the old session/client, so a running remote command becomes unreadable/unwritable even after another operation has successfully reconnected. Please route process operations through the reconnecting client or re-register/rebind active process sessions after reconnect.

Useful? React with 👍 / 👎.

Comment thread codex-rs/core/tests/suite/remote_env.rs Outdated
file_system
.write_file(&file_path_abs, payload.clone(), /*sandbox*/ None)
.await?;
remote_exec("ss -K state established '( sport = :31987 )'")?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Isolate the reconnect test from parallel remote tests

When the remote-env test binary runs with the default parallel test harness, this command kills every established connection to the shared exec-server port in CODEX_TEST_REMOTE_ENV; #[serial] only serializes this test with other serial tests, not the unannotated remote_env tests that use the same container. If another remote filesystem/process test is running at the same time, its websocket can be dropped unexpectedly and the suite becomes flaky. Please isolate this test's exec-server instance or serialize all tests sharing that remote server.

Useful? React with 👍 / 👎.

@starr-openai starr-openai reopened this May 20, 2026
@starr-openai starr-openai force-pushed the starr/exec-reconnect-min-cand-a-full-ci branch 6 times, most recently from 5a344e4 to 055c9f8 Compare May 20, 2026 21:13
@starr-openai starr-openai changed the title Add minimal exec-server websocket reconnect Reconnect disconnected exec-server websocket clients with fresh sessions May 20, 2026
@starr-openai starr-openai force-pushed the starr/exec-reconnect-min-cand-a-full-ci branch from 055c9f8 to 70cd4a7 Compare May 21, 2026 11:58
starr-openai added a commit that referenced this pull request May 21, 2026
@starr-openai starr-openai force-pushed the starr/exec-reconnect-min-cand-a-full-ci branch from 3e3ea88 to 28bb7bf Compare May 21, 2026 12:21
@starr-openai starr-openai deleted the starr/exec-reconnect-min-cand-a-full-ci branch May 21, 2026 12:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant