Skip to content

Avoid instant remote disconnects under websocket queue pressure#18265

Closed
etraut-openai wants to merge 7 commits intomainfrom
etraut/websoket-disconnect
Closed

Avoid instant remote disconnects under websocket queue pressure#18265
etraut-openai wants to merge 7 commits intomainfrom
etraut/websoket-disconnect

Conversation

@etraut-openai
Copy link
Copy Markdown
Collaborator

@etraut-openai etraut-openai commented Apr 17, 2026

Fixes #18203.

Why

Remote TUI clients connected through codex app-server --listen ws://... can receive bursts of outbound turn/progress notifications. Before this change, a full per-connection writer queue was treated as an immediate slow-client signal, so a healthy WebSocket client that was only momentarily behind could be disconnected mid-turn.

What Changed

Disconnectable WebSocket connections now get a bounded grace window when their outbound writer queue is full. If the queue drains within that window, the message is delivered and the connection stays open. If it remains full, the connection is cancelled so the outbound router cannot be blocked indefinitely.

The implementation also adds a per-connection bounded overflow channel for WebSocket sends. This keeps the shared outbound router from waiting on one slow connection, preserves message order for that connection while the overflow drains, and avoids creating an unbounded backlog of detached send waiters.

When a connection times out or its writer closes, the overflow worker now notifies the outbound router so the connection is removed from routing promptly instead of waiting for the later ConnectionClosed event. The overflow-channel-full path also uses the same grace window before disconnecting, so short bursts that saturate both bounded queues do not immediately drop the client.

The stdio and in-process transports keep their existing reliable-delivery behavior; the grace/overflow path applies to disconnectable WebSocket connections.

Testing

Added transport tests covering:

  • temporary WebSocket queue pressure draining without disconnecting
  • per-connection message ordering while overflow messages drain
  • disconnect after the queue remains full past the grace window
  • applying the grace window before disconnecting when both the writer queue and overflow queue are saturated

I also manually confirmed that the repro methodology from #18203 disconnects before the fix and stays connected with this change.

chatgpt-codex-connector[bot]

This comment was marked as resolved.

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d896f5cffc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/app-server/src/transport/mod.rs Outdated
Comment thread codex-rs/app-server/src/transport/mod.rs Outdated
Comment thread codex-rs/app-server/src/transport/mod.rs Outdated
Comment thread codex-rs/app-server/src/transport/mod.rs Outdated
Comment thread codex-rs/app-server/src/transport/mod.rs Outdated
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 64872b998a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/app-server/src/transport/mod.rs
Comment thread codex-rs/app-server/src/transport/mod.rs
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 997a23330f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread codex-rs/app-server/src/transport/mod.rs Outdated
Comment thread codex-rs/app-server/src/transport/mod.rs Outdated
Comment thread codex-rs/app-server/src/transport/mod.rs Outdated
Copy link
Copy Markdown
Contributor

@maxj-oai maxj-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @etraut-openai

etraut-openai added a commit that referenced this pull request Apr 24, 2026
Fixes #18203.

## Why

Remote TUI clients connected through `codex app-server --listen
ws://...` can receive short bursts of outbound turn and tool-output
notifications. The WebSocket transport previously used the shared
128-message channel capacity for its outbound writer queue, so a healthy
client that briefly lagged during normal output streaming could fill the
queue and be disconnected immediately.

This is a smaller mitigation than #18265: instead of adding a new
overflow/backpressure pipeline, keep the existing non-blocking router
behavior and give WebSocket clients enough bounded headroom for
realistic bursts.

## What Changed

- Added a WebSocket-only outbound writer capacity of `64 * 1024`
messages.
- Used that larger capacity only for the WebSocket data writer queue in
`codex-rs/app-server/src/transport/websocket.rs`.
- Left the shared `CHANNEL_CAPACITY` and the existing disconnect-on-full
behavior unchanged for internal/control channels and genuinely stuck
clients.

## Verification

- `cargo test -p codex-app-server
transport::tests::broadcast_does_not_block_on_slow_connection`
- Manually retried the #18203 repro prompt against the remote TUI and
confirmed it stayed connected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

App-server disconnects remote TUI mid-turn when outbound queue fills (128 messages)

2 participants