Skip to content

[app-server] Reap orphaned idle threads on websocket disconnect#14997

Closed
cooper-oai wants to merge 3 commits intomainfrom
cooper/mcp-thread-orphan-reap
Closed

[app-server] Reap orphaned idle threads on websocket disconnect#14997
cooper-oai wants to merge 3 commits intomainfrom
cooper/mcp-thread-orphan-reap

Conversation

@cooper-oai
Copy link
Copy Markdown

@cooper-oai cooper-oai commented Mar 18, 2026

Summary

Today, switching threads over websocket can leave the previous thread loaded even after it has no subscribers. Because local stdio MCP servers are thread-scoped, those orphaned loaded threads keep their helper processes alive. Repeating that across many threads causes process-count and memory growth, which matches the original report (31 local kepler processes and multi-GB RSS after repeated thread switching).

This change keeps the fix intentionally narrow:

  • on websocket disconnect, if a thread is orphaned, already inactive, and has a materialized rollout on disk, app-server now unloads it immediately
  • unloading reuses the existing hard-shutdown path, so thread-scoped stdio MCP servers are reaped instead of lingering after the UI switches away from a thread
  • add a focused websocket regression test that proves a resumed thread's local stdio MCP helper exits after the websocket disconnects

Side effect to be aware of: if the last client disconnects from an idle thread, app-server now closes that thread instead of keeping it loaded in memory. If someone opens that thread again later, it still resumes from saved history, but reopening may be a little slower because the thread and its MCP helper processes start again.

Out of scope for this PR:

  • broader disconnect/reconnect semantics for active or still-running threads
  • grace-period or replay behavior for prompts during reconnect
  • any larger refactor of thread lifecycle management

Validation

  • bazel test //codex-rs/app-server:app-server-all-test

I also ran a manual before/after reproduction of the reported leak using a temporary integration harness with a local stdio MCP helper.

Workload:

  1. Create 15 persisted threads.
  2. Resume each thread over websocket.
  3. Close the websocket before moving to the next thread.
  4. After each switch, record thread/loaded/list, live MCP helper count, and helper RSS.

Measured results:

  • origin/main: leaked one loaded thread and one live MCP helper per switch; after 15 switches the server still had 15 loaded threads and 15 live helpers, with helper RSS growing to about 104.6 MB
  • this branch: after every websocket close, loaded-thread count returned to 0 and live helper count returned to 0; after 15 switches the server finished with 0 loaded threads and 0 live helpers

That before/after experiment is the closest reproduction of the original user report, and it shows this scoped change stops the linear thread/process accumulation during left-nav thread switching.

Follow-ups

  • handle orphaned active/running threads after disconnect without expanding this PR into a full reconnect-lifecycle rewrite
  • decide whether local MCP server lifetime should remain thread-scoped or move to a shared/pool model, since that would address process bloat more directly
  • revisit broader disconnect/reconnect semantics separately if we want transient websocket blips during active turns to preserve resumability

…hanged_files]

Co-authored-by: Codex <noreply@openai.com>
@cooper-oai cooper-oai force-pushed the cooper/mcp-thread-orphan-reap branch from 91641c5 to 2779ade Compare March 18, 2026 03:06
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 2, 2026

Closing this pull request because it has had no updates for more than 14 days. If you plan to continue working on it, feel free to reopen or open a new PR.

@github-actions github-actions Bot closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant