Skip to content

fix: make session MCP startup lazy#22901

Draft
efrazer-oai wants to merge 3 commits into
mainfrom
fix/lazy-session-mcp-startup
Draft

fix: make session MCP startup lazy#22901
efrazer-oai wants to merge 3 commits into
mainfrom
fix/lazy-session-mcp-startup

Conversation

@efrazer-oai
Copy link
Copy Markdown
Contributor

@efrazer-oai efrazer-oai commented May 15, 2026

Summary

App-server thread startup used to build the session's MCP pool up front. thread/start and thread/resume waited for that work, including required MCP server startup, before returning.

This change keeps one MCP pool per session, but removes that startup wait. thread/start and thread/resume now return after session configuration, while MCP initialization begins in the background. The first real MCP consumer still awaits the shared session pool if startup has not finished yet.

This keeps the user-visible startup path fast without changing how MCP consumers observe readiness, and it preserves the early mcpServer/startupStatus/updated notifications that clients already rely on.

This also keeps the forked-thread boundary intact. We still do not share one process-wide MCP pool because MCP runtime state depends on session config, cwd, environment selection, and approval state.

Design decisions

  • Keep the singleton at session scope. Forked threads can carry different MCP config and runtime context, so a process-wide singleton would mix unrelated state.
  • Start initialization in the background after session configuration. This removes MCP startup from the thread/start and thread/resume critical path while still emitting startup status updates promptly.
  • Preserve explicit refresh. Refresh still rebuilds and swaps the shared session pool, and it still uses the active turn context so turn-local environment overrides keep working.
  • Move readiness waits to consumers. Tool building, app tool discovery, and MCP resource/tool calls now await the shared pool on demand instead of making thread creation wait.

Testing

  • Targeted codex-core tests covering deferred refresh and idempotent lazy initialization.
  • Targeted codex-app-server tests covering thread/start, thread/resume, and MCP status notifications.
  • Scoped clippy autofix and formatting for the touched crates.

@efrazer-oai
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 599f53bc67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +888 to +889
pending_mcp_server_refresh: Mutex::new(None),
mcp_connection_manager_initialized: std::sync::atomic::AtomicBool::new(false),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Start MCP initialization for idle thread starts

With the manager now left uninitialized here and no replacement startup path after SessionConfigured, a client that only calls thread/start or thread/resume and then listens for MCP server status never triggers ensure_mcp_connection_manager_initialized(). That means mcpServer/startupStatus/updated notifications are no longer emitted for configured servers until a later turn/tool listing happens; the existing thread_start_emits_mcp_server_status_updated_notifications workflow still waits for those notifications immediately after ThreadStartResponse, so it will time out for optional/broken MCP configs.

Useful? React with 👍 / 👎.

@efrazer-oai efrazer-oai force-pushed the fix/lazy-session-mcp-startup branch from 599f53b to 95d02e8 Compare May 15, 2026 21:26
@efrazer-oai
Copy link
Copy Markdown
Contributor Author

@codex review

@efrazer-oai
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68dc2fda2f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +227 to +228
client_elicitation_capability: mcp_config.client_elicitation_capability,
elicitation_reviewer: None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass the Guardian reviewer during lazy MCP init

When the session MCP manager is first created through the new lazy/startup path, elicitation_reviewer is hard-coded to None, whereas the old startup path passed Some(sess.mcp_elicitation_reviewer()). In sessions using Guardian approval routing, MCP servers that emit Guardian MCP-tool elicitations during normal tool calls will now bypass review_guardian_mcp_elicitation for the lifetime of the initial manager and fall back to raw ElicitationRequest handling until a refresh happens.

Useful? React with 👍 / 👎.

let (refreshed_manager, cancel_token) = McpConnectionManager::new(
&mcp_servers,

self.replace_mcp_connection_manager(ReplaceMcpConnectionManagerArgs {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Serialize refresh with lazy MCP initialization

This refresh path can run while the background startup task is still inside ensure_mcp_connection_manager_initialized(), but it does not take the initialization lock or otherwise invalidate the in-flight initial build. If an app-server client sends RefreshMcpServers before the startup prewarm completes and the next turn applies it, the refresh and the original lazy init can race; whichever replace_mcp_connection_manager finishes last wins, so the original config can overwrite the just-refreshed server set.

Useful? React with 👍 / 👎.

Comment on lines +95 to +97
.unwrap_or_else(|err| {
panic!("session MCP environment selections should remain valid: {err}")
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Return an error for invalid MCP environments

With a stored/sticky MCP environment id that is no longer known, this lazy initialization path now panics inside the spawned startup task instead of returning the InvalidRequest error that the old Session::new path produced. That leaves app-server thread/start/thread/resume apparently successful, and the first later MCP consumer can panic again while trying to initialize the manager.

Useful? React with 👍 / 👎.

anyhow::bail!("required MCP servers failed to initialize: {details}");
}
}
sess.start_mcp_connection_manager_initialization();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cancel lazy MCP initialization on shutdown

Starting MCP initialization as a detached task here means an immediate Op::Shutdown can drain the still-uninitialized manager and emit ShutdownComplete while this task continues holding the session alive. When it later reaches replace_mcp_connection_manager, it can create and swap in real MCP clients after shutdown, leaving stdio MCP server processes running and possibly emitting startup events for a closed thread.

Useful? React with 👍 / 👎.

Comment on lines +214 to +229
self.replace_mcp_connection_manager(ReplaceMcpConnectionManagerArgs {
submit_id: INITIAL_SUBMIT_ID.to_owned(),
approval_policy: &session_configuration.approval_policy,
permission_profile: session_configuration.permission_profile(),
runtime_environment: self.session_mcp_runtime_environment(&session_configuration),
config: &config,
mcp_servers,
store_mode: config.mcp_oauth_credentials_store_mode,
auth: auth.as_ref(),
host_owned_codex_apps_enabled: host_owned_codex_apps_enabled(
&mcp_config,
auth.as_ref(),
),
client_elicitation_capability: mcp_config.client_elicitation_capability,
elicitation_reviewer: None,
})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve required MCP server failures

After moving startup behind lazy initialization, the old required_startup_failures(...) check is no longer called anywhere, so a configured required = true server that fails to start only produces startup events and the turn proceeds with the missing tools. For setups that mark MCP servers required because the agent depends on them, the first MCP/tool-building consumer should still fail or otherwise stop the turn instead of silently continuing without the required server.

Useful? React with 👍 / 👎.

Comment on lines +135 to +139
{
let mut guard = self.services.mcp_startup_cancellation_token.lock().await;
guard.cancel();
*guard = CancellationToken::new();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor interrupts that arrive before token replacement

If the user interrupts an idle session while the lazy MCP init task is still computing provenance/auth status, cancel_mcp_startup() cancels the placeholder token, but this block then discards that cancelled token and installs a fresh uncancelled one before spawning MCP clients. In that timing window, the interrupt no longer cancels startup, so MCP processes can still be launched after the user asked to stop idle startup work.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant