Skip to content

[apps] Add tool_suggest tool.#14287

Merged
mzeng-openai merged 59 commits intomainfrom
dev/mzeng/tool_suggest_2
Mar 12, 2026
Merged

[apps] Add tool_suggest tool.#14287
mzeng-openai merged 59 commits intomainfrom
dev/mzeng/tool_suggest_2

Conversation

@mzeng-openai
Copy link
Collaborator

@mzeng-openai mzeng-openai commented Mar 11, 2026

  • Add tool_suggest tool.
  • Move chatgpt/src/connectors.rs and core/src/connectors.rs into a dedicated mod so that we have all the logic and global cache in one place.
  • Update TUI app link view to support rendering the installation view for mcp elicitation.

shaqayeq-oai and others added 30 commits March 10, 2026 01:00
## Summary
Foundation PR only (base for PR #3).

This PR contains the SDK runtime foundation and generated artifacts:

- pinned runtime binary in `sdk/python/bin/` (`codex` or `codex.exe` by
platform)
- single maintenance script:
`sdk/python/scripts/update_sdk_artifacts.py`
- generated protocol/types artifacts under:
  - `sdk/python/src/codex_app_server/generated/protocol_types.py`
  - `sdk/python/src/codex_app_server/generated/schema_types.py`
  - `sdk/python/src/codex_app_server/generated/v2_all/*`
- generation-contract test wiring (`tests/test_contract_generation.py`)

## Release asset behavior
`update_sdk_artifacts.py` now:
- selects latest release by channel (`--channel stable|alpha`)
- resolves the correct asset for current OS/arch
- extracts platform binary (`codex` on macOS/Linux, `codex.exe` on
Windows)
- keeps runtime on single pinned binary source in `sdk/python/bin/`

## Scope boundary
- ✅ PR #2 = binary + generation pipeline + generated types foundation
- ❌ PR #2 does **not** include examples/integration logic polish (that
is PR #3)

## Validation
- Ran: `python scripts/update_sdk_artifacts.py --channel stable`
- Regenerated and committed resulting generated artifacts
- Local tests pass on branch
Addresses #13586

This doesn't affect our CI scripts. It was user-reported.

Summary
- add `wiremock::ResponseTemplate` and `body_string_contains` imports
behind `#[cfg(not(debug_assertions))]` in
`codex-rs/core/tests/suite/view_image.rs` so release builds only pull
the helpers they actually use
Replace the Unix shell lookup path in `codex-rs/core/src/shell.rs` to
use
`libc::getpwuid_r()` instead of `libc::getpwuid()` when resolving the
current
user's shell.

Why:
- `getpwuid()` can return pointers into libc-managed shared storage
- on the musl static Linux build, concurrent callers can race on that
storage
- this matches the crash pattern reported in tmux/Linux sessions with
parallel
  shell activity

Refs:
- Fixes #13842
There are some bug investigations that currently require us to ask users
for their user ID even though they've already uploaded logs and session
details via `/feedback`. This frustrates users and increases the time
for diagnosis.

This PR includes the ChatGPT user ID in the metadata uploaded for
`/feedback` (both the TUI and app-server).
Summary
- document output types for the various tool handlers and registry so
the API exposes richer descriptions
- update unified execution helpers and client tests to align with the
new output metadata
- clean up unused helpers across tool dispatch paths

Testing
- Not run (not requested)
## Summary
- run the split stdout/stderr PTY test through the normal shell helper
on every platform
- use a Windows-native command string instead of depending on Python to
emit split streams
- assert CRLF line endings on Windows explicitly

## Why this fixes the flake
The earlier PTY split-output test used a Python one-liner on Windows
while the rest of the file exercised shell-command behavior. That made
the test depend on runner-local Python availability and masked the real
Windows shell output shape. Using a native cmd-compatible command and
asserting the actual CRLF output makes the split stdout/stderr coverage
deterministic on Windows runners.
We already have a type to represent the MCP tool output, reuse it
instead of the custom McpHandlerOutput
Fixes a Codex app bug where quitting the app mid-run could leave the
reopened thread stuck in progress and non-interactable. On cold thread
resume, app-server could return an idle thread with a replayed turn
still marked in progress. This marks incomplete replayed turns as
interrupted unless the thread is actually active.
This updates the `skill-creator` sample skill to explicitly cover
forward-testing as part of the skill authoring workflow. The guidance
now treats subagent-based validation as a first-class step for complex
or fragile skills, with an emphasis on preserving evaluation integrity
and avoiding leaked context.

The sample initialization script is also updated so newly created skills
point authors toward forward-testing after validation. Together, these
changes make the sample more opinionated about how skills should be
iterated on once the initial implementation is complete.

- Add new guidance to `SKILL.md` on protecting validation integrity,
when to use subagents for forward-testing, and how to structure
realistic test prompts without leaking expected answers.
- Expand the skill creation workflow so iteration explicitly includes
forward-testing for complex skills, including approval guidance for
expensive or risky validation runs.
## Summary
- add the OpenAI Docs skill under
codex-rs/skills/src/assets/samples/openai-docs
- include the skill metadata, assets, and GPT-5.4 upgrade reference
files
- exclude the test harness and test fixtures

## Testing
- not run (skill-only asset copy)
## Summary
- add ARC monitor support for MCP tool calls by serializing MCP approval
requests into the ARC action shape and sending the relevant
conversation/policy context to the `/api/codex/safety/arc` endpoint
- route ARC outcomes back into MCP approval flow so `ask-user` falls
back to a user prompt and `steer-model` blocks the tool call, with
guardian/ARC tests covering the new request shape
- update the TUI approval copy from “Approve Once” to “Allow” / “Allow
for this session” and refresh the related
  snapshots

---------

Co-authored-by: Fouad Matin <fouad@openai.com>
Co-authored-by: Fouad Matin <169186268+fouad-openai@users.noreply.github.com>
Add forceRemoteSync to plugin/list.
When it is set to True, we will sync the local plugin status with the
remote one (backend-api/plugins/list).
#14232)

…e package

## Summary

This changes the Python SDK packaging model so we no longer commit
`codex`
binaries into `sdk/python`.

Instead, published SDK builds now depend on a separate `codex-cli-bin`
runtime
package that carries the platform-specific `codex` binary. The SDK and
runtime
can be staged together with an exact version pin, so the published
Python SDK
still resolves to a Codex version we know is compatible.

The SDK now resolves `codex` in this order:

- `AppServerConfig.codex_bin` if explicitly set
- installed `codex-cli-bin` runtime package

There is no `PATH` fallback anymore. Published installs either use the
pinned
runtime or fail loudly, and local development uses an explicit
`codex_bin`
override when working from the repo.

## What changed

- removed checked-in binaries from `sdk/python/src/codex_app_server/bin`
- changed `AppServerClient` to resolve `codex` from:
  - explicit `AppServerConfig.codex_bin`
  - installed `codex-cli-bin`
- kept `AppServerConfig.codex_bin` override support for local/dev use
- added a new `sdk/python-runtime` package template for the pinned
runtime
- updated `scripts/update_sdk_artifacts.py` to stage releasable
SDK/runtime
  packages instead of downloading binaries into the repo
- made `codex-cli-bin` build as a platform-specific wheel
- made `codex-cli-bin` wheel-only by rejecting `sdist` builds
- updated docs/tests to match the new packaging flow and explicit
local-dev
  contract

## Why

Checking in six platform binaries made the repo much heavier and tied
normal
source changes to release artifacts.

This keeps the compatibility guarantees we want, but moves them into
packaging:

- the published SDK can depend on an exact `codex-cli-bin==...`
- the runtime package carries the platform-specific binary
- users still get a pinned runtime
- the repo no longer needs to store those binaries

It also makes the runtime contract stricter and more predictable:

- published installs never silently fall back to an arbitrary `codex` on
`PATH`
- local development remains supported through explicit `codex_bin`
- `codex-cli-bin` is distributed as platform wheels only, which avoids
unsafe
source-distribution installs for a package that embeds a prebuilt binary

## Validation

- ran targeted Python SDK tests:
- `python3 -m pytest
sdk/python/tests/test_artifact_workflow_and_binaries.py
sdk/python/tests/test_client_rpc_methods.py
sdk/python/tests/test_contract_generation.py`
- exercised the staging flow with a local dummy binary to verify
SDK/runtime
  staging end to end
- verified the staged runtime package builds a platform-specific wheel
(`Root-Is-Purelib: false`) rather than a universal `py3-none-any` wheel
- added test coverage for the explicit-only runtime resolution model
- added test coverage that `codex-cli-bin` rejects `sdist` builds

---------

Co-authored-by: sdcoffey <stevendcoffey@gmail.com>
- add `model` and `reasoning_effort` to the `spawn_agent` schema so the
values pass through
- validate requested models against `model.model` and only check that
the selected model supports the requested reasoning effort

---------

Co-authored-by: Codex <noreply@openai.com>
image-gen feature will have the model saving to /tmp by default + at all
times
…Reject (#14191)

## Summary
This change makes `AskForApproval::Reject` gate correctly anywhere it
appears inside otherwise-stable app-server protocol types.

Previously, experimental gating for `approval_policy: Reject` was
handled with request-specific logic in `ClientRequest` detection. That
covered a few request params types, but it did not generalize to other
nested uses such as `ProfileV2`, `Config`, `ConfigReadResponse`, or
`ConfigRequirements`.

This PR replaces that ad hoc handling with a generic nested experimental
propagation mechanism.

## Testing

seeing this when run app-server-test-client without experimental api
enabled:
```
 initialize response: InitializeResponse { user_agent: "codex-toy-app-server/0.0.0 (Mac OS 26.3.1; arm64) vscode/2.4.36 (codex-toy-app-server; 0.0.0)" }
> {
>   "id": "50244f6a-270a-425d-ace0-e9e98205bde7",
>   "method": "thread/start",
>   "params": {
>     "approvalPolicy": {
>       "reject": {
>         "mcp_elicitations": false,
>         "request_permissions": true,
>         "rules": false,
>         "sandbox_approval": true
>       }
>     },
>     "baseInstructions": null,
>     "config": null,
>     "cwd": null,
>     "developerInstructions": null,
>     "dynamicTools": null,
>     "ephemeral": null,
>     "experimentalRawEvents": false,
>     "mockExperimentalField": null,
>     "model": null,
>     "modelProvider": null,
>     "persistExtendedHistory": false,
>     "personality": null,
>     "sandbox": null,
>     "serviceName": null
>   }
> }
< {
<   "error": {
<     "code": -32600,
<     "message": "askForApproval.reject requires experimentalApi capability"
<   },
<   "id": "50244f6a-270a-425d-ace0-e9e98205bde7"
< }
[verified] thread/start rejected approvalPolicy=Reject without experimentalApi
```

---------

Co-authored-by: celia-oai <celia@openai.com>
…de (#14236)

Summary
- drop `McpToolOutput` in favor of `CallToolResult`, moving its helpers
to keep MCP tooling focused on the final result shape
- wire the new schema definitions through code mode, context, handlers,
and spec modules so MCP tools serialize the exact output shape expected
by the model
- extend code mode tests to cover multiple MCP call scenarios and ensure
the serialized data matches the new schema
- refresh JS runner helpers and protocol models alongside the schema
changes

Testing
- Not run (not requested)
Summary
- document that `@openai/code_mode` exposes
`set_max_output_tokens_per_exec_call` and that `code_mode` truncates the
final Rust-side output when the budget is exceeded
- enforce the configured budget in the Rust tool runner, reusing
truncation helpers so text-only outputs follow the unified-exec wrapper
and mixed outputs still fit within the limit
- ensure the new behavior is covered by a code-mode integration test and
string spec update

Testing
- Not run (not requested)
- raise the sdk workflow job timeout from 10 to 15 minutes to reduce
false cancellations near the current limit

---------

Co-authored-by: Codex <noreply@openai.com>
- clarify the `close_agent` tool description so it nudges models to
close agents they no longer need
- keep the change scoped to the tool spec text only

Co-authored-by: Codex <noreply@openai.com>
Summary
- document how code-mode can import `output_text`/`output_image` and
ensure `add_content` stays compatible
- add a synthetic `@openai/code_mode` module that appends content items
and validates inputs
- cover the new behavior with integration tests for structured text and
image outputs

Testing
- Not run (not requested)
### Summary
This PR adds first-class ephemeral support to thread/fork, bringing it
in line with thread/start. The goal is to support one-off completions on
full forked threads without persisting them as normal user-visible
threads.

### Testing
…ontacts, Reminders (#14155)

Add additional macOS Sandbox Permissions levers for the following:

- Launch Services
- Contacts
- Reminders
Pass more params to /compact. This should give us parity with the
/responses endpoint to improve caching.

I'm torn about the MCP await. Blocking will give us parity but it seems
like we explicitly don't block on MCPs. Happy either way
adds support for transferring state across code mode invocations.
## Summary
- add `skill_approval` to `RejectConfig` and the app-server v2
`AskForApproval::Reject` payload so skill-script prompts can be
configured independently from sandbox and rule-based prompts
- update Unix shell escalation to reject prompts based on the actual
decision source, keeping prefix rules tied to `rules`, unmatched command
fallbacks tied to `sandbox_approval`, and skill scripts tied to
`skill_approval`
- regenerate the affected protocol/config schemas and expand
unit/integration coverage for the new flag and skill approval behavior
## What changed
- keep the explicit stdin-close behavior after writing so the child
still receives EOF deterministically
- on Windows, stop using `python -c` for the round-trip assertion and
instead run a native `cmd.exe` pipeline that reads one line from stdin
with `set /p` and echoes it back
- send `
` on Windows so the stdin payload matches the platform-native line
ending the shell reader expects

## Why this fixes flakiness
The failing branch-local flake was not in `spawn_pipe_process` itself.
The child exited cleanly, but the Windows ARM runner sometimes produced
an empty stdout string when the test used Python as the stdin consumer.
That makes the test sensitive to Python startup and stdin-close timing
rather than the pipe primitive we actually want to validate. Switching
the Windows path to a native `cmd.exe` reader keeps the assertion
focused on our pipe behavior: bytes written to stdin should come back on
stdout before EOF closes the process. The explicit `
` write removes line-ending ambiguity on Windows.

## Scope
- test-only
- no production logic change
## Summary
  - update the guardian prompting
- clarify the guardian rejection message so an action may still proceed
if the user explicitly approves it after being informed of the risk

  ## Testing
  - cargo run on selected examples
Copy link
Contributor

@canvrno-oai canvrno-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and working, very cool stuff. Added two comments for small issues, nothing blocking. During testing I had a small hiccup (of my own making) since the connector I tested with had been installed and disabled previously. We should think about how to handle this when the connector is disabled, there are a few ways to look at this.

};
let auth = sess.services.auth_manager.auth().await;
let discoverable_connectors =
if turn_context.features.enabled(Feature::Apps) && turn_context.tools_config.search_tool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be gated by the Feature::ToolSuggest feature?

connectors::list_tool_suggest_discoverable_connectors_with_auth(
&turn.config,
auth.as_ref(),
&[],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we pass an empty set here the downstream accessible-connectors filtering will not filter out connectors that are already accessible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mzeng-openai is this a problem? the comment here seems true.

@@ -1,25 +1,18 @@
use std::collections::HashMap;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move most of the logic to the connectors mod so that we can share the logic and persist global connectors cache in one place.

Comment on lines +16 to +17
- `tool_type`: `connector` or `plugin`
- `action_type`: `install` or `enable`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right now only connector + install is supported. i know we will fast-follow with others, but should we include those in the tool description now? dont want to confuse the model.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to support plugins.

};
self.app_event_tx.send(AppEvent::SubmitThreadOp {
thread_id: target.thread_id,
op: Op::ResolveElicitation {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe nit: should app_link_view know about resolving elicitations, request_id, etc? its mostly rendering/input handling, should we just return an outcome and consolidate the elicitation parsing + resolution in a single spot in the bottom_pane mod?

turn_id: Some(turn_id),
server_name: CODEX_APPS_MCP_SERVER_NAME.to_string(),
request: McpServerElicitationRequest::Form {
meta: Some(build_tool_suggestion_meta(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this be cleaned up in the follow up you mentioned here?

it doesnt seem great to shove all the tool suggestion stuff into this meta field like i mentioned.

Copy link
Collaborator

@sayan-oai sayan-oai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving so we can start testing. i think the elicitation parsing and resolution should be consolidated, right now it feels like the TUI needs to know too much about how things are passed and structured.

@mzeng-openai mzeng-openai merged commit ba5b942 into main Mar 12, 2026
32 of 33 checks passed
@mzeng-openai mzeng-openai deleted the dev/mzeng/tool_suggest_2 branch March 12, 2026 05:07
@github-actions github-actions bot locked and limited conversation to collaborators Mar 12, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.