What version of Codex CLI is running?
codex-cli 0.130.0
What subscription do you have?
ChatGPT (auth_mode="Chatgpt", gpt-5.4 access via subscription)
Which model were you using?
gpt-5.4 (the same error fires regardless of model; gpt-5.4 used here because it is the default)
What platform is your computer?
Darwin 24.6.0 x86_64 i386 (macOS)
What terminal emulator and version are you using (if applicable)?
Reproduces from codex exec invoked directly (no terminal multiplexer), and equally reproduces when codex-cli is invoked as a subprocess from another tool. Not terminal-specific.
Codex doctor report
codex doctor requires an interactive stdin in 0.130.0; not available in this non-interactive sandbox. Verbose-run snippet attached in "Additional information" instead.
What issue are you seeing?
codex_models_manager periodically emits this stderr line throughout every run, roughly every 15-45 seconds:
ERROR codex_models_manager::manager: failed to refresh available models: timeout waiting for child process to exit
The run itself completes normally — agents finish tool calls and produce expected output between occurrences — but the stderr noise is constant, the implied background-child latency is non-zero on every occurrence, and on long-running invocations the error count is significant: 12 occurrences in a single 10-minute production heartbeat (sample run a4864072-70a6-48ed-a98b-c8a6098e8499 of an MCP-driven agent host). In that reference run, 100% of stderr emitted by codex was this error (12 of 12 stderr lines).
This is distinct from #22205 ("Azure custom provider: model refresh logs missing field models"), which involves the same codex_models_manager::manager module but a different underlying failure mode (response-decode error vs. child-process-exit timeout). Posting separately so the root causes don't get conflated, but linking #22205 as related.
What steps can reproduce the bug?
Minimal reproducer (fresh CODEX_HOME, default OpenAI provider, default gpt-5.4, no curated plugins required):
SCRATCH=$(mktemp -d)
cp ~/.codex/auth.json "$SCRATCH/"
printf 'model = "gpt-5.4"\n' > "$SCRATCH/config.toml"
CODEX_HOME="$SCRATCH" codex exec --skip-git-repo-check "say DONE" 2>&1 \
| grep -c codex_models_manager
# Observed: 2 occurrences (in ~46-55s wall-clock), even on a single-turn run.
Notes on the reproducer:
- The error only fires when the models cache is cold. The cache lives at
$CODEX_HOME/models_cache.json with a 300s TTL — to reproduce reliably, either use a fresh $CODEX_HOME (as above) or delete models_cache.json between runs.
- Disabling curated plugins (
gmail, google-calendar, github, google-drive, canva — all @openai-curated) does NOT suppress the error. The trigger is internal to the model-refresh path, not the plugin loader.
- Reproduces with the default OpenAI provider; no custom
[model_providers] block needed.
- Production heartbeat density (10-min sessions): 12 occurrences per run, mean gap 55.4s, min gap 13.0s, max gap 140.6s. The variable gap suggests backoff-on-failure logic.
What is the expected behavior?
Either of:
- The model-refresh subprocess should not time out under default conditions. If something is genuinely deadlocking in the child, surface it through the normal error path; do not silently retry the same hung subprocess on every 300s TTL expiry.
- If the refresh attempt is expected to fail in some configurations, downgrade the log level from
ERROR to WARN (or DEBUG) and suppress repeated occurrences once the failure mode is established for a given session — the current pattern emits noise indistinguishable from a real failure to anyone tailing logs.
Either is acceptable. Right now downstream tooling cannot distinguish "agent dead" from "agent fine but Codex is logging spam" without parsing the specific error string.
Additional information
Operational impact for MCP host integrations:
This affects any environment that hosts Codex as a subprocess and uses stderr as a liveness signal. We mitigated downstream (in Paperclip's heartbeat watchdog) by explicitly excluding stderr from the activity signal — but that requires every host integration to know about this Codex-specific noise pattern and ignore it. A cleaner fix in Codex would let host integrations rely on stderr again.
Verbose-log snippet showing the model-cache TTL path (RUST_LOG=info, default config):
INFO codex_models_manager::manager: models cache: evaluating cache eligibility client_version="0.130.0"
INFO codex_models_manager::cache: models cache: attempting load_fresh cache_path=...
INFO codex_models_manager::cache: models cache: cache hit cache_path=... cache_ttl_secs=300
INFO codex_models_manager::manager: models cache: cache entry applied models_count=6 etag=Some("...")
INFO codex_models_manager::manager: models cache: using cached models for OnlineIfUncached
When the cache misses (TTL elapsed, fresh CODEX_HOME, or models_cache.json absent), the load_fresh path spawns a child process that hits the timeout waiting for child process to exit failure — this is the error in the title.
Related issue: #22205 (Azure custom provider, same module, different root cause).
What version of Codex CLI is running?
codex-cli 0.130.0
What subscription do you have?
ChatGPT (auth_mode="Chatgpt",
gpt-5.4access via subscription)Which model were you using?
gpt-5.4 (the same error fires regardless of model; gpt-5.4 used here because it is the default)
What platform is your computer?
Darwin 24.6.0 x86_64 i386 (macOS)
What terminal emulator and version are you using (if applicable)?
Reproduces from
codex execinvoked directly (no terminal multiplexer), and equally reproduces when codex-cli is invoked as a subprocess from another tool. Not terminal-specific.Codex doctor report
codex doctorrequires an interactive stdin in 0.130.0; not available in this non-interactive sandbox. Verbose-run snippet attached in "Additional information" instead.What issue are you seeing?
codex_models_managerperiodically emits this stderr line throughout every run, roughly every 15-45 seconds:The run itself completes normally — agents finish tool calls and produce expected output between occurrences — but the stderr noise is constant, the implied background-child latency is non-zero on every occurrence, and on long-running invocations the error count is significant: 12 occurrences in a single 10-minute production heartbeat (sample run
a4864072-70a6-48ed-a98b-c8a6098e8499of an MCP-driven agent host). In that reference run, 100% of stderr emitted by codex was this error (12 of 12 stderr lines).This is distinct from #22205 ("Azure custom provider: model refresh logs
missing field models"), which involves the samecodex_models_manager::managermodule but a different underlying failure mode (response-decode error vs. child-process-exit timeout). Posting separately so the root causes don't get conflated, but linking #22205 as related.What steps can reproduce the bug?
Minimal reproducer (fresh
CODEX_HOME, default OpenAI provider, defaultgpt-5.4, no curated plugins required):Notes on the reproducer:
$CODEX_HOME/models_cache.jsonwith a 300s TTL — to reproduce reliably, either use a fresh$CODEX_HOME(as above) or deletemodels_cache.jsonbetween runs.gmail,google-calendar,github,google-drive,canva— all@openai-curated) does NOT suppress the error. The trigger is internal to the model-refresh path, not the plugin loader.[model_providers]block needed.What is the expected behavior?
Either of:
ERRORtoWARN(orDEBUG) and suppress repeated occurrences once the failure mode is established for a given session — the current pattern emits noise indistinguishable from a real failure to anyone tailing logs.Either is acceptable. Right now downstream tooling cannot distinguish "agent dead" from "agent fine but Codex is logging spam" without parsing the specific error string.
Additional information
Operational impact for MCP host integrations:
This affects any environment that hosts Codex as a subprocess and uses stderr as a liveness signal. We mitigated downstream (in Paperclip's heartbeat watchdog) by explicitly excluding stderr from the activity signal — but that requires every host integration to know about this Codex-specific noise pattern and ignore it. A cleaner fix in Codex would let host integrations rely on stderr again.
Verbose-log snippet showing the model-cache TTL path (RUST_LOG=info, default config):
When the cache misses (TTL elapsed, fresh CODEX_HOME, or models_cache.json absent), the
load_freshpath spawns a child process that hits thetimeout waiting for child process to exitfailure — this is the error in the title.Related issue: #22205 (Azure custom provider, same module, different root cause).