Skip to content

feat(cli): add codex doctor diagnostics#22336

Merged
fcoury-oai merged 28 commits into
mainfrom
fcoury/doctor
May 13, 2026
Merged

feat(cli): add codex doctor diagnostics#22336
fcoury-oai merged 28 commits into
mainfrom
fcoury/doctor

Conversation

@fcoury-oai
Copy link
Copy Markdown
Contributor

@fcoury-oai fcoury-oai commented May 12, 2026

Why

Users and support need a single command that captures the local Codex runtime, configuration, auth, terminal, network, and state shape without asking the user to know which diagnostic depth to choose first. codex doctor now runs the useful checks by default and makes the detailed human output the default because the command is usually run when someone already needs context.

The command also targets concrete support failure modes we have seen while iterating on the design:

  • update-target mismatches like fix(tui): avoid update loops for mismatched npm installs #21956, where the installed package manager target can differ from the running executable
  • terminal and multiplexer issues that depend on TERM, tmux/zellij state, color handling, and TTY metadata
  • provider-specific HTTP/WebSocket connectivity, including ChatGPT WebSocket handshakes and API-key/provider endpoint reachability
  • local state/log SQLite integrity problems and large rollout directories
  • feedback reports that need an attached, redacted diagnostic snapshot without asking the user to run a second command

What Changed

  • Adds codex doctor as a grouped CLI diagnostic report with default detailed output and --summary for the compact view.
  • Adds stable report sections for Environment, Configuration, Updates, Connectivity, and Background Server, plus a top Notes block that promotes anomalies such as available updates, large rollout directories, optional MCP issues, and mixed auth signals.
  • Adds runtime provenance, install consistency, bundled/system search readiness, terminal/multiplexer metadata, config.toml parse status, auth mode details, sandbox details, feature flag summaries, update cache/latest-version state, app-server daemon state, SQLite integrity checks, rollout statistics, and provider-aware network diagnostics.
  • Adds ChatGPT WebSocket diagnostics that report the negotiated HTTP upgrade as HTTP 101 Switching Protocols and include timeout, DNS, auth, and provider context in detailed output.
  • Makes reachability provider-aware: API-key OpenAI setups check the API endpoint, ChatGPT auth checks the ChatGPT path, and custom/AWS/local providers check configured HTTP endpoints when available.
  • Adds structured, redacted JSON output where checks is keyed by check id and details is a key/value object for support tooling.
  • Integrates doctor with feedback uploads by attaching a best-effort codex-doctor-report.json report and adding derived Sentry tags for overall status and failing/warning checks.
  • Updates the TUI feedback consent copy so users can see that the doctor report is included when logs/diagnostics are uploaded.
  • Updates the CLI bug issue template to ask reporters for codex doctor --json and render pasted reports as JSON.

Example Output

The examples below are sanitized from local smoke runs with --no-color so the structure is reviewable in plain text.

codex doctor

Codex Doctor v0.0.0 · macos-aarch64

Notes
   ↑ updates      0.130.0 available (current 0.0.0, dismissed 0.128.0)
   ⚠ rollouts     1,526 active files · 2.53 GB on disk
   ⚠ mcp          MCP configuration has optional issues
   ⚠ auth         mixed auth signals: ChatGPT login plus API key env var; HTTP reachability uses API-key mode
─────────────────────────────────────────────────────────────

Environment
  ✓ runtime      local debug build
      version                  0.0.0
      install method           other
      commit                   unknown
      executable               ~/code/codex.fcoury-doct…x-rs/target/debug/codex
  ✓ install      consistent
      context                  other
      managed by               npm: no · bun: no · package root —
      PATH entries (2)         ~/.local/share/mise/installs/node/24/bin/codex
                               ~/.local/share/mise/shims/codex
  ✓ search       ripgrep 15.1.0 (system, `rg`)
  ✓ terminal     Ghostty 1.3.2-main-+b0f827665 · tmux 3.6a · TERM=xterm-256color
      terminal                 Ghostty
      TERM_PROGRAM             ghostty
      terminal version         1.3.2-main-+b0f827665
      TERM                     xterm-256color
      multiplexer              tmux 3.6a
      tmux extended-keys       on
      tmux allow-passthrough   on
      tmux set-clipboard       on
  ✓ state        databases healthy
      CODEX_HOME               ~/.codex (dir)
      state DB                 ~/.codex/state_5.sqlite (file) · integrity ok
      log DB                   ~/.codex/logs_2.sqlite (file) · integrity ok
      active rollouts          1,526 files · 2.53 GB (avg 1.70 MB)
      archived rollouts        8 files · 3.84 MB (avg 491.11 KB)

Configuration
  ✓ config       loaded
      model                    gpt-5.5 · openai
      cwd                      ~/code/codex.fcoury-doctor/codex-rs
      config.toml              ~/.codex/config.toml
      config.toml parse        ok
      MCP servers              1
      feature flags            36 enabled · 7 overridden (full list with --all)
      overrides                code_mode, code_mode_only, memories, chronicle, goals, remote_control, prevent_idle_sleep
  ✓ auth         auth is configured
      auth storage mode        File
      auth file                ~/.codex/auth.json
      auth env vars present    OPENAI_API_KEY
      stored auth mode         chatgpt
      stored API key           false
      stored ChatGPT tokens    true
      stored agent identity    false
  ⚠ mcp          MCP configuration has optional issues — Set the missing MCP env vars or disable the affected server.
      configured servers       1
      disabled servers         0
      streamable_http servers  1
      optional reachability    openaiDeveloperDocs: https://developers.openai.com/mcp (HEAD connect failed; GET connect failed)
  ✓ sandbox      restricted fs + restricted network · approval OnRequest
      approval policy          OnRequest
      filesystem sandbox       restricted
      network sandbox          restricted

Connectivity
  ✓ network      network-related environment looks readable
  ✓ websocket    connected (HTTP 101 Switching Protocols) · 15s timeout
      model provider           openai
      provider name            OpenAI
      wire API                 responses
      supports websockets      true
      connect timeout          15000 ms
      auth mode                chatgpt
      endpoint                 wss://chatgpt.com/backend-api/<redacted>
      DNS                      2 IPv4, 2 IPv6, first IPv6
      handshake result         HTTP 101 Switching Protocols
  ✗ reachability one or more required provider endpoints are unreachable over HTTP — Check proxy, VPN, firewall, DNS, and custom CA configuration.
      reachability mode        API key auth
      openai API               https://api.openai.com/v1 connect failed (required)

Background Server
  ○ app-server   not running (ephemeral mode)

─────────────────────────────────────────────────────────────
11 ok · 1 idle · 4 notes · 1 warn · 1 fail failed

--summary compact output           --all expand truncated lists
--json redacted report

codex doctor --summary

Codex Doctor v0.0.0 · macos-aarch64

Notes
   ↑ updates      0.130.0 available (current 0.0.0, dismissed 0.128.0)
   ⚠ rollouts     1,526 active files · 2.53 GB on disk
   ⚠ mcp          MCP configuration has optional issues
   ⚠ auth         mixed auth signals: ChatGPT login plus API key env var; HTTP reachability uses API-key mode
─────────────────────────────────────────────────────────────

Environment
  ✓ runtime      local debug build
  ✓ install      consistent
  ✓ search       ripgrep 15.1.0 (system, `rg`)
  ✓ terminal     Ghostty 1.3.2-main-+b0f827665 · tmux 3.6a · TERM=xterm-256color
  ✓ state        databases healthy

Configuration
  ✓ config       loaded
  ✓ auth         auth is configured
  ⚠ mcp          MCP configuration has optional issues — Set the missing MCP env vars or disable the affected server.
  ✓ sandbox      restricted fs + restricted network · approval OnRequest

Updates
  ✓ updates      update configuration is locally consistent

Connectivity
  ✓ network      network-related environment looks readable
  ✓ websocket    connected (HTTP 101 Switching Protocols) · 15s timeout
  ✗ reachability one or more required provider endpoints are unreachable over HTTP — Check proxy, VPN, firewall, DNS, and custom CA configuration.

Background Server
  ○ app-server   not running (ephemeral mode)

─────────────────────────────────────────────────────────────
11 ok · 1 idle · 4 notes · 1 warn · 1 fail failed

Run codex doctor without --summary for detailed diagnostics.
--all expand truncated lists       --json redacted report

codex doctor --json shape

{
  "schema_version": 1,
  "overall_status": "fail",
  "checks": {
    "runtime.provenance": {
      "id": "runtime.provenance",
      "category": "Environment",
      "status": "ok",
      "summary": "local debug build",
      "details": {
        "version": "0.0.0",
        "install method": "other",
        "commit": "unknown"
      }
    },
    "sandbox.helpers": {
      "id": "sandbox.helpers",
      "category": "Configuration",
      "status": "ok",
      "summary": "restricted fs + restricted network · approval OnRequest",
      "details": {
        "approval policy": "OnRequest",
        "filesystem sandbox": "restricted",
        "network sandbox": "restricted"
      }
    }
  }
}

/feedback new sentry attachment

CleanShot 2026-05-13 at 15 36 14

New section in CLI issue template

CleanShot 2026-05-13 at 15 47 24

How to Test

  1. Run cargo run --bin codex -- doctor --no-color.
  2. Confirm the detailed report is the default and includes promoted Notes, grouped sections, terminal details, state DB integrity, rollout stats, provider reachability, WebSocket diagnostics, and app-server status.
  3. Run cargo run --bin codex -- doctor --summary --no-color.
  4. Confirm the compact view keeps the same sections and summary counts but omits detailed key/value rows.
  5. Run cargo run --bin codex -- doctor --json.
  6. Confirm the output is redacted JSON, checks is an object keyed by check id, and each check's details is a key/value object.
  7. Preview the CLI bug issue template and confirm the Codex doctor report field appears after the terminal field, asks for codex doctor --json, and renders pasted output as JSON.
  8. Start a feedback flow that includes logs.
  9. Confirm the upload consent copy lists codex-doctor-report.json alongside the log attachments.

Targeted tests:

  • cargo test -p codex-cli doctor
  • cargo test -p codex-app-server doctor_report_tags_summarize_status_counts
  • cargo test -p codex-feedback
  • cargo test -p codex-tui feedback_view
  • just argument-comment-lint
  • git diff --check

@fcoury-oai fcoury-oai changed the title Add richer codex doctor diagnostics Run full codex doctor diagnostics by default May 12, 2026
@fcoury-oai fcoury-oai changed the title Run full codex doctor diagnostics by default feat(doctor): add codex doctor diagnostics May 12, 2026
@fcoury-oai fcoury-oai changed the title feat(doctor): add codex doctor diagnostics feat(cli): add codex doctor diagnostics May 12, 2026
Copy link
Copy Markdown
Collaborator

@etraut-openai etraut-openai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great feature! I've been thinking about doing something like this for a while but never got around to it. Thanks for doing this!

Here are some thoughts & questions:

  • What does "running local" mean? Does that mean it's not installed using npm, bun, or homebrew? Or does it mean that it's not connected to a remote app server?
  • I noticed that you're checking the connectivity of the "OpenAI endpoints" via HTTP. Do you check for websocket connectivity? That's a common complaint from customers.
  • Do you have any checks for Azure connectivity? We receive a lot of bugs about Azure endpoint connectivity.
  • I wonder if there's an opportunity to integrate this with the existing /feedback mechanism. For example, would it make sense to run this diagnostic report and upload it via sentry? Could be done as a follow-on feature.
  • Another common error that we're seeing lately has to do with the integrity of the state db and the log db (both SQLite). Is there an integrity check that we could run on these and include it in the report?
  • Another thing that might be useful to include in the report is some stats about local rollout files - both non-archived and archived: counts, total disk space consumed, average rollout size.
  • It might be useful in the configuration section to output which feature flags are enabled in the config. Users sometimes get into trouble by enabling features that are not yet ready for use.
  • I find the verbose output much more useful than the non-verbose output. I'm wondering if we should always output verbose for this feature. If you're running this, you probably want as much information as possible. What do you think?
  • It runs locally, so it won't help diagnose problems when connecting remotely. For example, if a remote app-server is having problems connecting to the responses endpoint, this won't help. I think that OK. Just pointing it out.

@fcoury-oai
Copy link
Copy Markdown
Contributor Author

@etraut-openai thanks for the detailed pass. Here’s what changed in response to each item:

What does "running local" mean? Does that mean it's not installed using npm, bun, or homebrew? Or does it mean that it's not connected to a remote app server?

Updated the wording to avoid the ambiguous “local” label. Doctor now reports runtime provenance more explicitly, e.g. local debug build, and keeps install/package-manager context in the install section.

I noticed that you're checking the connectivity of the "OpenAI endpoints" via HTTP. Do you check for websocket connectivity? That's a common complaint from customers.

Added a WebSocket diagnostic. It checks the active provider/auth path when WebSockets are supported, reports DNS shape, timeout, endpoint, auth mode, and the handshake result as connected (HTTP 101 Switching Protocols) when the upgrade succeeds.

Do you have any checks for Azure connectivity? We receive a lot of bugs about Azure endpoint connectivity.

Added provider-aware reachability. Doctor no longer hard-codes OpenAI/ChatGPT as the only meaningful endpoints. It now uses the active provider configuration and probes the configured provider endpoint when present, including custom/Azure-style endpoints. Deeper Azure-specific validation for deployment names, API versions, and Azure auth conventions is still a follow-up.

I wonder if there's an opportunity to integrate this with the existing /feedback mechanism. For example, would it make sense to run this diagnostic report and upload it via sentry? Could be done as a follow-on feature.

Implemented this integration. When users consent to upload logs through /feedback, the app-server runs codex doctor --json best-effort, attaches it as codex-doctor-report.json, and adds doctor-derived Sentry tags such as overall status, warning/fail counts, and failing check ids. The upload consent UI also lists the doctor report explicitly.

Another common error that we're seeing lately has to do with the integrity of the state db and the log db (both SQLite). Is there an integrity check that we could run on these and include it in the report?

Added SQLite integrity checks for both state and log databases using PRAGMA integrity_check. Existing DBs that are corrupt or unreadable now fail doctor.

Another thing that might be useful to include in the report is some stats about local rollout files - both non-archived and archived: counts, total disk space consumed, average rollout size.

Added rollout stats for active and archived rollouts, including file count, total disk usage, and average size. Large active rollout usage is promoted into the Notes block.

It might be useful in the configuration section to output which feature flags are enabled in the config. Users sometimes get into trouble by enabling features that are not yet ready for use.

Added feature flag details to the config section: enabled count, overridden count, explicit overrides, and legacy alias mappings. The default human output keeps long lists truncated, with --all to expand.

I find the verbose output much more useful than the non-verbose output. I'm wondering if we should always output verbose for this feature. If you're running this, you probably want as much information as possible. What do you think?

Changed doctor to be detailed by default and added --summary for compact output. The default output now uses hierarchy, Notes, grouped sections, and two-column details so the extra information is still scannable.

It runs locally, so it won't help diagnose problems when connecting remotely. For example, if a remote app-server is having problems connecting to the responses endpoint, this won't help. I think that OK. Just pointing it out.

Kept this PR scoped to local diagnostics. The report now makes local runtime/app-server status clearer, but remote app-server diagnostics remain a follow-up rather than being mixed into this local command.

Additional improvements added while addressing the feedback:

  • Added progress output while doctor runs, with JSON/summary modes staying quiet.
  • Added redacted structured JSON output with checks keyed by stable check id and details represented as structured fields.
  • Added safer redaction for URLs, including credentials in userinfo, query strings, fragments, and secret-looking path segments.
  • Added MCP diagnostics for disabled servers, optional vs required reachability, stdio command resolution, executable permissions, and invalid remote-sourced env vars in local stdio configs.
  • Added provider/auth correctness fixes so API-key users do not require ChatGPT reachability, provider-specific auth is respected, and malformed stored auth is detected.
  • Added clearer warning/failure output: non-ok checks now carry a cause, measured/expected values, offending detail fields, and a concrete remedy.
  • Added terminal diagnostics for TERM, locale, terminal size, TERMINFO, tmux, zellij, and related terminal environment signals.
  • Added CLI issue template guidance asking users to paste codex doctor --json output.

@etraut-openai
Copy link
Copy Markdown
Collaborator

The upload consent UI also lists the doctor report explicitly

I think the consent dialog is a per-client UI. Let's make sure that this change to the /feedback flow doesn't break the IDE extension or app.

@fcoury-oai fcoury-oai enabled auto-merge (squash) May 13, 2026 21:21
@fcoury-oai fcoury-oai merged commit 9798eb3 into main May 13, 2026
31 checks passed
@fcoury-oai fcoury-oai deleted the fcoury/doctor branch May 13, 2026 21:23
@github-actions github-actions Bot locked and limited conversation to collaborators May 13, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants