Skip to content

M5: shim tracer + first real benchmark numbers#19

Merged
madeye merged 1 commit intomainfrom
m5/shim-tracer-and-real-run
Apr 20, 2026
Merged

M5: shim tracer + first real benchmark numbers#19
madeye merged 1 commit intomainfrom
m5/shim-tracer-and-real-run

Conversation

@madeye
Copy link
Copy Markdown
Owner

@madeye madeye commented Apr 20, 2026

Summary

Lands the macOS-without-root tracing path and reports the first end-to-end benchmark numbers against `openai/codex@rust-v0.121.0`.

Bench infrastructure

  • New `shim` tracer mode (auto-picked on macOS without root). PATH-shadow shim wrappers count per-binary fork/exec by appending a line to a counter file before exec'ing the real binary.
  • `--add-dir "$OUT_DIR"` whitelists the bench output dir in codex's Seatbelt sandbox so the shim writes don't EPERM.
  • `$ZDOTDIR` override re-prepends the shim dir after the user's zsh profile runs — without this, `zsh -lc` puts `/opt/homebrew/bin` first and shadows our wrappers.
  • Tag resolution now uses `gh release list --exclude-pre-releases` (codex's release tags look like `rust-v0.121.0`, not bare semver), with a relaxed regex fallback for repos without `gh`.
  • Per-run `CODEX_HOME` is bootstrapped with a copy of `auth.json` so codex can call the model in isolation from the user's MCP server config.

Reporting

  • `compare.py` shows every binary that fired in either run (not just the headline list), plus a `mcp_tool_call` breakdown grouped by `server/tool` so it's obvious whether codex actually routed through mcp-cli or stuck with built-in Bash.
  • Token parser handles codex JSONL keys (`input_tokens` / `cached_input_tokens` / `output_tokens`) in addition to the human-readable fallback.

Headline finding

Full writeup: `bench/codex-forkexec/results/2026-04-20-rust-v0.121.0.md`.

```
metric baseline with mcp-cli delta

execve total 93 88 -5
rg 17 13 -4
git 16 16 +0
sed 56 54 -2
wc 3 4 +1
awk 1 1 +0
wall clock (s) 283.31 232.74 -50.56
input tokens 2,357,650 2,460,309 +102,659
cached input tokens 2,215,040 2,230,528 +15,488
output tokens 9,880 8,797 -1,083

codex MCP tool calls (server/tool)
codex/list_mcp_resource_templates 2 2 +0
codex/list_mcp_resources 2 2 +0
```

The MCP plugin loads fine; codex never picks its tools over its built-in Read / Grep / Bash. The execve delta and wall-clock delta are within single-sample model variance — codex didn't actually call `mcp-cli/fs_read` or `mcp-cli/search_grep` once.

Closing this gap needs either a codex-side preference rule or a rtk-style Bash-rewriting hook — neither lives in mcp-cli today. The bench harness itself is now a clean reproducible measurement, ready to verify any future fix.

Test plan

  • `bash -n bench/codex-forkexec/run.sh`
  • Python AST parse on `compare.py` and `parse_trace.py`
  • Smoke run with a tiny prompt (validated shim plumbing end-to-end)
  • Real run against `openai/codex@rust-v0.121.0` — table above

🤖 Generated with Claude Code

Adds the macOS-without-root tracing path and reports the first
end-to-end benchmark numbers against openai/codex@rust-v0.121.0.

Bench infrastructure:

* New `shim` tracer mode (auto-picked on macOS without root, on
  Linux without strace, etc.). PATH-shadow shim wrappers count
  per-binary fork/exec by appending a line to a counter file
  before exec'ing the real binary.

* `--add-dir "$OUT_DIR"` whitelists the bench output dir in
  codex's Seatbelt sandbox so the shim writes don't EPERM.

* `$ZDOTDIR` override re-prepends the shim dir after the user's
  zsh profile runs — without this, `zsh -lc` puts
  `/opt/homebrew/bin` first and shadows our wrappers.

* Tag resolution now uses `gh release list --exclude-pre-releases`
  (codex's release tags look like `rust-v0.121.0`, not bare
  semver), with a relaxed regex fallback for repos without `gh`.

* Per-run `CODEX_HOME` is bootstrapped with a copy of the user's
  `auth.json` so codex can actually call the model in isolation
  from the user's MCP server config.

Reporting:

* `compare.py` now shows every binary that fired in either run
  (not just the headline DAEMON_REPLACED list), plus a
  `mcp_tool_call` breakdown grouped by `server/tool` so the
  report makes it obvious whether codex actually routed through
  mcp-cli or stuck with built-in Bash. Token parser handles
  codex's JSONL keys (`input_tokens` / `cached_input_tokens` /
  `output_tokens`) in addition to the human-readable fallback.

Headline finding (full writeup in
`bench/codex-forkexec/results/2026-04-20-rust-v0.121.0.md`):

* execve total: 93 → 88 (delta -5, single-sample noise)
* wall clock: 283.31s → 232.74s (delta -50.56s, also noise)
* codex MCP tool calls landing on mcp-cli: 0
* codex MCP tool calls hitting startup discovery: 4

The MCP plugin loads fine; codex never picks its tools over its
built-in Read / Grep / Bash. Closing this needs either a codex
preference rule or a rtk-style Bash-rewriting hook — neither
lives in mcp-cli today. The bench harness itself is now a clean
reproducible measurement, ready to verify any future fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@madeye madeye merged commit 4b40839 into main Apr 20, 2026
1 check passed
@madeye madeye deleted the m5/shim-tracer-and-real-run branch April 20, 2026 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant