M5: shim tracer + first real benchmark numbers by madeye · Pull Request #19 · madeye/mcp-cli

madeye · 2026-04-20T07:48:53Z

Summary

Lands the macOS-without-root tracing path and reports the first end-to-end benchmark numbers against `openai/codex@rust-v0.121.0`.

Bench infrastructure

New `shim` tracer mode (auto-picked on macOS without root). PATH-shadow shim wrappers count per-binary fork/exec by appending a line to a counter file before exec'ing the real binary.
`--add-dir "$OUT_DIR"` whitelists the bench output dir in codex's Seatbelt sandbox so the shim writes don't EPERM.
`$ZDOTDIR` override re-prepends the shim dir after the user's zsh profile runs — without this, `zsh -lc` puts `/opt/homebrew/bin` first and shadows our wrappers.
Tag resolution now uses `gh release list --exclude-pre-releases` (codex's release tags look like `rust-v0.121.0`, not bare semver), with a relaxed regex fallback for repos without `gh`.
Per-run `CODEX_HOME` is bootstrapped with a copy of `auth.json` so codex can call the model in isolation from the user's MCP server config.

Reporting

`compare.py` shows every binary that fired in either run (not just the headline list), plus a `mcp_tool_call` breakdown grouped by `server/tool` so it's obvious whether codex actually routed through mcp-cli or stuck with built-in Bash.
Token parser handles codex JSONL keys (`input_tokens` / `cached_input_tokens` / `output_tokens`) in addition to the human-readable fallback.

Headline finding

Full writeup: `bench/codex-forkexec/results/2026-04-20-rust-v0.121.0.md`.

```
metric baseline with mcp-cli delta

execve total 93 88 -5
rg 17 13 -4
git 16 16 +0
sed 56 54 -2
wc 3 4 +1
awk 1 1 +0
wall clock (s) 283.31 232.74 -50.56
input tokens 2,357,650 2,460,309 +102,659
cached input tokens 2,215,040 2,230,528 +15,488
output tokens 9,880 8,797 -1,083

codex MCP tool calls (server/tool)
codex/list_mcp_resource_templates 2 2 +0
codex/list_mcp_resources 2 2 +0
```

The MCP plugin loads fine; codex never picks its tools over its built-in Read / Grep / Bash. The execve delta and wall-clock delta are within single-sample model variance — codex didn't actually call `mcp-cli/fs_read` or `mcp-cli/search_grep` once.

Closing this gap needs either a codex-side preference rule or a rtk-style Bash-rewriting hook — neither lives in mcp-cli today. The bench harness itself is now a clean reproducible measurement, ready to verify any future fix.

Test plan

`bash -n bench/codex-forkexec/run.sh`
Python AST parse on `compare.py` and `parse_trace.py`
Smoke run with a tiny prompt (validated shim plumbing end-to-end)
Real run against `openai/codex@rust-v0.121.0` — table above

🤖 Generated with Claude Code

Adds the macOS-without-root tracing path and reports the first end-to-end benchmark numbers against openai/codex@rust-v0.121.0. Bench infrastructure: * New `shim` tracer mode (auto-picked on macOS without root, on Linux without strace, etc.). PATH-shadow shim wrappers count per-binary fork/exec by appending a line to a counter file before exec'ing the real binary. * `--add-dir "$OUT_DIR"` whitelists the bench output dir in codex's Seatbelt sandbox so the shim writes don't EPERM. * `$ZDOTDIR` override re-prepends the shim dir after the user's zsh profile runs — without this, `zsh -lc` puts `/opt/homebrew/bin` first and shadows our wrappers. * Tag resolution now uses `gh release list --exclude-pre-releases` (codex's release tags look like `rust-v0.121.0`, not bare semver), with a relaxed regex fallback for repos without `gh`. * Per-run `CODEX_HOME` is bootstrapped with a copy of the user's `auth.json` so codex can actually call the model in isolation from the user's MCP server config. Reporting: * `compare.py` now shows every binary that fired in either run (not just the headline DAEMON_REPLACED list), plus a `mcp_tool_call` breakdown grouped by `server/tool` so the report makes it obvious whether codex actually routed through mcp-cli or stuck with built-in Bash. Token parser handles codex's JSONL keys (`input_tokens` / `cached_input_tokens` / `output_tokens`) in addition to the human-readable fallback. Headline finding (full writeup in `bench/codex-forkexec/results/2026-04-20-rust-v0.121.0.md`): * execve total: 93 → 88 (delta -5, single-sample noise) * wall clock: 283.31s → 232.74s (delta -50.56s, also noise) * codex MCP tool calls landing on mcp-cli: 0 * codex MCP tool calls hitting startup discovery: 4 The MCP plugin loads fine; codex never picks its tools over its built-in Read / Grep / Bash. Closing this needs either a codex preference rule or a rtk-style Bash-rewriting hook — neither lives in mcp-cli today. The bench harness itself is now a clean reproducible measurement, ready to verify any future fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

madeye merged commit 4b40839 into main Apr 20, 2026
1 check passed

madeye deleted the m5/shim-tracer-and-real-run branch April 20, 2026 08:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

M5: shim tracer + first real benchmark numbers#19

M5: shim tracer + first real benchmark numbers#19
madeye merged 1 commit intomainfrom
m5/shim-tracer-and-real-run

madeye commented Apr 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

madeye commented Apr 20, 2026

Summary

Bench infrastructure

Reporting

Headline finding

``` metric baseline with mcp-cli delta

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

```
metric baseline with mcp-cli delta