M5: shim tracer + first real benchmark numbers#19
Merged
Conversation
Adds the macOS-without-root tracing path and reports the first end-to-end benchmark numbers against openai/codex@rust-v0.121.0. Bench infrastructure: * New `shim` tracer mode (auto-picked on macOS without root, on Linux without strace, etc.). PATH-shadow shim wrappers count per-binary fork/exec by appending a line to a counter file before exec'ing the real binary. * `--add-dir "$OUT_DIR"` whitelists the bench output dir in codex's Seatbelt sandbox so the shim writes don't EPERM. * `$ZDOTDIR` override re-prepends the shim dir after the user's zsh profile runs — without this, `zsh -lc` puts `/opt/homebrew/bin` first and shadows our wrappers. * Tag resolution now uses `gh release list --exclude-pre-releases` (codex's release tags look like `rust-v0.121.0`, not bare semver), with a relaxed regex fallback for repos without `gh`. * Per-run `CODEX_HOME` is bootstrapped with a copy of the user's `auth.json` so codex can actually call the model in isolation from the user's MCP server config. Reporting: * `compare.py` now shows every binary that fired in either run (not just the headline DAEMON_REPLACED list), plus a `mcp_tool_call` breakdown grouped by `server/tool` so the report makes it obvious whether codex actually routed through mcp-cli or stuck with built-in Bash. Token parser handles codex's JSONL keys (`input_tokens` / `cached_input_tokens` / `output_tokens`) in addition to the human-readable fallback. Headline finding (full writeup in `bench/codex-forkexec/results/2026-04-20-rust-v0.121.0.md`): * execve total: 93 → 88 (delta -5, single-sample noise) * wall clock: 283.31s → 232.74s (delta -50.56s, also noise) * codex MCP tool calls landing on mcp-cli: 0 * codex MCP tool calls hitting startup discovery: 4 The MCP plugin loads fine; codex never picks its tools over its built-in Read / Grep / Bash. Closing this needs either a codex preference rule or a rtk-style Bash-rewriting hook — neither lives in mcp-cli today. The bench harness itself is now a clean reproducible measurement, ready to verify any future fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Lands the macOS-without-root tracing path and reports the first end-to-end benchmark numbers against `openai/codex@rust-v0.121.0`.
Bench infrastructure
Reporting
Headline finding
Full writeup: `bench/codex-forkexec/results/2026-04-20-rust-v0.121.0.md`.
```
metric baseline with mcp-cli delta
execve total 93 88 -5
rg 17 13 -4
git 16 16 +0
sed 56 54 -2
wc 3 4 +1
awk 1 1 +0
wall clock (s) 283.31 232.74 -50.56
input tokens 2,357,650 2,460,309 +102,659
cached input tokens 2,215,040 2,230,528 +15,488
output tokens 9,880 8,797 -1,083
codex MCP tool calls (server/tool)
codex/list_mcp_resource_templates 2 2 +0
codex/list_mcp_resources 2 2 +0
```
The MCP plugin loads fine; codex never picks its tools over its built-in Read / Grep / Bash. The execve delta and wall-clock delta are within single-sample model variance — codex didn't actually call `mcp-cli/fs_read` or `mcp-cli/search_grep` once.
Closing this gap needs either a codex-side preference rule or a rtk-style Bash-rewriting hook — neither lives in mcp-cli today. The bench harness itself is now a clean reproducible measurement, ready to verify any future fix.
Test plan
🤖 Generated with Claude Code