feat(comparison): codegraph head-to-head + B1 symbol-slot governance + B2 eval harness by omerakben · Pull Request #21 · omerakben/code-oz

omerakben · 2026-05-10T23:50:54Z

Summary

Head-to-head comparison of code-oz vs @colbymchenry/codegraph v0.7.2 (tree-sitter + SQLite code-intelligence indexer for Claude Code) plus implementation of the two action-bearing borrows identified by Codex.

The single most consequential outcome was Codex's pre-implementation Q8 catch: code-oz's reserved-but-unsupported 'symbol' slot in RepoContextToolName was already contract debt today, not harmless optionality. The type union allowed it, permissions.ts validated it, error types referenced it, but runner.ts rejected it with a generic "unsupported tool" error. Defense-in-depth wiring across config-load (src/agents/schema.ts) and runtime (src/tools/repo-context/permissions.ts) closes that debt while preserving the schema slot for a telemetry-gated reopen.

Comparison verdict

YES, code-oz is ahead on category — codegraph is a code-intelligence indexer for chat-tool agents, not an SDLC orchestrator. 90 percent of codegraph's surface is structurally orthogonal. The 10 percent overlap is exactly the tool_use.repo_context permission scope, which the comparison contracts and the implementation closes.

What's in this PR

B1 — Symbol slot contract cleanup (Option D-reserved)

RESERVED_REPO_CONTEXT_TOOLS = ['symbol'] constant in src/agents/schema.ts
validateRepoContext rejects 'symbol' in tools[] at config-load with schema_invalid_permissions and a rule anchored to REPO_CONTEXT.md § Reservation
intersectPermissions rejects any direct tool === 'symbol' request at runtime with tool_unavailable (defense-in-depth for type-bypassed callers like JSON-decoded payloads)
RepoContextToolName JSDoc + TOOL_DESCRIPTIONS.symbol persona vocabulary updated to RESERVED status with doc anchor
docs/contracts/REPO_CONTEXT.md § "Reservation and reopen-the-slot signal" codifies the 4-condition AND telemetry signal that would reopen the slot: high search churn + manifest-cap saturation + phase result-tokens > 200k + downstream VERIFY/REVIEW failure attributable to missed semantic context, on three runs across two repos
Type-union member preserved so the schema slot is callable for backward-compat when telemetry triggers a reopen

B2 — Three-case deterministic eval harness

tests/evaluation/repo-context/harness.ts drives runRepoContextTool (production code path) and computes recall@k + bytes/tokens + tool-call counts + truncation flag from the audit log
Three cases:
- case-01 discovery — recall@4 = 1.0 with one tool call (regression detector for grep correctness)
- case-02 usage — recall@6 = 1.0 with no maxResults saturation (regression detector for caller-graph queries at default caps)
- case-03 budget pressure — 40 matching files × 3 lines/file = 120 candidate matches against maxResults=25; asserts anyTruncated === true, per-call byte envelope, and precision under truncation (every returned path starts with src/match/, no decoy leakage)
scripts/eval-repo-context.ts standalone runner with --json / --strict flags
package.json script: bun run eval:repo_context

B5 — Framework-aware route detection reclassified

docs/design/ROADMAP.md W3 section: new "Deferred-with-trigger items" subsection covering both the symbol-slot reopen condition and B5 (reopens if a routing/API-surface audit persona enters the company roster)
Old "Optional symbol LSP integration (deferred from M6)" line replaced; M6 acceptance bullet updated

Documentation

docs/comparison/06-codegraph/COMPARISON.md — full head-to-head, borrow ranking, decision section
docs/comparison/06-codegraph/CODEX_BRIEFING.md — pre-implementation dispatch (historical)
docs/comparison/06-codegraph/CODEX_RESPONSE.md — R0 transcript + Postscript with canonical review-round table
docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md — planning record + Outcome section

Verification

bun test: 3116 pass / 0 fail / 2 skip (3108 baseline + 8 new). Three consecutive runs deterministic
bun run typecheck: clean
bun run eval:repo_context --strict: PASS on all three cases

Cross-model peer review

Multi-round Codex review process (canonical table in docs/comparison/06-codegraph/CODEX_RESPONSE.md § Postscript):

R0 019e12ed accept-with-modifications — pre-impl debate; flagged contract debt in Q8
R1 019e1326 fix-first — 3 block-push (REPO_CONTEXT.md drift, recall@k metric bug, case-03 not exercising budget pressure)
R2 019e1330 fix-first — 4 doc/parity drift
R3 019e141b fix-first — 3 doc-drift in companion files
R4 019e1421 fix-first — 3 doc-drift sweep (COMPARISON.md + ROADMAP M6)
R5 019e142d fix-first — CODEX_BRIEFING.md missing historical banner
R6 019e1436 fix-first — metadata-recursion: stabilize round-count language
R7 019e143d fix-first — 3 residual hardcoded round-count sites
R8 019e1442 fix-first — 6 residual round-count sites in planning-body
R9 019e1447 push — convergence

Test plan

bun test shows 3116+ pass / 0 fail
bun run typecheck clean
bun run eval:repo_context --strict PASS on all three cases
Reviewer-side spot check: try declaring tools: ['glob', 'grep', 'read', 'symbol'] in an agent fixture — config-load should reject with the rule string pointing at REPO_CONTEXT.md § Reservation

Comparison vs ~/Projects/agents/templates/codegraph (@colbymchenry/codegraph v0.7.2; tree-sitter+SQLite code-intelligence indexer). Verdict: code-oz is ahead on category. Two action-bearing borrows + one reclassified deferred-with-trigger; Codex (gpt-5.5 xhigh, thread 019e12ed) returned accept-with-modifications. Codex's load-bearing catch (Q8): the reserved-but-unsupported 'symbol' slot is already contract debt today — RepoContextToolName allows it, runner.ts errors with a generic "unsupported tool". Synthesis converted the original "defer indefinitely" into Option D-reserved with a 4-condition AND telemetry signal locked for the reopen path. Companion IMPLEMENTATION_PLAN.md sequences the four-commit follow-up: B1 contract cleanup, B2 three-case eval harness, B5 reclassification.

…able Closes the contract debt Codex caught in the codegraph comparison (thread 019e12ed Q8): the 'symbol' member of RepoContextToolName lived in the type union, in permission validation, and in error type unions while the runtime rejected it with a generic "unsupported tool" error. That is neither a clean deferral nor a usable feature — anyone reading the contract sees a tool name that exists nowhere downstream. Changes (defense-in-depth, two layers): 1. Config-load (src/agents/schema.ts): RESERVED_REPO_CONTEXT_TOOLS constant + validateRepoContext rejects any agent listing 'symbol' in tool_use.repo_context.tools[] with code schema_invalid_permissions and a rule pointing at REPO_CONTEXT.md § Reservation. 2. Runtime (src/tools/repo-context/permissions.ts): intersectPermissions rejects any direct request whose tool === 'symbol' with code tool_unavailable, even when the request bypasses TypeScript type checking (untyped JSON, untrusted callers). The type-union member in RepoContextToolName is preserved on purpose: the schema slot is callable for backward-compat when the 4-condition AND telemetry signal in REPO_CONTEXT.md § "Reservation and reopen-the- slot signal" fires (high search churn + manifest-cap saturation + phase result-tokens > 200k + downstream VERIFY/REVIEW failure attributable to missed semantic context, on three runs across two repos). Vocabulary updates: - src/prompts/index.ts TOOL_DESCRIPTIONS.symbol describes the slot as RESERVED with the doc anchor. - src/tools/repo-context/runner.ts unsupported-tool branch documents the unreachability and stays as a typed safety net for future tool additions. - docs/contracts/REPO_CONTEXT.md § "Reservation and reopen-the-slot signal" codifies the 4-condition AND. Tests: - tests/agents-permissions.test.ts: flipped existing "symbol allowed in schema" test to assert the RESERVED rejection. - tests/repo-context-symbol-reservation.test.ts: new defense-in-depth coverage at the runtime layer + RESERVED list shape + persona prompt vocabulary check. Test count: 3108 -> 3113. Typecheck clean.

Borrows codegraph's SEARCH_QUALITY_LOOP methodology, narrowed per Codex Q4 in the codegraph comparison synthesis (thread 019e12ed): three cases, deterministic, recall@k + bytes/tokens + tool-call counts; no LLM-judged path in default CI. Pure addition; no contract change. Cases: case-01-discovery — given a task prompt, can grep find the expected auth files? Asserts recall@4 = 1.0 under default caps with one tool call. case-02-usage — broad symbol query for callers of one function. Asserts every caller surfaces under default caps without saturating maxResults. case-03-budget-pressure — 30 candidate files, 12 matching, 13 tool calls (1 grep + 12 reads) against tightened caps (maxResults=25, maxBytesPerResult=4096, maxFilesForNextManifest=10). Asserts recall ≥ 0.8 floor and total result bytes < 1.5 MB envelope. The harness drives runRepoContextTool — the same orchestrator the spine uses — so metrics reflect the production code path including permission intersection and event emission. Wire-up: tests/evaluation/repo-context/harness.ts — case runner; emits JSON metrics; throws actionable error when rg is missing. tests/evaluation/repo-context/case-*.ts — three case fixtures. tests/evaluation/repo-context/runner.test.ts — bun-test wrapper so regressions surface in `bun test`. scripts/eval-repo-context.ts — standalone runner; `bun run eval:repo_context` emits JSON report and supports --strict (CI). package.json — eval:repo_context script. Test count: 3113 -> 3116 (3 new eval cases). Typecheck clean. Standalone runner: PASS case-01 / PASS case-02 / PASS case-03 with recall=1.0 across all three.

Closes B5 reclassification from the codegraph comparison synthesis (Codex thread 019e12ed Q6): framework-aware route detection moves from "no-borrow today" to "deferred-with-trigger" — reopens if a routing/API-surface audit persona enters the company roster (W4 candidate). Until then no orchestrator persona consumes route nodes and the borrow does not earn its rule-20 cost. Replaces the prior "Optional `symbol` LSP integration for repo-context tools (deferred from M6)" line in the W3 section with a new "Deferred-with-trigger items" subsection covering both: - The `symbol` tool backend (Option D-reserved; reopen condition lives in REPO_CONTEXT.md § "Reservation and reopen-the-slot signal"). - The B5 framework-aware route detection (reopen on routing-audit persona). The W3 section's other items (provider integration, LanguagePack, integrations, DSPy MIPRO, concurrent runs) are unchanged.

Three block-push findings from Codex R1 (thread 019e1326): 1. REPO_CONTEXT.md:32-34 stale wording. The bullet "tool_use.repo_context is the only tool_use sub-scope in v0.1" was wrong — M7-M10 added tool_use.write, .execute, .review_request, .debate. The line about 'symbol' being "optional in M6, deferred to W3" contradicted the new RESERVED contract added in commit f351688. Replaced both with accurate text and a forward link to the Reservation section. 2. recall@k metric was not actually computing recall@k. The harness was unioning all returned paths and counting hits — a noisy query returning 50 files including the expected 4 would still score 1.0. Refactored to preserve event-encounter order, deduplicate in order, slice to k, then count. Renamed `distinctReturnedPaths` to `orderedReturnedPaths` so the field name matches the semantics. Added doc comment explaining what the metric catches. 3. case-03 was not actually exercising budget pressure. With 12 matching files and maxResults=25, grep never truncated and the "budget pressure" name was theater. Rebuilt the case with 40 matching files × 3 lines/file = 120 candidate matches against maxResults=25 — truncation now genuinely fires (anyTruncated=true asserted). Replaced the recall floor (which depended on rg's filesystem traversal order, platform-dependent) with a precision metric: every returned path must start with `src/match/`, never `src/decoy/`. Precision-under-truncation is the real regression signal (a tool change that leaks decoys past cap saturation must fail). The standalone runner gained matching `precisionPathPrefix` threshold + `mustTruncate` flag. Side effect: case-01's README contained the trigger word "authenticate", which leaked into the result stream and made recall@4 nondeterministic (rg's parallel traversal sometimes put README.md before src/index.ts in encounter order). Removed the trigger word from README content; recall is now deterministically 1.0 across runs. Verification: bun test 3116 pass / 0 fail; typecheck clean; three consecutive bun-test runs all deterministic; bun run eval:repo_context --strict passes all three cases.

…ings Four findings from Codex R2 (thread 019e1330), all doc/parity drift from the R1 behavioral fixes: Area 3 — case-03 stale prose: - tests/evaluation/repo-context/harness.ts file-header: case-03 bullet still said "selected-path counts and result bytes stay below caps while keeping recall ≥ threshold." Rewrote to reflect the precision-under-truncation contract and the explicit no-recall-assertion rationale. - tests/evaluation/repo-context/runner.test.ts file-header: case-03 bullet still claimed "recall must stay ≥ 0.8 under tight caps." Rewrote to describe the truncation + envelope + precision assertions and to call out platform-dependent traversal order as the reason recall is excluded. - tests/evaluation/repo-context/case-03-budget-pressure.ts:44 matching() helper comment said "Two TARGET_SYMBOL occurrences per file ⇒ 80 grep match-lines for 40 files." The fixture body actually emits three matches per file (120 total). Corrected. - tests/evaluation/repo-context/case-03-budget-pressure.ts:59-62 described a k=10 recall window that no longer matches expectedAll (k = 40 after the R1 rebuild). Rewrote to point at the file-header explanation. Area 2 — minor recall@k doc precision: - harness.ts interface doc said "k = expectedPaths.length"; the code uses `new Set(expectedPaths).size` to keep the metric well-defined for fixtures that might list a path twice. Tightened the doc to match the code. Area 4 — standalone runner parity: - scripts/eval-repo-context.ts: new `minReturnedPaths` threshold mirrors the bun-test wrapper's `expect(orderedReturnedPaths.length).toBeGreaterThan(0)` sanity check. Wired `minReturnedPaths: 1` into all three case threshold entries so `bun run eval:repo_context --strict` reaches the same pass/fail conclusion as `bun test`. Areas 1 and 5 returned `push` from R2 directly and need no changes. Verification: bun test 3116 pass / 0 fail; typecheck clean; standalone runner --strict reports PASS on all three cases.

…cript) Three drift findings from Codex R3 (thread 019e141b), all in companion docs that the prior rounds did not audit: 1. docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md described case-03 as "30 candidate files; pattern matches 12; recall@k ≥ 0.8; selected-path count assertion" and proposed a new schema_reserved_tool error code. The shipped contract is 40 matching files × 3 match-lines/file = 120 candidate matches; asserts truncation + envelope + precision (no recall assertion; no selected-path assertion); reuses schema_invalid_permissions and tool_unavailable rather than minting a new code. Resolution: marked the document as a historical planning snapshot in its header, added an "Outcome (post-R3 final state)" section at the bottom that captures every divergence (error-code reuse, case-03 redesign, case-01 README fix, standalone runner parity), plus a review-round summary table. 2. docs/comparison/06-codegraph/CODEX_RESPONSE.md captured the R0 pre-implementation Q&A verbatim. Several specifics evolved during R1+R2 — particularly the B2 case-03 contract and the recall@k metric semantics. The transcript still read as the current contract. Resolution: marked the document as a historical R0 transcript in its header, added a "Postscript — implementation evolution (R1 / R2 / R3)" section at the bottom enumerating each divergence with commit anchors. 3. tests/evaluation/repo-context/harness.ts inline comment at the recall@k computation said "k = expectedPaths.length" while the code uses `new Set(expectedPaths).size` (and the public JSDoc was already corrected in R2). The inline comment was the last drifted surface. Resolution: rewrote the inline comment to match the code and explain why the dedup'd set size is the right denominator. Verification: bun test 3116 pass / 0 fail; typecheck clean. No code changes in this commit beyond a single comment block — only the companion docs that ship in the PR were updated. Closes Codex R3 (thread 019e141b). Next dispatch should return push.

…ROADMAP M6 line) Three drift findings from Codex R4 (thread 019e1421), all in companion/planning docs that the R3 audit narrowed past: 1. docs/comparison/06-codegraph/COMPARISON.md:33 (Symbol-aware queries row) still described the post-implementation state as "runner.ts errors 'unsupported tool symbol'" with a W3 ROADMAP reference to "Optional LSP integration." That misrepresents the shipped contract. Rewrote the cell to name the actual shipped rejection points: schema.ts config-load via schema_invalid_permissions + permissions.ts runtime via tool_unavailable + the 4-condition AND telemetry signal in REPO_CONTEXT.md § Reservation. 2. docs/comparison/06-codegraph/COMPARISON.md:54 (G1 narrative) still referenced "W3 ROADMAP defers it as Optional LSP integration" as if that line was still in the roadmap. The roadmap entry was replaced in commit 366dd9e. Rewrote G1 to point at the post- resolution location (REPO_CONTEXT.md § Reservation) and the shipped Option D-reserved decision. 3. docs/comparison/06-codegraph/COMPARISON.md:88 and :115 described the B1 cleanup as "permissions.ts config-load rejection with typed tool_unavailable." That conflated config-load (schema.ts) and runtime (permissions.ts) layers. Rewrote both to name the four shipped components: RESERVED_REPO_CONTEXT_TOOLS constant + validateRepoContext rejection (schema_invalid_permissions) + intersectPermissions runtime guard (tool_unavailable) + RepoContextToolName JSDoc + REPO_CONTEXT.md § Reservation. 4. docs/comparison/06-codegraph/COMPARISON.md:133 (References section) still listed "code-oz W3 LSP line" pointing at the deleted ROADMAP wording. Replaced with the post-resolution reference: ROADMAP W3 § "Deferred-with-trigger items" added in commit 366dd9e. 5. docs/design/ROADMAP.md:135 M6 acceptance bullet still said "symbol optional, callable by PLAN under tool_use.repo_context." The W3 section was corrected in commit 366dd9e but the M6 acceptance criterion remained stale. Rewrote to describe the reserved-but-not-permissionable contract with a forward link to REPO_CONTEXT.md § Reservation. Also added a "Document status — synthesis complete and shipped" banner to COMPARISON.md's frontmatter region so future readers know the document reflects post-implementation state (matching the banner pattern already on IMPLEMENTATION_PLAN.md and CODEX_RESPONSE.md). Frontmatter gained explicit review-rounds field tracking R0+R1+R2+R3+R4 thread IDs. Verification: bun test 3116 pass / 0 fail; typecheck clean; final grep audit shows the only remaining "Optional symbol LSP" hits are in M6-era historical docs (MERGE_PLAN.md, SESSION_M6_KICKOFF.md — out of scope for this PR) and the ROADMAP entry that intentionally references the prior wording when explaining the replacement. Closes Codex R4 (thread 019e1421). Next dispatch should return push.

Closes Codex R5 (thread 019e142d): the briefing file's lines 21-23 and 41-43 framed the pre-implementation symbol-slot state and the old "Optional symbol LSP integration" ROADMAP wording as current. The file was the last companion doc without an explicit historical-snapshot banner; IMPLEMENTATION_PLAN.md and CODEX_RESPONSE.md already received the same treatment in earlier rounds (commits 63721e7, 15d5d43). Resolution mirrors the established pattern: - Frontmatter `status: dispatched` changed to `historical (pre- implementation dispatch; superseded by R1/R2/R3/R4/R5 — see banner)`. - Frontmatter companion list extended to point at CODEX_RESPONSE.md and IMPLEMENTATION_PLAN.md so readers see the chain. - Blockquote banner added immediately after frontmatter naming the pre-implementation status, identifying the two stale spots inside (symbol-slot framing + Optional LSP wording), and pointing readers at the post-implementation locations (IMPLEMENTATION_PLAN § Outcome and REPO_CONTEXT § Reservation). No body content changed; the historical artifact is preserved verbatim for the audit trail. Verification: bun test 3116 pass / 0 fail; typecheck clean.

Codex R6 (thread 019e1436) caught a recursive metadata problem: each fix-first closure round wrote frontmatter/banner copy that pinned a specific round count, which the next round then flagged as stale. R6 specifically flagged that COMPARISON.md said "4 review rounds → R4 push", IMPLEMENTATION_PLAN.md said "Codex R3 push", and CODEX_RESPONSE.md status said "superseded by R1/R2/R3" — all stale once R4 and R5 fix-first closures landed. Resolution — break the recursion by: 1. Moving the canonical round-by-round table to a single location (CODEX_RESPONSE.md § Postscript "Review round summary"). IMPL_PLAN mirrors it. All other companion docs point at that table instead of duplicating it. 2. Stabilizing banner and frontmatter language to refer to "a multi- round Codex review process" or "subsequent rounds" instead of a specific count. A future round that fix-firsts metadata can update the canonical table without invalidating banner copy. 3. Updating the canonical table to include R0, R1, R2, R3, R4, R5, R6 (this commit), and a `next` row marking the target convergence. Files touched: - COMPARISON.md frontmatter (status + review-rounds; references table-as-source-of-truth) and banner (stable-language). - IMPLEMENTATION_PLAN.md frontmatter and banner (stable-language); Outcome § "Codex review round summary" extended to R4/R5/R6 with commit anchors per row. - CODEX_RESPONSE.md frontmatter and banner; Postscript table extended to R4/R5/R6 + `next` row. - CODEX_BRIEFING.md frontmatter and banner. No code changes. No technical contract changes. The shipped implementation contract from commits f351688 + bf0c12d + 366dd9e + b41b3f5 + 28ee554 is unchanged. Verification: bun test 3116 pass / 0 fail; typecheck clean. Closes Codex R6. Next dispatch should return push.

Codex R7 (thread 019e143d) caught three sites where R6's stabilization sweep missed hardcoded round counts: 1. CODEX_RESPONSE.md:170 Postscript heading still said "(R1 / R2 / R3)" — the original R3-era title; line 172 still said "three subsequent Codex review rounds." Rewrote to drop the count from both heading and prose; the canonical round-by-round table at the bottom of the Postscript is the single source of truth. 2. COMPARISON.md:8 review-rounds frontmatter still hardcoded "R1..R6 sequential fix-first closures" with the explicit thread-ID list. Replaced with a pointer to the canonical table in CODEX_RESPONSE.md § Postscript. 3. IMPLEMENTATION_PLAN.md:9 review-rounds frontmatter said "Outcome § Codex review round summary" is the canonical table, and the R6 entry in that same table claimed it had migrated round-counts "to this table as the single source of truth." That conflicted with R6's stated fix (which named CODEX_RESPONSE.md § Postscript as canonical). Updated both references to point at CODEX_RESPONSE.md and clarify that the IMPL_PLAN table is a mirror. Both tables (CODEX_RESPONSE Postscript + IMPL_PLAN Outcome) now include R7 with the (current commit) anchor. Stable language used elsewhere: "multi-round Codex review process", "subsequent rounds", "the canonical round-by-round table" — none pin a specific count. The only remaining round-count occurrences are inside the canonical tables themselves, which is correct (the table IS the source of truth; round numbers belong there). Verification: bun test 3116 pass / 0 fail; typecheck clean; grep audit confirms no hardcoded "three/four/five rounds" or "R1..Rn" range copy outside canonical tables. Closes Codex R7. Next dispatch should return push.

Codex R8 (thread 019e1442) flagged six remaining sites in IMPLEMENTATION_PLAN.md where round-count copy survived earlier sweeps. The planning-body section preserved the original "two-round target" plan; the Outcome section had a "post-R3 final state" heading; the test-count table row was labeled "After B2 + R1 + R2". All have been replaced with stable language: - IMPLEMENTATION_PLAN.md:29 "two Codex review rounds (R1 + R2 minimum) reaching push" → "a Codex review process culminating in push" + footnote that the original plan called for two rounds and the canonical table records what actually happened. - IMPLEMENTATION_PLAN.md:95-102 review-rounds section rewritten: "dispatch Codex R1", "dispatch R2", "Two-round target. R1+R2 must converge" → generic "dispatch a Codex review", "dispatch a follow-up review", "original target: minimum two-round convergence; canonical table records actual history." - IMPLEMENTATION_PLAN.md:112 "## Outcome (post-R3 final state)" → "## Outcome (post-implementation final state)". - IMPLEMENTATION_PLAN.md:120 "### B2 — case-03 redesign (R1 → R2 fixes)" → "### B2 — case-03 redesign (post-debate fixes)". - IMPLEMENTATION_PLAN.md:124 body prose "Codex R1 finding 3 caught" → "A post-debate review (see canonical table) caught" to remove the hardcoded round number. - IMPLEMENTATION_PLAN.md:147 test-count table "After B2 + R1 + R2" → "After B2 + post-debate fixes". Note: specific-round attributions inside the Outcome and Postscript sections (e.g., "Codex R1 finding 3" or "Codex R1 (thread 019e1326) caught") are preserved because those sections are explicitly the round-by-round history. The stability rule applies to banners, frontmatter, planning-body, and headings — not to body prose inside the round-summary documentation itself. Verification: bun test 3116 pass / 0 fail; typecheck clean. The canonical round-by-round table in CODEX_RESPONSE.md § Postscript remains the single source of truth (mirrored in IMPL_PLAN Outcome). Closes Codex R8. Next dispatch should return push.

Trailing administrative commit closing the R8 paper-trail. The R8 content fix landed in 13d2206; this commit extends the canonical round-by-round tables in CODEX_RESPONSE.md § Postscript and IMPLEMENTATION_PLAN.md § Outcome with the R8 row + commit anchor. No code or contract changes. Two-line diff in each table.

Codex R9 (thread 019e1447) returned push. The shipped contract is intact (RESERVED_REPO_CONTEXT_TOOLS + validateRepoContext schema_invalid_permissions + intersectPermissions tool_unavailable + REPO_CONTEXT.md § Reservation + 4-condition AND telemetry signal + case-03 40-files-3-lines-precision-not-recall contract), no remaining stale round-count copy outside the canonical tables, and the metadata-recursion sweep that consumed R6-R8 has converged. This commit extends both canonical round-by-round tables (CODEX_RESPONSE.md § Postscript and IMPLEMENTATION_PLAN.md § Outcome) with the R9 push row and removes the "(next)" pending row. The branch is now ready for PR. Final state: - 14 commits on top of origin/main - bun test: 3116 pass / 0 fail - bun run typecheck: clean - bun run eval:repo_context --strict: PASS on all three cases

coderabbitai · 2026-05-10T23:51:02Z

Warning

Rate limit exceeded

@omerakben has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 58 minutes and 46 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 918f960b-407a-4b43-92e5-e53fd5d4f8e7

📥 Commits

Reviewing files that changed from the base of the PR and between 0dce4b0 and 8cf3726.

📒 Files selected for processing (20)

docs/comparison/06-codegraph/CODEX_BRIEFING.md
docs/comparison/06-codegraph/CODEX_RESPONSE.md
docs/comparison/06-codegraph/COMPARISON.md
docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md
docs/contracts/REPO_CONTEXT.md
docs/design/ROADMAP.md
package.json
scripts/eval-repo-context.ts
src/agents/schema.ts
src/prompts/index.ts
src/tools/repo-context/permissions.ts
src/tools/repo-context/runner.ts
src/tools/repo-context/types.ts
tests/agents-permissions.test.ts
tests/evaluation/repo-context/case-01-discovery.ts
tests/evaluation/repo-context/case-02-usage.ts
tests/evaluation/repo-context/case-03-budget-pressure.ts
tests/evaluation/repo-context/harness.ts
tests/evaluation/repo-context/runner.test.ts
tests/repo-context-symbol-reservation.test.ts

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/comparison-codegraph

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f9e0e8c87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-10T23:54:28Z

+  const project = await mkdtemp(join(tmpdir(), `codeoz-eval-${caseName}-`))
+  for (const [rel, content] of setup.files) {
+    const abs = join(project, rel)
+    const dir = abs.slice(0, abs.lastIndexOf('/'))


Use path.dirname when creating fixture directories

Deriving dir via abs.slice(0, abs.lastIndexOf('/')) is POSIX-only and breaks on Windows, where join() produces backslash-separated paths (so lastIndexOf('/') is -1). In that environment this computes an invalid parent path and can fail before fixtures are written, which makes both tests/evaluation/repo-context/runner.test.ts and bun run eval:repo_context non-portable. Use dirname(abs) (or equivalent path API) to make fixture setup work across platforms.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request implements several improvements derived from a comparison with the codegraph tool. Key changes include tightening the governance of the symbol tool slot by marking it as explicitly reserved and non-permissionable, and introducing a new three-case deterministic evaluation harness for repo-context tools (glob, grep, read). The harness measures discovery, usage, and budget pressure metrics like recall@k and byte envelopes. Feedback focused on improving the portability of the evaluation harness by replacing brittle manual path manipulation with standard library functions to ensure compatibility across different operating systems.

gemini-code-assist · 2026-05-10T23:55:34Z

+
+import { mkdtemp, mkdir, writeFile } from 'node:fs/promises'
+import { tmpdir } from 'node:os'
+import { join } from 'node:path'


Please add dirname to the imports from node:path to support the suggested refactoring of the directory creation logic.

Suggested change

import { join } from 'node:path'

import { dirname, join } from 'node:path'

gemini-code-assist · 2026-05-10T23:55:34Z

+  const project = await mkdtemp(join(tmpdir(), `codeoz-eval-${caseName}-`))
+  for (const [rel, content] of setup.files) {
+    const abs = join(project, rel)
+    const dir = abs.slice(0, abs.lastIndexOf('/'))


The manual path slicing logic using lastIndexOf('/') is brittle and will fail on Windows systems where the path separator is \. Using path.dirname is more robust and portable.

Suggested change

const dir = abs.slice(0, abs.lastIndexOf('/'))

const dir = dirname(abs)

Copilot

Pull request overview

This PR tightens governance around the reserved repo_context tool slot (symbol) to eliminate contract debt, and adds a deterministic evaluation harness to regression-test repo_context retrieval quality (recall@k + budget metrics) without introducing new authority surfaces.

Changes:

Enforce 'symbol' as reserved-but-not-permissionable at both config-load (validateRepoContext) and runtime (intersectPermissions), and align prompt vocabulary + docs to the reservation contract.
Add a 3-case deterministic repo_context eval harness (discovery/usage/budget-pressure) used in CI (bun test) and a standalone runner (bun run eval:repo_context).
Update roadmap + comparison documentation to reflect the post-synthesis reservation/reopen trigger contract and the codegraph head-to-head record.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/repo-context-symbol-reservation.test.ts	Adds defense-in-depth tests for runtime rejection of `tool === 'symbol'` + prompt vocabulary checks.
tests/evaluation/repo-context/runner.test.ts	CI wrapper that runs the 3-case eval harness under `bun test` (skips/branches when `rg` missing).
tests/evaluation/repo-context/harness.ts	Core harness: builds temp repo, drives `runRepoContextTool`, derives recall@k + budget metrics from `events.jsonl`.
tests/evaluation/repo-context/case-01-discovery.ts	Fixture for broad discovery via grep; asserts recall@k and basic invariants.
tests/evaluation/repo-context/case-02-usage.ts	Fixture for call-site discovery; asserts recall and no truncation.
tests/evaluation/repo-context/case-03-budget-pressure.ts	Fixture that forces cap truncation and asserts envelope + precision-under-truncation.
tests/agents-permissions.test.ts	Updates schema validation test to assert `'symbol'` is rejected at config-load as RESERVED.
src/tools/repo-context/types.ts	Documents that `'symbol'` remains in the tool-name union but is reserved in v0.x.
src/tools/repo-context/runner.ts	Adds rationale comment about unsupported-tool branch being unreachable in v0.x.
src/tools/repo-context/permissions.ts	Adds runtime defense-in-depth rejection for untyped `request.tool === 'symbol'`.
src/prompts/index.ts	Exports `TOOL_DESCRIPTIONS` and updates `symbol` description to RESERVED with doc anchor.
src/agents/schema.ts	Adds `RESERVED_REPO_CONTEXT_TOOLS` and rejects reserved tools in `validateRepoContext`.
scripts/eval-repo-context.ts	Standalone runner for harness with `--json` / `--strict` reporting and thresholds.
package.json	Adds `eval:repo_context` script entry.
docs/design/ROADMAP.md	Reclassifies symbol backend + route detection as deferred-with-trigger items and updates M6 acceptance text.
docs/contracts/REPO_CONTEXT.md	Codifies reservation + explicit reopen-the-slot telemetry signal and rejection behavior.
docs/comparison/06-codegraph/COMPARISON.md	Adds shipped comparison record and borrow decision write-up.
docs/comparison/06-codegraph/CODEX_RESPONSE.md	Captures Codex R0 transcript + postscript with canonical review-round summary.
docs/comparison/06-codegraph/CODEX_BRIEFING.md	Adds historical pre-implementation briefing record with banner.
docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md	Adds implementation plan with authoritative “Outcome” section capturing final shipped state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    expect(() =>
+      intersectPermissions({
+        agentPermissions: noPerms,
+        request: req,
+        projectRoot: PROJECT,
+      }),
+    ).toThrow(RepoContextError)


+    // Unreachable in v0.x: 'glob' / 'grep' / 'read' all set `result` above,
+    // and 'symbol' is rejected by intersectPermissions before we reach the
+    // dispatch (see permissions.ts § "Defense-in-depth for the
+    // reserved-but-not-permissionable slot"). Kept as a typed safety net
+    // so future tool additions cannot regress the invariant silently.


…ssion Untyped callers (JSON-decoded payloads, tests, future call sites with untrusted input) could reach runRepoContextTool with tool='symbol'. intersectPermissions correctly threw, but the runner's downstream event-logging path called describeQuery/readRequestedRootsForLog on the unrecognized tool, producing query=undefined. appendEvent's schema validator then threw EventLogError, violating the function's "always emits exactly one repo_context_searched event" guarantee. Move the symbol-reservation guard to runner entry. The runner now rejects any non-runnable tool name (symbol or unknown) with tool_unavailable BEFORE any other operation. intersectPermissions keeps its own guard for direct callers; this is defense-in-depth at the next layer up. Strengthen the order-of-checks test: assert issues[0].code is tool_unavailable and tool is symbol, not just RepoContextError thrown. A regression that moves the guard below tu===undefined would surface tool_no_permissions and fail the assertion. Add three runner-level tests covering symbol rejection, no event emitted on the guard path (proving order), and rejection of arbitrary unknown tool names. Closes Copilot PR #21 block-class finding.

The earlier abs.slice(0, abs.lastIndexOf('/')) was POSIX-only. On Windows, node:path's join produces backslash-separated paths, so lastIndexOf('/') returns -1 and the derived dir is an invalid prefix. Fixture setup fails before any tool call, making the harness and bun run eval:repo_context non-portable. Use dirname() which honors the platform separator. Closes Codex PR #21 P2 portability finding.

omerakben · 2026-05-11T01:33:49Z

Three PR review findings closed in 8cf3726 / 5eaa508:

Copilot block-class (runtime bug) — runner.ts now guards tool=symbol (and any non-runnable tool) at function entry before intersectPermissions and event emission. Untyped callers previously caused appendEvent to throw EventLogError on query=undefined, breaking the "always emits one event" guarantee. Three new runner-level tests cover symbol rejection, the no-event-emitted invariant (proving order), and unknown-tool rejection.
Codex P2 (Windows portability) — harness.ts now uses node:path dirname() instead of POSIX-only abs.slice(0, lastIndexOf('/')). The earlier code returned -1 on Windows where join produces backslashes, breaking fixture setup.
Copilot (weak test assertion) — the "guard runs first" test in tests/repo-context-symbol-reservation.test.ts now asserts issues[0].code === 'tool_unavailable' and tool === 'symbol' instead of only toThrow(RepoContextError). A regression that moves the guard below tu===undefined would surface tool_no_permissions and fail the assertion.

3119 tests pass (+3 new). typecheck clean.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cf372665a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-11T01:37:01Z

+    name: 'case-02-usage',
+    setup: CASE_02_USAGE,
+    thresholds: { minRecall: 1.0, mustNotTruncate: true, minReturnedPaths: 1 },


Enforce case-02 saturation guard in strict eval mode

--strict is documented as using the same regression thresholds as tests/evaluation/repo-context/runner.test.ts, but case-02 omits the test’s orderedReturnedPaths.length < 50 assertion. That means the standalone evaluator can report PASS even when the usage query has drifted to the max-results boundary (e.g., returning 50 distinct paths without truncation), which weakens the harness’s ability to catch search-noise regressions outside bun test and can produce misleading experiment outcomes.

Useful? React with 👍 / 👎.

omerakben added 14 commits May 10, 2026 14:16

Copilot AI review requested due to automatic review settings May 10, 2026 23:50

Copilot started reviewing on behalf of omerakben May 10, 2026 23:51 View session

chatgpt-codex-connector Bot reviewed May 10, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

Copilot AI reviewed May 10, 2026

View reviewed changes

omerakben added 2 commits May 10, 2026 21:33

chatgpt-codex-connector Bot reviewed May 11, 2026

View reviewed changes

omerakben merged commit 36aa2fc into main May 11, 2026
1 check passed

omerakben deleted the feat/comparison-codegraph branch May 30, 2026 03:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(comparison): codegraph head-to-head + B1 symbol-slot governance + B2 eval harness#21

feat(comparison): codegraph head-to-head + B1 symbol-slot governance + B2 eval harness#21
omerakben merged 16 commits into
mainfrom
feat/comparison-codegraph

omerakben commented May 10, 2026

Uh oh!

coderabbitai Bot commented May 10, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 10, 2026

Uh oh!

gemini-code-assist Bot May 10, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

omerakben commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	import { join } from 'node:path'
	import { dirname, join } from 'node:path'

	const dir = abs.slice(0, abs.lastIndexOf('/'))
	const dir = dirname(abs)

Conversation

omerakben commented May 10, 2026

Summary

Comparison verdict

What's in this PR

B1 — Symbol slot contract cleanup (Option D-reserved)

B2 — Three-case deterministic eval harness

B5 — Framework-aware route detection reclassified

Documentation

Verification

Cross-model peer review

Test plan

Uh oh!

coderabbitai Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

omerakben commented May 11, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented May 10, 2026 •

edited

Loading