Skip to content

feat(comparison): codegraph head-to-head + B1 symbol-slot governance + B2 eval harness#21

Merged
omerakben merged 16 commits into
mainfrom
feat/comparison-codegraph
May 11, 2026
Merged

feat(comparison): codegraph head-to-head + B1 symbol-slot governance + B2 eval harness#21
omerakben merged 16 commits into
mainfrom
feat/comparison-codegraph

Conversation

@omerakben
Copy link
Copy Markdown
Owner

Summary

Head-to-head comparison of code-oz vs @colbymchenry/codegraph v0.7.2 (tree-sitter + SQLite code-intelligence indexer for Claude Code) plus implementation of the two action-bearing borrows identified by Codex.

The single most consequential outcome was Codex's pre-implementation Q8 catch: code-oz's reserved-but-unsupported 'symbol' slot in RepoContextToolName was already contract debt today, not harmless optionality. The type union allowed it, permissions.ts validated it, error types referenced it, but runner.ts rejected it with a generic "unsupported tool" error. Defense-in-depth wiring across config-load (src/agents/schema.ts) and runtime (src/tools/repo-context/permissions.ts) closes that debt while preserving the schema slot for a telemetry-gated reopen.

Comparison verdict

YES, code-oz is ahead on category — codegraph is a code-intelligence indexer for chat-tool agents, not an SDLC orchestrator. 90 percent of codegraph's surface is structurally orthogonal. The 10 percent overlap is exactly the tool_use.repo_context permission scope, which the comparison contracts and the implementation closes.

What's in this PR

B1 — Symbol slot contract cleanup (Option D-reserved)

  • RESERVED_REPO_CONTEXT_TOOLS = ['symbol'] constant in src/agents/schema.ts
  • validateRepoContext rejects 'symbol' in tools[] at config-load with schema_invalid_permissions and a rule anchored to REPO_CONTEXT.md § Reservation
  • intersectPermissions rejects any direct tool === 'symbol' request at runtime with tool_unavailable (defense-in-depth for type-bypassed callers like JSON-decoded payloads)
  • RepoContextToolName JSDoc + TOOL_DESCRIPTIONS.symbol persona vocabulary updated to RESERVED status with doc anchor
  • docs/contracts/REPO_CONTEXT.md § "Reservation and reopen-the-slot signal" codifies the 4-condition AND telemetry signal that would reopen the slot: high search churn + manifest-cap saturation + phase result-tokens > 200k + downstream VERIFY/REVIEW failure attributable to missed semantic context, on three runs across two repos
  • Type-union member preserved so the schema slot is callable for backward-compat when telemetry triggers a reopen

B2 — Three-case deterministic eval harness

  • tests/evaluation/repo-context/harness.ts drives runRepoContextTool (production code path) and computes recall@k + bytes/tokens + tool-call counts + truncation flag from the audit log
  • Three cases:
    • case-01 discovery — recall@4 = 1.0 with one tool call (regression detector for grep correctness)
    • case-02 usage — recall@6 = 1.0 with no maxResults saturation (regression detector for caller-graph queries at default caps)
    • case-03 budget pressure — 40 matching files × 3 lines/file = 120 candidate matches against maxResults=25; asserts anyTruncated === true, per-call byte envelope, and precision under truncation (every returned path starts with src/match/, no decoy leakage)
  • scripts/eval-repo-context.ts standalone runner with --json / --strict flags
  • package.json script: bun run eval:repo_context

B5 — Framework-aware route detection reclassified

  • docs/design/ROADMAP.md W3 section: new "Deferred-with-trigger items" subsection covering both the symbol-slot reopen condition and B5 (reopens if a routing/API-surface audit persona enters the company roster)
  • Old "Optional symbol LSP integration (deferred from M6)" line replaced; M6 acceptance bullet updated

Documentation

  • docs/comparison/06-codegraph/COMPARISON.md — full head-to-head, borrow ranking, decision section
  • docs/comparison/06-codegraph/CODEX_BRIEFING.md — pre-implementation dispatch (historical)
  • docs/comparison/06-codegraph/CODEX_RESPONSE.md — R0 transcript + Postscript with canonical review-round table
  • docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md — planning record + Outcome section

Verification

  • bun test: 3116 pass / 0 fail / 2 skip (3108 baseline + 8 new). Three consecutive runs deterministic
  • bun run typecheck: clean
  • bun run eval:repo_context --strict: PASS on all three cases

Cross-model peer review

Multi-round Codex review process (canonical table in docs/comparison/06-codegraph/CODEX_RESPONSE.md § Postscript):

  • R0 019e12ed accept-with-modifications — pre-impl debate; flagged contract debt in Q8
  • R1 019e1326 fix-first — 3 block-push (REPO_CONTEXT.md drift, recall@k metric bug, case-03 not exercising budget pressure)
  • R2 019e1330 fix-first — 4 doc/parity drift
  • R3 019e141b fix-first — 3 doc-drift in companion files
  • R4 019e1421 fix-first — 3 doc-drift sweep (COMPARISON.md + ROADMAP M6)
  • R5 019e142d fix-first — CODEX_BRIEFING.md missing historical banner
  • R6 019e1436 fix-first — metadata-recursion: stabilize round-count language
  • R7 019e143d fix-first — 3 residual hardcoded round-count sites
  • R8 019e1442 fix-first — 6 residual round-count sites in planning-body
  • R9 019e1447 push — convergence

Test plan

  • bun test shows 3116+ pass / 0 fail
  • bun run typecheck clean
  • bun run eval:repo_context --strict PASS on all three cases
  • Reviewer-side spot check: try declaring tools: ['glob', 'grep', 'read', 'symbol'] in an agent fixture — config-load should reject with the rule string pointing at REPO_CONTEXT.md § Reservation

omerakben added 14 commits May 10, 2026 14:16
Comparison vs ~/Projects/agents/templates/codegraph (@colbymchenry/codegraph
v0.7.2; tree-sitter+SQLite code-intelligence indexer). Verdict: code-oz is
ahead on category. Two action-bearing borrows + one reclassified
deferred-with-trigger; Codex (gpt-5.5 xhigh, thread 019e12ed) returned
accept-with-modifications.

Codex's load-bearing catch (Q8): the reserved-but-unsupported 'symbol' slot
is already contract debt today — RepoContextToolName allows it, runner.ts
errors with a generic "unsupported tool". Synthesis converted the original
"defer indefinitely" into Option D-reserved with a 4-condition AND
telemetry signal locked for the reopen path.

Companion IMPLEMENTATION_PLAN.md sequences the four-commit follow-up:
B1 contract cleanup, B2 three-case eval harness, B5 reclassification.
…able

Closes the contract debt Codex caught in the codegraph comparison
(thread 019e12ed Q8): the 'symbol' member of RepoContextToolName lived
in the type union, in permission validation, and in error type unions
while the runtime rejected it with a generic "unsupported tool" error.
That is neither a clean deferral nor a usable feature — anyone reading
the contract sees a tool name that exists nowhere downstream.

Changes (defense-in-depth, two layers):

1. Config-load (src/agents/schema.ts): RESERVED_REPO_CONTEXT_TOOLS
   constant + validateRepoContext rejects any agent listing 'symbol'
   in tool_use.repo_context.tools[] with code schema_invalid_permissions
   and a rule pointing at REPO_CONTEXT.md § Reservation.

2. Runtime (src/tools/repo-context/permissions.ts): intersectPermissions
   rejects any direct request whose tool === 'symbol' with code
   tool_unavailable, even when the request bypasses TypeScript type
   checking (untyped JSON, untrusted callers).

The type-union member in RepoContextToolName is preserved on purpose:
the schema slot is callable for backward-compat when the 4-condition
AND telemetry signal in REPO_CONTEXT.md § "Reservation and reopen-the-
slot signal" fires (high search churn + manifest-cap saturation +
phase result-tokens > 200k + downstream VERIFY/REVIEW failure
attributable to missed semantic context, on three runs across two
repos).

Vocabulary updates:
- src/prompts/index.ts TOOL_DESCRIPTIONS.symbol describes the slot as
  RESERVED with the doc anchor.
- src/tools/repo-context/runner.ts unsupported-tool branch documents
  the unreachability and stays as a typed safety net for future tool
  additions.
- docs/contracts/REPO_CONTEXT.md § "Reservation and reopen-the-slot
  signal" codifies the 4-condition AND.

Tests:
- tests/agents-permissions.test.ts: flipped existing
  "symbol allowed in schema" test to assert the RESERVED rejection.
- tests/repo-context-symbol-reservation.test.ts: new defense-in-depth
  coverage at the runtime layer + RESERVED list shape + persona prompt
  vocabulary check.

Test count: 3108 -> 3113. Typecheck clean.
Borrows codegraph's SEARCH_QUALITY_LOOP methodology, narrowed per Codex
Q4 in the codegraph comparison synthesis (thread 019e12ed): three
cases, deterministic, recall@k + bytes/tokens + tool-call counts; no
LLM-judged path in default CI. Pure addition; no contract change.

Cases:

  case-01-discovery        — given a task prompt, can grep find the
                              expected auth files? Asserts recall@4 = 1.0
                              under default caps with one tool call.

  case-02-usage             — broad symbol query for callers of one
                              function. Asserts every caller surfaces
                              under default caps without saturating
                              maxResults.

  case-03-budget-pressure   — 30 candidate files, 12 matching, 13 tool
                              calls (1 grep + 12 reads) against
                              tightened caps (maxResults=25,
                              maxBytesPerResult=4096,
                              maxFilesForNextManifest=10). Asserts
                              recall ≥ 0.8 floor and total result bytes
                              < 1.5 MB envelope.

The harness drives runRepoContextTool — the same orchestrator the spine
uses — so metrics reflect the production code path including permission
intersection and event emission.

Wire-up:

  tests/evaluation/repo-context/harness.ts    — case runner; emits JSON
                                                metrics; throws actionable
                                                error when rg is missing.
  tests/evaluation/repo-context/case-*.ts     — three case fixtures.
  tests/evaluation/repo-context/runner.test.ts — bun-test wrapper so
                                                regressions surface in
                                                `bun test`.
  scripts/eval-repo-context.ts                — standalone runner;
                                                `bun run eval:repo_context`
                                                emits JSON report and
                                                supports --strict (CI).
  package.json                                — eval:repo_context script.

Test count: 3113 -> 3116 (3 new eval cases). Typecheck clean. Standalone
runner: PASS case-01 / PASS case-02 / PASS case-03 with recall=1.0 across
all three.
Closes B5 reclassification from the codegraph comparison synthesis
(Codex thread 019e12ed Q6): framework-aware route detection moves from
"no-borrow today" to "deferred-with-trigger" — reopens if a
routing/API-surface audit persona enters the company roster (W4
candidate). Until then no orchestrator persona consumes route nodes and
the borrow does not earn its rule-20 cost.

Replaces the prior "Optional `symbol` LSP integration for repo-context
tools (deferred from M6)" line in the W3 section with a new
"Deferred-with-trigger items" subsection covering both:

- The `symbol` tool backend (Option D-reserved; reopen condition lives
  in REPO_CONTEXT.md § "Reservation and reopen-the-slot signal").
- The B5 framework-aware route detection (reopen on routing-audit
  persona).

The W3 section's other items (provider integration, LanguagePack,
integrations, DSPy MIPRO, concurrent runs) are unchanged.
Three block-push findings from Codex R1 (thread 019e1326):

1. REPO_CONTEXT.md:32-34 stale wording. The bullet "tool_use.repo_context
   is the only tool_use sub-scope in v0.1" was wrong — M7-M10 added
   tool_use.write, .execute, .review_request, .debate. The line about
   'symbol' being "optional in M6, deferred to W3" contradicted the
   new RESERVED contract added in commit f351688. Replaced both with
   accurate text and a forward link to the Reservation section.

2. recall@k metric was not actually computing recall@k. The harness
   was unioning all returned paths and counting hits — a noisy query
   returning 50 files including the expected 4 would still score 1.0.
   Refactored to preserve event-encounter order, deduplicate in order,
   slice to k, then count. Renamed `distinctReturnedPaths` to
   `orderedReturnedPaths` so the field name matches the semantics.
   Added doc comment explaining what the metric catches.

3. case-03 was not actually exercising budget pressure. With 12
   matching files and maxResults=25, grep never truncated and the
   "budget pressure" name was theater. Rebuilt the case with 40
   matching files × 3 lines/file = 120 candidate matches against
   maxResults=25 — truncation now genuinely fires (anyTruncated=true
   asserted). Replaced the recall floor (which depended on rg's
   filesystem traversal order, platform-dependent) with a precision
   metric: every returned path must start with `src/match/`, never
   `src/decoy/`. Precision-under-truncation is the real regression
   signal (a tool change that leaks decoys past cap saturation must
   fail). The standalone runner gained matching `precisionPathPrefix`
   threshold + `mustTruncate` flag.

Side effect: case-01's README contained the trigger word "authenticate",
which leaked into the result stream and made recall@4 nondeterministic
(rg's parallel traversal sometimes put README.md before src/index.ts in
encounter order). Removed the trigger word from README content; recall
is now deterministically 1.0 across runs.

Verification: bun test 3116 pass / 0 fail; typecheck clean; three
consecutive bun-test runs all deterministic; bun run eval:repo_context
--strict passes all three cases.
…ings

Four findings from Codex R2 (thread 019e1330), all doc/parity drift
from the R1 behavioral fixes:

Area 3 — case-03 stale prose:
  - tests/evaluation/repo-context/harness.ts file-header: case-03
    bullet still said "selected-path counts and result bytes stay
    below caps while keeping recall ≥ threshold." Rewrote to reflect
    the precision-under-truncation contract and the explicit
    no-recall-assertion rationale.
  - tests/evaluation/repo-context/runner.test.ts file-header: case-03
    bullet still claimed "recall must stay ≥ 0.8 under tight caps."
    Rewrote to describe the truncation + envelope + precision
    assertions and to call out platform-dependent traversal order
    as the reason recall is excluded.
  - tests/evaluation/repo-context/case-03-budget-pressure.ts:44
    matching() helper comment said "Two TARGET_SYMBOL occurrences per
    file ⇒ 80 grep match-lines for 40 files." The fixture body
    actually emits three matches per file (120 total). Corrected.
  - tests/evaluation/repo-context/case-03-budget-pressure.ts:59-62
    described a k=10 recall window that no longer matches
    expectedAll (k = 40 after the R1 rebuild). Rewrote to point at
    the file-header explanation.

Area 2 — minor recall@k doc precision:
  - harness.ts interface doc said "k = expectedPaths.length"; the
    code uses `new Set(expectedPaths).size` to keep the metric
    well-defined for fixtures that might list a path twice. Tightened
    the doc to match the code.

Area 4 — standalone runner parity:
  - scripts/eval-repo-context.ts: new `minReturnedPaths` threshold
    mirrors the bun-test wrapper's
    `expect(orderedReturnedPaths.length).toBeGreaterThan(0)` sanity
    check. Wired `minReturnedPaths: 1` into all three case threshold
    entries so `bun run eval:repo_context --strict` reaches the same
    pass/fail conclusion as `bun test`.

Areas 1 and 5 returned `push` from R2 directly and need no changes.

Verification: bun test 3116 pass / 0 fail; typecheck clean; standalone
runner --strict reports PASS on all three cases.
…cript)

Three drift findings from Codex R3 (thread 019e141b), all in companion
docs that the prior rounds did not audit:

1. docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md described
   case-03 as "30 candidate files; pattern matches 12; recall@k ≥
   0.8; selected-path count assertion" and proposed a new
   schema_reserved_tool error code. The shipped contract is 40
   matching files × 3 match-lines/file = 120 candidate matches;
   asserts truncation + envelope + precision (no recall assertion;
   no selected-path assertion); reuses schema_invalid_permissions
   and tool_unavailable rather than minting a new code.

   Resolution: marked the document as a historical planning snapshot
   in its header, added an "Outcome (post-R3 final state)" section
   at the bottom that captures every divergence (error-code reuse,
   case-03 redesign, case-01 README fix, standalone runner parity),
   plus a review-round summary table.

2. docs/comparison/06-codegraph/CODEX_RESPONSE.md captured the R0
   pre-implementation Q&A verbatim. Several specifics evolved during
   R1+R2 — particularly the B2 case-03 contract and the recall@k
   metric semantics. The transcript still read as the current
   contract.

   Resolution: marked the document as a historical R0 transcript in
   its header, added a "Postscript — implementation evolution (R1 /
   R2 / R3)" section at the bottom enumerating each divergence with
   commit anchors.

3. tests/evaluation/repo-context/harness.ts inline comment at the
   recall@k computation said "k = expectedPaths.length" while the
   code uses `new Set(expectedPaths).size` (and the public JSDoc was
   already corrected in R2). The inline comment was the last drifted
   surface.

   Resolution: rewrote the inline comment to match the code and
   explain why the dedup'd set size is the right denominator.

Verification: bun test 3116 pass / 0 fail; typecheck clean. No code
changes in this commit beyond a single comment block — only the
companion docs that ship in the PR were updated.

Closes Codex R3 (thread 019e141b). Next dispatch should return push.
…ROADMAP M6 line)

Three drift findings from Codex R4 (thread 019e1421), all in
companion/planning docs that the R3 audit narrowed past:

1. docs/comparison/06-codegraph/COMPARISON.md:33 (Symbol-aware queries
   row) still described the post-implementation state as "runner.ts
   errors 'unsupported tool symbol'" with a W3 ROADMAP reference to
   "Optional LSP integration." That misrepresents the shipped contract.
   Rewrote the cell to name the actual shipped rejection points:
   schema.ts config-load via schema_invalid_permissions + permissions.ts
   runtime via tool_unavailable + the 4-condition AND telemetry signal
   in REPO_CONTEXT.md § Reservation.

2. docs/comparison/06-codegraph/COMPARISON.md:54 (G1 narrative) still
   referenced "W3 ROADMAP defers it as Optional LSP integration" as
   if that line was still in the roadmap. The roadmap entry was
   replaced in commit 366dd9e. Rewrote G1 to point at the post-
   resolution location (REPO_CONTEXT.md § Reservation) and the
   shipped Option D-reserved decision.

3. docs/comparison/06-codegraph/COMPARISON.md:88 and :115 described
   the B1 cleanup as "permissions.ts config-load rejection with typed
   tool_unavailable." That conflated config-load (schema.ts) and
   runtime (permissions.ts) layers. Rewrote both to name the four
   shipped components: RESERVED_REPO_CONTEXT_TOOLS constant +
   validateRepoContext rejection (schema_invalid_permissions) +
   intersectPermissions runtime guard (tool_unavailable) +
   RepoContextToolName JSDoc + REPO_CONTEXT.md § Reservation.

4. docs/comparison/06-codegraph/COMPARISON.md:133 (References section)
   still listed "code-oz W3 LSP line" pointing at the deleted
   ROADMAP wording. Replaced with the post-resolution reference:
   ROADMAP W3 § "Deferred-with-trigger items" added in commit 366dd9e.

5. docs/design/ROADMAP.md:135 M6 acceptance bullet still said
   "symbol optional, callable by PLAN under tool_use.repo_context."
   The W3 section was corrected in commit 366dd9e but the M6
   acceptance criterion remained stale. Rewrote to describe the
   reserved-but-not-permissionable contract with a forward link to
   REPO_CONTEXT.md § Reservation.

Also added a "Document status — synthesis complete and shipped"
banner to COMPARISON.md's frontmatter region so future readers know
the document reflects post-implementation state (matching the
banner pattern already on IMPLEMENTATION_PLAN.md and
CODEX_RESPONSE.md). Frontmatter gained explicit review-rounds
field tracking R0+R1+R2+R3+R4 thread IDs.

Verification: bun test 3116 pass / 0 fail; typecheck clean; final
grep audit shows the only remaining "Optional symbol LSP" hits are
in M6-era historical docs (MERGE_PLAN.md, SESSION_M6_KICKOFF.md —
out of scope for this PR) and the ROADMAP entry that intentionally
references the prior wording when explaining the replacement.

Closes Codex R4 (thread 019e1421). Next dispatch should return push.
Closes Codex R5 (thread 019e142d): the briefing file's lines 21-23 and
41-43 framed the pre-implementation symbol-slot state and the old
"Optional symbol LSP integration" ROADMAP wording as current. The
file was the last companion doc without an explicit historical-snapshot
banner; IMPLEMENTATION_PLAN.md and CODEX_RESPONSE.md already received
the same treatment in earlier rounds (commits 63721e7, 15d5d43).

Resolution mirrors the established pattern:

- Frontmatter `status: dispatched` changed to `historical (pre-
  implementation dispatch; superseded by R1/R2/R3/R4/R5 — see banner)`.
- Frontmatter companion list extended to point at CODEX_RESPONSE.md
  and IMPLEMENTATION_PLAN.md so readers see the chain.
- Blockquote banner added immediately after frontmatter naming the
  pre-implementation status, identifying the two stale spots inside
  (symbol-slot framing + Optional LSP wording), and pointing readers
  at the post-implementation locations (IMPLEMENTATION_PLAN § Outcome
  and REPO_CONTEXT § Reservation).

No body content changed; the historical artifact is preserved
verbatim for the audit trail.

Verification: bun test 3116 pass / 0 fail; typecheck clean.
Codex R6 (thread 019e1436) caught a recursive metadata problem: each
fix-first closure round wrote frontmatter/banner copy that pinned a
specific round count, which the next round then flagged as stale. R6
specifically flagged that COMPARISON.md said "4 review rounds → R4
push", IMPLEMENTATION_PLAN.md said "Codex R3 push", and
CODEX_RESPONSE.md status said "superseded by R1/R2/R3" — all stale
once R4 and R5 fix-first closures landed.

Resolution — break the recursion by:

1. Moving the canonical round-by-round table to a single location
   (CODEX_RESPONSE.md § Postscript "Review round summary"). IMPL_PLAN
   mirrors it. All other companion docs point at that table instead
   of duplicating it.

2. Stabilizing banner and frontmatter language to refer to "a multi-
   round Codex review process" or "subsequent rounds" instead of a
   specific count. A future round that fix-firsts metadata can update
   the canonical table without invalidating banner copy.

3. Updating the canonical table to include R0, R1, R2, R3, R4, R5,
   R6 (this commit), and a `next` row marking the target convergence.

Files touched:
- COMPARISON.md frontmatter (status + review-rounds; references
  table-as-source-of-truth) and banner (stable-language).
- IMPLEMENTATION_PLAN.md frontmatter and banner (stable-language);
  Outcome § "Codex review round summary" extended to R4/R5/R6 with
  commit anchors per row.
- CODEX_RESPONSE.md frontmatter and banner; Postscript table
  extended to R4/R5/R6 + `next` row.
- CODEX_BRIEFING.md frontmatter and banner.

No code changes. No technical contract changes. The shipped
implementation contract from commits f351688 + bf0c12d + 366dd9e
+ b41b3f5 + 28ee554 is unchanged.

Verification: bun test 3116 pass / 0 fail; typecheck clean.

Closes Codex R6. Next dispatch should return push.
Codex R7 (thread 019e143d) caught three sites where R6's
stabilization sweep missed hardcoded round counts:

1. CODEX_RESPONSE.md:170 Postscript heading still said "(R1 / R2 /
   R3)" — the original R3-era title; line 172 still said "three
   subsequent Codex review rounds." Rewrote to drop the count from
   both heading and prose; the canonical round-by-round table at the
   bottom of the Postscript is the single source of truth.

2. COMPARISON.md:8 review-rounds frontmatter still hardcoded "R1..R6
   sequential fix-first closures" with the explicit thread-ID list.
   Replaced with a pointer to the canonical table in
   CODEX_RESPONSE.md § Postscript.

3. IMPLEMENTATION_PLAN.md:9 review-rounds frontmatter said "Outcome §
   Codex review round summary" is the canonical table, and the R6
   entry in that same table claimed it had migrated round-counts "to
   this table as the single source of truth." That conflicted with
   R6's stated fix (which named CODEX_RESPONSE.md § Postscript as
   canonical). Updated both references to point at CODEX_RESPONSE.md
   and clarify that the IMPL_PLAN table is a mirror.

Both tables (CODEX_RESPONSE Postscript + IMPL_PLAN Outcome) now
include R7 with the (current commit) anchor. Stable language used
elsewhere: "multi-round Codex review process", "subsequent rounds",
"the canonical round-by-round table" — none pin a specific count.

The only remaining round-count occurrences are inside the canonical
tables themselves, which is correct (the table IS the source of
truth; round numbers belong there).

Verification: bun test 3116 pass / 0 fail; typecheck clean; grep
audit confirms no hardcoded "three/four/five rounds" or "R1..Rn"
range copy outside canonical tables.

Closes Codex R7. Next dispatch should return push.
Codex R8 (thread 019e1442) flagged six remaining sites in
IMPLEMENTATION_PLAN.md where round-count copy survived earlier sweeps.
The planning-body section preserved the original "two-round target"
plan; the Outcome section had a "post-R3 final state" heading; the
test-count table row was labeled "After B2 + R1 + R2". All have
been replaced with stable language:

- IMPLEMENTATION_PLAN.md:29 "two Codex review rounds (R1 + R2
  minimum) reaching push" → "a Codex review process culminating
  in push" + footnote that the original plan called for two
  rounds and the canonical table records what actually happened.
- IMPLEMENTATION_PLAN.md:95-102 review-rounds section rewritten:
  "dispatch Codex R1", "dispatch R2", "Two-round target. R1+R2
  must converge" → generic "dispatch a Codex review", "dispatch
  a follow-up review", "original target: minimum two-round
  convergence; canonical table records actual history."
- IMPLEMENTATION_PLAN.md:112 "## Outcome (post-R3 final state)"
  → "## Outcome (post-implementation final state)".
- IMPLEMENTATION_PLAN.md:120 "### B2 — case-03 redesign (R1 → R2
  fixes)" → "### B2 — case-03 redesign (post-debate fixes)".
- IMPLEMENTATION_PLAN.md:124 body prose "Codex R1 finding 3
  caught" → "A post-debate review (see canonical table) caught"
  to remove the hardcoded round number.
- IMPLEMENTATION_PLAN.md:147 test-count table "After B2 + R1 +
  R2" → "After B2 + post-debate fixes".

Note: specific-round attributions inside the Outcome and Postscript
sections (e.g., "Codex R1 finding 3" or "Codex R1 (thread 019e1326)
caught") are preserved because those sections are explicitly the
round-by-round history. The stability rule applies to banners,
frontmatter, planning-body, and headings — not to body prose inside
the round-summary documentation itself.

Verification: bun test 3116 pass / 0 fail; typecheck clean. The
canonical round-by-round table in CODEX_RESPONSE.md § Postscript
remains the single source of truth (mirrored in IMPL_PLAN Outcome).

Closes Codex R8. Next dispatch should return push.
Trailing administrative commit closing the R8 paper-trail. The R8
content fix landed in 13d2206; this commit extends the canonical
round-by-round tables in CODEX_RESPONSE.md § Postscript and
IMPLEMENTATION_PLAN.md § Outcome with the R8 row + commit anchor.

No code or contract changes. Two-line diff in each table.
Codex R9 (thread 019e1447) returned push. The shipped contract is
intact (RESERVED_REPO_CONTEXT_TOOLS + validateRepoContext
schema_invalid_permissions + intersectPermissions tool_unavailable +
REPO_CONTEXT.md § Reservation + 4-condition AND telemetry signal +
case-03 40-files-3-lines-precision-not-recall contract), no
remaining stale round-count copy outside the canonical tables, and
the metadata-recursion sweep that consumed R6-R8 has converged.

This commit extends both canonical round-by-round tables
(CODEX_RESPONSE.md § Postscript and IMPLEMENTATION_PLAN.md §
Outcome) with the R9 push row and removes the "(next)" pending row.
The branch is now ready for PR.

Final state:
- 14 commits on top of origin/main
- bun test: 3116 pass / 0 fail
- bun run typecheck: clean
- bun run eval:repo_context --strict: PASS on all three cases
Copilot AI review requested due to automatic review settings May 10, 2026 23:50
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 10, 2026

Warning

Rate limit exceeded

@omerakben has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 58 minutes and 46 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 918f960b-407a-4b43-92e5-e53fd5d4f8e7

📥 Commits

Reviewing files that changed from the base of the PR and between 0dce4b0 and 8cf3726.

📒 Files selected for processing (20)
  • docs/comparison/06-codegraph/CODEX_BRIEFING.md
  • docs/comparison/06-codegraph/CODEX_RESPONSE.md
  • docs/comparison/06-codegraph/COMPARISON.md
  • docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md
  • docs/contracts/REPO_CONTEXT.md
  • docs/design/ROADMAP.md
  • package.json
  • scripts/eval-repo-context.ts
  • src/agents/schema.ts
  • src/prompts/index.ts
  • src/tools/repo-context/permissions.ts
  • src/tools/repo-context/runner.ts
  • src/tools/repo-context/types.ts
  • tests/agents-permissions.test.ts
  • tests/evaluation/repo-context/case-01-discovery.ts
  • tests/evaluation/repo-context/case-02-usage.ts
  • tests/evaluation/repo-context/case-03-budget-pressure.ts
  • tests/evaluation/repo-context/harness.ts
  • tests/evaluation/repo-context/runner.test.ts
  • tests/repo-context-symbol-reservation.test.ts
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/comparison-codegraph

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3f9e0e8c87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

const project = await mkdtemp(join(tmpdir(), `codeoz-eval-${caseName}-`))
for (const [rel, content] of setup.files) {
const abs = join(project, rel)
const dir = abs.slice(0, abs.lastIndexOf('/'))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use path.dirname when creating fixture directories

Deriving dir via abs.slice(0, abs.lastIndexOf('/')) is POSIX-only and breaks on Windows, where join() produces backslash-separated paths (so lastIndexOf('/') is -1). In that environment this computes an invalid parent path and can fail before fixtures are written, which makes both tests/evaluation/repo-context/runner.test.ts and bun run eval:repo_context non-portable. Use dirname(abs) (or equivalent path API) to make fixture setup work across platforms.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements several improvements derived from a comparison with the codegraph tool. Key changes include tightening the governance of the symbol tool slot by marking it as explicitly reserved and non-permissionable, and introducing a new three-case deterministic evaluation harness for repo-context tools (glob, grep, read). The harness measures discovery, usage, and budget pressure metrics like recall@k and byte envelopes. Feedback focused on improving the portability of the evaluation harness by replacing brittle manual path manipulation with standard library functions to ensure compatibility across different operating systems.


import { mkdtemp, mkdir, writeFile } from 'node:fs/promises'
import { tmpdir } from 'node:os'
import { join } from 'node:path'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Please add dirname to the imports from node:path to support the suggested refactoring of the directory creation logic.

Suggested change
import { join } from 'node:path'
import { dirname, join } from 'node:path'

const project = await mkdtemp(join(tmpdir(), `codeoz-eval-${caseName}-`))
for (const [rel, content] of setup.files) {
const abs = join(project, rel)
const dir = abs.slice(0, abs.lastIndexOf('/'))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The manual path slicing logic using lastIndexOf('/') is brittle and will fail on Windows systems where the path separator is \. Using path.dirname is more robust and portable.

Suggested change
const dir = abs.slice(0, abs.lastIndexOf('/'))
const dir = dirname(abs)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens governance around the reserved repo_context tool slot (symbol) to eliminate contract debt, and adds a deterministic evaluation harness to regression-test repo_context retrieval quality (recall@k + budget metrics) without introducing new authority surfaces.

Changes:

  • Enforce 'symbol' as reserved-but-not-permissionable at both config-load (validateRepoContext) and runtime (intersectPermissions), and align prompt vocabulary + docs to the reservation contract.
  • Add a 3-case deterministic repo_context eval harness (discovery/usage/budget-pressure) used in CI (bun test) and a standalone runner (bun run eval:repo_context).
  • Update roadmap + comparison documentation to reflect the post-synthesis reservation/reopen trigger contract and the codegraph head-to-head record.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/repo-context-symbol-reservation.test.ts Adds defense-in-depth tests for runtime rejection of tool === 'symbol' + prompt vocabulary checks.
tests/evaluation/repo-context/runner.test.ts CI wrapper that runs the 3-case eval harness under bun test (skips/branches when rg missing).
tests/evaluation/repo-context/harness.ts Core harness: builds temp repo, drives runRepoContextTool, derives recall@k + budget metrics from events.jsonl.
tests/evaluation/repo-context/case-01-discovery.ts Fixture for broad discovery via grep; asserts recall@k and basic invariants.
tests/evaluation/repo-context/case-02-usage.ts Fixture for call-site discovery; asserts recall and no truncation.
tests/evaluation/repo-context/case-03-budget-pressure.ts Fixture that forces cap truncation and asserts envelope + precision-under-truncation.
tests/agents-permissions.test.ts Updates schema validation test to assert 'symbol' is rejected at config-load as RESERVED.
src/tools/repo-context/types.ts Documents that 'symbol' remains in the tool-name union but is reserved in v0.x.
src/tools/repo-context/runner.ts Adds rationale comment about unsupported-tool branch being unreachable in v0.x.
src/tools/repo-context/permissions.ts Adds runtime defense-in-depth rejection for untyped request.tool === 'symbol'.
src/prompts/index.ts Exports TOOL_DESCRIPTIONS and updates symbol description to RESERVED with doc anchor.
src/agents/schema.ts Adds RESERVED_REPO_CONTEXT_TOOLS and rejects reserved tools in validateRepoContext.
scripts/eval-repo-context.ts Standalone runner for harness with --json / --strict reporting and thresholds.
package.json Adds eval:repo_context script entry.
docs/design/ROADMAP.md Reclassifies symbol backend + route detection as deferred-with-trigger items and updates M6 acceptance text.
docs/contracts/REPO_CONTEXT.md Codifies reservation + explicit reopen-the-slot telemetry signal and rejection behavior.
docs/comparison/06-codegraph/COMPARISON.md Adds shipped comparison record and borrow decision write-up.
docs/comparison/06-codegraph/CODEX_RESPONSE.md Captures Codex R0 transcript + postscript with canonical review-round summary.
docs/comparison/06-codegraph/CODEX_BRIEFING.md Adds historical pre-implementation briefing record with banner.
docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md Adds implementation plan with authoritative “Outcome” section capturing final shipped state.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +79 to +85
expect(() =>
intersectPermissions({
agentPermissions: noPerms,
request: req,
projectRoot: PROJECT,
}),
).toThrow(RepoContextError)
Comment on lines +116 to +120
// Unreachable in v0.x: 'glob' / 'grep' / 'read' all set `result` above,
// and 'symbol' is rejected by intersectPermissions before we reach the
// dispatch (see permissions.ts § "Defense-in-depth for the
// reserved-but-not-permissionable slot"). Kept as a typed safety net
// so future tool additions cannot regress the invariant silently.
omerakben added 2 commits May 10, 2026 21:33
…ssion

Untyped callers (JSON-decoded payloads, tests, future call sites with
untrusted input) could reach runRepoContextTool with tool='symbol'.
intersectPermissions correctly threw, but the runner's downstream
event-logging path called describeQuery/readRequestedRootsForLog on the
unrecognized tool, producing query=undefined. appendEvent's schema
validator then threw EventLogError, violating the function's "always
emits exactly one repo_context_searched event" guarantee.

Move the symbol-reservation guard to runner entry. The runner now
rejects any non-runnable tool name (symbol or unknown) with
tool_unavailable BEFORE any other operation. intersectPermissions keeps
its own guard for direct callers; this is defense-in-depth at the
next layer up.

Strengthen the order-of-checks test: assert issues[0].code is
tool_unavailable and tool is symbol, not just RepoContextError thrown.
A regression that moves the guard below tu===undefined would surface
tool_no_permissions and fail the assertion.

Add three runner-level tests covering symbol rejection, no event
emitted on the guard path (proving order), and rejection of arbitrary
unknown tool names.

Closes Copilot PR #21 block-class finding.
The earlier abs.slice(0, abs.lastIndexOf('/')) was POSIX-only. On
Windows, node:path's join produces backslash-separated paths, so
lastIndexOf('/') returns -1 and the derived dir is an invalid prefix.
Fixture setup fails before any tool call, making the harness and
bun run eval:repo_context non-portable.

Use dirname() which honors the platform separator.

Closes Codex PR #21 P2 portability finding.
@omerakben
Copy link
Copy Markdown
Owner Author

Three PR review findings closed in 8cf3726 / 5eaa508:

  1. Copilot block-class (runtime bug) — runner.ts now guards tool=symbol (and any non-runnable tool) at function entry before intersectPermissions and event emission. Untyped callers previously caused appendEvent to throw EventLogError on query=undefined, breaking the "always emits one event" guarantee. Three new runner-level tests cover symbol rejection, the no-event-emitted invariant (proving order), and unknown-tool rejection.

  2. Codex P2 (Windows portability) — harness.ts now uses node:path dirname() instead of POSIX-only abs.slice(0, lastIndexOf('/')). The earlier code returned -1 on Windows where join produces backslashes, breaking fixture setup.

  3. Copilot (weak test assertion) — the "guard runs first" test in tests/repo-context-symbol-reservation.test.ts now asserts issues[0].code === 'tool_unavailable' and tool === 'symbol' instead of only toThrow(RepoContextError). A regression that moves the guard below tu===undefined would surface tool_no_permissions and fail the assertion.

3119 tests pass (+3 new). typecheck clean.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 8cf372665a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +64 to +66
name: 'case-02-usage',
setup: CASE_02_USAGE,
thresholds: { minRecall: 1.0, mustNotTruncate: true, minReturnedPaths: 1 },
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce case-02 saturation guard in strict eval mode

--strict is documented as using the same regression thresholds as tests/evaluation/repo-context/runner.test.ts, but case-02 omits the test’s orderedReturnedPaths.length < 50 assertion. That means the standalone evaluator can report PASS even when the usage query has drifted to the max-results boundary (e.g., returning 50 distinct paths without truncation), which weakens the harness’s ability to catch search-noise regressions outside bun test and can produce misleading experiment outcomes.

Useful? React with 👍 / 👎.

@omerakben omerakben merged commit 36aa2fc into main May 11, 2026
1 check passed
@omerakben omerakben deleted the feat/comparison-codegraph branch May 30, 2026 03:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants