feat(comparison): codegraph head-to-head + B1 symbol-slot governance + B2 eval harness#21
Conversation
Comparison vs ~/Projects/agents/templates/codegraph (@colbymchenry/codegraph v0.7.2; tree-sitter+SQLite code-intelligence indexer). Verdict: code-oz is ahead on category. Two action-bearing borrows + one reclassified deferred-with-trigger; Codex (gpt-5.5 xhigh, thread 019e12ed) returned accept-with-modifications. Codex's load-bearing catch (Q8): the reserved-but-unsupported 'symbol' slot is already contract debt today — RepoContextToolName allows it, runner.ts errors with a generic "unsupported tool". Synthesis converted the original "defer indefinitely" into Option D-reserved with a 4-condition AND telemetry signal locked for the reopen path. Companion IMPLEMENTATION_PLAN.md sequences the four-commit follow-up: B1 contract cleanup, B2 three-case eval harness, B5 reclassification.
…able Closes the contract debt Codex caught in the codegraph comparison (thread 019e12ed Q8): the 'symbol' member of RepoContextToolName lived in the type union, in permission validation, and in error type unions while the runtime rejected it with a generic "unsupported tool" error. That is neither a clean deferral nor a usable feature — anyone reading the contract sees a tool name that exists nowhere downstream. Changes (defense-in-depth, two layers): 1. Config-load (src/agents/schema.ts): RESERVED_REPO_CONTEXT_TOOLS constant + validateRepoContext rejects any agent listing 'symbol' in tool_use.repo_context.tools[] with code schema_invalid_permissions and a rule pointing at REPO_CONTEXT.md § Reservation. 2. Runtime (src/tools/repo-context/permissions.ts): intersectPermissions rejects any direct request whose tool === 'symbol' with code tool_unavailable, even when the request bypasses TypeScript type checking (untyped JSON, untrusted callers). The type-union member in RepoContextToolName is preserved on purpose: the schema slot is callable for backward-compat when the 4-condition AND telemetry signal in REPO_CONTEXT.md § "Reservation and reopen-the- slot signal" fires (high search churn + manifest-cap saturation + phase result-tokens > 200k + downstream VERIFY/REVIEW failure attributable to missed semantic context, on three runs across two repos). Vocabulary updates: - src/prompts/index.ts TOOL_DESCRIPTIONS.symbol describes the slot as RESERVED with the doc anchor. - src/tools/repo-context/runner.ts unsupported-tool branch documents the unreachability and stays as a typed safety net for future tool additions. - docs/contracts/REPO_CONTEXT.md § "Reservation and reopen-the-slot signal" codifies the 4-condition AND. Tests: - tests/agents-permissions.test.ts: flipped existing "symbol allowed in schema" test to assert the RESERVED rejection. - tests/repo-context-symbol-reservation.test.ts: new defense-in-depth coverage at the runtime layer + RESERVED list shape + persona prompt vocabulary check. Test count: 3108 -> 3113. Typecheck clean.
Borrows codegraph's SEARCH_QUALITY_LOOP methodology, narrowed per Codex
Q4 in the codegraph comparison synthesis (thread 019e12ed): three
cases, deterministic, recall@k + bytes/tokens + tool-call counts; no
LLM-judged path in default CI. Pure addition; no contract change.
Cases:
case-01-discovery — given a task prompt, can grep find the
expected auth files? Asserts recall@4 = 1.0
under default caps with one tool call.
case-02-usage — broad symbol query for callers of one
function. Asserts every caller surfaces
under default caps without saturating
maxResults.
case-03-budget-pressure — 30 candidate files, 12 matching, 13 tool
calls (1 grep + 12 reads) against
tightened caps (maxResults=25,
maxBytesPerResult=4096,
maxFilesForNextManifest=10). Asserts
recall ≥ 0.8 floor and total result bytes
< 1.5 MB envelope.
The harness drives runRepoContextTool — the same orchestrator the spine
uses — so metrics reflect the production code path including permission
intersection and event emission.
Wire-up:
tests/evaluation/repo-context/harness.ts — case runner; emits JSON
metrics; throws actionable
error when rg is missing.
tests/evaluation/repo-context/case-*.ts — three case fixtures.
tests/evaluation/repo-context/runner.test.ts — bun-test wrapper so
regressions surface in
`bun test`.
scripts/eval-repo-context.ts — standalone runner;
`bun run eval:repo_context`
emits JSON report and
supports --strict (CI).
package.json — eval:repo_context script.
Test count: 3113 -> 3116 (3 new eval cases). Typecheck clean. Standalone
runner: PASS case-01 / PASS case-02 / PASS case-03 with recall=1.0 across
all three.
Closes B5 reclassification from the codegraph comparison synthesis (Codex thread 019e12ed Q6): framework-aware route detection moves from "no-borrow today" to "deferred-with-trigger" — reopens if a routing/API-surface audit persona enters the company roster (W4 candidate). Until then no orchestrator persona consumes route nodes and the borrow does not earn its rule-20 cost. Replaces the prior "Optional `symbol` LSP integration for repo-context tools (deferred from M6)" line in the W3 section with a new "Deferred-with-trigger items" subsection covering both: - The `symbol` tool backend (Option D-reserved; reopen condition lives in REPO_CONTEXT.md § "Reservation and reopen-the-slot signal"). - The B5 framework-aware route detection (reopen on routing-audit persona). The W3 section's other items (provider integration, LanguagePack, integrations, DSPy MIPRO, concurrent runs) are unchanged.
Three block-push findings from Codex R1 (thread 019e1326): 1. REPO_CONTEXT.md:32-34 stale wording. The bullet "tool_use.repo_context is the only tool_use sub-scope in v0.1" was wrong — M7-M10 added tool_use.write, .execute, .review_request, .debate. The line about 'symbol' being "optional in M6, deferred to W3" contradicted the new RESERVED contract added in commit f351688. Replaced both with accurate text and a forward link to the Reservation section. 2. recall@k metric was not actually computing recall@k. The harness was unioning all returned paths and counting hits — a noisy query returning 50 files including the expected 4 would still score 1.0. Refactored to preserve event-encounter order, deduplicate in order, slice to k, then count. Renamed `distinctReturnedPaths` to `orderedReturnedPaths` so the field name matches the semantics. Added doc comment explaining what the metric catches. 3. case-03 was not actually exercising budget pressure. With 12 matching files and maxResults=25, grep never truncated and the "budget pressure" name was theater. Rebuilt the case with 40 matching files × 3 lines/file = 120 candidate matches against maxResults=25 — truncation now genuinely fires (anyTruncated=true asserted). Replaced the recall floor (which depended on rg's filesystem traversal order, platform-dependent) with a precision metric: every returned path must start with `src/match/`, never `src/decoy/`. Precision-under-truncation is the real regression signal (a tool change that leaks decoys past cap saturation must fail). The standalone runner gained matching `precisionPathPrefix` threshold + `mustTruncate` flag. Side effect: case-01's README contained the trigger word "authenticate", which leaked into the result stream and made recall@4 nondeterministic (rg's parallel traversal sometimes put README.md before src/index.ts in encounter order). Removed the trigger word from README content; recall is now deterministically 1.0 across runs. Verification: bun test 3116 pass / 0 fail; typecheck clean; three consecutive bun-test runs all deterministic; bun run eval:repo_context --strict passes all three cases.
…ings
Four findings from Codex R2 (thread 019e1330), all doc/parity drift
from the R1 behavioral fixes:
Area 3 — case-03 stale prose:
- tests/evaluation/repo-context/harness.ts file-header: case-03
bullet still said "selected-path counts and result bytes stay
below caps while keeping recall ≥ threshold." Rewrote to reflect
the precision-under-truncation contract and the explicit
no-recall-assertion rationale.
- tests/evaluation/repo-context/runner.test.ts file-header: case-03
bullet still claimed "recall must stay ≥ 0.8 under tight caps."
Rewrote to describe the truncation + envelope + precision
assertions and to call out platform-dependent traversal order
as the reason recall is excluded.
- tests/evaluation/repo-context/case-03-budget-pressure.ts:44
matching() helper comment said "Two TARGET_SYMBOL occurrences per
file ⇒ 80 grep match-lines for 40 files." The fixture body
actually emits three matches per file (120 total). Corrected.
- tests/evaluation/repo-context/case-03-budget-pressure.ts:59-62
described a k=10 recall window that no longer matches
expectedAll (k = 40 after the R1 rebuild). Rewrote to point at
the file-header explanation.
Area 2 — minor recall@k doc precision:
- harness.ts interface doc said "k = expectedPaths.length"; the
code uses `new Set(expectedPaths).size` to keep the metric
well-defined for fixtures that might list a path twice. Tightened
the doc to match the code.
Area 4 — standalone runner parity:
- scripts/eval-repo-context.ts: new `minReturnedPaths` threshold
mirrors the bun-test wrapper's
`expect(orderedReturnedPaths.length).toBeGreaterThan(0)` sanity
check. Wired `minReturnedPaths: 1` into all three case threshold
entries so `bun run eval:repo_context --strict` reaches the same
pass/fail conclusion as `bun test`.
Areas 1 and 5 returned `push` from R2 directly and need no changes.
Verification: bun test 3116 pass / 0 fail; typecheck clean; standalone
runner --strict reports PASS on all three cases.
…cript) Three drift findings from Codex R3 (thread 019e141b), all in companion docs that the prior rounds did not audit: 1. docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md described case-03 as "30 candidate files; pattern matches 12; recall@k ≥ 0.8; selected-path count assertion" and proposed a new schema_reserved_tool error code. The shipped contract is 40 matching files × 3 match-lines/file = 120 candidate matches; asserts truncation + envelope + precision (no recall assertion; no selected-path assertion); reuses schema_invalid_permissions and tool_unavailable rather than minting a new code. Resolution: marked the document as a historical planning snapshot in its header, added an "Outcome (post-R3 final state)" section at the bottom that captures every divergence (error-code reuse, case-03 redesign, case-01 README fix, standalone runner parity), plus a review-round summary table. 2. docs/comparison/06-codegraph/CODEX_RESPONSE.md captured the R0 pre-implementation Q&A verbatim. Several specifics evolved during R1+R2 — particularly the B2 case-03 contract and the recall@k metric semantics. The transcript still read as the current contract. Resolution: marked the document as a historical R0 transcript in its header, added a "Postscript — implementation evolution (R1 / R2 / R3)" section at the bottom enumerating each divergence with commit anchors. 3. tests/evaluation/repo-context/harness.ts inline comment at the recall@k computation said "k = expectedPaths.length" while the code uses `new Set(expectedPaths).size` (and the public JSDoc was already corrected in R2). The inline comment was the last drifted surface. Resolution: rewrote the inline comment to match the code and explain why the dedup'd set size is the right denominator. Verification: bun test 3116 pass / 0 fail; typecheck clean. No code changes in this commit beyond a single comment block — only the companion docs that ship in the PR were updated. Closes Codex R3 (thread 019e141b). Next dispatch should return push.
…ROADMAP M6 line) Three drift findings from Codex R4 (thread 019e1421), all in companion/planning docs that the R3 audit narrowed past: 1. docs/comparison/06-codegraph/COMPARISON.md:33 (Symbol-aware queries row) still described the post-implementation state as "runner.ts errors 'unsupported tool symbol'" with a W3 ROADMAP reference to "Optional LSP integration." That misrepresents the shipped contract. Rewrote the cell to name the actual shipped rejection points: schema.ts config-load via schema_invalid_permissions + permissions.ts runtime via tool_unavailable + the 4-condition AND telemetry signal in REPO_CONTEXT.md § Reservation. 2. docs/comparison/06-codegraph/COMPARISON.md:54 (G1 narrative) still referenced "W3 ROADMAP defers it as Optional LSP integration" as if that line was still in the roadmap. The roadmap entry was replaced in commit 366dd9e. Rewrote G1 to point at the post- resolution location (REPO_CONTEXT.md § Reservation) and the shipped Option D-reserved decision. 3. docs/comparison/06-codegraph/COMPARISON.md:88 and :115 described the B1 cleanup as "permissions.ts config-load rejection with typed tool_unavailable." That conflated config-load (schema.ts) and runtime (permissions.ts) layers. Rewrote both to name the four shipped components: RESERVED_REPO_CONTEXT_TOOLS constant + validateRepoContext rejection (schema_invalid_permissions) + intersectPermissions runtime guard (tool_unavailable) + RepoContextToolName JSDoc + REPO_CONTEXT.md § Reservation. 4. docs/comparison/06-codegraph/COMPARISON.md:133 (References section) still listed "code-oz W3 LSP line" pointing at the deleted ROADMAP wording. Replaced with the post-resolution reference: ROADMAP W3 § "Deferred-with-trigger items" added in commit 366dd9e. 5. docs/design/ROADMAP.md:135 M6 acceptance bullet still said "symbol optional, callable by PLAN under tool_use.repo_context." The W3 section was corrected in commit 366dd9e but the M6 acceptance criterion remained stale. Rewrote to describe the reserved-but-not-permissionable contract with a forward link to REPO_CONTEXT.md § Reservation. Also added a "Document status — synthesis complete and shipped" banner to COMPARISON.md's frontmatter region so future readers know the document reflects post-implementation state (matching the banner pattern already on IMPLEMENTATION_PLAN.md and CODEX_RESPONSE.md). Frontmatter gained explicit review-rounds field tracking R0+R1+R2+R3+R4 thread IDs. Verification: bun test 3116 pass / 0 fail; typecheck clean; final grep audit shows the only remaining "Optional symbol LSP" hits are in M6-era historical docs (MERGE_PLAN.md, SESSION_M6_KICKOFF.md — out of scope for this PR) and the ROADMAP entry that intentionally references the prior wording when explaining the replacement. Closes Codex R4 (thread 019e1421). Next dispatch should return push.
Closes Codex R5 (thread 019e142d): the briefing file's lines 21-23 and 41-43 framed the pre-implementation symbol-slot state and the old "Optional symbol LSP integration" ROADMAP wording as current. The file was the last companion doc without an explicit historical-snapshot banner; IMPLEMENTATION_PLAN.md and CODEX_RESPONSE.md already received the same treatment in earlier rounds (commits 63721e7, 15d5d43). Resolution mirrors the established pattern: - Frontmatter `status: dispatched` changed to `historical (pre- implementation dispatch; superseded by R1/R2/R3/R4/R5 — see banner)`. - Frontmatter companion list extended to point at CODEX_RESPONSE.md and IMPLEMENTATION_PLAN.md so readers see the chain. - Blockquote banner added immediately after frontmatter naming the pre-implementation status, identifying the two stale spots inside (symbol-slot framing + Optional LSP wording), and pointing readers at the post-implementation locations (IMPLEMENTATION_PLAN § Outcome and REPO_CONTEXT § Reservation). No body content changed; the historical artifact is preserved verbatim for the audit trail. Verification: bun test 3116 pass / 0 fail; typecheck clean.
Codex R6 (thread 019e1436) caught a recursive metadata problem: each fix-first closure round wrote frontmatter/banner copy that pinned a specific round count, which the next round then flagged as stale. R6 specifically flagged that COMPARISON.md said "4 review rounds → R4 push", IMPLEMENTATION_PLAN.md said "Codex R3 push", and CODEX_RESPONSE.md status said "superseded by R1/R2/R3" — all stale once R4 and R5 fix-first closures landed. Resolution — break the recursion by: 1. Moving the canonical round-by-round table to a single location (CODEX_RESPONSE.md § Postscript "Review round summary"). IMPL_PLAN mirrors it. All other companion docs point at that table instead of duplicating it. 2. Stabilizing banner and frontmatter language to refer to "a multi- round Codex review process" or "subsequent rounds" instead of a specific count. A future round that fix-firsts metadata can update the canonical table without invalidating banner copy. 3. Updating the canonical table to include R0, R1, R2, R3, R4, R5, R6 (this commit), and a `next` row marking the target convergence. Files touched: - COMPARISON.md frontmatter (status + review-rounds; references table-as-source-of-truth) and banner (stable-language). - IMPLEMENTATION_PLAN.md frontmatter and banner (stable-language); Outcome § "Codex review round summary" extended to R4/R5/R6 with commit anchors per row. - CODEX_RESPONSE.md frontmatter and banner; Postscript table extended to R4/R5/R6 + `next` row. - CODEX_BRIEFING.md frontmatter and banner. No code changes. No technical contract changes. The shipped implementation contract from commits f351688 + bf0c12d + 366dd9e + b41b3f5 + 28ee554 is unchanged. Verification: bun test 3116 pass / 0 fail; typecheck clean. Closes Codex R6. Next dispatch should return push.
Codex R7 (thread 019e143d) caught three sites where R6's stabilization sweep missed hardcoded round counts: 1. CODEX_RESPONSE.md:170 Postscript heading still said "(R1 / R2 / R3)" — the original R3-era title; line 172 still said "three subsequent Codex review rounds." Rewrote to drop the count from both heading and prose; the canonical round-by-round table at the bottom of the Postscript is the single source of truth. 2. COMPARISON.md:8 review-rounds frontmatter still hardcoded "R1..R6 sequential fix-first closures" with the explicit thread-ID list. Replaced with a pointer to the canonical table in CODEX_RESPONSE.md § Postscript. 3. IMPLEMENTATION_PLAN.md:9 review-rounds frontmatter said "Outcome § Codex review round summary" is the canonical table, and the R6 entry in that same table claimed it had migrated round-counts "to this table as the single source of truth." That conflicted with R6's stated fix (which named CODEX_RESPONSE.md § Postscript as canonical). Updated both references to point at CODEX_RESPONSE.md and clarify that the IMPL_PLAN table is a mirror. Both tables (CODEX_RESPONSE Postscript + IMPL_PLAN Outcome) now include R7 with the (current commit) anchor. Stable language used elsewhere: "multi-round Codex review process", "subsequent rounds", "the canonical round-by-round table" — none pin a specific count. The only remaining round-count occurrences are inside the canonical tables themselves, which is correct (the table IS the source of truth; round numbers belong there). Verification: bun test 3116 pass / 0 fail; typecheck clean; grep audit confirms no hardcoded "three/four/five rounds" or "R1..Rn" range copy outside canonical tables. Closes Codex R7. Next dispatch should return push.
Codex R8 (thread 019e1442) flagged six remaining sites in IMPLEMENTATION_PLAN.md where round-count copy survived earlier sweeps. The planning-body section preserved the original "two-round target" plan; the Outcome section had a "post-R3 final state" heading; the test-count table row was labeled "After B2 + R1 + R2". All have been replaced with stable language: - IMPLEMENTATION_PLAN.md:29 "two Codex review rounds (R1 + R2 minimum) reaching push" → "a Codex review process culminating in push" + footnote that the original plan called for two rounds and the canonical table records what actually happened. - IMPLEMENTATION_PLAN.md:95-102 review-rounds section rewritten: "dispatch Codex R1", "dispatch R2", "Two-round target. R1+R2 must converge" → generic "dispatch a Codex review", "dispatch a follow-up review", "original target: minimum two-round convergence; canonical table records actual history." - IMPLEMENTATION_PLAN.md:112 "## Outcome (post-R3 final state)" → "## Outcome (post-implementation final state)". - IMPLEMENTATION_PLAN.md:120 "### B2 — case-03 redesign (R1 → R2 fixes)" → "### B2 — case-03 redesign (post-debate fixes)". - IMPLEMENTATION_PLAN.md:124 body prose "Codex R1 finding 3 caught" → "A post-debate review (see canonical table) caught" to remove the hardcoded round number. - IMPLEMENTATION_PLAN.md:147 test-count table "After B2 + R1 + R2" → "After B2 + post-debate fixes". Note: specific-round attributions inside the Outcome and Postscript sections (e.g., "Codex R1 finding 3" or "Codex R1 (thread 019e1326) caught") are preserved because those sections are explicitly the round-by-round history. The stability rule applies to banners, frontmatter, planning-body, and headings — not to body prose inside the round-summary documentation itself. Verification: bun test 3116 pass / 0 fail; typecheck clean. The canonical round-by-round table in CODEX_RESPONSE.md § Postscript remains the single source of truth (mirrored in IMPL_PLAN Outcome). Closes Codex R8. Next dispatch should return push.
Trailing administrative commit closing the R8 paper-trail. The R8 content fix landed in 13d2206; this commit extends the canonical round-by-round tables in CODEX_RESPONSE.md § Postscript and IMPLEMENTATION_PLAN.md § Outcome with the R8 row + commit anchor. No code or contract changes. Two-line diff in each table.
Codex R9 (thread 019e1447) returned push. The shipped contract is intact (RESERVED_REPO_CONTEXT_TOOLS + validateRepoContext schema_invalid_permissions + intersectPermissions tool_unavailable + REPO_CONTEXT.md § Reservation + 4-condition AND telemetry signal + case-03 40-files-3-lines-precision-not-recall contract), no remaining stale round-count copy outside the canonical tables, and the metadata-recursion sweep that consumed R6-R8 has converged. This commit extends both canonical round-by-round tables (CODEX_RESPONSE.md § Postscript and IMPLEMENTATION_PLAN.md § Outcome) with the R9 push row and removes the "(next)" pending row. The branch is now ready for PR. Final state: - 14 commits on top of origin/main - bun test: 3116 pass / 0 fail - bun run typecheck: clean - bun run eval:repo_context --strict: PASS on all three cases
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (20)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3f9e0e8c87
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const project = await mkdtemp(join(tmpdir(), `codeoz-eval-${caseName}-`)) | ||
| for (const [rel, content] of setup.files) { | ||
| const abs = join(project, rel) | ||
| const dir = abs.slice(0, abs.lastIndexOf('/')) |
There was a problem hiding this comment.
Use path.dirname when creating fixture directories
Deriving dir via abs.slice(0, abs.lastIndexOf('/')) is POSIX-only and breaks on Windows, where join() produces backslash-separated paths (so lastIndexOf('/') is -1). In that environment this computes an invalid parent path and can fail before fixtures are written, which makes both tests/evaluation/repo-context/runner.test.ts and bun run eval:repo_context non-portable. Use dirname(abs) (or equivalent path API) to make fixture setup work across platforms.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements several improvements derived from a comparison with the codegraph tool. Key changes include tightening the governance of the symbol tool slot by marking it as explicitly reserved and non-permissionable, and introducing a new three-case deterministic evaluation harness for repo-context tools (glob, grep, read). The harness measures discovery, usage, and budget pressure metrics like recall@k and byte envelopes. Feedback focused on improving the portability of the evaluation harness by replacing brittle manual path manipulation with standard library functions to ensure compatibility across different operating systems.
|
|
||
| import { mkdtemp, mkdir, writeFile } from 'node:fs/promises' | ||
| import { tmpdir } from 'node:os' | ||
| import { join } from 'node:path' |
| const project = await mkdtemp(join(tmpdir(), `codeoz-eval-${caseName}-`)) | ||
| for (const [rel, content] of setup.files) { | ||
| const abs = join(project, rel) | ||
| const dir = abs.slice(0, abs.lastIndexOf('/')) |
There was a problem hiding this comment.
There was a problem hiding this comment.
Pull request overview
This PR tightens governance around the reserved repo_context tool slot (symbol) to eliminate contract debt, and adds a deterministic evaluation harness to regression-test repo_context retrieval quality (recall@k + budget metrics) without introducing new authority surfaces.
Changes:
- Enforce
'symbol'as reserved-but-not-permissionable at both config-load (validateRepoContext) and runtime (intersectPermissions), and align prompt vocabulary + docs to the reservation contract. - Add a 3-case deterministic
repo_contexteval harness (discovery/usage/budget-pressure) used in CI (bun test) and a standalone runner (bun run eval:repo_context). - Update roadmap + comparison documentation to reflect the post-synthesis reservation/reopen trigger contract and the codegraph head-to-head record.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/repo-context-symbol-reservation.test.ts | Adds defense-in-depth tests for runtime rejection of tool === 'symbol' + prompt vocabulary checks. |
| tests/evaluation/repo-context/runner.test.ts | CI wrapper that runs the 3-case eval harness under bun test (skips/branches when rg missing). |
| tests/evaluation/repo-context/harness.ts | Core harness: builds temp repo, drives runRepoContextTool, derives recall@k + budget metrics from events.jsonl. |
| tests/evaluation/repo-context/case-01-discovery.ts | Fixture for broad discovery via grep; asserts recall@k and basic invariants. |
| tests/evaluation/repo-context/case-02-usage.ts | Fixture for call-site discovery; asserts recall and no truncation. |
| tests/evaluation/repo-context/case-03-budget-pressure.ts | Fixture that forces cap truncation and asserts envelope + precision-under-truncation. |
| tests/agents-permissions.test.ts | Updates schema validation test to assert 'symbol' is rejected at config-load as RESERVED. |
| src/tools/repo-context/types.ts | Documents that 'symbol' remains in the tool-name union but is reserved in v0.x. |
| src/tools/repo-context/runner.ts | Adds rationale comment about unsupported-tool branch being unreachable in v0.x. |
| src/tools/repo-context/permissions.ts | Adds runtime defense-in-depth rejection for untyped request.tool === 'symbol'. |
| src/prompts/index.ts | Exports TOOL_DESCRIPTIONS and updates symbol description to RESERVED with doc anchor. |
| src/agents/schema.ts | Adds RESERVED_REPO_CONTEXT_TOOLS and rejects reserved tools in validateRepoContext. |
| scripts/eval-repo-context.ts | Standalone runner for harness with --json / --strict reporting and thresholds. |
| package.json | Adds eval:repo_context script entry. |
| docs/design/ROADMAP.md | Reclassifies symbol backend + route detection as deferred-with-trigger items and updates M6 acceptance text. |
| docs/contracts/REPO_CONTEXT.md | Codifies reservation + explicit reopen-the-slot telemetry signal and rejection behavior. |
| docs/comparison/06-codegraph/COMPARISON.md | Adds shipped comparison record and borrow decision write-up. |
| docs/comparison/06-codegraph/CODEX_RESPONSE.md | Captures Codex R0 transcript + postscript with canonical review-round summary. |
| docs/comparison/06-codegraph/CODEX_BRIEFING.md | Adds historical pre-implementation briefing record with banner. |
| docs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md | Adds implementation plan with authoritative “Outcome” section capturing final shipped state. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| expect(() => | ||
| intersectPermissions({ | ||
| agentPermissions: noPerms, | ||
| request: req, | ||
| projectRoot: PROJECT, | ||
| }), | ||
| ).toThrow(RepoContextError) |
| // Unreachable in v0.x: 'glob' / 'grep' / 'read' all set `result` above, | ||
| // and 'symbol' is rejected by intersectPermissions before we reach the | ||
| // dispatch (see permissions.ts § "Defense-in-depth for the | ||
| // reserved-but-not-permissionable slot"). Kept as a typed safety net | ||
| // so future tool additions cannot regress the invariant silently. |
…ssion Untyped callers (JSON-decoded payloads, tests, future call sites with untrusted input) could reach runRepoContextTool with tool='symbol'. intersectPermissions correctly threw, but the runner's downstream event-logging path called describeQuery/readRequestedRootsForLog on the unrecognized tool, producing query=undefined. appendEvent's schema validator then threw EventLogError, violating the function's "always emits exactly one repo_context_searched event" guarantee. Move the symbol-reservation guard to runner entry. The runner now rejects any non-runnable tool name (symbol or unknown) with tool_unavailable BEFORE any other operation. intersectPermissions keeps its own guard for direct callers; this is defense-in-depth at the next layer up. Strengthen the order-of-checks test: assert issues[0].code is tool_unavailable and tool is symbol, not just RepoContextError thrown. A regression that moves the guard below tu===undefined would surface tool_no_permissions and fail the assertion. Add three runner-level tests covering symbol rejection, no event emitted on the guard path (proving order), and rejection of arbitrary unknown tool names. Closes Copilot PR #21 block-class finding.
The earlier abs.slice(0, abs.lastIndexOf('/')) was POSIX-only. On
Windows, node:path's join produces backslash-separated paths, so
lastIndexOf('/') returns -1 and the derived dir is an invalid prefix.
Fixture setup fails before any tool call, making the harness and
bun run eval:repo_context non-portable.
Use dirname() which honors the platform separator.
Closes Codex PR #21 P2 portability finding.
|
Three PR review findings closed in 8cf3726 / 5eaa508:
3119 tests pass (+3 new). typecheck clean. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8cf372665a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| name: 'case-02-usage', | ||
| setup: CASE_02_USAGE, | ||
| thresholds: { minRecall: 1.0, mustNotTruncate: true, minReturnedPaths: 1 }, |
There was a problem hiding this comment.
Enforce case-02 saturation guard in strict eval mode
--strict is documented as using the same regression thresholds as tests/evaluation/repo-context/runner.test.ts, but case-02 omits the test’s orderedReturnedPaths.length < 50 assertion. That means the standalone evaluator can report PASS even when the usage query has drifted to the max-results boundary (e.g., returning 50 distinct paths without truncation), which weakens the harness’s ability to catch search-noise regressions outside bun test and can produce misleading experiment outcomes.
Useful? React with 👍 / 👎.
Summary
Head-to-head comparison of code-oz vs
@colbymchenry/codegraphv0.7.2 (tree-sitter + SQLite code-intelligence indexer for Claude Code) plus implementation of the two action-bearing borrows identified by Codex.The single most consequential outcome was Codex's pre-implementation Q8 catch: code-oz's reserved-but-unsupported
'symbol'slot inRepoContextToolNamewas already contract debt today, not harmless optionality. The type union allowed it,permissions.tsvalidated it, error types referenced it, butrunner.tsrejected it with a generic "unsupported tool" error. Defense-in-depth wiring across config-load (src/agents/schema.ts) and runtime (src/tools/repo-context/permissions.ts) closes that debt while preserving the schema slot for a telemetry-gated reopen.Comparison verdict
YES, code-oz is ahead on category — codegraph is a code-intelligence indexer for chat-tool agents, not an SDLC orchestrator. 90 percent of codegraph's surface is structurally orthogonal. The 10 percent overlap is exactly the
tool_use.repo_contextpermission scope, which the comparison contracts and the implementation closes.What's in this PR
B1 — Symbol slot contract cleanup (Option D-reserved)
RESERVED_REPO_CONTEXT_TOOLS = ['symbol']constant insrc/agents/schema.tsvalidateRepoContextrejects'symbol'intools[]at config-load withschema_invalid_permissionsand a rule anchored toREPO_CONTEXT.md § ReservationintersectPermissionsrejects any directtool === 'symbol'request at runtime withtool_unavailable(defense-in-depth for type-bypassed callers like JSON-decoded payloads)RepoContextToolNameJSDoc +TOOL_DESCRIPTIONS.symbolpersona vocabulary updated to RESERVED status with doc anchordocs/contracts/REPO_CONTEXT.md§ "Reservation and reopen-the-slot signal" codifies the 4-condition AND telemetry signal that would reopen the slot: high search churn + manifest-cap saturation + phase result-tokens > 200k + downstream VERIFY/REVIEW failure attributable to missed semantic context, on three runs across two reposB2 — Three-case deterministic eval harness
tests/evaluation/repo-context/harness.tsdrivesrunRepoContextTool(production code path) and computes recall@k + bytes/tokens + tool-call counts + truncation flag from the audit logmaxResults=25; assertsanyTruncated === true, per-call byte envelope, and precision under truncation (every returned path starts withsrc/match/, no decoy leakage)scripts/eval-repo-context.tsstandalone runner with--json/--strictflagspackage.jsonscript:bun run eval:repo_contextB5 — Framework-aware route detection reclassified
docs/design/ROADMAP.mdW3 section: new "Deferred-with-trigger items" subsection covering both the symbol-slot reopen condition and B5 (reopens if a routing/API-surface audit persona enters the company roster)symbolLSP integration (deferred from M6)" line replaced; M6 acceptance bullet updatedDocumentation
docs/comparison/06-codegraph/COMPARISON.md— full head-to-head, borrow ranking, decision sectiondocs/comparison/06-codegraph/CODEX_BRIEFING.md— pre-implementation dispatch (historical)docs/comparison/06-codegraph/CODEX_RESPONSE.md— R0 transcript + Postscript with canonical review-round tabledocs/comparison/06-codegraph/IMPLEMENTATION_PLAN.md— planning record + Outcome sectionVerification
bun test: 3116 pass / 0 fail / 2 skip (3108 baseline + 8 new). Three consecutive runs deterministicbun run typecheck: cleanbun run eval:repo_context --strict: PASS on all three casesCross-model peer review
Multi-round Codex review process (canonical table in
docs/comparison/06-codegraph/CODEX_RESPONSE.md § Postscript):019e12edaccept-with-modifications — pre-impl debate; flagged contract debt in Q8019e1326fix-first — 3 block-push (REPO_CONTEXT.md drift, recall@k metric bug, case-03 not exercising budget pressure)019e1330fix-first — 4 doc/parity drift019e141bfix-first — 3 doc-drift in companion files019e1421fix-first — 3 doc-drift sweep (COMPARISON.md + ROADMAP M6)019e142dfix-first — CODEX_BRIEFING.md missing historical banner019e1436fix-first — metadata-recursion: stabilize round-count language019e143dfix-first — 3 residual hardcoded round-count sites019e1442fix-first — 6 residual round-count sites in planning-body019e1447push — convergenceTest plan
bun testshows 3116+ pass / 0 failbun run typecheckcleanbun run eval:repo_context --strictPASS on all three casestools: ['glob', 'grep', 'read', 'symbol']in an agent fixture — config-load should reject with the rule string pointing atREPO_CONTEXT.md § Reservation