feat: call resolution precision/recall benchmark suite (4.4)#507
feat: call resolution precision/recall benchmark suite (4.4)#507carlos-alm merged 3 commits intomainfrom
Conversation
Add tests/benchmarks/resolution/ with hand-annotated fixture projects for JavaScript and TypeScript. Each fixture declares expected call edges in an expected-edges.json manifest. The benchmark runner builds the graph, compares resolved edges against expected, and reports precision/recall per language and per resolution mode (static, receiver-typed). Runs as part of npm test — CI fails if metrics drop below baseline. Current baselines: - JS: 100% precision, 60% recall (9/15 edges) - TS: 100% precision, 69% recall (11/16 edges) Impact: 43 functions changed, 9 affected
|
Claude finished @carlos-alm's task —— View job Reviewing PR: Call Resolution Precision/Recall Benchmark SuiteConducting comprehensive review following the 5-phase process: Todo List:
Phase 1: Root Cause Analysis ✓Problem Being Solved: This PR addresses a critical gap in the codebase's testing infrastructure. Currently, codegraph has no quantitative way to measure how accurately it resolves function calls - a core capability that directly impacts AI agent effectiveness (the stated problem: "AI coding assistants waste tokens re-orienting in large codebases, hallucinate dependencies, and miss blast radius"). Why This Approach: The approach is well-justified:
Risk Assessment: Low risk - this is purely additive testing infrastructure with no changes to core logic. Phase 2: Configuration Change Detection ✓EXCELLENT - No configuration changes detected:
This is exactly what we want to see - the PR adds testing without weakening quality gates. Phase 3: Test integrity verification ✓STRONG - Test assertions are appropriately strict:
The test structure follows the documented pattern of "ratcheting up" baselines as resolution improves rather than lowering standards. Phase 4: Code Quality Review ✓Implementation Quality - STRONG: Fixtures & Annotations (
Fixture Quality - JS/TS examples:
Schema validation (
Documentation updates:
No Security/Performance Issues Detected:
Phase 5: Backlog Compliance Assessment ✓Root Cause AnalysisProblem Being Solved: Call resolution accuracy measurement for continuous regression prevention Backlog Compliance
Critical ConcernsNone detected. This is exemplary testing infrastructure. Final Recommendation
|
Greptile SummaryThis PR introduces a call-resolution precision/recall benchmark suite ( Key changes:
Issues found:
Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Vitest
participant BenchRunner as resolution-benchmark.test.js
participant FS as Node FS
participant Builder as buildGraph()
participant DB as SQLite (.codegraph/graph.db)
participant Metrics as computeMetrics()
Vitest->>BenchRunner: run suite
BenchRunner->>FS: discoverFixtures() — readdirSync(fixtures/)
FS-->>BenchRunner: ["javascript", "typescript"]
loop for each language
BenchRunner->>FS: copyFixture(lang) → tmp dir
BenchRunner->>Builder: buildFixtureGraph(tmpDir)
Builder->>DB: write nodes + edges
Builder-->>BenchRunner: done
BenchRunner->>DB: extractResolvedEdges() SELECT edges JOIN nodes
DB-->>BenchRunner: resolvedEdges[]
BenchRunner->>FS: read expected-edges.json
FS-->>BenchRunner: manifest { edges[] }
BenchRunner->>Metrics: computeMetrics(resolvedEdges, expectedEdges)
Metrics-->>BenchRunner: { precision, recall, byMode, falsePositives, falseNegatives }
BenchRunner->>Vitest: assert precision ≥ threshold
BenchRunner->>Vitest: assert recall ≥ threshold
BenchRunner->>Vitest: assert staticRecall ≥ threshold
BenchRunner->>Vitest: assert receiverRecall ≥ threshold
BenchRunner->>FS: rmSync(tmpDir)
end
BenchRunner->>Vitest: afterAll — print summary table
|
- Ratchet TypeScript recall thresholds to measured-10pp (recall 0.58, receiverRecall 0.45, staticRecall 0.9) so the CI gate catches regressions - Remove duplicate formatReport console.log from precision test (afterAll already prints the summary) - Use withFileTypes in copyFixture to skip subdirectories safely - Guard discoverFixtures with existsSync to prevent opaque ENOENT at import
|
Addressed all review feedback:
|
|
All 4 Greptile review items addressed in 7e2e707: P1 — TypeScript thresholds tightened:
P2 — Duplicate P2 — P2 — |
Summary
tests/benchmarks/resolution/with hand-annotated fixture projects for JavaScript and TypeScriptexpected-edges.jsonmanifest with resolution mode annotations (static,receiver-typed)npm test— CI fails if metrics drop below baseline thresholdsexpected-edges.schema.jsonCONTRIBUTING.mdwith resolution benchmark docs and instructions for adding new language fixturesREADME.mdwith links to all benchmarks including resolution precision/recallCurrent measured baselines
Missing edges (improvement targets)
JavaScript (6 false negatives):
this.logger.*()chains — nested receiver dispatch not yet resolvedthis._write()— same-class method self-callsTypeScript (5 false negatives):
this.repo.*()/this.serializer.*()— interface-typed field dispatch (Repository<User>,Serializer<User>)Test plan
npx vitest run tests/benchmarks/resolution/— 12/12 tests passnpm test— full suite passes (1925 tests)npx @biomejs/biome check tests/benchmarks/— lint clean (1 warning: TS import type suggestion)