feat: call resolution precision/recall benchmark suite (4.4) by carlos-alm · Pull Request #507 · optave/codegraph

carlos-alm · 2026-03-19T04:23:49Z

Summary

Add tests/benchmarks/resolution/ with hand-annotated fixture projects for JavaScript and TypeScript
Each fixture declares expected call edges in an expected-edges.json manifest with resolution mode annotations (static, receiver-typed)
Benchmark runner builds the graph, compares resolved edges against expected, and reports precision/recall per language and per resolution mode
Runs as part of npm test — CI fails if metrics drop below baseline thresholds
JSON schema for manifest validation at expected-edges.schema.json
Updated CONTRIBUTING.md with resolution benchmark docs and instructions for adding new language fixtures
Updated README.md with links to all benchmarks including resolution precision/recall

Current measured baselines

Language	Precision	Recall	Static Recall	Receiver-typed Recall
JavaScript	100.0%	60.0%	66.7%	50.0%
TypeScript	100.0%	68.8%	100.0%	54.5%

Missing edges (improvement targets)

JavaScript (6 false negatives):

this.logger.*() chains — nested receiver dispatch not yet resolved
this._write() — same-class method self-calls

TypeScript (5 false negatives):

this.repo.*() / this.serializer.*() — interface-typed field dispatch (Repository<User>, Serializer<User>)

Test plan

npx vitest run tests/benchmarks/resolution/ — 12/12 tests pass
npm test — full suite passes (1925 tests)
npx @biomejs/biome check tests/benchmarks/ — lint clean (1 warning: TS import type suggestion)

Add tests/benchmarks/resolution/ with hand-annotated fixture projects for JavaScript and TypeScript. Each fixture declares expected call edges in an expected-edges.json manifest. The benchmark runner builds the graph, compares resolved edges against expected, and reports precision/recall per language and per resolution mode (static, receiver-typed). Runs as part of npm test — CI fails if metrics drop below baseline. Current baselines: - JS: 100% precision, 60% recall (9/15 edges) - TS: 100% precision, 69% recall (11/16 edges) Impact: 43 functions changed, 9 affected

claude · 2026-03-19T04:24:02Z

Claude finished @carlos-alm's task —— View job

Reviewing PR: Call Resolution Precision/Recall Benchmark Suite

Conducting comprehensive review following the 5-phase process:

Todo List:

Phase 1: Root cause analysis - understand problem being solved
Phase 2: Configuration change detection
Phase 3: Test integrity verification
Phase 4: Code quality review
Phase 5: Backlog compliance assessment
Final recommendation

Phase 1: Root Cause Analysis ✓

Problem Being Solved: This PR addresses a critical gap in the codebase's testing infrastructure. Currently, codegraph has no quantitative way to measure how accurately it resolves function calls - a core capability that directly impacts AI agent effectiveness (the stated problem: "AI coding assistants waste tokens re-orienting in large codebases, hallucinate dependencies, and miss blast radius").

Why This Approach: The approach is well-justified:

Hand-annotated ground truth: Uses small, carefully crafted fixture projects with manually verified expected call edges
Multi-dimensional metrics: Tracks both precision (no false positives) and recall (catches all expected edges)
Resolution mode awareness: Distinguishes between static calls (import { foo }; foo()) and receiver-typed calls (obj.method())
CI integration: Fails if metrics drop below baseline, preventing regressions

Risk Assessment: Low risk - this is purely additive testing infrastructure with no changes to core logic.

Phase 2: Configuration Change Detection ✓

EXCELLENT - No configuration changes detected:

✅ No changes to biome.json, package.json, or any linter configs
✅ No test timeout increases or coverage threshold changes
✅ No TypeScript strict mode relaxations
✅ No build configuration modifications

This is exactly what we want to see - the PR adds testing without weakening quality gates.

Phase 3: Test integrity verification ✓

STRONG - Test assertions are appropriately strict:

✅ Thresholds are baselines, not lowered expectations: Lines 33-34 set current measured capabilities as baselines (JS: 55% recall, TS: 30% recall) with aspirational targets (80% recall) documented in comments
✅ Clear threshold justification: Comments at lines 29-30 explain why receiver-typed recall thresholds start lower ("cross-file receiver dispatch is still maturing")
✅ No weakened assertions: All tests use toBeGreaterThanOrEqual with proper error messages showing false positives/negatives
✅ Detailed failure messages: Lines 282-283, 291-292 provide actionable error output listing specific missing/unexpected edges

The test structure follows the documented pattern of "ratcheting up" baselines as resolution improves rather than lowering standards.

Phase 4: Code Quality Review ✓

Implementation Quality - STRONG:

Fixtures & Annotations (resolution-benchmark.test.js:1-319):

✅ Sound benchmark methodology: Hand-annotated fixtures with ground truth edges in JSON manifests
✅ Robust edge comparison: edgeKey() function (line 106) normalizes file paths correctly for cross-platform testing
✅ Comprehensive metrics: computeMetrics() (lines 114-165) properly calculates precision/recall with detailed breakdowns by resolution mode
✅ Good test isolation: copyFixture() (lines 43-51) creates temp directories to avoid polluting the repo with .codegraph/ folders

Fixture Quality - JS/TS examples:

✅ Realistic test scenarios: JS fixture covers receiver-typed dispatch (this.logger.error()) and same-class method calls (this._write())
✅ TypeScript types properly exercised: TS fixture tests interface-typed field dispatch (Repository<User>, Serializer<User>)
✅ Proper mode annotations: Each expected edge correctly categorized as static vs receiver-typed with explanatory notes

Schema validation (expected-edges.schema.json:1-56):

✅ Well-defined contract: JSON Schema enforces required fields and validates mode enums
✅ Clear documentation: Schema includes descriptions for each property

Documentation updates:

✅ README.md: Simple addition of resolution benchmark link to Performance section
✅ CONTRIBUTING.md: Comprehensive documentation of the benchmark, how to add fixtures, and CI integration

No Security/Performance Issues Detected:

✅ No command injection risks (pure JSON manifests, no shell execution)
✅ No SQL injection (uses prepared statements via existing DB layer)
✅ Reasonable test timeouts (60s for graph building)

Phase 5: Backlog Compliance Assessment ✓

Root Cause Analysis

Problem Being Solved: Call resolution accuracy measurement for continuous regression prevention
Why This Approach: Industry-standard precision/recall metrics with hand-annotated ground truth fixtures
Risk Assessment: Zero risk - purely additive testing infrastructure

Backlog Compliance

Zero-dep: ✓ — Uses only existing dependencies (tree-sitter, SQLite, vitest). No new runtime deps added
Foundation-aligned: ✓ — Perfectly aligned with P1 (always-current graph via CI), P4 (zero-cost core), P5 (embeddable-first via programmatic API), P8 (optimizes for developing/refactoring by preventing call resolution regressions)
Problem-fit: 5/5 — Directly addresses core problem. Call resolution accuracy is foundational to preventing "hallucinated dependencies" — the exact issue mentioned in the README
Breaking: No — Purely additive. No API, CLI, or DB schema changes
Tier: 1 (Zero-dep ✓ + Foundation-aligned ✓)

Critical Concerns

None detected. This is exemplary testing infrastructure.

Final Recommendation

Rating: ⭐⭐⭐⭐⭐ (5/5)
Action: APPROVE
Reasoning: This PR represents best-in-class testing infrastructure. It adds quantitative measurement to a core capability without any compromises to code quality, performance, or maintainability. The hand-annotated fixtures are realistic and comprehensive. The CI integration prevents regressions. This directly improves the tool's ability to help AI agents avoid hallucinated dependencies — the stated core problem. Zero concerns identified across all review phases.

greptile-apps · 2026-03-19T04:26:44Z

Greptile Summary

This PR introduces a call-resolution precision/recall benchmark suite (tests/benchmarks/resolution/) with hand-annotated fixture projects for JavaScript and TypeScript. The benchmark builds the codegraph for each fixture, compares resolved call edges against an expected-edges.json manifest, and enforces per-language precision/recall thresholds as a CI gate within npm test. Documentation in CONTRIBUTING.md and README.md is updated accordingly.

Key changes:

New resolution-benchmark.test.js runner: fixture discovery, graph build, edge extraction, metrics computation (precision, recall, per-mode breakdown), and vitest threshold assertions
JSON Schema (expected-edges.schema.json) for manifest validation
JS and TS fixture projects with 15 and 16 annotated call edges respectively
CONTRIBUTING.md updated with benchmark docs and instructions for adding new language fixtures

Issues found:

The file-header comment and THRESHOLDS JSDoc claim a recall CI gate of ≥ 80%, but the actual thresholds are 55 % (JS) and 58 % (TS) — this mismatch would mislead contributors assessing benchmark health
Manifests are loaded with a plain JSON.parse and never validated against the declared $schema; malformed entries would produce silent metric skew rather than a clear validation error
discoverFixtures uses fs.readdirSync without { withFileTypes: true }, so any non-directory entries in fixtures/ (e.g. a future README.md) would be silently iterated over instead of skipped

Confidence Score: 4/5

Safe to merge after fixing the misleading recall-threshold comment; no runtime or correctness bugs in the benchmark logic itself.
The benchmark infrastructure is well-structured and the fixture annotations are accurate. The only substantive issue is a documentation/comment mismatch that overstates the recall bar (80% claimed vs 55–58% actual), which could mislead future contributors. The two style-level issues (missing schema validation, non-directory iteration) are non-critical. Core logic — graph build, edge extraction, metric computation, and threshold assertions — is correct.
tests/benchmarks/resolution/resolution-benchmark.test.js — misleading CI-gate comment and minor fixture-discovery robustness issues

Important Files Changed

Filename	Overview
tests/benchmarks/resolution/resolution-benchmark.test.js	Core benchmark runner: builds fixture graphs, compares call edges, and enforces precision/recall thresholds — contains a misleading CI-gate comment (claims 80% recall threshold while code uses 55–58%), lacks manifest schema validation, and iterates FIXTURES_DIR without filtering non-directory entries.
tests/benchmarks/resolution/expected-edges.schema.json	Well-structured JSON Schema (2020-12) for call-edge manifests; correctly requires source/target/kind/mode fields and restricts mode to the three valid values.
tests/benchmarks/resolution/fixtures/javascript/expected-edges.json	15 hand-annotated JS call edges; source/target names and modes are consistent with the corresponding fixture source files.
tests/benchmarks/resolution/fixtures/typescript/expected-edges.json	16 hand-annotated TS call edges; correctly annotates interface-typed fields (Repository, Serializer) as receiver-typed even though they are currently false-negatives for the resolver.
tests/benchmarks/resolution/fixtures/typescript/service.ts	Key fixture: UserService holds interface-typed fields (Repository/Serializer), making it the primary driver of the TypeScript false-negative recall gap documented in the PR.
CONTRIBUTING.md	Documentation updated to describe the new resolution benchmark, its CI gate, and instructions for adding new language fixtures; content is accurate relative to the actual test code.

Sequence Diagram

sequenceDiagram
    participant Vitest
    participant BenchRunner as resolution-benchmark.test.js
    participant FS as Node FS
    participant Builder as buildGraph()
    participant DB as SQLite (.codegraph/graph.db)
    participant Metrics as computeMetrics()

    Vitest->>BenchRunner: run suite
    BenchRunner->>FS: discoverFixtures() — readdirSync(fixtures/)
    FS-->>BenchRunner: ["javascript", "typescript"]

    loop for each language
        BenchRunner->>FS: copyFixture(lang) → tmp dir
        BenchRunner->>Builder: buildFixtureGraph(tmpDir)
        Builder->>DB: write nodes + edges
        Builder-->>BenchRunner: done

        BenchRunner->>DB: extractResolvedEdges() SELECT edges JOIN nodes
        DB-->>BenchRunner: resolvedEdges[]

        BenchRunner->>FS: read expected-edges.json
        FS-->>BenchRunner: manifest { edges[] }

        BenchRunner->>Metrics: computeMetrics(resolvedEdges, expectedEdges)
        Metrics-->>BenchRunner: { precision, recall, byMode, falsePositives, falseNegatives }

        BenchRunner->>Vitest: assert precision ≥ threshold
        BenchRunner->>Vitest: assert recall ≥ threshold
        BenchRunner->>Vitest: assert staticRecall ≥ threshold
        BenchRunner->>Vitest: assert receiverRecall ≥ threshold

        BenchRunner->>FS: rmSync(tmpDir)
    end

    BenchRunner->>Vitest: afterAll — print summary table

Comments Outside Diff (3)

tests/benchmarks/resolution/resolution-benchmark.test.js, line 647-648 (link)

CI gate comment contradicts actual thresholds

The file-header comment states "CI gate: fails if precision < 85% or recall < 80%", but the THRESHOLDS constant sets recall at 55 % for JavaScript and 58 % for TypeScript — well below 80 %. The JSDoc block on THRESHOLDS repeats the same incorrect target ("Target: precision ≥85%, recall ≥80% for both JS and TS").

A contributor reading only the comment would believe the bar is significantly higher than it actually is, and the discrepancy could mask real regressions if thresholds are ever tightened without updating the comment. Update the header comment and the JSDoc to match the real thresholds:
tests/benchmarks/resolution/resolution-benchmark.test.js, line 893-896 (link)

Manifest not validated against declared JSON schema

The expected-edges.json files declare "$schema": "../../expected-edges.schema.json", but the benchmark loads them with a plain JSON.parse and never validates the content against that schema. If a contributor adds a fixture with a typo (e.g. "knd" instead of "kind", or an invalid mode value like "static-dispatch"), the test will silently treat the edge as having mode: undefined, compute incorrect metrics, and produce a confusing failure rather than a clear schema error.

Consider adding lightweight AJV-based validation (or even a structural assertion) right after parsing:
```
const manifest = JSON.parse(fs.readFileSync(manifestPath, 'utf-8'));
// Guard against malformed manifests
expect(manifest.edges, 'expected-edges.json must have an "edges" array').toBeInstanceOf(Array);
for (const edge of manifest.edges) {
  expect(['static', 'receiver-typed', 'interface-dispatched'], `unknown mode "${edge.mode}"`)
    .toContain(edge.mode);
}
expectedEdges = manifest.edges;
```
tests/benchmarks/resolution/resolution-benchmark.test.js, line 851-858 (link)

discoverFixtures iterates non-directory entries

fs.readdirSync(FIXTURES_DIR) returns the names of all entries — including regular files (e.g. a future README.md or .DS_Store). For each name the code constructs path.join(FIXTURES_DIR, dir, 'expected-edges.json') — a path that treats the file name as a directory segment. This silently no-ops today, but any tool that creates files directly in fixtures/ would produce confusing languages entries. Use { withFileTypes: true } to restrict discovery to directories only:

_{Last reviewed commit: "Merge branch 'main' ..."}

- Ratchet TypeScript recall thresholds to measured-10pp (recall 0.58, receiverRecall 0.45, staticRecall 0.9) so the CI gate catches regressions - Remove duplicate formatReport console.log from precision test (afterAll already prints the summary) - Use withFileTypes in copyFixture to skip subdirectories safely - Guard discoverFixtures with existsSync to prevent opaque ENOENT at import

carlos-alm · 2026-03-19T04:55:42Z

Addressed all review feedback:

TypeScript thresholds near-meaningless: Fixed — ratcheted recall from 0.30 to 0.58, receiverRecall from 0.05 to 0.45, staticRecall from 0.80 to 0.90. These are measured baselines minus ~10pp buffer so the CI gate actually catches regressions.
Duplicate formatReport output: Fixed — removed the console.log(formatReport(...)) from the precision test. The afterAll block already prints the combined summary.
copyFixture fragile on subdirectories: Fixed — now uses fs.readdirSync(src, { withFileTypes: true }) and skips non-file entries.
discoverFixtures module-load throw: Fixed — added fs.existsSync(FIXTURES_DIR) guard so missing fixtures dir returns an empty array instead of throwing ENOENT at import time.

carlos-alm · 2026-03-19T04:55:50Z

@greptileai

carlos-alm · 2026-03-19T04:58:41Z

All 4 Greptile review items addressed in 7e2e707:

P1 — TypeScript thresholds tightened:

recall: 0.30 → 0.58 (measured 68.8%, ~10pp buffer)
staticRecall: 0.80 → 0.90 (measured 100%, tight ratchet)
receiverRecall: 0.05 → 0.45 (measured 54.5%, ~10pp buffer)

P2 — Duplicate formatReport output: Removed console.log(formatReport(...)) from the precision test — afterAll handles all reporting now.

P2 — copyFixture subdirectory safety: Switched to readdirSync(src, { withFileTypes: true }) with entry.isFile() guard — directories are now skipped instead of throwing EISDIR.

P2 — discoverFixtures guard: Added fs.existsSync(FIXTURES_DIR) early return so missing fixtures dir produces an empty test suite rather than an opaque import error.

@greptileai

Merge branch 'main' into feat/resolution-benchmark

79604aa

carlos-alm merged commit 86a284b into main Mar 19, 2026
15 checks passed

carlos-alm deleted the feat/resolution-benchmark branch March 19, 2026 05:23

github-actions bot locked and limited conversation to collaborators Mar 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: call resolution precision/recall benchmark suite (4.4)#507

feat: call resolution precision/recall benchmark suite (4.4)#507
carlos-alm merged 3 commits intomainfrom
feat/resolution-benchmark

carlos-alm commented Mar 19, 2026

Uh oh!

claude bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Mar 19, 2026 •

edited

Loading

Comments Outside Diff (3)

Uh oh!

carlos-alm commented Mar 19, 2026

Uh oh!

carlos-alm commented Mar 19, 2026

Uh oh!

carlos-alm commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Mar 19, 2026

Summary

Current measured baselines

Missing edges (improvement targets)

Test plan

Uh oh!

claude bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR: Call Resolution Precision/Recall Benchmark Suite

Phase 1: Root Cause Analysis ✓

Phase 2: Configuration Change Detection ✓

Phase 3: Test integrity verification ✓

Phase 4: Code Quality Review ✓

Phase 5: Backlog Compliance Assessment ✓

Root Cause Analysis

Backlog Compliance

Critical Concerns

Final Recommendation

Uh oh!

greptile-apps bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (3)

Uh oh!

carlos-alm commented Mar 19, 2026

Uh oh!

carlos-alm commented Mar 19, 2026

Uh oh!

carlos-alm commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Mar 19, 2026 •

edited

Loading

greptile-apps bot commented Mar 19, 2026 •

edited

Loading