Skip to content

feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix#219

Merged
jafreck merged 5 commits intomainfrom
fix/postgres-benchmark-answers
Mar 20, 2026
Merged

feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix#219
jafreck merged 5 commits intomainfrom
fix/postgres-benchmark-answers

Conversation

@jafreck
Copy link
Copy Markdown
Owner

@jafreck jafreck commented Mar 20, 2026

Summary

Two related improvements: a benchmark overhaul focused on call-graph questions, and an indexer fix that patches SCIP symbol end_line using tree-sitter data.

Benchmark Overhaul

Question catalog: Replaced the 15-question mixed catalog (call graph, type hierarchy, module deps, test mapping, complexity, API surface, architecture) with 13 call-graph focused questions where 10 leverage lore_snippet for code-level understanding.

Key changes:

  • Added symbol2 placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace)
  • Redesigned Q1.7 from "trace deepest chain" (nondeterministic) to "trace from X to Y" (deterministic)
  • Redesigned Q10.1 from "rank top 3 by fan-in" (nondeterministic) to "list all callers of X" (deterministic)
  • Fixed incorrect ground truth across all repos (verified against pinned SHA DBs)
  • Removed prose from expectedAnswer; use only concrete tokens (function names, file paths) for reliable substring scoring

Ground truth fixes:

  • esbuild Q1.1: callers of Build are runImpl/handleBuildRequest, not main/rebuildImpl
  • jackson-databind Q7.3: createCollectionDeserializer delegates to findTypeDeserializer etc., not sibling factories
  • lore-self Q7.3: resolutionStage only calls resolveSymbolEdges (no resolveImports)
  • lore-self Q10.1: top fan-in is setLoreMeta(5 files), not openDb
  • fastapi Q1.2/Q7.3: add_api_route calls get_value_or_default/APIRoute, not get_request_handler
  • jackson-databind Q1.2: strip receiver qualifiers (type.getContentTypegetContentType)

Indexer Fix: Tree-sitter Patches SCIP Symbol end_line

Problem: scip-go provides empty enclosingRange for all definitions. The SCIP stage's estimateSymbolEndLine heuristic fails for Go methods with complex parameter types (e.g., map[string]interface{}) because naive brace-counting sees {} in the type annotation.

Fix: SourceIndexStage.computeMetricsForScipFiles (which already parses SCIP-sourced files with tree-sitter for metrics) now also patches end_line when tree-sitter provides a wider range. Includes receiver-prefix name matching (Server.HandleRequestHandleRequest).

Impact: Go method handleBuildRequest goes from end_line=596 (wrong, just signature) to end_line=790 (correct, full body). This unlocks 20 outgoing call edges and makes cross-file callers like handleBuildRequest → Build visible.

Test: Integration test creates a Go method with map[string]interface{} params, a synthetic SCIP index with no enclosingRange, runs both stages, and asserts tree-sitter corrects end_line.

Benchmark Results (Run 5)

Repo Lore Correctness Pass Rate
lore-self 98.7% 14/14
jackson-databind 96.2% 13/14
postgres 85.6% 12/14
zod 83.7% 14/14
fastapi 73.7% 14/14
esbuild 59.4% 14/14

Average lore-arm correctness: ~83% (up from ~27% at session start).

jafreck added 5 commits March 15, 2026 21:00
…able questions

- Update Q1.6 expected answer from 'None' to 'addNSItemForReturning'
- Update Q10.2 with accurate caller counts from SCIP index
- Remove Q2.1 (type hierarchy) and Q3.5 (external deps) — not applicable to postgres repo
…postgres

Q5.1 relies on embedding-based semantic search, which we want to
exclude from the structural benchmark question set.
Benchmark overhaul:
- Replace 15-question mixed catalog with 13 call-graph focused questions
- 10 of 13 questions leverage lore_snippet for code-level understanding
- Add symbol2 placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace)
- Redesign Q1.7 to trace between specific endpoints (deterministic)
- Redesign Q10.1 to query fan-in for a specific function (deterministic)
- Fix ground truth: esbuild Q1.1 callers, jackson-databind Q7.3 delegation,
  lore-self Q7.3/Q1.7/Q10.1, fastapi add_api_route callees
- Strip receiver qualifiers from jackson-databind Q1.2 expectedAnswer
- Remove prose from expectedAnswer; use only concrete tokens for scoring

Indexer fix — tree-sitter patches SCIP symbol end_line:
- SCIP Go indexer (scip-go) provides no enclosingRange for definitions
- SourceIndexStage now patches symbol end_line using tree-sitter data
  when tree-sitter provides a wide  when tree-sitter provides a wide  when tree-sitter provides a wide  when tree-sitter provides a wide  when tree-sitter prll   when tree-sity inv  when tree-sitter provides a wide  when tree-sitter provides a wide  wer  when tree-sitter provides a wide  when tree-sitter provides a wide   w  when tree-sitter provides a wide  when tree-sitter provides a wide  when tre-s  when tree-sitter provides a wide  when tree-sitter provides a w96.  when tree-sitter provides a wide  when tree-sitter provides a wide  w14  when tree-sitter provides a wide  when tree-sitter provides a wide  whe     when tree-sitter provides a wide  when tree-sitter pro59.4% correctness (14/14 pass, Go indexer limitation)
- Lower category count threshold from 5 to 3 (now 4 categories)
- Replace Q6.1 reference with Q7.3 in renderPrompt test (Q6.1 removed)
- Update tasks test category count assertion
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 20, 2026

Codecov Report

❌ Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.82%. Comparing base (508f9be) to head (1ad94db).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/indexer/stages/scip-indexer.ts 40.00% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #219      +/-   ##
==========================================
+ Coverage   87.49%   87.82%   +0.33%     
==========================================
  Files          85       85              
  Lines        9475     9482       +7     
  Branches     2932     2936       +4     
==========================================
+ Hits         8290     8328      +38     
+ Misses       1185     1154      -31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jafreck jafreck merged commit c4ea440 into main Mar 20, 2026
3 checks passed
@jafreck jafreck mentioned this pull request Mar 21, 2026
jafreck added a commit that referenced this pull request Mar 27, 2026
…fix (#219)

* fix(benchmark): correct postgres expected answers and remove inapplicable questions

- Update Q1.6 expected answer from 'None' to 'addNSItemForReturning'
- Update Q10.2 with accurate caller counts from SCIP index
- Remove Q2.1 (type hierarchy) and Q3.5 (external deps) — not applicable to postgres repo

* fix(benchmark): remove Q5.1 (semantic similarity) from lore-self and postgres

Q5.1 relies on embedding-based semantic search, which we want to
exclude from the structural benchmark question set.

* feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix

Benchmark overhaul:
- Replace 15-question mixed catalog with 13 call-graph focused questions
- 10 of 13 questions leverage lore_snippet for code-level understanding
- Add symbol2 placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace)
- Redesign Q1.7 to trace between specific endpoints (deterministic)
- Redesign Q10.1 to query fan-in for a specific function (deterministic)
- Fix ground truth: esbuild Q1.1 callers, jackson-databind Q7.3 delegation,
  lore-self Q7.3/Q1.7/Q10.1, fastapi add_api_route callees
- Strip receiver qualifiers from jackson-databind Q1.2 expectedAnswer
- Remove prose from expectedAnswer; use only concrete tokens for scoring

Indexer fix — tree-sitter patches SCIP symbol end_line:
- SCIP Go indexer (scip-go) provides no enclosingRange for definitions
- SourceIndexStage now patches symbol end_line using tree-sitter data
  when tree-sitter provides a wide  when tree-sitter provides a wide  when tree-sitter provides a wide  when tree-sitter provides a wide  when tree-sitter prll   when tree-sity inv  when tree-sitter provides a wide  when tree-sitter provides a wide  wer  when tree-sitter provides a wide  when tree-sitter provides a wide   w  when tree-sitter provides a wide  when tree-sitter provides a wide  when tre-s  when tree-sitter provides a wide  when tree-sitter provides a w96.  when tree-sitter provides a wide  when tree-sitter provides a wide  w14  when tree-sitter provides a wide  when tree-sitter provides a wide  whe     when tree-sitter provides a wide  when tree-sitter pro59.4% correctness (14/14 pass, Go indexer limitation)

* fix(test): update benchmark test assertions for 13-question catalog

- Lower category count threshold from 5 to 3 (now 4 categories)
- Replace Q6.1 reference with Q7.3 in renderPrompt test (Q6.1 removed)
- Update tasks test category count assertion
@jafreck jafreck deleted the fix/postgres-benchmark-answers branch March 27, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant