feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix by jafreck · Pull Request #219 · jafreck/Lore

jafreck · 2026-03-20T23:15:30Z

Summary

Two related improvements: a benchmark overhaul focused on call-graph questions, and an indexer fix that patches SCIP symbol end_line using tree-sitter data.

Benchmark Overhaul

Question catalog: Replaced the 15-question mixed catalog (call graph, type hierarchy, module deps, test mapping, complexity, API surface, architecture) with 13 call-graph focused questions where 10 leverage lore_snippet for code-level understanding.

Key changes:

Added symbol2 placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace)
Redesigned Q1.7 from "trace deepest chain" (nondeterministic) to "trace from X to Y" (deterministic)
Redesigned Q10.1 from "rank top 3 by fan-in" (nondeterministic) to "list all callers of X" (deterministic)
Fixed incorrect ground truth across all repos (verified against pinned SHA DBs)
Removed prose from expectedAnswer; use only concrete tokens (function names, file paths) for reliable substring scoring

Ground truth fixes:

esbuild Q1.1: callers of Build are runImpl/handleBuildRequest, not main/rebuildImpl
jackson-databind Q7.3: createCollectionDeserializer delegates to findTypeDeserializer etc., not sibling factories
lore-self Q7.3: resolutionStage only calls resolveSymbolEdges (no resolveImports)
lore-self Q10.1: top fan-in is setLoreMeta(5 files), not openDb
fastapi Q1.2/Q7.3: add_api_route calls get_value_or_default/APIRoute, not get_request_handler
jackson-databind Q1.2: strip receiver qualifiers (type.getContentType → getContentType)

Indexer Fix: Tree-sitter Patches SCIP Symbol end_line

Problem: scip-go provides empty enclosingRange for all definitions. The SCIP stage's estimateSymbolEndLine heuristic fails for Go methods with complex parameter types (e.g., map[string]interface{}) because naive brace-counting sees {} in the type annotation.

Fix: SourceIndexStage.computeMetricsForScipFiles (which already parses SCIP-sourced files with tree-sitter for metrics) now also patches end_line when tree-sitter provides a wider range. Includes receiver-prefix name matching (Server.HandleRequest → HandleRequest).

Impact: Go method handleBuildRequest goes from end_line=596 (wrong, just signature) to end_line=790 (correct, full body). This unlocks 20 outgoing call edges and makes cross-file callers like handleBuildRequest → Build visible.

Test: Integration test creates a Go method with map[string]interface{} params, a synthetic SCIP index with no enclosingRange, runs both stages, and asserts tree-sitter corrects end_line.

Benchmark Results (Run 5)

Repo	Lore Correctness	Pass Rate
lore-self	98.7%	14/14
jackson-databind	96.2%	13/14
postgres	85.6%	12/14
zod	83.7%	14/14
fastapi	73.7%	14/14
esbuild	59.4%	14/14

Average lore-arm correctness: ~83% (up from ~27% at session start).

…able questions - Update Q1.6 expected answer from 'None' to 'addNSItemForReturning' - Update Q10.2 with accurate caller counts from SCIP index - Remove Q2.1 (type hierarchy) and Q3.5 (external deps) — not applicable to postgres repo

…postgres Q5.1 relies on embedding-based semantic search, which we want to exclude from the structural benchmark question set.

Benchmark overhaul: - Replace 15-question mixed catalog with 13 call-graph focused questions - 10 of 13 questions leverage lore_snippet for code-level understanding - Add symbol2 placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace) - Redesign Q1.7 to trace between specific endpoints (deterministic) - Redesign Q10.1 to query fan-in for a specific function (deterministic) - Fix ground truth: esbuild Q1.1 callers, jackson-databind Q7.3 delegation, lore-self Q7.3/Q1.7/Q10.1, fastapi add_api_route callees - Strip receiver qualifiers from jackson-databind Q1.2 expectedAnswer - Remove prose from expectedAnswer; use only concrete tokens for scoring Indexer fix — tree-sitter patches SCIP symbol end_line: - SCIP Go indexer (scip-go) provides no enclosingRange for definitions - SourceIndexStage now patches symbol end_line using tree-sitter data when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter prll when tree-sity inv when tree-sitter provides a wide when tree-sitter provides a wide wer when tree-sitter provides a wide when tree-sitter provides a wide w when tree-sitter provides a wide when tree-sitter provides a wide when tre-s when tree-sitter provides a wide when tree-sitter provides a w96. when tree-sitter provides a wide when tree-sitter provides a wide w14 when tree-sitter provides a wide when tree-sitter provides a wide whe when tree-sitter provides a wide when tree-sitter pro59.4% correctness (14/14 pass, Go indexer limitation)

…talog)

- Lower category count threshold from 5 to 3 (now 4 categories) - Replace Q6.1 reference with Q7.3 in renderPrompt test (Q6.1 removed) - Update tasks test category count assertion

codecov · 2026-03-20T23:31:54Z

Codecov Report

❌ Patch coverage is 81.25000% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.82%. Comparing base (508f9be) to head (1ad94db).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/indexer/stages/scip-indexer.ts	40.00%	3 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #219      +/-   ##
==========================================
+ Coverage   87.49%   87.82%   +0.33%     
==========================================
  Files          85       85              
  Lines        9475     9482       +7     
  Branches     2932     2936       +4     
==========================================
+ Hits         8290     8328      +38     
+ Misses       1185     1154      -31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…fix (#219) * fix(benchmark): correct postgres expected answers and remove inapplicable questions - Update Q1.6 expected answer from 'None' to 'addNSItemForReturning' - Update Q10.2 with accurate caller counts from SCIP index - Remove Q2.1 (type hierarchy) and Q3.5 (external deps) — not applicable to postgres repo * fix(benchmark): remove Q5.1 (semantic similarity) from lore-self and postgres Q5.1 relies on embedding-based semantic search, which we want to exclude from the structural benchmark question set. * feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix Benchmark overhaul: - Replace 15-question mixed catalog with 13 call-graph focused questions - 10 of 13 questions leverage lore_snippet for code-level understanding - Add symbol2 placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace) - Redesign Q1.7 to trace between specific endpoints (deterministic) - Redesign Q10.1 to query fan-in for a specific function (deterministic) - Fix ground truth: esbuild Q1.1 callers, jackson-databind Q7.3 delegation, lore-self Q7.3/Q1.7/Q10.1, fastapi add_api_route callees - Strip receiver qualifiers from jackson-databind Q1.2 expectedAnswer - Remove prose from expectedAnswer; use only concrete tokens for scoring Indexer fix — tree-sitter patches SCIP symbol end_line: - SCIP Go indexer (scip-go) provides no enclosingRange for definitions - SourceIndexStage now patches symbol end_line using tree-sitter data when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter prll when tree-sity inv when tree-sitter provides a wide when tree-sitter provides a wide wer when tree-sitter provides a wide when tree-sitter provides a wide w when tree-sitter provides a wide when tree-sitter provides a wide when tre-s when tree-sitter provides a wide when tree-sitter provides a w96. when tree-sitter provides a wide when tree-sitter provides a wide w14 when tree-sitter provides a wide when tree-sitter provides a wide whe when tree-sitter provides a wide when tree-sitter pro59.4% correctness (14/14 pass, Go indexer limitation) * fix(test): update benchmark test assertions for 13-question catalog - Lower category count threshold from 5 to 3 (now 4 categories) - Replace Q6.1 reference with Q7.3 in renderPrompt test (Q6.1 removed) - Update tasks test category count assertion

jafreck added 5 commits March 15, 2026 21:00

fix(benchmark): remove Q5.1 (semantic similarity) from lore-self and …

3ea50d0

…postgres Q5.1 relies on embedding-based semantic search, which we want to exclude from the structural benchmark question set.

merge: resolve conflict with origin/main (keep call-graph question ca…

915fe4c

…talog)

fix(test): update benchmark test assertions for 13-question catalog

1ad94db

- Lower category count threshold from 5 to 3 (now 4 categories) - Replace Q6.1 reference with Q7.3 in renderPrompt test (Q6.1 removed) - Update tasks test category count assertion

jafreck merged commit c4ea440 into main Mar 20, 2026
3 checks passed

jafreck mentioned this pull request Mar 21, 2026

chore: release v0.3.9 #225

Merged

jafreck deleted the fix/postgres-benchmark-answers branch March 27, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix#219

feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix#219
jafreck merged 5 commits intomainfrom
fix/postgres-benchmark-answers

jafreck commented Mar 20, 2026

Uh oh!

codecov Bot commented Mar 20, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jafreck commented Mar 20, 2026

Summary

Benchmark Overhaul

Indexer Fix: Tree-sitter Patches SCIP Symbol end_line

Benchmark Results (Run 5)

Uh oh!

codecov Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov Bot commented Mar 20, 2026 •

edited

Loading