feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix#219
Merged
feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix#219
Conversation
…able questions - Update Q1.6 expected answer from 'None' to 'addNSItemForReturning' - Update Q10.2 with accurate caller counts from SCIP index - Remove Q2.1 (type hierarchy) and Q3.5 (external deps) — not applicable to postgres repo
…postgres Q5.1 relies on embedding-based semantic search, which we want to exclude from the structural benchmark question set.
Benchmark overhaul: - Replace 15-question mixed catalog with 13 call-graph focused questions - 10 of 13 questions leverage lore_snippet for code-level understanding - Add symbol2 placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace) - Redesign Q1.7 to trace between specific endpoints (deterministic) - Redesign Q10.1 to query fan-in for a specific function (deterministic) - Fix ground truth: esbuild Q1.1 callers, jackson-databind Q7.3 delegation, lore-self Q7.3/Q1.7/Q10.1, fastapi add_api_route callees - Strip receiver qualifiers from jackson-databind Q1.2 expectedAnswer - Remove prose from expectedAnswer; use only concrete tokens for scoring Indexer fix — tree-sitter patches SCIP symbol end_line: - SCIP Go indexer (scip-go) provides no enclosingRange for definitions - SourceIndexStage now patches symbol end_line using tree-sitter data when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter prll when tree-sity inv when tree-sitter provides a wide when tree-sitter provides a wide wer when tree-sitter provides a wide when tree-sitter provides a wide w when tree-sitter provides a wide when tree-sitter provides a wide when tre-s when tree-sitter provides a wide when tree-sitter provides a w96. when tree-sitter provides a wide when tree-sitter provides a wide w14 when tree-sitter provides a wide when tree-sitter provides a wide whe when tree-sitter provides a wide when tree-sitter pro59.4% correctness (14/14 pass, Go indexer limitation)
- Lower category count threshold from 5 to 3 (now 4 categories) - Replace Q6.1 reference with Q7.3 in renderPrompt test (Q6.1 removed) - Update tasks test category count assertion
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #219 +/- ##
==========================================
+ Coverage 87.49% 87.82% +0.33%
==========================================
Files 85 85
Lines 9475 9482 +7
Branches 2932 2936 +4
==========================================
+ Hits 8290 8328 +38
+ Misses 1185 1154 -31 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Merged
jafreck
added a commit
that referenced
this pull request
Mar 27, 2026
…fix (#219) * fix(benchmark): correct postgres expected answers and remove inapplicable questions - Update Q1.6 expected answer from 'None' to 'addNSItemForReturning' - Update Q10.2 with accurate caller counts from SCIP index - Remove Q2.1 (type hierarchy) and Q3.5 (external deps) — not applicable to postgres repo * fix(benchmark): remove Q5.1 (semantic similarity) from lore-self and postgres Q5.1 relies on embedding-based semantic search, which we want to exclude from the structural benchmark question set. * feat(benchmark): call-graph focused benchmark + tree-sitter end_line fix Benchmark overhaul: - Replace 15-question mixed catalog with 13 call-graph focused questions - 10 of 13 questions leverage lore_snippet for code-level understanding - Add symbol2 placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace) - Redesign Q1.7 to trace between specific endpoints (deterministic) - Redesign Q10.1 to query fan-in for a specific function (deterministic) - Fix ground truth: esbuild Q1.1 callers, jackson-databind Q7.3 delegation, lore-self Q7.3/Q1.7/Q10.1, fastapi add_api_route callees - Strip receiver qualifiers from jackson-databind Q1.2 expectedAnswer - Remove prose from expectedAnswer; use only concrete tokens for scoring Indexer fix — tree-sitter patches SCIP symbol end_line: - SCIP Go indexer (scip-go) provides no enclosingRange for definitions - SourceIndexStage now patches symbol end_line using tree-sitter data when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter provides a wide when tree-sitter prll when tree-sity inv when tree-sitter provides a wide when tree-sitter provides a wide wer when tree-sitter provides a wide when tree-sitter provides a wide w when tree-sitter provides a wide when tree-sitter provides a wide when tre-s when tree-sitter provides a wide when tree-sitter provides a w96. when tree-sitter provides a wide when tree-sitter provides a wide w14 when tree-sitter provides a wide when tree-sitter provides a wide whe when tree-sitter provides a wide when tree-sitter pro59.4% correctness (14/14 pass, Go indexer limitation) * fix(test): update benchmark test assertions for 13-question catalog - Lower category count threshold from 5 to 3 (now 4 categories) - Replace Q6.1 reference with Q7.3 in renderPrompt test (Q6.1 removed) - Update tasks test category count assertion
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two related improvements: a benchmark overhaul focused on call-graph questions, and an indexer fix that patches SCIP symbol
end_lineusing tree-sitter data.Benchmark Overhaul
Question catalog: Replaced the 15-question mixed catalog (call graph, type hierarchy, module deps, test mapping, complexity, API surface, architecture) with 13 call-graph focused questions where 10 leverage
lore_snippetfor code-level understanding.Key changes:
symbol2placeholder for Q1.5 (shared callers) and Q1.7 (call chain trace)expectedAnswer; use only concrete tokens (function names, file paths) for reliable substring scoringGround truth fixes:
BuildarerunImpl/handleBuildRequest, notmain/rebuildImplcreateCollectionDeserializerdelegates tofindTypeDeserializeretc., not sibling factoriesresolutionStageonly callsresolveSymbolEdges(noresolveImports)setLoreMeta(5 files), notopenDbadd_api_routecallsget_value_or_default/APIRoute, notget_request_handlertype.getContentType→getContentType)Indexer Fix: Tree-sitter Patches SCIP Symbol end_line
Problem:
scip-goprovides emptyenclosingRangefor all definitions. The SCIP stage'sestimateSymbolEndLineheuristic fails for Go methods with complex parameter types (e.g.,map[string]interface{}) because naive brace-counting sees{}in the type annotation.Fix:
SourceIndexStage.computeMetricsForScipFiles(which already parses SCIP-sourced files with tree-sitter for metrics) now also patchesend_linewhen tree-sitter provides a wider range. Includes receiver-prefix name matching (Server.HandleRequest→HandleRequest).Impact: Go method
handleBuildRequestgoes fromend_line=596(wrong, just signature) toend_line=790(correct, full body). This unlocks 20 outgoing call edges and makes cross-file callers likehandleBuildRequest → Buildvisible.Test: Integration test creates a Go method with
map[string]interface{}params, a synthetic SCIP index with noenclosingRange, runs both stages, and asserts tree-sitter correctsend_line.Benchmark Results (Run 5)
Average lore-arm correctness: ~83% (up from ~27% at session start).