feat(mcp): codedb_context — multi-line snippets + source-over-test ranking#479
Merged
Conversation
Two improvements driven by bench-data failure patterns:
1. Snippet enrichment: each Top-sites hit now emits ±2 lines of context
inside a ```fenced code block``` instead of just the matching line.
The judge was penalising agents for paraphrasing instead of literal
quoting; with multi-line literal snippets in the response, the agent
has copy-pasteable material to populate the answer's `snippet` field.
2. Composite file ranking: hits alone weren't enough — agents kept
picking test/spec/doc files over the real source. New score is
(raw hits) + 5 if file contains a symbol definition for any keyword,
−3 for test files (`/test`, `_test.`, `.test.`, `/__tests__/`, `/spec/`,
`/fixtures/`), −2 for docs (`.md`, `.rst`, `/docs/`). Final tie-break
by hit count.
Measured before/after on 16 bench tasks (react / regex / flask / gin),
codedb_CONTEXT vs codegraph_context:
quality tokens wall calls
before 4.06 1,126 20.8s 3.2
after (v3) 4.33 1,089 2.1s 1.9 (still ~9× fewer tokens than codegraph)
codegraph 4.44 9,719 31.5s 8.9
The remaining quality gap is concentrated in 3 tasks where the agent
paraphrases the snippet even though the literal code is in the response;
that's a prompt issue, not a tool issue.
Benchmark Regression ReportThresholds: 10.00% and 50,000 ns absolute delta
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two improvements to
codedb_contextdriven by failure-mode analysis on the code-search-shootout eval (16 tasks × 4 corpora):1. Snippet enrichment (±2 lines per hit, fenced code blocks). Old format was a single matching line per hit. The judge was penalising agents for paraphrasing instead of literal quoting; with multi-line literal snippets inside
```fenced```blocks, the agent has copy-pasteable material.2. Composite file ranking. Raw hit count alone misranked tests/specs/docs over real source. New score:
+5if the file contains a symbol definition for any extracted keyword−3for test files (/test,_test.,.test.,/__tests__/,/spec/,/fixtures/)−2for docs (.md,.rst,/docs/)Before this PR, agents kept picking
tests/test_basic.pyoversrc/flask/sansio/scaffold.py.Measured impact
Same 16 bench tasks, same agents, same judge:
Quality up +0.27 / call count down 1.7× / wall down 10× — all while staying ~9× under codegraph on tokens. Head-to-head: 2 wins / 10 ties / 3 losses against codegraph (was 2/10/4).
The 3 remaining losses are all "agent paraphrased the snippet" — a prompt issue, not a tool issue.
Test plan
zig build(ReleaseFast) passeszig build test— 486/487 pass (1 pre-existing failure onmain:issue-44)codedb_contextsmoke tests still greenfenced±2-line block