Skip to content

feat(search): per-file diversification so top-K isn't one class's methods#29

Open
mschreib28 wants to merge 2 commits into
mainfrom
upstream/feat/search-diversify
Open

feat(search): per-file diversification so top-K isn't one class's methods#29
mschreib28 wants to merge 2 commits into
mainfrom
upstream/feat/search-diversify

Conversation

@mschreib28
Copy link
Copy Markdown
Owner

Summary\n\nWhen a query matches many symbols in a single file, current ranking returns the matching class plus 9 of its members from the same file. The first hit is informative; the next 9 are implementation detail that pushes peer files (subclasses, callers, sibling modules) past the limit. This PR caps results per file so search surfaces representative breadth across the codebase rather than burying the user in one class's internals.\n\n## Empirical lift on codegraph (limit=10, default perFileCap=3)\n\n| Query | Before (max from one file) | After |\n|---|---:|---:|\n| ExtractionOrchestrator | 10/10 | 9/10 (only one file matches; backfill kicks in) |\n| database | 8/10 | 3/10 |\n| config | 5/10 | 3/10 |\n| resolve | 4/10 | 3/10 |\n| extract / parse | 3 (no regression) | 3 |\n\nTop-1 result is preserved in every case — diversification only reorders second-and-onward.\n\n## Components\n\n- SearchOptions.perFileCap?: number in src/types.ts — default 3; set to 0 to disable.\n\n- diversifyByFile(results, limit, perFileCap) in src/search/query-utils.ts: pure function. First pass picks at most perFileCap per file in score order. If limit isn't yet filled, backfills from skipped (in original score order) so we never return fewer results than the caller requested.\n\n- Wiring in src/db/queries.ts: applied after the existing rescoring pass, when there are more candidates than the caller's limit. Relies on the existing 5× internal over-fetch in searchNodesFTS for headroom — no new multiplier added (multiplier-on-multiplier composition was the reviewer's blocking concern in an earlier draft).\n\n- 13 regression tests covering pure-function behavior (cap, backfill, top-preservation, perFileCap=0, limit edges) + integration tests against an end-to-end DB.\n\n## Files changed\n\n| File | Change |\n|---|---|\n| src/types.ts | Add perFileCap to SearchOptions |\n| src/search/query-utils.ts | Add diversifyByFile pure helper |\n| src/db/queries.ts | Wire diversifyByFile into searchNodes; comment on the over-fetch composition |\n| __tests__/diversify.test.ts (NEW) | 13 regression tests |\n\n## Why this PR (not symbol clustering)\n\nThe previous proposal was symbol clustering — grouping User/UserService/UserController into a "User feature." After prototyping against codegraph (which is a tool-style codebase, not entity-style), the clusters that emerged were mostly verb collisions (get*, extract*, resolve*) — naming convention noise, not features. Result diversification turned out to be the actual cure for the pain point clustering was meant to address: "search returned 10 hits, all from one class" → "search returned representatives across files."\n\n## Test plan\n\n- [x] npm test: 393/393 pass on macOS\n- [x] npx tsc --noEmit clean\n- [x] Bench script confirms the lift in the table above\n- [x] Independent reviewer pass before pushing — addressed three findings:\n - Multiplier-on-multiplier (4× outer × 5× inner = 20× for large limits): outer multiplier removed; the inner over-fetch is sufficient. Re-ran bench to confirm no regression.\n - Within-limit reorder: documented as intentional pure-function behavior; integration path correctly skips when results.length <= limit.\n - MCP exposure of perFileCap: deferred — default 3 is the desired new behavior; MCP callers pick it up implicitly.\n\n## Backwards compatibility note\n\nA caller relying on getting all 10 hits from one file via searchNodes will now see at most 3 (with backfill if no peers exist). The new behavior is opt-out via perFileCap: 0.\n\n🤖 Generated with Claude Code\n


Copied from colbymchenry/codegraph#107

andreinknv and others added 2 commits April 26, 2026 16:36
…hods

When a query matches many symbols in a single file, current ranking
returns the matching class plus 9 of its members from the same file.
The first hit is informative; the next 9 are implementation detail
that pushes peer files (subclasses, callers, sibling modules) past the
limit. This PR caps results per file so search surfaces representative
breadth across the codebase rather than burying the user in one
class's internals.

## Empirical lift on codegraph (limit=10, default cap=3)

| Query | Before (max from one file) | After |
|---|---:|---:|
| ExtractionOrchestrator | 10/10 | 9/10 (only one file matches; backfill kicks in) |
| database | 8/10 | 3/10 |
| config | 5/10 | 3/10 |
| resolve | 4/10 | 3/10 |
| extract / parse | 3 (no regression) | 3 |

Top-1 result is preserved in every case — diversification only
reorders second-and-onward.

## Components

- `SearchOptions.perFileCap?: number` — default 3; 0 disables.

- `diversifyByFile(results, limit, perFileCap)` in
  src/search/query-utils.ts: pure function. First pass picks at most
  perFileCap per file in score order. If limit isn't yet filled,
  backfills from skipped (in original score order) so we never return
  fewer results than the caller requested.

- searchNodes wires it after the existing rescoring pass, when there
  are more candidates than the caller's limit. Relies on the existing
  5x internal over-fetch in searchNodesFTS for headroom — no new
  multiplier added (multiplier-on-multiplier composition was the
  reviewer's blocking concern in an earlier draft).

## Files changed

| File | Change |
|---|---|
| src/types.ts | Add perFileCap to SearchOptions |
| src/search/query-utils.ts | Add diversifyByFile pure helper |
| src/db/queries.ts | Wire diversifyByFile into searchNodes; comment on the over-fetch composition |
| __tests__/diversify.test.ts (NEW) | 13 regression tests |

## Test plan

- [x] npm test: 393/393 pass on macOS
- [x] npx tsc --noEmit clean
- [x] Bench script confirms the lift in the table above
- [x] Independent reviewer pass before pushing — addressed:
  - Multiplier-on-multiplier (4x outer * 5x inner = 20x for large
    limits): outer multiplier removed; inner over-fetch is sufficient.
  - Within-limit reorder: documented as intentional pure-function
    behavior; integration path correctly skips when results <= limit.
  - MCP exposure of perFileCap: deferred — default 3 is the desired
    new behavior; MCP can pick it up later if users want to tune.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
searchNodes returns bm25 + kindBonus + scorePathRelevance +
nameMatchBonus, which has no upper bound. The CLI then displayed
that as `(score * 100).toFixed(0)%` — producing "10449%", "6553%",
"2251%" for ordinary searches like `query serve` against ollama.
Beyond being misleading, the value isn't comparable across queries.

Render each hit's score as a fraction of the top hit's score so the
top result is always "100%" and everything below scales relative to
it. Topscore=0 (degenerate) shows as "0%" instead of NaN.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants