Problem
Explorer.searchContent (src/explore.zig:1504) returns matches in scan order through a 5-tier pipeline (word, prefix, trigram, sparse, full). The only post-hoc ordering is a per-line occurrence count tiebreaker at src/explore.zig:1674-1688. This means:
- A file mentioning every query term many times ranks identically to a file with a single tangential mention.
- Multi-term queries (
"parse Token", "render component state") dump every match without surfacing the most relevant file first.
- The filename fuzzy-finder (
fuzzyScore at src/explore.zig:4881) is the only path with a real relevance score; content search has none.
BM25 over the existing inverted indices would rank documents by a normalized (tf × idf) sum so the densest, term-rich files surface above noise.
Failing test
Branch: issue-393-failing-test (HEAD: 214b3c3). See test "issue-393: BM25 ranking surfaces high-density file before single-mention file" in src/tests.zig. Fails on main with:
src/tests.zig:9376:33: error: no field or member function named 'searchContentRanked' in 'explore.Explorer'
The test indexes a dense file (many parse/Token hits), a sparse file (one parse mention), and three noise files, calls searchContentRanked("parse Token", ...), and asserts that the top-ranked result is src/dense.zig with score > 0.
Proposed API
Additive, opt-in. Existing searchContent keeps its scan-order semantics so callers that just want hit listings are unaffected.
pub const RankedResult = SearchResult; // already has score: f32
pub fn searchContentRanked(
self: *Explorer,
query: []const u8,
allocator: std.mem.Allocator,
max_results: usize,
) ![]const SearchResult;
Behavior:
- Tokenizes the query the same way
WordIndex tokenizes documents (lowercase + identifier split). Single-token queries fall through to the existing pipeline (BM25 with one term degenerates to df-weighted scan order).
- Multi-term queries: collect candidate documents from
WordIndex.search for each term plus TrigramIndex.candidates for the joined literal. Score each candidate doc with BM25 (k1=1.2, b=0.75). Emit one SearchResult per top-N document — line is the highest-tf line for any query term in that doc.
- MCP wiring (out of scope for this issue but sketched): add
rank: "bm25" to codedb_search args; default stays unranked.
Implementation sketch
State (in WordIndex)
The WordIndex.index map already stores (doc_id, line_num) per term — repeated entries for the same (term, doc_id) give us per-doc TF for free. Add three lazy structures:
// All optional — built lazily on first searchContentRanked call,
// invalidated by indexFile / removeFile.
doc_lengths: ?std.AutoHashMap(u32, u32), // doc_id -> # tokens
total_tokens: u64, // sum of doc_lengths.values
df_cache: ?std.StringHashMap(u32), // term -> # distinct docs (computed on demand)
ranking_dirty: bool,
- N (doc count): already available via
WordIndex.fileCount.
- avgdl:
total_tokens / N.
- df(term): count distinct doc_ids in
index.get(term).items — O(hits) once, cached.
- tf(term, doc): count entries in posting list with matching doc_id — O(hits/N) on average via a single linear pass; for low-df hot terms this is bounded by the existing 512 MAX_POSTINGS pressure.
Persistence
doc_lengths adds 4 + 8 * N bytes — for N=14k files that's ~110 KB. Persist alongside the WordIndex on-disk format (DiskHeader bumps to v3, adds an optional length-table block; readers without v3 ignore it and rebuild lazily).
df_cache is not persisted — it's a per-query memoization; cold start does one O(hits) pass per term.
- No per-(term, doc) tf table on disk: tf is recomputed from the existing
index posting list, which already encodes occurrence count by repeated entries.
Perf and memory budget
- Extra heap state: ~110 KB for
doc_lengths on a 14k-file repo, ~8 KB on a 1k-file repo — negligible vs. the 11M codedb-linux-x86_64 binary.
df_cache: bounded by distinct query terms across a session — typically <1k entries, ~32 KB.
- Trigram MAX_POSTINGS=512 stays untouched. We do not need full per-doc tf for trigrams; word-level TF on the WordIndex side is sufficient and already exists.
- Per-query cost: O(Σ_t hits(t)) — same as the existing word-index path. Sorting by score adds O(C log C) where C ≤ max_results × #terms.
- Cold-start rebuild of
doc_lengths: one pass over file_words map. <50ms on 14k files.
Out of scope
- Reranking the trigram-only Tier 1 (no doc-level term info there). Trigrams stay as a candidate filter; ranking happens after candidates are resolved through word-level lookups.
- Cross-language stemming / synonyms.
- Phrase / proximity scoring (BM25F, BM25 with positions) — possible follow-up using the existing
loc_mask byte in PostingMask, but not in this issue.
- Persisting
df_cache to disk.
- Replacing the default
searchContent ranking — keep the additive searchContentRanked entry point.
Fix
Implement searchContentRanked per sketch above, wire WordIndex.docLength / WordIndex.docFrequency / WordIndex.termFrequency helpers, and turn the failing test green.
Problem
Explorer.searchContent(src/explore.zig:1504) returns matches in scan order through a 5-tier pipeline (word, prefix, trigram, sparse, full). The only post-hoc ordering is a per-line occurrence count tiebreaker at src/explore.zig:1674-1688. This means:"parse Token","render component state") dump every match without surfacing the most relevant file first.fuzzyScoreat src/explore.zig:4881) is the only path with a real relevance score; content search has none.BM25 over the existing inverted indices would rank documents by a normalized (tf × idf) sum so the densest, term-rich files surface above noise.
Failing test
Branch:
issue-393-failing-test(HEAD: 214b3c3). Seetest "issue-393: BM25 ranking surfaces high-density file before single-mention file"in src/tests.zig. Fails on main with:The test indexes a dense file (many
parse/Tokenhits), a sparse file (oneparsemention), and three noise files, callssearchContentRanked("parse Token", ...), and asserts that the top-ranked result issrc/dense.zigwithscore > 0.Proposed API
Additive, opt-in. Existing
searchContentkeeps its scan-order semantics so callers that just want hit listings are unaffected.Behavior:
WordIndextokenizes documents (lowercase + identifier split). Single-token queries fall through to the existing pipeline (BM25 with one term degenerates to df-weighted scan order).WordIndex.searchfor each term plusTrigramIndex.candidatesfor the joined literal. Score each candidate doc with BM25 (k1=1.2, b=0.75). Emit oneSearchResultper top-N document — line is the highest-tf line for any query term in that doc.rank: "bm25"tocodedb_searchargs; default stays unranked.Implementation sketch
State (in
WordIndex)The
WordIndex.indexmap already stores(doc_id, line_num)per term — repeated entries for the same(term, doc_id)give us per-doc TF for free. Add three lazy structures:WordIndex.fileCount.total_tokens / N.index.get(term).items— O(hits) once, cached.Persistence
doc_lengthsadds4 + 8 * Nbytes — for N=14k files that's ~110 KB. Persist alongside the WordIndex on-disk format (DiskHeader bumps to v3, adds an optional length-table block; readers without v3 ignore it and rebuild lazily).df_cacheis not persisted — it's a per-query memoization; cold start does one O(hits) pass per term.indexposting list, which already encodes occurrence count by repeated entries.Perf and memory budget
doc_lengthson a 14k-file repo, ~8 KB on a 1k-file repo — negligible vs. the 11M codedb-linux-x86_64 binary.df_cache: bounded by distinct query terms across a session — typically <1k entries, ~32 KB.doc_lengths: one pass overfile_wordsmap. <50ms on 14k files.Out of scope
loc_maskbyte inPostingMask, but not in this issue.df_cacheto disk.searchContentranking — keep the additivesearchContentRankedentry point.Fix
Implement
searchContentRankedper sketch above, wireWordIndex.docLength/WordIndex.docFrequency/WordIndex.termFrequencyhelpers, and turn the failing test green.