Skip to content

index: add BM25 ranking for content search #400

@justrach

Description

@justrach

Problem

Explorer.searchContent (src/explore.zig:1504) returns matches in scan order through a 5-tier pipeline (word, prefix, trigram, sparse, full). The only post-hoc ordering is a per-line occurrence count tiebreaker at src/explore.zig:1674-1688. This means:

  • A file mentioning every query term many times ranks identically to a file with a single tangential mention.
  • Multi-term queries ("parse Token", "render component state") dump every match without surfacing the most relevant file first.
  • The filename fuzzy-finder (fuzzyScore at src/explore.zig:4881) is the only path with a real relevance score; content search has none.

BM25 over the existing inverted indices would rank documents by a normalized (tf × idf) sum so the densest, term-rich files surface above noise.

Failing test

Branch: issue-393-failing-test (HEAD: 214b3c3). See test "issue-393: BM25 ranking surfaces high-density file before single-mention file" in src/tests.zig. Fails on main with:

src/tests.zig:9376:33: error: no field or member function named 'searchContentRanked' in 'explore.Explorer'

The test indexes a dense file (many parse/Token hits), a sparse file (one parse mention), and three noise files, calls searchContentRanked("parse Token", ...), and asserts that the top-ranked result is src/dense.zig with score > 0.

Proposed API

Additive, opt-in. Existing searchContent keeps its scan-order semantics so callers that just want hit listings are unaffected.

pub const RankedResult = SearchResult;  // already has score: f32

pub fn searchContentRanked(
    self: *Explorer,
    query: []const u8,
    allocator: std.mem.Allocator,
    max_results: usize,
) ![]const SearchResult;

Behavior:

  • Tokenizes the query the same way WordIndex tokenizes documents (lowercase + identifier split). Single-token queries fall through to the existing pipeline (BM25 with one term degenerates to df-weighted scan order).
  • Multi-term queries: collect candidate documents from WordIndex.search for each term plus TrigramIndex.candidates for the joined literal. Score each candidate doc with BM25 (k1=1.2, b=0.75). Emit one SearchResult per top-N document — line is the highest-tf line for any query term in that doc.
  • MCP wiring (out of scope for this issue but sketched): add rank: "bm25" to codedb_search args; default stays unranked.

Implementation sketch

State (in WordIndex)

The WordIndex.index map already stores (doc_id, line_num) per term — repeated entries for the same (term, doc_id) give us per-doc TF for free. Add three lazy structures:

// All optional — built lazily on first searchContentRanked call,
// invalidated by indexFile / removeFile.
doc_lengths: ?std.AutoHashMap(u32, u32),  // doc_id -> # tokens
total_tokens: u64,                         // sum of doc_lengths.values
df_cache: ?std.StringHashMap(u32),         // term -> # distinct docs (computed on demand)
ranking_dirty: bool,
  • N (doc count): already available via WordIndex.fileCount.
  • avgdl: total_tokens / N.
  • df(term): count distinct doc_ids in index.get(term).items — O(hits) once, cached.
  • tf(term, doc): count entries in posting list with matching doc_id — O(hits/N) on average via a single linear pass; for low-df hot terms this is bounded by the existing 512 MAX_POSTINGS pressure.

Persistence

  • doc_lengths adds 4 + 8 * N bytes — for N=14k files that's ~110 KB. Persist alongside the WordIndex on-disk format (DiskHeader bumps to v3, adds an optional length-table block; readers without v3 ignore it and rebuild lazily).
  • df_cache is not persisted — it's a per-query memoization; cold start does one O(hits) pass per term.
  • No per-(term, doc) tf table on disk: tf is recomputed from the existing index posting list, which already encodes occurrence count by repeated entries.

Perf and memory budget

  • Extra heap state: ~110 KB for doc_lengths on a 14k-file repo, ~8 KB on a 1k-file repo — negligible vs. the 11M codedb-linux-x86_64 binary.
  • df_cache: bounded by distinct query terms across a session — typically <1k entries, ~32 KB.
  • Trigram MAX_POSTINGS=512 stays untouched. We do not need full per-doc tf for trigrams; word-level TF on the WordIndex side is sufficient and already exists.
  • Per-query cost: O(Σ_t hits(t)) — same as the existing word-index path. Sorting by score adds O(C log C) where C ≤ max_results × #terms.
  • Cold-start rebuild of doc_lengths: one pass over file_words map. <50ms on 14k files.

Out of scope

  • Reranking the trigram-only Tier 1 (no doc-level term info there). Trigrams stay as a candidate filter; ranking happens after candidates are resolved through word-level lookups.
  • Cross-language stemming / synonyms.
  • Phrase / proximity scoring (BM25F, BM25 with positions) — possible follow-up using the existing loc_mask byte in PostingMask, but not in this issue.
  • Persisting df_cache to disk.
  • Replacing the default searchContent ranking — keep the additive searchContentRanked entry point.

Fix

Implement searchContentRanked per sketch above, wire WordIndex.docLength / WordIndex.docFrequency / WordIndex.termFrequency helpers, and turn the failing test green.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions