refactor(mem_wal): redesign FTS mem index for single-writer multi-reader#6726
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
91c8d65 to
4afab4d
Compare
FTS mem index — performance & memory updateBeyond the original single-writer/multi-reader redesign, this branch now also What changed
Benchmark (real FineWeb corpus, 150 queries, k=10)Against the pre-redesign mem index, at 1M documents:
Against Apache Lucene (same corpus and queries) as a reference point:
Split by query type at 1M docs:
Term queries — the dominant FTS workload — are at or ahead of Lucene on Known follow-upThe blended p95 is dominated by phrase queries (~14 ms at 1M, ~3x Lucene). |
4f93f5a to
5c054cb
Compare
There was a problem hiding this comment.
we should not create a new bench folder, these should be mved under rust/lance/benches. We should probably create a mem_wal folder with specific subfolders to organize different benchmarks at this point
There was a problem hiding this comment.
Done. Moved all the FTS bench files out of the top-level bench/ folder into rust/lance/benches/mem_wal/, with a fts/ subfolder for FTS-specific work:
rust/lance/benches/mem_wal/
run_shard_writer_backpressure.sh
fts/
mem_wal_fts_bench.rs
mem_wal_fineweb_fts.rs
LuceneFtsBench.java
run_fts_compare.sh
run_fineweb_fts.sh
Cargo.toml's [[bench]] entries now point at the new paths via path = .... I kept this scoped to the bench files this PR introduces — the pre-existing flat mem_wal_*.rs benches are left in place to avoid an unrelated churn diff; happy to do a follow-up that consolidates those into mem_wal/ too.
Also scrubbed the driver scripts and bench doc comments of infra-specific references (S3 bucket names, instance paths, AWS region exports) — they now default to tmpdir-based paths and resolve the repo root via git rev-parse.
jackye1995
left a comment
There was a problem hiding this comment.
mostly looks good to me, thanks for iterating this with me, just a final comment regarding the file layout
Replace the SkipMap<(token, row_position)> postings layout with per-term ArcSwap<TermSlice> slices and an ArcSwap<Snapshot>-published per-batch visibility watermark. Readers grab a snapshot, walk per-term chunks filtered by chunk.batch_position < snapshot.visible_count, and score with snapshot-coupled BM25 stats. Writer publishes the snapshot only after every term chunk for a batch is linked, so readers never observe a partial document or BM25 stats out of sync with the postings. Also: tokenize-time tokenizer ownership moves out of a shared Mutex into a tokenizer pool plus a writer-dedicated slot, so search calls do not serialize against the writer; add Utf8View to the supported text types; add memory_usage() for size-based flush triggers; reuse lance-index InvertedIndex's TokenSet/DocSet/PostingListBuilder for the flush path.
Boolean and Boost queries called search_query recursively for each sub-clause, and every leaf (search / search_phrase / search_fuzzy) took its own self.snapshot.load_full(). A writer publishing between sub-queries left the compound result mixing BM25 stats from different snapshots — n, avgdl, and df disagreed across leaves of the same query, so the summed score wasn't valid for any point in time. Snapshot once at search_query / search_with_options and pass the Arc<Snapshot> through every leaf and every recursive call. Public search* entry points keep their signatures and snapshot internally. Also: filter entry_count() by visibility; correct stale doc-comments on Snapshot::batches, TermChunk::positions, and Snapshot::batch_for; hoist test-body 'use std::sync::Arc;' to the test module's top.
New rust/lance/benches/mem_wal_fineweb_fts.rs covers three metrics across the 12 configs in the design doc: - write throughput at memtable sizes 100k / 500k / 1M - MemTable FTS query latency (avg/p50/p95) over 100 high-frequency tokens + 50 sampled phrases - consistency: |memtable_top10 ∩ post_flush_disk_top10| / |union| as a user-approved replacement for recall@k The bench downloads HuggingFaceFW/fineweb sample/10BT shards, caches them, and is fully env-driven so a single binary handles every config. Driver script bench/run_fineweb_fts.sh loops the 12 configs, uploads each result.json to S3, and prints a summary. Also: make `dataset::mem_wal::index` public so the bench can call `FtsMemIndex::search_with_options` directly to time the MemTable read path.
The previous spin on max_indexed_batch_position never terminated when max_memtable_rows triggers auto-flushes during ingest: the counter is reset on each new active memtable, so target_batch_pos = total_batches - 1 is unreachable from the active generation. Close the writer instead — it drains the final WAL flush and any outstanding memtable flush; the inline sync_indexed_write covers the per-put index updates. close() time is included in the measured elapsed so configs with different flush cadences are compared apples-to-apples.
…e bench Rewrites mem_wal_fineweb_fts as a CLI-arg bench modeled on the upstream mem_wal_shard_writer_backpressure bench: same Mode matrix (async/sync × index/no-index), same ShardWriter wiring, JSON output. Payload is real HuggingFace FineWeb text and the maintained index is the in-memory FTS index. Splits the run into independent --phase write|read invocations so a process never holds two ShardWriter lifecycles in sequence — that was the deadlock in the first iteration. The driver runs every config under a timeout watchdog so a hang costs one window, not days.
In async-index modes the FTS index is updated in the background WAL-flush handler, so querying the MemTable right after the put loop hit an empty index — consistency scored 0 because MemTable top-10 was always empty. The read phase now waits for the IndexStore visibility watermark to reach the last buffered batch before querying. Auto-flush is disabled in the read phase so the watermark is a stable, monotonic target. Reports the catchup time as index_catchup_seconds.
The watermark-based catchup wait never completed for async-index mode with auto-flush disabled. The read panel measures FTS read latency and MemTable/on-disk consistency, both of which are properties of the populated index and independent of whether ingestion indexed sync or async. So the read phase now forces sync_indexed_write: the FTS index is fully populated when the put loop returns and the query phase can start immediately, no watermark polling.
The read-phase consistency scored 0 because the 1M-row seed polluted the on-disk comparison: the MemTable FTS index covers only the ingested rows, but `full_text_search` ranked over all 1.1M rows and its top-10 was dominated by seed rows the MemTable never held — disjoint by construction. The read phase now seeds with only READ_SEED_ROWS (1000) rows and prefilters the on-disk query to `id >= READ_SEED_ROWS`, so both the MemTable and on-disk top-10 cover exactly the ingested population and BM25 stats are effectively identical.
…nce dataset The read-phase consistency was unmeasurable: (1) the post-MemWAL-flush on-disk query saw nothing because flushed data lives in the MemWAL LSM structure that a plain Dataset::scan does not traverse, and (2) the manual FtsMemIndex row_position -> id mapping was mis-aligned. The read phase now queries the MemTable through the production MemTableScanner (returns the id column directly, no manual mapping) and compares against a freshly built reference Lance dataset holding the identical ingested rows + a normal FTS index. Both sides cover exactly the same population with identical BM25 stats. Token queries are drawn from mid-frequency terms so the top-10 is well-determined rather than a high-frequency near-tie.
The MemTableScanner snapshots the visibility watermark at plan time, and that watermark only advances as the WAL becomes durable. Without durable_write the read phase queried a partially-visible MemTable: the async-mode read configs scored consistency ~0.03-0.23 while the durable sync-mode ones scored ~0.72-0.75 over identical data. The read phase now forces durable_write on so the scanner always sees the full ingested MemTable.
…e phase Adds paced ingest to the write phase for a backpressure sweep, mirroring the mem_wal_shard_writer_backpressure bench used for the HNSW vector panel. The write phase now reports slow_puts_ge_1s/10s, the memtable + pending-WAL backlog left when the put loop ends, and honors a paced target so a sweep can find the max sustainable async FTS throughput (the rate at which the flush/index pipeline keeps up without accumulating backpressure).
…cation With no --max-memtable-rows the write-phase config defaulted to usize::MAX/2, so max_memtable_batches = max_rows/batch_rows overflowed into a ~9e15 Vec capacity and the writer aborted on a 664 PB allocation. The default is now derived from the byte budget and clamped to 16M rows.
Centralizes the vector and FTS write-backpressure tests on one bench and one driver: the paced-ingest / WAL-queue-sampling / skip-close methodology is identical, and only --index-type selects which column is indexed (vec via IVF/PQ, or text via the inverted index). This makes the FTS backpressure numbers directly comparable to the HNSW vector sweep. The fineweb-shaped `text` column is now generated as varied vocabulary words instead of a single repeated character so the FTS index does representative work; the byte size is unchanged so vector runs (where text is inert payload) are unaffected. run_shard_writer_backpressure.sh drives both index types via INDEX_TYPE.
Mirrors the HNSW-vs-hnswlib comparison for full-text search: - mem_wal_fts_bench.rs — Lance side. `gen` slices a FineWeb corpus and writes the shared inputs (corpus.txt, corpus_tok.txt, queries.txt, exact-BM25 truth.txt); `bench` builds a raw FtsMemIndex and measures build throughput, term/phrase query latency + QPS, and recall@k. - lucene_fts_bench/LuceneFtsBench.java — Lucene reference, in-memory ByteBuffersDirectory + BM25Similarity(1.2,0.75), same inputs, same metrics, same JSON output shape. - run_fts_compare.sh — builds Lucene from the local checkout (JDK 25), builds both benches, runs Run A (pre-tokenized, isolates index + scorer) and Run B (native analyzers) across a 100k/500k/1M sweep, reports both impls side by side plus their mutual top-k overlap.
One Future per query made the multi-thread QPS dominated by thread-pool submit overhead (smoke run: Lucene qps_nt 10k vs Lance 105k for cheap queries). Submit one task per thread, each striding a slice of the queries x reps work — the same shape as the Lance side's rayon loop.
Adds a pub `wand_search` to the inverted module: it builds PostingIterators from caller-supplied PostingLists and runs the block-max WAND algorithm, with a no-op row mask. The WAND machinery was pub(crate); this exposes it for the in-memory FTS MemTable index, whose frozen segments will hold postings as PostingList::Plain and need the same query primitive as the on-disk path. Stage 1 of the segment-structured FtsMemIndex redesign.
Adds `Partition` + `PartitionBuilder` to the FTS mem index: an immutable frozen slice of inserts holding per-term posting lists in the on-disk `PostingList::Plain` form, queried with the shared block-max WAND (`wand_search`). WAND reports partition-local doc ids; `Partition::search` remaps them to MemTable row positions via the partition's `DocSet`. Standalone and dead-code-allowed — wired into `FtsMemIndex` in the next stage. Part of the partition-structured redesign that replaces the O(corpus) per-batch-chunk query layout.
Replaces the per-batch-chunk `FtsMemIndex` layout — where every query walked every chunk of every query term, making latency O(corpus) (p95 ~1.2s at 1M docs) — with Lucene's segment model adapted to a Lance FTS partition. The index is now a set of immutable `Partition`s plus a bounded mutable tail, published together behind one `ArcSwap` (`IndexState`) so a freeze is observed atomically. Each partition holds frozen, on-disk-shaped posting lists queried with block-max WAND in ~O(matches); the writer freezes the tail into a new partition past `freeze_threshold_rows` (50k) and merges partitions past a hard cap. All scoring routes through one corpus-wide `MemBM25Scorer` so partition and tail scores are comparable; the scorer and the scanned tail share a single snapshot to keep BM25 stats consistent. Flush merges every partition and the tail into one `InnerBuilder`. The public query API is unchanged. Exposes `Scorer` from `lance-index`'s inverted module.
Block-max WAND top-k pruning needs per-term `max_score` upper bounds, which in-memory partition posting lists do not carry (`max_score` is `None`). A WAND given a `limit` therefore pruned by a bound it could not compute and silently dropped valid top-k docs — term_recall@10 collapsed from 1.0 to ~0.1 once the corpus exceeded the freeze threshold and partitions appeared. Per-partition WAND now always runs unbounded; the scan is still O(matches) (the partition layout, not WAND pruning, is what makes latency flat), and `search_with_options` truncates to the real limit after merging. Adds a regression test asserting a limited search over many partitions returns the exact global top-k.
Block-max WAND needs per-term `max_score` upper bounds to prune safely. Frozen in-memory partition posting lists carry none (`max_score` is `None`), so even an unbounded WAND mis-pruned and term_recall@10 collapsed from 1.0 to ~0.12 once the corpus crossed the freeze threshold. `Partition::search_match` / `search_phrase` now scan the per-term merged posting list directly: OR accumulates each token's BM25 contribution per doc, phrase intersects from the rarest token and verifies positions. A partition holds one merged list per term, so this is O(matches) — the per-batch-chunk O(corpus) layout was the original problem, not the absence of WAND pruning. All scoring still routes through the shared `MemBM25Scorer`, so partition and tail scores stay identical. Block-max WAND can be reintroduced later once partitions carry precomputed per-term score bounds.
…r slack Partition posting lists were `PlainPostingList`s whose positions were each built through a fresh Arrow `ListBuilder`/`Int32Builder`. Those builders default-allocate 1024-element buffers, so the ~1M+ tiny per-term posting lists across a large index carried ~10+ GB of pure capacity slack — in-memory FTS footprint regressed ~2.8x (29 GB vs 10 GB at 1M docs). Replaces the partition posting representation with `PartitionPosting`: exact-sized plain `Vec`s, partition-local `u32` doc ids (not `u64`), `u32` frequencies (not `f32`), and a CSR position layout. No Arrow builders, no slack, smaller per-element. The OR/AND scan now accumulates into flat arrays indexed by the dense local doc id instead of a per-posting `HashMap`.
A limited query previously scored every matching doc in every partition — O(matches), so p50 grew with the corpus and trailed Lucene by ~24x. Each `PartitionPosting` now carries `(max_freq, min_dl)`, which yields a sound per-term BM25 score upper bound (`doc_weight` is monotone in both). With that bound, `Partition::wand_match` runs block-max WAND: it maintains a top-k heap and skips any doc whose cumulative term upper bounds cannot reach the k-th score. This is exact — the earlier WAND attempt mis-pruned only because the posting lists carried no bound at all. WAND runs for OR queries that carry a result limit (`search_with_options`); unbounded `search` and AND still take the exact direct scan. Adds a test asserting WAND top-k equals the exact full-scan top-k across partitions.
…onary Frozen partitions held uncompressed `Vec<u32>` posting lists plus a per- partition `HashMap<Arc<str>,u32>` term dictionary duplicated across every partition — ~4 GB at 1M docs (~5x Lucene). Postings are now stored in three shared per-partition buffers as 128-doc blocks: VByte + delta doc-id gaps, VByte frequencies, and VByte delta-encoded positions. A `PostingCursor` decodes a block at a time and `skip_to` jumps whole blocks via `BlockMeta` without decoding — so WAND still block-skips and the exact O(matches) scan path is unchanged. Per-term overhead drops to one `PostingRef`. The term dictionary becomes a sorted `Box<[Arc<str>]>` (binary-searched, no hashmap node overhead) whose strings are shared across partitions via an index-wide `TermInterner`. Flush is unaffected — `to_index_builder_reversed` decodes postings through the same cursor, reverses, and rebuilds the Lance FTS on-disk format as before. Adds VByte and multi-block cursor round-trip tests.
VByte's byte-at-a-time decode loop made compressed-partition queries far slower than the uncompressed layout — p95 at 1M docs rose 4.7 ms to 24.7 ms. Doc ids and frequencies in each 128-doc block are now fixed-width bit-packed (doc ids as `doc - first_doc`, both at the block's minimal bit width, recorded in `BlockMeta`). Decoding a block is now a tight shift/mask loop instead of a branchy VByte scan, restoring query latency while keeping the footprint compressed. Positions stay VByte (decoded lazily, off the hot path). Adds a bit-pack round-trip test covering zero/sub-byte/byte-crossing/full widths.
Each partition's WAND previously ran with its own top-k heap, cold-starting the pruning threshold at -inf. At 1M docs (~20 partitions) that re-paid the "fill the heap to raise the threshold" warm-up ~20x, leaving FTS query p50 ~2.5x Lucene even on the uncompressed layout. Limited (top-k) OR searches now feed a single shared `TopK` across the whole index: the tail is scanned first to warm the threshold, then each partition's `wand_into` prunes against the shared, monotonically rising threshold — so a partition processed late skips far more. Still exact: the per-term `doc_weight(max_freq, min_dl)` upper bound is sound, and partitions hold disjoint docs so the shared heap collects the true global top-k. Unbounded `search` and AND keep the exact O(matches) scan.
The per-list WAND bound (`doc_weight(max_freq over the whole posting list, ...)`) is so loose for common terms that the pivot never drops below the threshold — WAND degenerated into a full scan, so neither the shared threshold nor a faster codec could help. Each `BlockMeta` now carries the block's `max_freq`, and `wand_into` is a block-max WAND: the per-list bound still selects the pivot doc, then a per-block bound — summed over all lanes for the blocks covering the pivot — decides whether the whole `[pivot_doc, min block end]` region can be skipped without decoding or scoring it. The bound is sound (block max_freq and term min_dl over-estimate every doc in the block), so the top-k stays exact. `PostingCursor` gains shallow `block_max_freq_for` / `block_end_for` lookups (binary search over `BlockMeta`, no payload decode). Adds a 600-doc/2-partition test asserting the WAND top-k score multiset equals the exact full-scan top-k across several k.
This reverts commit 0aa620c.
Compression cut the in-memory FTS footprint 2.7x but VByte/scalar block decoding regressed query p95 at 1M docs from ~4.7 ms to ~25 ms — and the FineWeb top-k queries turned out to be scan-bound, not prune-bound, so WAND pruning could not recover it. Decode speed is the lever. Full 128-doc posting blocks (doc ids and frequencies) are now packed and unpacked with `bitpacking`'s SIMD `BitPacker4x`; the partial final block of each term keeps the scalar codec. The byte layout and `BlockMeta` widths are unchanged, so the two codecs interoperate per block. Positions stay VByte (decoded lazily, off the hot path). Adds the `bitpacking` crate dependency.
Reports term vs phrase p50/p95 separately so a latency regression can be attributed to one query class instead of the blended percentile.
Diagnosis (term/phrase latency split): term queries were already fast (p95 ~0.6 ms, ahead of Lucene) — the blended p95 of ~25 ms was entirely phrase queries. `search_phrase` calls `positions()` for the candidate docs of the rare driving term, but each call VByte-decoded the *whole* 128-doc block's positions to use one — ~128x wasted work, scattered across blocks. A document's position count equals its term frequency, so no per-doc count needs storing: each block's positions are now one bit-packed delta stream, and doc `i`'s slice is located from the freq prefix sum. `bitpack_read_at` reads that slice directly — phrase decodes one document's positions, never the block. This also drops the per-doc count bytes, so the footprint is slightly smaller. Replaces the VByte position codec entirely.
run_fts_compare.sh selected the bench binary with `sort | tail -1` — a lexical sort by Cargo's build hash. When a dependency change rotates that hash, an older binary can sort last, so the harness silently ran stale code. Now it clears old binaries before building and selects by modification time.
…er API `initialize_mem_wal` is now a builder (`.maintained_indexes(...).execute()`) rather than taking a `MemWalConfig`; update the benchmark accordingly after rebasing onto main.
The FTS MemTable index runs its own block-max WAND over its compressed in-memory posting layout, so the `wand_search` / `WandTerm` entry point added to `lance-index` early in this work is now dead. Revert it and the `wand` re-export — `lance-index` is left with a single one-identifier change: `Scorer` joins `MemBM25Scorer` in the `inverted` re-export so the in-memory index can call the scorer's trait methods.
The `lance` crate gained a `bitpacking` dependency; the separate pylance lockfile must list it so CI's `--locked` lint passes.
…fra references Relocate the FTS bench files out of the top-level bench/ folder into rust/lance/benches/mem_wal/, with an fts/ subfolder for FTS-specific benchmarks and drivers. Replace hard-coded S3 buckets, instance paths, and AWS region exports in the driver scripts and bench doc comments with generic placeholders and tmpdir-based defaults.
3b83a08 to
9f62f87
Compare
Summary
Replaces the
SkipMap<(token, row_position)>postings layout inrust/lance/src/dataset/mem_wal/index/fts.rswith per-termArcSwap<TermSlice>slices and anArcSwap<Snapshot>-publishedper-batch visibility watermark. Readers grab the snapshot, walk per-term
chunks filtered by
chunk.batch_position < snapshot.visible_count, andscore with snapshot-coupled BM25 stats.
The writer publishes the snapshot only after every term chunk for a
batch is linked, so readers never observe a partial document or BM25
stats out of sync with the postings — per-batch monotonic visibility.
Also in this PR:
Mutexinto a small reader pool plus a writer-dedicated slot, so search
calls no longer serialize against the writer.
Utf8Viewis added to the supported text types alongsideUtf8/LargeUtf8; non-string columns now produce a clear error instead ofbeing silently skipped.
FtsMemIndex::memory_usage()for size-based flush triggers.Boolean,Boost) snapshot once atsearch_queryand thread the same
Arc<Snapshot>through every leaf, so acompound query never mixes BM25 stats from different snapshots.
lance_index::scalar::invertedtypes(
TokenSet/DocSet/PostingListBuilder); the on-disk format isunchanged. The stale "flush is not yet implemented" header doc is
corrected.
Public API (
FtsMemIndexconstructors,FtsQueryExpr,SearchOptions,BooleanQueryBuilder,FtsIndexConfig,to_index_builder_reversed)is preserved. No callers of the removed internal
FtsKey/PostingValuetypes existed outside the file.cc @jackye1995 for review.