fix(mem_wal): build secondary indexes when flushing active memtable#6901
Merged
Conversation
The background memtable flush handler called `flush`, which writes data and the bloom filter but builds no secondary indexes. As a result point lookups over flushed generations fell back to full scans, and vector search — which uses index-only `fast_search` over flushed generations — could not see the flushed rows at all (a correctness bug, not just a perf regression). Thread the shard's index configs into `MemTableFlushHandler` and call `flush_with_indexes` when any are configured, so each flushed generation carries the same secondary indexes as the active memtable. Plain `flush` is still used when no indexes are configured to avoid a needless dataset open. The indexed path's future is boxed to keep the flush async block under the type-layout recursion limit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
jackye1995
approved these changes
May 21, 2026
Contributor
|
cc @touch-of-grey you probably want to rerun the backpressure benchmark after this |
touch-of-grey
added a commit
to touch-of-grey/lance
that referenced
this pull request
May 22, 2026
…ance-format#6901) lance lance-format#6901 makes the memtable flush handler build the shard's maintained secondary indexes on each flushed generation, so the FTS index now exists on every flushed gen without the bench creating it. Remove the manual create_index loop; both scoring modes still use the fast indexed path, and the rescore planner's flat fallback remains for the no-maintained-index case (covered by unit tests).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The background memtable flush handler (
MemTableFlushHandler::flush_memtable) calledflush, which persists the data file and bloom filter but builds no secondary indexes. The handler never even received the shard's index configs.Two query-side consequences over flushed generations:
point_lookup.rs) run afilter_exprscan. Lance can route that through a scalar index — but none existed, so lookups fell back to a full scan (perf regression).vector_search.rs) uses index-onlyfast_search(). Its doc comment assumes "each flushed memtable has its own vector index built during flush", which was false — so flushed vector rows were invisible to KNN. This is a correctness bug, not just perf.Changes
index_configsintoMemTableFlushHandler.flush_with_indexeswhen any indexes are configured so each flushed generation carries the same secondary indexes as the active memtable; keep plainflushwhen none are configured to avoid a needless dataset open.flush_with_indexesfuture to keep the flush async block under the type-layout recursion limit.Testing
test_flushed_generation_is_indexed: writes through the realShardWriterpath, forces a flush, and asserts the flushed generation (a) carries the BTree index and (b) resolvesid = 5viaScalarIndexQueryrather than a scan. HNSW/FTS flush andfast_searchover an indexed flushed generation are already covered by existing tests.cargo test -p lance --lib dataset::mem_wal::— 316 passed, 0 failed.cargo fmt --allandcargo clippy -p lance --tests -- -D warnings— clean.🤖 Generated with Claude Code