feat: allow LSM scanner to read fresh tier without a base table#6749
Conversation
Adds an opt-in constructor on LsmScanner and LsmDataSourceCollector that omits the base Lance table from the scan. The scanner then reads only the active memtable and flushed L0 generations, with deduplication semantics unchanged. Existing new(...) signatures are preserved. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
…nner Adds a filter() integration test for scanners built via without_base_table() to exercise the schema-driven filter path, and extends the vector-search without-base-table test to execute the plan so runtime issues surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@claude review once |
There was a problem hiding this comment.
LGTM — additive constructors with the existing path preserved and good test coverage across all three planners.
Extended reasoning...
Overview
This PR adds LsmScanner::without_base_table(...) and LsmDataSourceCollector::without_base_table(...) so callers that own the base read path elsewhere can scan only the WAL fresh tier (active memtable ∪ L0 flushed generations). Internally, LsmDataSourceCollector::base_table becomes Option<Arc<Dataset>> and the scanner now carries a BaseSource enum plus an explicit schema field. collect() / collect_for_shards() simply skip the base source when absent; the three planners (scan, point lookup, vector search) already drive off the collector so they need no changes.
Security risks
None. This is internal query-path code with no auth, crypto, permissions, or external-input surface. No new I/O paths or path-resolution behavior beyond what the existing new() path already does (uses caller-supplied base_path for _mem_wal/{shard}/{gen} resolution, same as before).
Level of scrutiny
Low-to-moderate. The PR is purely additive at the public API: existing LsmScanner::new(...) and LsmDataSourceCollector::new(...) signatures are unchanged, so Python bindings (python/src/mem_wal.rs) and benches (rust/lance/benches/mem_wal_*.rs) keep compiling without edits. The one technically-breaking change — LsmDataSourceCollector::base_table() now returns Option<&Arc<Dataset>> — has no callers outside the module (verified via grep). Storing schema on the scanner is equivalent behavior since the base dataset schema doesnt change during a scanners lifetime.
Other factors
- The bug hunting system found no issues.
- Test coverage is solid: new tests cover scan (with/without filter, empty fresh tier), point lookup (hit + miss), and vector search plan construction & execution. The plan-shape assertions check that no
base/dataLanceRead is emitted, which directly validates the intent of the change. - Patch coverage 92.5% per codecov; the uncovered lines are in the
Debugimpl and thePathOnlybranches ofbuild_collectorthat are exercised at runtime by the new tests. - The maintainer (westonpace) explicitly requested a review.
Summary
LsmScanner::without_base_table(schema, base_path, snapshots, pk_columns)andLsmDataSourceCollector::without_base_table(base_path, snapshots)so callers can scan only the active memtable and flushed L0 generations.base_tablebecomesOption<Arc<Dataset>>;collect()/collect_for_shards()skip the base source when it is absent. All three planners (scan, point lookup, vector search) are unaffected since they already drive off the collector and keep their ownbase_schemafield.LsmScanner::new(...)andLsmDataSourceCollector::new(...)signatures are unchanged, so Python bindings and benches keep compiling without edits.Motivation
Some callers own the base read path elsewhere and only need the WAL's contribution to a query — the active memtable ∪ L0 flushed generations. Without this change,
LsmScanneralways pulls the base dataset in, which duplicates work and forces those callers to subtract the base contribution out of band. With the new constructor, the same scanner code path serves both modes; dedup semantics across generations are unchanged.Test plan
cargo test -p lance --lib dataset::mem_wal::scanner— 55/55 passing, including three new tests:test_lsm_scan_without_base_table— verifies the plan excludes any base-tableLanceReadand that ids only present in the base are absent from the result.test_lsm_scan_without_base_table_no_flushed_no_active— empty fresh tier returns an empty plan cleanly.test_point_lookup_without_base_table— point lookup hits a flushed generation; missing key returns empty.test_vector_search_without_base_table— plan construction succeeds and excludes the base.cargo clippy -p lance --tests -- -D warningscleancargo fmt -p lanceclean🤖 Generated with Claude Code