Skip to content

feat: allow LSM scanner to read fresh tier without a base table#6749

Merged
westonpace merged 2 commits into
lance-format:mainfrom
hamersaw:feature/wal-lsm-scanner-read-paths
May 13, 2026
Merged

feat: allow LSM scanner to read fresh tier without a base table#6749
westonpace merged 2 commits into
lance-format:mainfrom
hamersaw:feature/wal-lsm-scanner-read-paths

Conversation

@hamersaw
Copy link
Copy Markdown
Contributor

Summary

  • Adds LsmScanner::without_base_table(schema, base_path, snapshots, pk_columns) and LsmDataSourceCollector::without_base_table(base_path, snapshots) so callers can scan only the active memtable and flushed L0 generations.
  • Internally, base_table becomes Option<Arc<Dataset>>; collect() / collect_for_shards() skip the base source when it is absent. All three planners (scan, point lookup, vector search) are unaffected since they already drive off the collector and keep their own base_schema field.
  • Existing LsmScanner::new(...) and LsmDataSourceCollector::new(...) signatures are unchanged, so Python bindings and benches keep compiling without edits.

Motivation

Some callers own the base read path elsewhere and only need the WAL's contribution to a query — the active memtable ∪ L0 flushed generations. Without this change, LsmScanner always pulls the base dataset in, which duplicates work and forces those callers to subtract the base contribution out of band. With the new constructor, the same scanner code path serves both modes; dedup semantics across generations are unchanged.

Test plan

  • cargo test -p lance --lib dataset::mem_wal::scanner — 55/55 passing, including three new tests:
    • test_lsm_scan_without_base_table — verifies the plan excludes any base-table LanceRead and that ids only present in the base are absent from the result.
    • test_lsm_scan_without_base_table_no_flushed_no_active — empty fresh tier returns an empty plan cleanly.
    • test_point_lookup_without_base_table — point lookup hits a flushed generation; missing key returns empty.
    • test_vector_search_without_base_table — plan construction succeeds and excludes the base.
  • cargo clippy -p lance --tests -- -D warnings clean
  • cargo fmt -p lance clean

🤖 Generated with Claude Code

Adds an opt-in constructor on LsmScanner and LsmDataSourceCollector that
omits the base Lance table from the scan. The scanner then reads only the
active memtable and flushed L0 generations, with deduplication semantics
unchanged. Existing new(...) signatures are preserved.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 12, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

…nner

Adds a filter() integration test for scanners built via without_base_table()
to exercise the schema-driven filter path, and extends the vector-search
without-base-table test to execute the plan so runtime issues surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@westonpace
Copy link
Copy Markdown
Member

@claude review once

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — additive constructors with the existing path preserved and good test coverage across all three planners.

Extended reasoning...

Overview

This PR adds LsmScanner::without_base_table(...) and LsmDataSourceCollector::without_base_table(...) so callers that own the base read path elsewhere can scan only the WAL fresh tier (active memtable ∪ L0 flushed generations). Internally, LsmDataSourceCollector::base_table becomes Option<Arc<Dataset>> and the scanner now carries a BaseSource enum plus an explicit schema field. collect() / collect_for_shards() simply skip the base source when absent; the three planners (scan, point lookup, vector search) already drive off the collector so they need no changes.

Security risks

None. This is internal query-path code with no auth, crypto, permissions, or external-input surface. No new I/O paths or path-resolution behavior beyond what the existing new() path already does (uses caller-supplied base_path for _mem_wal/{shard}/{gen} resolution, same as before).

Level of scrutiny

Low-to-moderate. The PR is purely additive at the public API: existing LsmScanner::new(...) and LsmDataSourceCollector::new(...) signatures are unchanged, so Python bindings (python/src/mem_wal.rs) and benches (rust/lance/benches/mem_wal_*.rs) keep compiling without edits. The one technically-breaking change — LsmDataSourceCollector::base_table() now returns Option<&Arc<Dataset>> — has no callers outside the module (verified via grep). Storing schema on the scanner is equivalent behavior since the base dataset schema doesnt change during a scanners lifetime.

Other factors

  • The bug hunting system found no issues.
  • Test coverage is solid: new tests cover scan (with/without filter, empty fresh tier), point lookup (hit + miss), and vector search plan construction & execution. The plan-shape assertions check that no base/data LanceRead is emitted, which directly validates the intent of the change.
  • Patch coverage 92.5% per codecov; the uncovered lines are in the Debug impl and the PathOnly branches of build_collector that are exercised at runtime by the new tests.
  • The maintainer (westonpace) explicitly requested a review.

Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@westonpace westonpace merged commit abb988f into lance-format:main May 13, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants