Skip to content

feat: unify WalAppender into ShardWriter via enable_memtable mode#6675

Merged
jackye1995 merged 6 commits intolance-format:mainfrom
touch-of-grey:unify-shard-writer-wal-appender
May 5, 2026
Merged

feat: unify WalAppender into ShardWriter via enable_memtable mode#6675
jackye1995 merged 6 commits intolance-format:mainfrom
touch-of-grey:unify-shard-writer-wal-appender

Conversation

@touch-of-grey
Copy link
Copy Markdown
Contributor

Summary

  • Add ShardWriterConfig::enable_memtable (default true); when false, ShardWriter runs in a new WAL-only mode that keeps the async batched WAL pipeline but skips MemTable allocation, in-memory indexes, MemTable freezing, and MemTable-bytes backpressure (a separate WAL-only backpressure budget reuses max_unflushed_memtable_bytes).
  • WalAppender becomes the WAL write primitive used by both modes inside WalFlusher, replacing the prior duplicate plain-put path. MemTable mode now also gets atomic put-if-not-exists, conflict retry, and fence-on-write.
  • WalAppender stays public as the lowest-level synchronous-atomic appender. New pub(crate) WalAppender::with_claimed_epoch lets ShardWriter::open claim the epoch once and inject it.
  • WAL-only mode uses a drainable WalOnlyState queue with snapshot/commit semantics so a failed append leaves batches in the queue for retry instead of dropping them silently.
  • memtable_stats(), scan(), active_memtable_ref() now return Result<...> and produce a clear invalid_input error in WAL-only mode.

Behavior change to call out: first WAL entry on a fresh shard is now position 0 instead of 1, matching the 0-based positions documented in the MemTable & WAL spec. The previous flusher seeded its counter from wal_entry_position_last_seen + 1 and so always skipped position 0.

Context: this realizes the layering discussed in #6669 (comment) — keep WalAppender as the low-level primitive and let users always use ShardWriter, with a config switch to turn the MemTable on or off.

cc @jackye1995

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added the enhancement New feature or request label May 4, 2026
@jackye1995
Copy link
Copy Markdown
Contributor

@claude review

ShardWriter is now the single high-level entry point; a new
`ShardWriterConfig::enable_memtable` flag (default true) selects between
the existing MemTable + indexes + Lance-file flushing pipeline and a
new WAL-only mode that keeps the async batched WAL pipeline but skips
MemTable allocation, in-memory indexes, MemTable freezing, and
MemTable-bytes backpressure. WalAppender stays as the lower-level
synchronous-atomic primitive and is now also the WAL write engine
inside `WalFlusher`, replacing the previous duplicate plain-`put`
implementation. As a consequence, MemTable mode also gains atomic
put-if-not-exists, conflict retry, and fence-on-write for free; first
WAL entry on a fresh shard is now position 0 (spec-aligned) instead
of 1.

Discussion: lance-format#6669 (comment)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@touch-of-grey touch-of-grey force-pushed the unify-shard-writer-wal-appender branch from 94d760d to 56de709 Compare May 4, 2026 08:53
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 rust/lance/src/dataset/mem_wal/wal.rs:701-710 — Stats observability regression: wal_stats().next_wal_entry_position reads 0 between open() and the first append on a shard with prior entries, where pre-PR it returned manifest.wal_entry_position_last_seen + 1 immediately on open. The append path itself is correct (lazy discovery still works), but the public WalStats field doc ("Next WAL entry position to be used") doesn't carry the new "best-effort/not-authoritative" caveat, so external monitoring will silently read 0 for reopened shards until the first append lands. Easy fix: seed next_entry_position_hint from manifest.wal_entry_position_last_seen + 1 in WalAppender::with_claimed_epoch (the manifest_store/manifest are already in scope at ShardWriter::open).

    Extended reasoning...

    What changed. Pre-PR, WalFlusher::new took next_wal_entry_position: u64 and stored it in an AtomicU64 initialized to manifest.wal_entry_position_last_seen + 1, so wal_stats().next_wal_entry_position immediately reflected the post-recovery cursor. Post-PR, WalFlusher::next_wal_entry_position() (wal.rs:568) delegates to wal_appender.next_entry_position_hint(), which is initialized to AtomicU64::new(0) in WalAppender::with_claimed_epoch (wal.rs:744) and is only bumped after a successful append (wal.rs:812).

    Where it surfaces. ShardWriter::wal_stats() exposes the value publicly via the WalStats struct in write.rs. The new code's own doc comments at wal.rs:704–706 and wal.rs:770–775 explicitly call this out ("Stays at 0 until the first append discovers the starting position", "not authoritative"). However, the WalStats field itself is documented as "Next WAL entry position to be used", with no "best-effort" caveat — so the public observability contract has silently regressed.

    Step-by-step proof. 1) A shard is closed with WAL entries at positions 0..99; the manifest persists wal_entry_position_last_seen = 99. 2) A new ShardWriter reopens the shard via ShardWriter::open, which constructs WalAppender::with_claimed_epoch and WalFlusher::new(appender). 3) next_entry_position_hint is now AtomicU64::new(0). 4) Caller invokes writer.wal_stats() — returns next_wal_entry_position = 0, indistinguishable from a fresh shard. Pre-PR the same call would have returned 100. 5) Once the first put() triggers WalAppender::append, discover_next_position lazily probes from the manifest hint, the append succeeds at position 100, and the hint catches up — but the window between open and first append is exposed.

    Why the refutation isn't fully load-bearing. The refutation observes that the doc comments next to the field correctly mark it as best-effort. That is true, but the refutation conflates the appender-internal hint (whose docs were updated) with the public WalStats surface (whose docs weren't). A caller reading wal_stats() does not see those caveats; they see a struct field claiming to report the next WAL position. Pre-PR they got the right answer immediately on open; post-PR they get 0 transiently. That is the regression — not in the WAL append path (which is unaffected) but in the public stat.

    Severity. Nit. No correctness impact: the WAL append path uses lazy discovery and writes to the right position. No internal consumer of WalStats depends on the value being authoritative. The window also corrects itself after the first append. But it is a real, observable behavior change for any external monitor reading the field, and the fix is essentially free.

    Suggested fix. In WalAppender::with_claimed_epoch, accept the manifest (or the seed value) and initialize next_entry_position_hint to manifest.wal_entry_position_last_seen + 1 (or 0 if no manifest exists yet). Both ShardWriter::open and WalAppender::open already have the manifest available at the call site, so no extra storage round-trip is needed. Alternatively, eagerly call discover_next_position once during ShardWriter::open and seed the hint from its result.

Comment on lines +184 to +208

/// Whether to maintain an in-memory MemTable on top of the WAL.
///
/// When `true` (default), the writer maintains an in-memory `MemTable`,
/// optionally with indexes, and asynchronously flushes frozen MemTables
/// to Lance files alongside the WAL.
///
/// When `false`, the writer skips the MemTable layer entirely:
/// - No MemTable / BatchStore / IndexStore is allocated.
/// - `index_configs` must be empty (validated at open time).
/// - No MemTable freezing or Lance file flushing happens.
/// - No MemTable-bytes backpressure (`max_unflushed_memtable_bytes` is ignored).
/// - Each WAL flush drains the pending-batch queue, so the in-memory
/// footprint is bounded by `max_wal_buffer_size` /
/// `max_wal_flush_interval`.
/// - The async batched WAL pipeline still runs, driven by the same
/// `max_wal_buffer_size`, `max_wal_flush_interval`, and
/// `durable_write` settings as MemTable mode.
///
/// MemTable-tied tunables (`max_memtable_size`, `max_memtable_rows`,
/// `max_memtable_batches`, `max_unflushed_memtable_bytes`,
/// `ivf_index_partition_capacity_safety_factor`, `sync_indexed_write`,
/// `async_index_buffer_rows`, `async_index_interval`) are ignored when
/// `enable_memtable == false`.
///
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The doc string for enable_memtable says max_unflushed_memtable_bytes is ignored in WAL-only mode (line 195) and lists it among MemTable-tied tunables 'ignored when enable_memtable == false' (lines 203-207), but the implementation reuses it as the WAL-only backpressure budget — put_wal_only feeds WalOnlyState::estimated_size() into a BackpressureController that triggers when bytes exceed max_unflushed_memtable_bytes. A user reading the doc may leave the knob at the 1GB default and be surprised when WAL-only puts start blocking under sustained writes, or unexpectedly throttle WAL-only writers when tuning the knob down for MemTable mode. Fix is doc-only: replace 'is ignored' with 'is reused as the WAL-only pending-queue backpressure budget' and remove it from the ignored-tunables list (the inline comment in open_wal_only_mode already states this contract correctly).

Extended reasoning...

The contradiction. The ShardWriterConfig::enable_memtable doc string in rust/lance/src/dataset/mem_wal/write.rs makes two explicit claims about WAL-only mode:

  • Line 195: No MemTable-bytes backpressure (\max_unflushed_memtable_bytes` is ignored).`
  • Lines 203-207: lists max_unflushed_memtable_bytes among MemTable-tied tunables that 'are ignored when enable_memtable == false'.

The implementation does the opposite. open_wal_only_mode constructs BackpressureController::new(config.clone()), and the inline comment right above it (lines 1242-1246) explicitly says: 'Reuse BackpressureController (which is keyed off max_unflushed_memtable_bytes) as the WAL-only backpressure budget. WAL-only callers feed it WalOnlyState::estimated_size(). Keeps the config knob meaningful in WAL-only mode and prevents the pending queue from growing unbounded under non-durable writes.' The PR description itself says 'a separate WAL-only backpressure budget reuses max_unflushed_memtable_bytes'. So the author intends the knob to apply, but the doc string says it doesn't.

Why existing code doesn't prevent it. The doc string and the inline comment are in the same file, only ~1050 lines apart, but nothing cross-checks them. The BackpressureController is fundamentally keyed off config.max_unflushed_memtable_bytes (maybe_apply_backpressure at line 650 returns early only when unflushed_memtable_bytes < self.config.max_unflushed_memtable_bytes), and put_wal_only calls it unconditionally. There is no separate WAL-only knob; reusing the existing field is the design.

Step-by-step proof.

  1. A user sets config.enable_memtable = false and reads the doc, which says max_unflushed_memtable_bytes is ignored. They leave it at the default (1GB) without thinking about it.
  2. ShardWriter::open calls open_wal_only_mode, which builds backpressure = BackpressureController::new(config.clone()) — the controller now holds max_unflushed_memtable_bytes = 1GB.
  3. The user issues put calls. put dispatches to put_wal_only, which at line 1389 calls backpressure.maybe_apply_backpressure(|| (state.estimated_size(), None)).await? before doing anything else.
  4. maybe_apply_backpressure (line 650) checks if unflushed_memtable_bytes < self.config.max_unflushed_memtable_bytes. As long as the WAL-only pending queue stays under 1GB, this is a no-op — matching the user's expectation that the knob is ignored.
  5. Under sustained writes (e.g., bursty producer outpacing the WAL flush handler), WalOnlyState::estimated_size() grows past 1GB. The check now fails, the controller enters its wait loop (line 666 onwards), and put blocks until the pending queue drains below the threshold. The user, who was told the knob is ignored, has no idea why their puts are blocking.
  6. Symmetrically: a user on MemTable mode tunes max_unflushed_memtable_bytes down to 16MB to throttle MemTable growth. They later flip enable_memtable = false to get a pure WAL pipeline, expecting the throttle no longer applies. It silently still does — WAL-only writes now block at 16MB pending instead of 1GB.

Impact. This is on a public configuration field on a stable-looking config struct. The doc is what users will read; the inline comment in open_wal_only_mode is implementation-internal. A user who follows the doc will misconfigure their writer, either by leaving an unintended 1GB pending budget or by unintentionally inheriting a tight MemTable-mode throttle. Severity is low because behavior is reasonable in both default cases — the bug is the doc lying about a knob that is in fact load-bearing.

Fix. Doc-only. On line 195, replace 'No MemTable-bytes backpressure (max_unflushed_memtable_bytes is ignored).' with something like 'WAL-only backpressure on the pending-batch queue is bounded by max_unflushed_memtable_bytes (the same knob is reused as the pending-queue budget).' On lines 203-207, remove max_unflushed_memtable_bytes from the 'ignored' list. This brings the doc in line with the existing inline comment at lines 1242-1246.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 92.46285% with 71 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/mem_wal/write.rs 92.82% 40 Missing and 12 partials ⚠️
rust/lance/src/dataset/mem_wal/wal.rs 91.24% 14 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

touch-of-grey and others added 2 commits May 4, 2026 15:25
- Doc fix: `enable_memtable` doc said `max_unflushed_memtable_bytes` is
  ignored in WAL-only mode, but the implementation reuses it as the
  WAL-only pending-queue backpressure budget. Update the doc to match.
- Stats fix: seed `WalAppender::next_entry_position_hint` from
  `manifest.wal_entry_position_last_seen + 1` at construction. Previously
  the hint was initialized at 0 and only updated after the first
  successful append, so `wal_stats().next_wal_entry_position` returned 0
  on a reopened shard between `open()` and the first put — a regression
  from the pre-PR baseline that read the post-recovery cursor
  immediately. Adds `test_wal_stats_seeded_from_manifest_on_reopen`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `ShardWriter::open` always started with an empty MemTable, so
writes that were durable in the WAL but never reached a Lance generation
flush (e.g. because the previous writer crashed) were unreachable
through the writer's read path until they were re-flushed.

This commit wires the replay path the spec describes: on open in
MemTable mode, walk WAL entries from
`manifest.replay_after_wal_entry_position + 1` (or position 0 if no
flushes have happened yet) up to the WAL tip via `WalTailer`, insert
their batches into the freshly-built MemTable, and update any
in-memory indexes accordingly. Each entry's `writer_epoch` is checked
against our `claim_epoch`-returned epoch — a strictly greater epoch
indicates a successor writer claimed the shard mid-open and we abort
with a fence error. After replay, `WalAppender::seed_next_position`
seeds the appender's position counter so the first put writes past the
replayed entries instead of paying the lazy-discovery probe cost.

WAL-only mode is unaffected (no MemTable to rebuild).

Tests:
- recovery: writer A durably writes, drops without close; writer B's
  open replays A's batches.
- no-op: fresh shard opens to empty MemTable.
- fence: a higher-epoch entry injected via direct `WalAppender` causes
  open to fail with a clear `WAL replay aborted ... fenced` error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@touch-of-grey
Copy link
Copy Markdown
Contributor Author

@jackye1995 — quick heads-up: there was a previous change I had drafted for doing MemTable replay on ShardWriter::open that I never pushed. Now that we have WalTailer in this PR's layering, we can just use it directly for replay, so I added that along the way as a separate commit.

What it does, in MemTable mode only:

  • On open, after the fresh MemTable + indexes are built, walk WalTailer::read_entry(pos) from manifest.replay_after_wal_entry_position + 1 (or 0 for a shard with no flushed generations yet) up to the tip.
  • For each entry, check entry.writer_epoch <= our_claimed_epoch — strictly greater means a successor claimed the shard between our claim_epoch and replay, so we abort the open with a clear fence error.
  • Insert the entry's batches into the MemTable and feed the resulting StoredBatches to IndexStore::insert_batches_parallel (via spawn_blocking) so any in-memory indexes are rebuilt from the WAL.
  • Seed WalAppender::next_entry_position from highest_replayed + 1 so the first put after open writes past the replayed entries without paying for the lazy-discovery probe.

WAL-only mode is unchanged — no MemTable to rebuild.

Three new tests cover: recovery (writer A drops without close → writer B's reopened MemTable contains A's rows), no-op on a fresh shard, and fence-on-replay (a high-epoch entry injected via direct WalAppender causes the next opener to fail with WAL replay aborted ... fenced).

All 226 mem_wal tests pass; clippy + fmt clean.

Please take another look.

CI caught two follow-up sites that needed adjusting after `memtable_stats`,
`scan`, and `active_memtable_ref` started returning `Result<...>` and
`ShardWriterConfig` gained the `enable_memtable` field:

- `python/src/mem_wal.rs`: propagate the new errors via `?` in
  `RegionWriter::close` and `RegionWriter::scan`, and flatten the
  match arms in `RegionWriter::memtable_stats` so the active and
  closed branches share one `Result<MemTableStats, lance::Error>`
  type.
- `rust/lance/benches/mem_wal_read.rs`: `.unwrap()` on the two
  `active_memtable_ref().await` call sites.
- `rust/lance/benches/mem_wal_write.rs`: add `enable_memtable` to the
  exhaustive `ShardWriterConfig` literal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the python label May 5, 2026
- `mem_wal_write` gains an `ENABLE_MEMTABLE` env knob (yes/no/both)
  so we can compare MemTable-mode and WAL-only-mode write throughput
  side-by-side. WAL-only branches automatically force
  `MEMWAL_MAINTAINED_INDEXES=none` and skip `INDEXED_WRITE=yes` since
  sync-indexed writes require a MemTable.
- New `mem_wal_replay` benchmark in two variants:
  1. `WalTailer::read_entry` throughput — pull N pre-written WAL
     entries off storage end-to-end.
  2. `ShardWriter::open` replay — measure the full replay cost a
     post-crash reopen pays in MemTable mode (tailer reads + MemTable
     inserts).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread rust/lance/src/dataset/mem_wal/write.rs Outdated
shard_id,
manifest_store.clone(),
epoch,
manifest.wal_entry_position_last_seen.saturating_add(1),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the next_entry_position_hint should be the max of wal entry position last seen or the replay after wal entry position then +1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 92054be — seed is now max(wal_entry_position_last_seen, replay_after_wal_entry_position) + 1 in both WalAppender::open and ShardWriter::open.

Comment thread rust/lance/src/dataset/mem_wal/write.rs Outdated
};

let tailer = WalTailer::new(object_store, base_path, shard_id);
let batches_before = memtable.batch_count();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to record before? shouldn't it always be an empty memtable if we replay?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed in 92054be — dropped the batches_before bookkeeping; the MemTable is freshly built before replay runs so we just index [0, batch_count). Added a debug_assert_eq! for the invariant.

Comment thread rust/lance/src/dataset/mem_wal/write.rs Outdated
if !entry.batches.is_empty() {
memtable.insert_batches_only(entry.batches).await?;
}
highest_replayed = Some(position);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point, the tailer already found the tip of the WAL. this optional information could be passed to the WAL appender so it can skip trying to re-discover the tip of the WAL when memtable is enabled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 92054bereplay_memtable_from_wal now returns the next WAL write position unconditionally (highest replayed + 1, or the original start_position itself when the loop found nothing — that position is provably empty since read_entry just returned None for it). The caller seeds WalAppender::next_entry_position unconditionally, so the first post-open put skips the discovery probe even on shards where replay had nothing to do.

Three small follow-ups from @jackye1995's review:

- Seed `WalAppender::next_entry_position_hint` from
  `max(wal_entry_position_last_seen, replay_after_wal_entry_position) + 1`
  rather than just `wal_entry_position_last_seen + 1`. The two cursors
  can lead each other depending on which was updated last (the
  replay-after position is updated authoritatively at flush, while the
  last-seen hint is best-effort), so the max is the correct
  post-recovery cursor. Same fix in both `WalAppender::open` and
  `ShardWriter::open`.

- Drop the `batches_before = memtable.batch_count()` bookkeeping in
  `replay_memtable_from_wal`. The MemTable is freshly built before
  replay runs, so the BatchStore is empty by construction — index the
  whole `[0, batch_count)` range. Asserts the invariant via
  `debug_assert_eq!`.

- Have `replay_memtable_from_wal` return the next WAL write position
  unconditionally (i.e. one past the highest replayed entry, or
  `start_position` itself when the loop found nothing — that position
  is provably empty since the tailer just returned `None` for it). The
  caller now seeds `WalAppender::next_entry_position` unconditionally,
  so the first put after open always skips the
  `discover_next_position` probe — even on shards where replay had
  nothing to do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jackye1995
Copy link
Copy Markdown
Contributor

The failure is unrelated, I will fix it in main.

Copy link
Copy Markdown
Contributor

@jackye1995 jackye1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding this!

@jackye1995 jackye1995 merged commit e9d1617 into lance-format:main May 5, 2026
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants