feat: unify WalAppender into ShardWriter via enable_memtable mode by touch-of-grey · Pull Request #6675 · lance-format/lance

touch-of-grey · 2026-05-04T08:41:30Z

Summary

Add ShardWriterConfig::enable_memtable (default true); when false, ShardWriter runs in a new WAL-only mode that keeps the async batched WAL pipeline but skips MemTable allocation, in-memory indexes, MemTable freezing, and MemTable-bytes backpressure (a separate WAL-only backpressure budget reuses max_unflushed_memtable_bytes).
WalAppender becomes the WAL write primitive used by both modes inside WalFlusher, replacing the prior duplicate plain-put path. MemTable mode now also gets atomic put-if-not-exists, conflict retry, and fence-on-write.
WalAppender stays public as the lowest-level synchronous-atomic appender. New pub(crate) WalAppender::with_claimed_epoch lets ShardWriter::open claim the epoch once and inject it.
WAL-only mode uses a drainable WalOnlyState queue with snapshot/commit semantics so a failed append leaves batches in the queue for retry instead of dropping them silently.
memtable_stats(), scan(), active_memtable_ref() now return Result<...> and produce a clear invalid_input error in WAL-only mode.

Behavior change to call out: first WAL entry on a fresh shard is now position 0 instead of 1, matching the 0-based positions documented in the MemTable & WAL spec. The previous flusher seeded its counter from wal_entry_position_last_seen + 1 and so always skipped position 0.

Context: this realizes the layering discussed in #6669 (comment) — keep WalAppender as the low-level primitive and let users always use ShardWriter, with a config switch to turn the MemTable on or off.

cc @jackye1995

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

jackye1995 · 2026-05-04T08:48:36Z

@claude review

ShardWriter is now the single high-level entry point; a new `ShardWriterConfig::enable_memtable` flag (default true) selects between the existing MemTable + indexes + Lance-file flushing pipeline and a new WAL-only mode that keeps the async batched WAL pipeline but skips MemTable allocation, in-memory indexes, MemTable freezing, and MemTable-bytes backpressure. WalAppender stays as the lower-level synchronous-atomic primitive and is now also the WAL write engine inside `WalFlusher`, replacing the previous duplicate plain-`put` implementation. As a consequence, MemTable mode also gains atomic put-if-not-exists, conflict retry, and fence-on-write for free; first WAL entry on a fresh shard is now position 0 (spec-aligned) instead of 1. Discussion: lance-format#6669 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 rust/lance/src/dataset/mem_wal/wal.rs:701-710 — Stats observability regression: wal_stats().next_wal_entry_position reads 0 between open() and the first append on a shard with prior entries, where pre-PR it returned manifest.wal_entry_position_last_seen + 1 immediately on open. The append path itself is correct (lazy discovery still works), but the public WalStats field doc ("Next WAL entry position to be used") doesn't carry the new "best-effort/not-authoritative" caveat, so external monitoring will silently read 0 for reopened shards until the first append lands. Easy fix: seed next_entry_position_hint from manifest.wal_entry_position_last_seen + 1 in WalAppender::with_claimed_epoch (the manifest_store/manifest are already in scope at ShardWriter::open).

Extended reasoning...

What changed. Pre-PR, WalFlusher::new took next_wal_entry_position: u64 and stored it in an AtomicU64 initialized to manifest.wal_entry_position_last_seen + 1, so wal_stats().next_wal_entry_position immediately reflected the post-recovery cursor. Post-PR, WalFlusher::next_wal_entry_position() (wal.rs:568) delegates to wal_appender.next_entry_position_hint(), which is initialized to AtomicU64::new(0) in WalAppender::with_claimed_epoch (wal.rs:744) and is only bumped after a successful append (wal.rs:812).

Where it surfaces. ShardWriter::wal_stats() exposes the value publicly via the WalStats struct in write.rs. The new code's own doc comments at wal.rs:704–706 and wal.rs:770–775 explicitly call this out ("Stays at 0 until the first append discovers the starting position", "not authoritative"). However, the WalStats field itself is documented as "Next WAL entry position to be used", with no "best-effort" caveat — so the public observability contract has silently regressed.

Step-by-step proof. 1) A shard is closed with WAL entries at positions 0..99; the manifest persists wal_entry_position_last_seen = 99. 2) A new ShardWriter reopens the shard via ShardWriter::open, which constructs WalAppender::with_claimed_epoch and WalFlusher::new(appender). 3) next_entry_position_hint is now AtomicU64::new(0). 4) Caller invokes writer.wal_stats() — returns next_wal_entry_position = 0, indistinguishable from a fresh shard. Pre-PR the same call would have returned 100. 5) Once the first put() triggers WalAppender::append, discover_next_position lazily probes from the manifest hint, the append succeeds at position 100, and the hint catches up — but the window between open and first append is exposed.

Why the refutation isn't fully load-bearing. The refutation observes that the doc comments next to the field correctly mark it as best-effort. That is true, but the refutation conflates the appender-internal hint (whose docs were updated) with the public WalStats surface (whose docs weren't). A caller reading wal_stats() does not see those caveats; they see a struct field claiming to report the next WAL position. Pre-PR they got the right answer immediately on open; post-PR they get 0 transiently. That is the regression — not in the WAL append path (which is unaffected) but in the public stat.

Severity. Nit. No correctness impact: the WAL append path uses lazy discovery and writes to the right position. No internal consumer of WalStats depends on the value being authoritative. The window also corrects itself after the first append. But it is a real, observable behavior change for any external monitor reading the field, and the fix is essentially free.

Suggested fix. In WalAppender::with_claimed_epoch, accept the manifest (or the seed value) and initialize next_entry_position_hint to manifest.wal_entry_position_last_seen + 1 (or 0 if no manifest exists yet). Both ShardWriter::open and WalAppender::open already have the manifest available at the call site, so no extra storage round-trip is needed. Alternatively, eagerly call discover_next_position once during ShardWriter::open and seed the hint from its result.

claude · 2026-05-04T09:12:24Z

+
+    /// Whether to maintain an in-memory MemTable on top of the WAL.
+    ///
+    /// When `true` (default), the writer maintains an in-memory `MemTable`,
+    /// optionally with indexes, and asynchronously flushes frozen MemTables
+    /// to Lance files alongside the WAL.
+    ///
+    /// When `false`, the writer skips the MemTable layer entirely:
+    /// - No MemTable / BatchStore / IndexStore is allocated.
+    /// - `index_configs` must be empty (validated at open time).
+    /// - No MemTable freezing or Lance file flushing happens.
+    /// - No MemTable-bytes backpressure (`max_unflushed_memtable_bytes` is ignored).
+    /// - Each WAL flush drains the pending-batch queue, so the in-memory
+    ///   footprint is bounded by `max_wal_buffer_size` /
+    ///   `max_wal_flush_interval`.
+    /// - The async batched WAL pipeline still runs, driven by the same
+    ///   `max_wal_buffer_size`, `max_wal_flush_interval`, and
+    ///   `durable_write` settings as MemTable mode.
+    ///
+    /// MemTable-tied tunables (`max_memtable_size`, `max_memtable_rows`,
+    /// `max_memtable_batches`, `max_unflushed_memtable_bytes`,
+    /// `ivf_index_partition_capacity_safety_factor`, `sync_indexed_write`,
+    /// `async_index_buffer_rows`, `async_index_interval`) are ignored when
+    /// `enable_memtable == false`.
+    ///


🟡 The doc string for enable_memtable says max_unflushed_memtable_bytes is ignored in WAL-only mode (line 195) and lists it among MemTable-tied tunables 'ignored when enable_memtable == false' (lines 203-207), but the implementation reuses it as the WAL-only backpressure budget — put_wal_only feeds WalOnlyState::estimated_size() into a BackpressureController that triggers when bytes exceed max_unflushed_memtable_bytes. A user reading the doc may leave the knob at the 1GB default and be surprised when WAL-only puts start blocking under sustained writes, or unexpectedly throttle WAL-only writers when tuning the knob down for MemTable mode. Fix is doc-only: replace 'is ignored' with 'is reused as the WAL-only pending-queue backpressure budget' and remove it from the ignored-tunables list (the inline comment in open_wal_only_mode already states this contract correctly).

Extended reasoning...

The contradiction. The ShardWriterConfig::enable_memtable doc string in rust/lance/src/dataset/mem_wal/write.rs makes two explicit claims about WAL-only mode:

Line 195: No MemTable-bytes backpressure (\max_unflushed_memtable_bytes` is ignored).`

Lines 203-207: lists max_unflushed_memtable_bytes among MemTable-tied tunables that 'are ignored when enable_memtable == false'.

The implementation does the opposite. open_wal_only_mode constructs BackpressureController::new(config.clone()), and the inline comment right above it (lines 1242-1246) explicitly says: 'Reuse BackpressureController (which is keyed off max_unflushed_memtable_bytes) as the WAL-only backpressure budget. WAL-only callers feed it WalOnlyState::estimated_size(). Keeps the config knob meaningful in WAL-only mode and prevents the pending queue from growing unbounded under non-durable writes.' The PR description itself says 'a separate WAL-only backpressure budget reuses max_unflushed_memtable_bytes'. So the author intends the knob to apply, but the doc string says it doesn't.

Why existing code doesn't prevent it. The doc string and the inline comment are in the same file, only ~1050 lines apart, but nothing cross-checks them. The BackpressureController is fundamentally keyed off config.max_unflushed_memtable_bytes (maybe_apply_backpressure at line 650 returns early only when unflushed_memtable_bytes < self.config.max_unflushed_memtable_bytes), and put_wal_only calls it unconditionally. There is no separate WAL-only knob; reusing the existing field is the design.

Step-by-step proof.

A user sets config.enable_memtable = false and reads the doc, which says max_unflushed_memtable_bytes is ignored. They leave it at the default (1GB) without thinking about it.

ShardWriter::open calls open_wal_only_mode, which builds backpressure = BackpressureController::new(config.clone()) — the controller now holds max_unflushed_memtable_bytes = 1GB.

The user issues put calls. put dispatches to put_wal_only, which at line 1389 calls backpressure.maybe_apply_backpressure(|| (state.estimated_size(), None)).await? before doing anything else.

maybe_apply_backpressure (line 650) checks if unflushed_memtable_bytes < self.config.max_unflushed_memtable_bytes. As long as the WAL-only pending queue stays under 1GB, this is a no-op — matching the user's expectation that the knob is ignored.

Under sustained writes (e.g., bursty producer outpacing the WAL flush handler), WalOnlyState::estimated_size() grows past 1GB. The check now fails, the controller enters its wait loop (line 666 onwards), and put blocks until the pending queue drains below the threshold. The user, who was told the knob is ignored, has no idea why their puts are blocking.

Symmetrically: a user on MemTable mode tunes max_unflushed_memtable_bytes down to 16MB to throttle MemTable growth. They later flip enable_memtable = false to get a pure WAL pipeline, expecting the throttle no longer applies. It silently still does — WAL-only writes now block at 16MB pending instead of 1GB.

Impact. This is on a public configuration field on a stable-looking config struct. The doc is what users will read; the inline comment in open_wal_only_mode is implementation-internal. A user who follows the doc will misconfigure their writer, either by leaving an unintended 1GB pending budget or by unintentionally inheriting a tight MemTable-mode throttle. Severity is low because behavior is reasonable in both default cases — the bug is the doc lying about a knob that is in fact load-bearing.

Fix. Doc-only. On line 195, replace 'No MemTable-bytes backpressure (max_unflushed_memtable_bytes is ignored).' with something like 'WAL-only backpressure on the pending-batch queue is bounded by max_unflushed_memtable_bytes (the same knob is reused as the pending-queue budget).' On lines 203-207, remove max_unflushed_memtable_bytes from the 'ignored' list. This brings the doc in line with the existing inline comment at lines 1242-1246.

codecov · 2026-05-04T09:22:56Z

Codecov Report

❌ Patch coverage is 92.46285% with 71 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance/src/dataset/mem_wal/write.rs	92.82%	40 Missing and 12 partials ⚠️
rust/lance/src/dataset/mem_wal/wal.rs	91.24%	14 Missing and 5 partials ⚠️

📢 Thoughts on this report? Let us know!

- Doc fix: `enable_memtable` doc said `max_unflushed_memtable_bytes` is ignored in WAL-only mode, but the implementation reuses it as the WAL-only pending-queue backpressure budget. Update the doc to match. - Stats fix: seed `WalAppender::next_entry_position_hint` from `manifest.wal_entry_position_last_seen + 1` at construction. Previously the hint was initialized at 0 and only updated after the first successful append, so `wal_stats().next_wal_entry_position` returned 0 on a reopened shard between `open()` and the first put — a regression from the pre-PR baseline that read the post-recovery cursor immediately. Adds `test_wal_stats_seeded_from_manifest_on_reopen`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously `ShardWriter::open` always started with an empty MemTable, so writes that were durable in the WAL but never reached a Lance generation flush (e.g. because the previous writer crashed) were unreachable through the writer's read path until they were re-flushed. This commit wires the replay path the spec describes: on open in MemTable mode, walk WAL entries from `manifest.replay_after_wal_entry_position + 1` (or position 0 if no flushes have happened yet) up to the WAL tip via `WalTailer`, insert their batches into the freshly-built MemTable, and update any in-memory indexes accordingly. Each entry's `writer_epoch` is checked against our `claim_epoch`-returned epoch — a strictly greater epoch indicates a successor writer claimed the shard mid-open and we abort with a fence error. After replay, `WalAppender::seed_next_position` seeds the appender's position counter so the first put writes past the replayed entries instead of paying the lazy-discovery probe cost. WAL-only mode is unaffected (no MemTable to rebuild). Tests: - recovery: writer A durably writes, drops without close; writer B's open replays A's batches. - no-op: fresh shard opens to empty MemTable. - fence: a higher-epoch entry injected via direct `WalAppender` causes open to fail with a clear `WAL replay aborted ... fenced` error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

touch-of-grey · 2026-05-05T01:32:25Z

@jackye1995 — quick heads-up: there was a previous change I had drafted for doing MemTable replay on ShardWriter::open that I never pushed. Now that we have WalTailer in this PR's layering, we can just use it directly for replay, so I added that along the way as a separate commit.

What it does, in MemTable mode only:

On open, after the fresh MemTable + indexes are built, walk WalTailer::read_entry(pos) from manifest.replay_after_wal_entry_position + 1 (or 0 for a shard with no flushed generations yet) up to the tip.
For each entry, check entry.writer_epoch <= our_claimed_epoch — strictly greater means a successor claimed the shard between our claim_epoch and replay, so we abort the open with a clear fence error.
Insert the entry's batches into the MemTable and feed the resulting StoredBatches to IndexStore::insert_batches_parallel (via spawn_blocking) so any in-memory indexes are rebuilt from the WAL.
Seed WalAppender::next_entry_position from highest_replayed + 1 so the first put after open writes past the replayed entries without paying for the lazy-discovery probe.

WAL-only mode is unchanged — no MemTable to rebuild.

Three new tests cover: recovery (writer A drops without close → writer B's reopened MemTable contains A's rows), no-op on a fresh shard, and fence-on-replay (a high-epoch entry injected via direct WalAppender causes the next opener to fail with WAL replay aborted ... fenced).

All 226 mem_wal tests pass; clippy + fmt clean.

Please take another look.

CI caught two follow-up sites that needed adjusting after `memtable_stats`, `scan`, and `active_memtable_ref` started returning `Result<...>` and `ShardWriterConfig` gained the `enable_memtable` field: - `python/src/mem_wal.rs`: propagate the new errors via `?` in `RegionWriter::close` and `RegionWriter::scan`, and flatten the match arms in `RegionWriter::memtable_stats` so the active and closed branches share one `Result<MemTableStats, lance::Error>` type. - `rust/lance/benches/mem_wal_read.rs`: `.unwrap()` on the two `active_memtable_ref().await` call sites. - `rust/lance/benches/mem_wal_write.rs`: add `enable_memtable` to the exhaustive `ShardWriterConfig` literal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- `mem_wal_write` gains an `ENABLE_MEMTABLE` env knob (yes/no/both) so we can compare MemTable-mode and WAL-only-mode write throughput side-by-side. WAL-only branches automatically force `MEMWAL_MAINTAINED_INDEXES=none` and skip `INDEXED_WRITE=yes` since sync-indexed writes require a MemTable. - New `mem_wal_replay` benchmark in two variants: 1. `WalTailer::read_entry` throughput — pull N pre-written WAL entries off storage end-to-end. 2. `ShardWriter::open` replay — measure the full replay cost a post-crash reopen pays in MemTable mode (tailer reads + MemTable inserts). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jackye1995 · 2026-05-05T05:54:44Z

+            shard_id,
+            manifest_store.clone(),
+            epoch,
+            manifest.wal_entry_position_last_seen.saturating_add(1),


the next_entry_position_hint should be the max of wal entry position last seen or the replay after wal entry position then +1

Done in 92054be — seed is now max(wal_entry_position_last_seen, replay_after_wal_entry_position) + 1 in both WalAppender::open and ShardWriter::open.

jackye1995 · 2026-05-05T05:55:23Z

+    };
+
+    let tailer = WalTailer::new(object_store, base_path, shard_id);
+    let batches_before = memtable.batch_count();


why do we need to record before? shouldn't it always be an empty memtable if we replay?

Right, fixed in 92054be — dropped the batches_before bookkeeping; the MemTable is freshly built before replay runs so we just index [0, batch_count). Added a debug_assert_eq! for the invariant.

jackye1995 · 2026-05-05T05:56:26Z

+                if !entry.batches.is_empty() {
+                    memtable.insert_batches_only(entry.batches).await?;
+                }
+                highest_replayed = Some(position);


at this point, the tailer already found the tip of the WAL. this optional information could be passed to the WAL appender so it can skip trying to re-discover the tip of the WAL when memtable is enabled.

Done in 92054be — replay_memtable_from_wal now returns the next WAL write position unconditionally (highest replayed + 1, or the original start_position itself when the loop found nothing — that position is provably empty since read_entry just returned None for it). The caller seeds WalAppender::next_entry_position unconditionally, so the first post-open put skips the discovery probe even on shards where replay had nothing to do.

@jackye1995

Three small follow-ups from @jackye1995's review: - Seed `WalAppender::next_entry_position_hint` from `max(wal_entry_position_last_seen, replay_after_wal_entry_position) + 1` rather than just `wal_entry_position_last_seen + 1`. The two cursors can lead each other depending on which was updated last (the replay-after position is updated authoritatively at flush, while the last-seen hint is best-effort), so the max is the correct post-recovery cursor. Same fix in both `WalAppender::open` and `ShardWriter::open`. - Drop the `batches_before = memtable.batch_count()` bookkeeping in `replay_memtable_from_wal`. The MemTable is freshly built before replay runs, so the BatchStore is empty by construction — index the whole `[0, batch_count)` range. Asserts the invariant via `debug_assert_eq!`. - Have `replay_memtable_from_wal` return the next WAL write position unconditionally (i.e. one past the highest replayed entry, or `start_position` itself when the loop found nothing — that position is provably empty since the tailer just returned `None` for it). The caller now seeds `WalAppender::next_entry_position` unconditionally, so the first put after open always skips the `discover_next_position` probe — even on shards where replay had nothing to do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

jackye1995 · 2026-05-05T06:34:01Z

The failure is unrelated, I will fix it in main.

jackye1995

thanks for adding this!

claude Bot reviewed May 4, 2026

View reviewed changes

github-actions Bot added the enhancement New feature or request label May 4, 2026

touch-of-grey force-pushed the unify-shard-writer-wal-appender branch from 94d760d to 56de709 Compare May 4, 2026 08:53

claude Bot reviewed May 4, 2026

View reviewed changes

touch-of-grey and others added 2 commits May 4, 2026 15:25

github-actions Bot added the python label May 5, 2026

jackye1995 reviewed May 5, 2026

View reviewed changes

jackye1995 approved these changes May 5, 2026

View reviewed changes

jackye1995 merged commit e9d1617 into lance-format:main May 5, 2026
27 of 28 checks passed

jackye1995 mentioned this pull request May 5, 2026

fix: replay unflushed WAL entries into memtable on restart #6532

Closed

4 tasks

Conversation

touch-of-grey commented May 4, 2026

Summary

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

jackye1995 commented May 4, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

touch-of-grey commented May 5, 2026

Uh oh!

jackye1995 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

touch-of-grey May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jackye1995 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

touch-of-grey May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jackye1995 May 5, 2026

Choose a reason for hiding this comment

Uh oh!

touch-of-grey May 5, 2026

Choose a reason for hiding this comment

Uh oh!

jackye1995 commented May 5, 2026

Uh oh!

jackye1995 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 4, 2026 •

edited

Loading