feat: unify WalAppender into ShardWriter via enable_memtable mode#6675
Conversation
|
@claude review |
ShardWriter is now the single high-level entry point; a new `ShardWriterConfig::enable_memtable` flag (default true) selects between the existing MemTable + indexes + Lance-file flushing pipeline and a new WAL-only mode that keeps the async batched WAL pipeline but skips MemTable allocation, in-memory indexes, MemTable freezing, and MemTable-bytes backpressure. WalAppender stays as the lower-level synchronous-atomic primitive and is now also the WAL write engine inside `WalFlusher`, replacing the previous duplicate plain-`put` implementation. As a consequence, MemTable mode also gains atomic put-if-not-exists, conflict retry, and fence-on-write for free; first WAL entry on a fresh shard is now position 0 (spec-aligned) instead of 1. Discussion: lance-format#6669 (comment) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
94d760d to
56de709
Compare
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
rust/lance/src/dataset/mem_wal/wal.rs:701-710— Stats observability regression:wal_stats().next_wal_entry_positionreads 0 betweenopen()and the first append on a shard with prior entries, where pre-PR it returnedmanifest.wal_entry_position_last_seen + 1immediately on open. The append path itself is correct (lazy discovery still works), but the publicWalStatsfield doc ("Next WAL entry position to be used") doesn't carry the new "best-effort/not-authoritative" caveat, so external monitoring will silently read 0 for reopened shards until the first append lands. Easy fix: seednext_entry_position_hintfrommanifest.wal_entry_position_last_seen + 1inWalAppender::with_claimed_epoch(the manifest_store/manifest are already in scope atShardWriter::open).Extended reasoning...
What changed. Pre-PR,
WalFlusher::newtooknext_wal_entry_position: u64and stored it in anAtomicU64initialized tomanifest.wal_entry_position_last_seen + 1, sowal_stats().next_wal_entry_positionimmediately reflected the post-recovery cursor. Post-PR,WalFlusher::next_wal_entry_position()(wal.rs:568) delegates towal_appender.next_entry_position_hint(), which is initialized toAtomicU64::new(0)inWalAppender::with_claimed_epoch(wal.rs:744) and is only bumped after a successful append (wal.rs:812).Where it surfaces.
ShardWriter::wal_stats()exposes the value publicly via theWalStatsstruct inwrite.rs. The new code's own doc comments at wal.rs:704–706 and wal.rs:770–775 explicitly call this out ("Stays at 0 until the first append discovers the starting position", "not authoritative"). However, theWalStatsfield itself is documented as "Next WAL entry position to be used", with no "best-effort" caveat — so the public observability contract has silently regressed.Step-by-step proof. 1) A shard is closed with WAL entries at positions 0..99; the manifest persists
wal_entry_position_last_seen = 99. 2) A newShardWriterreopens the shard viaShardWriter::open, which constructsWalAppender::with_claimed_epochandWalFlusher::new(appender). 3)next_entry_position_hintis nowAtomicU64::new(0). 4) Caller invokeswriter.wal_stats()— returnsnext_wal_entry_position = 0, indistinguishable from a fresh shard. Pre-PR the same call would have returned 100. 5) Once the firstput()triggersWalAppender::append,discover_next_positionlazily probes from the manifest hint, the append succeeds at position 100, and the hint catches up — but the window between open and first append is exposed.Why the refutation isn't fully load-bearing. The refutation observes that the doc comments next to the field correctly mark it as best-effort. That is true, but the refutation conflates the appender-internal hint (whose docs were updated) with the public
WalStatssurface (whose docs weren't). A caller readingwal_stats()does not see those caveats; they see a struct field claiming to report the next WAL position. Pre-PR they got the right answer immediately on open; post-PR they get 0 transiently. That is the regression — not in the WAL append path (which is unaffected) but in the public stat.Severity. Nit. No correctness impact: the WAL append path uses lazy discovery and writes to the right position. No internal consumer of
WalStatsdepends on the value being authoritative. The window also corrects itself after the first append. But it is a real, observable behavior change for any external monitor reading the field, and the fix is essentially free.Suggested fix. In
WalAppender::with_claimed_epoch, accept the manifest (or the seed value) and initializenext_entry_position_hinttomanifest.wal_entry_position_last_seen + 1(or 0 if no manifest exists yet). BothShardWriter::openandWalAppender::openalready have the manifest available at the call site, so no extra storage round-trip is needed. Alternatively, eagerly calldiscover_next_positiononce duringShardWriter::openand seed the hint from its result.
|
|
||
| /// Whether to maintain an in-memory MemTable on top of the WAL. | ||
| /// | ||
| /// When `true` (default), the writer maintains an in-memory `MemTable`, | ||
| /// optionally with indexes, and asynchronously flushes frozen MemTables | ||
| /// to Lance files alongside the WAL. | ||
| /// | ||
| /// When `false`, the writer skips the MemTable layer entirely: | ||
| /// - No MemTable / BatchStore / IndexStore is allocated. | ||
| /// - `index_configs` must be empty (validated at open time). | ||
| /// - No MemTable freezing or Lance file flushing happens. | ||
| /// - No MemTable-bytes backpressure (`max_unflushed_memtable_bytes` is ignored). | ||
| /// - Each WAL flush drains the pending-batch queue, so the in-memory | ||
| /// footprint is bounded by `max_wal_buffer_size` / | ||
| /// `max_wal_flush_interval`. | ||
| /// - The async batched WAL pipeline still runs, driven by the same | ||
| /// `max_wal_buffer_size`, `max_wal_flush_interval`, and | ||
| /// `durable_write` settings as MemTable mode. | ||
| /// | ||
| /// MemTable-tied tunables (`max_memtable_size`, `max_memtable_rows`, | ||
| /// `max_memtable_batches`, `max_unflushed_memtable_bytes`, | ||
| /// `ivf_index_partition_capacity_safety_factor`, `sync_indexed_write`, | ||
| /// `async_index_buffer_rows`, `async_index_interval`) are ignored when | ||
| /// `enable_memtable == false`. | ||
| /// |
There was a problem hiding this comment.
🟡 The doc string for enable_memtable says max_unflushed_memtable_bytes is ignored in WAL-only mode (line 195) and lists it among MemTable-tied tunables 'ignored when enable_memtable == false' (lines 203-207), but the implementation reuses it as the WAL-only backpressure budget — put_wal_only feeds WalOnlyState::estimated_size() into a BackpressureController that triggers when bytes exceed max_unflushed_memtable_bytes. A user reading the doc may leave the knob at the 1GB default and be surprised when WAL-only puts start blocking under sustained writes, or unexpectedly throttle WAL-only writers when tuning the knob down for MemTable mode. Fix is doc-only: replace 'is ignored' with 'is reused as the WAL-only pending-queue backpressure budget' and remove it from the ignored-tunables list (the inline comment in open_wal_only_mode already states this contract correctly).
Extended reasoning...
The contradiction. The ShardWriterConfig::enable_memtable doc string in rust/lance/src/dataset/mem_wal/write.rs makes two explicit claims about WAL-only mode:
- Line 195:
No MemTable-bytes backpressure (\max_unflushed_memtable_bytes` is ignored).` - Lines 203-207: lists
max_unflushed_memtable_bytesamong MemTable-tied tunables that 'are ignored whenenable_memtable == false'.
The implementation does the opposite. open_wal_only_mode constructs BackpressureController::new(config.clone()), and the inline comment right above it (lines 1242-1246) explicitly says: 'Reuse BackpressureController (which is keyed off max_unflushed_memtable_bytes) as the WAL-only backpressure budget. WAL-only callers feed it WalOnlyState::estimated_size(). Keeps the config knob meaningful in WAL-only mode and prevents the pending queue from growing unbounded under non-durable writes.' The PR description itself says 'a separate WAL-only backpressure budget reuses max_unflushed_memtable_bytes'. So the author intends the knob to apply, but the doc string says it doesn't.
Why existing code doesn't prevent it. The doc string and the inline comment are in the same file, only ~1050 lines apart, but nothing cross-checks them. The BackpressureController is fundamentally keyed off config.max_unflushed_memtable_bytes (maybe_apply_backpressure at line 650 returns early only when unflushed_memtable_bytes < self.config.max_unflushed_memtable_bytes), and put_wal_only calls it unconditionally. There is no separate WAL-only knob; reusing the existing field is the design.
Step-by-step proof.
- A user sets
config.enable_memtable = falseand reads the doc, which saysmax_unflushed_memtable_bytesis ignored. They leave it at the default (1GB) without thinking about it. ShardWriter::opencallsopen_wal_only_mode, which buildsbackpressure = BackpressureController::new(config.clone())— the controller now holdsmax_unflushed_memtable_bytes = 1GB.- The user issues
putcalls.putdispatches toput_wal_only, which at line 1389 callsbackpressure.maybe_apply_backpressure(|| (state.estimated_size(), None)).await?before doing anything else. maybe_apply_backpressure(line 650) checksif unflushed_memtable_bytes < self.config.max_unflushed_memtable_bytes. As long as the WAL-only pending queue stays under 1GB, this is a no-op — matching the user's expectation that the knob is ignored.- Under sustained writes (e.g., bursty producer outpacing the WAL flush handler),
WalOnlyState::estimated_size()grows past 1GB. The check now fails, the controller enters its wait loop (line 666 onwards), andputblocks until the pending queue drains below the threshold. The user, who was told the knob is ignored, has no idea why their puts are blocking. - Symmetrically: a user on MemTable mode tunes
max_unflushed_memtable_bytesdown to 16MB to throttle MemTable growth. They later flipenable_memtable = falseto get a pure WAL pipeline, expecting the throttle no longer applies. It silently still does — WAL-only writes now block at 16MB pending instead of 1GB.
Impact. This is on a public configuration field on a stable-looking config struct. The doc is what users will read; the inline comment in open_wal_only_mode is implementation-internal. A user who follows the doc will misconfigure their writer, either by leaving an unintended 1GB pending budget or by unintentionally inheriting a tight MemTable-mode throttle. Severity is low because behavior is reasonable in both default cases — the bug is the doc lying about a knob that is in fact load-bearing.
Fix. Doc-only. On line 195, replace 'No MemTable-bytes backpressure (max_unflushed_memtable_bytes is ignored).' with something like 'WAL-only backpressure on the pending-batch queue is bounded by max_unflushed_memtable_bytes (the same knob is reused as the pending-queue budget).' On lines 203-207, remove max_unflushed_memtable_bytes from the 'ignored' list. This brings the doc in line with the existing inline comment at lines 1242-1246.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
- Doc fix: `enable_memtable` doc said `max_unflushed_memtable_bytes` is ignored in WAL-only mode, but the implementation reuses it as the WAL-only pending-queue backpressure budget. Update the doc to match. - Stats fix: seed `WalAppender::next_entry_position_hint` from `manifest.wal_entry_position_last_seen + 1` at construction. Previously the hint was initialized at 0 and only updated after the first successful append, so `wal_stats().next_wal_entry_position` returned 0 on a reopened shard between `open()` and the first put — a regression from the pre-PR baseline that read the post-recovery cursor immediately. Adds `test_wal_stats_seeded_from_manifest_on_reopen`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `ShardWriter::open` always started with an empty MemTable, so writes that were durable in the WAL but never reached a Lance generation flush (e.g. because the previous writer crashed) were unreachable through the writer's read path until they were re-flushed. This commit wires the replay path the spec describes: on open in MemTable mode, walk WAL entries from `manifest.replay_after_wal_entry_position + 1` (or position 0 if no flushes have happened yet) up to the WAL tip via `WalTailer`, insert their batches into the freshly-built MemTable, and update any in-memory indexes accordingly. Each entry's `writer_epoch` is checked against our `claim_epoch`-returned epoch — a strictly greater epoch indicates a successor writer claimed the shard mid-open and we abort with a fence error. After replay, `WalAppender::seed_next_position` seeds the appender's position counter so the first put writes past the replayed entries instead of paying the lazy-discovery probe cost. WAL-only mode is unaffected (no MemTable to rebuild). Tests: - recovery: writer A durably writes, drops without close; writer B's open replays A's batches. - no-op: fresh shard opens to empty MemTable. - fence: a higher-epoch entry injected via direct `WalAppender` causes open to fail with a clear `WAL replay aborted ... fenced` error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@jackye1995 — quick heads-up: there was a previous change I had drafted for doing MemTable replay on What it does, in MemTable mode only:
WAL-only mode is unchanged — no MemTable to rebuild. Three new tests cover: recovery (writer A drops without close → writer B's reopened MemTable contains A's rows), no-op on a fresh shard, and fence-on-replay (a high-epoch entry injected via direct All 226 mem_wal tests pass; clippy + fmt clean. Please take another look. |
CI caught two follow-up sites that needed adjusting after `memtable_stats`, `scan`, and `active_memtable_ref` started returning `Result<...>` and `ShardWriterConfig` gained the `enable_memtable` field: - `python/src/mem_wal.rs`: propagate the new errors via `?` in `RegionWriter::close` and `RegionWriter::scan`, and flatten the match arms in `RegionWriter::memtable_stats` so the active and closed branches share one `Result<MemTableStats, lance::Error>` type. - `rust/lance/benches/mem_wal_read.rs`: `.unwrap()` on the two `active_memtable_ref().await` call sites. - `rust/lance/benches/mem_wal_write.rs`: add `enable_memtable` to the exhaustive `ShardWriterConfig` literal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- `mem_wal_write` gains an `ENABLE_MEMTABLE` env knob (yes/no/both)
so we can compare MemTable-mode and WAL-only-mode write throughput
side-by-side. WAL-only branches automatically force
`MEMWAL_MAINTAINED_INDEXES=none` and skip `INDEXED_WRITE=yes` since
sync-indexed writes require a MemTable.
- New `mem_wal_replay` benchmark in two variants:
1. `WalTailer::read_entry` throughput — pull N pre-written WAL
entries off storage end-to-end.
2. `ShardWriter::open` replay — measure the full replay cost a
post-crash reopen pays in MemTable mode (tailer reads + MemTable
inserts).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| shard_id, | ||
| manifest_store.clone(), | ||
| epoch, | ||
| manifest.wal_entry_position_last_seen.saturating_add(1), |
There was a problem hiding this comment.
the next_entry_position_hint should be the max of wal entry position last seen or the replay after wal entry position then +1
There was a problem hiding this comment.
Done in 92054be — seed is now max(wal_entry_position_last_seen, replay_after_wal_entry_position) + 1 in both WalAppender::open and ShardWriter::open.
| }; | ||
|
|
||
| let tailer = WalTailer::new(object_store, base_path, shard_id); | ||
| let batches_before = memtable.batch_count(); |
There was a problem hiding this comment.
why do we need to record before? shouldn't it always be an empty memtable if we replay?
There was a problem hiding this comment.
Right, fixed in 92054be — dropped the batches_before bookkeeping; the MemTable is freshly built before replay runs so we just index [0, batch_count). Added a debug_assert_eq! for the invariant.
| if !entry.batches.is_empty() { | ||
| memtable.insert_batches_only(entry.batches).await?; | ||
| } | ||
| highest_replayed = Some(position); |
There was a problem hiding this comment.
at this point, the tailer already found the tip of the WAL. this optional information could be passed to the WAL appender so it can skip trying to re-discover the tip of the WAL when memtable is enabled.
There was a problem hiding this comment.
Done in 92054be — replay_memtable_from_wal now returns the next WAL write position unconditionally (highest replayed + 1, or the original start_position itself when the loop found nothing — that position is provably empty since read_entry just returned None for it). The caller seeds WalAppender::next_entry_position unconditionally, so the first post-open put skips the discovery probe even on shards where replay had nothing to do.
Three small follow-ups from @jackye1995's review: - Seed `WalAppender::next_entry_position_hint` from `max(wal_entry_position_last_seen, replay_after_wal_entry_position) + 1` rather than just `wal_entry_position_last_seen + 1`. The two cursors can lead each other depending on which was updated last (the replay-after position is updated authoritatively at flush, while the last-seen hint is best-effort), so the max is the correct post-recovery cursor. Same fix in both `WalAppender::open` and `ShardWriter::open`. - Drop the `batches_before = memtable.batch_count()` bookkeeping in `replay_memtable_from_wal`. The MemTable is freshly built before replay runs, so the BatchStore is empty by construction — index the whole `[0, batch_count)` range. Asserts the invariant via `debug_assert_eq!`. - Have `replay_memtable_from_wal` return the next WAL write position unconditionally (i.e. one past the highest replayed entry, or `start_position` itself when the loop found nothing — that position is provably empty since the tailer just returned `None` for it). The caller now seeds `WalAppender::next_entry_position` unconditionally, so the first put after open always skips the `discover_next_position` probe — even on shards where replay had nothing to do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The failure is unrelated, I will fix it in main. |
jackye1995
left a comment
There was a problem hiding this comment.
thanks for adding this!
Summary
ShardWriterConfig::enable_memtable(defaulttrue); whenfalse,ShardWriterruns in a new WAL-only mode that keeps the async batched WAL pipeline but skips MemTable allocation, in-memory indexes, MemTable freezing, and MemTable-bytes backpressure (a separate WAL-only backpressure budget reusesmax_unflushed_memtable_bytes).WalAppenderbecomes the WAL write primitive used by both modes insideWalFlusher, replacing the prior duplicate plain-putpath. MemTable mode now also gets atomic put-if-not-exists, conflict retry, and fence-on-write.WalAppenderstays public as the lowest-level synchronous-atomic appender. Newpub(crate) WalAppender::with_claimed_epochletsShardWriter::openclaim the epoch once and inject it.WalOnlyStatequeue with snapshot/commit semantics so a failed append leaves batches in the queue for retry instead of dropping them silently.memtable_stats(),scan(),active_memtable_ref()now returnResult<...>and produce a clearinvalid_inputerror in WAL-only mode.Behavior change to call out: first WAL entry on a fresh shard is now position 0 instead of 1, matching the 0-based positions documented in the MemTable & WAL spec. The previous flusher seeded its counter from
wal_entry_position_last_seen + 1and so always skipped position 0.Context: this realizes the layering discussed in #6669 (comment) — keep
WalAppenderas the low-level primitive and let users always useShardWriter, with a config switch to turn the MemTable on or off.cc @jackye1995