docs(audit): apply_batch_parallel verify-vs-write overlap potential (ABW-1)#4670
Merged
Merged
Conversation
…ABW-1) Catalogues the current `apply_batch_parallel` two-phase shape (parallel verify + serial drain), quantifies the wall-clock breakdown across balanced/CPU-bound/I/O-bound regimes, sketches the bounded-channel pipelined alternative, and recommends gating ABW-2/3 on BR-3i.f bench evidence showing verify and write costs within 2x of each other. Findings: - Today's shape has zero production callers (per RJN-1 audit); promotion is the open question. - For single-file batches the per-file Mutex serialises every write; pipelining buys nothing. - For balanced workloads the expected speedup is ~1.13x; for verify- dominated workloads ~1.5x; for write-dominated workloads ~1.03x. - Only the middle regime justifies the design complexity (bounded channel, writer thread, error propagation rework). Recommendation: skip ABW-2/3 unless measurement places verify and write within 2x; otherwise close the line as investigated.
2 tasks
oferchen
added a commit
that referenced
this pull request
May 21, 2026
Discharges the ABW-1 audit's recommendation (PR #4670, section 4) to skip the pipelined verify/write design for apply_batch_parallel until bench evidence shows verify and write costs within 2x of each other on a production-relevant workload cell. - Recaps the ABW-1 quantified speedup table (1.13x balanced, 1.50x CPU-bound, 1.03x I/O-bound, ~0x single-file). - Documents why deferring the design (not just the implementation) is the right call: peak benefit is workload-dependent, complexity-to- payoff is poor in the measured cells, and the PIP-3+5 dispatch heuristic (PR #4666) already gates the degenerate single-file case out of parallel-receive-delta. - Names BR-3j.f (#2508) as the gating re-bench task and lifts the audit's decision gate verbatim. - Preserves the option: per-file Mutex is the real bottleneck; a future multi-threaded-per-file writer or a CPU-bound verify regime would re-open ABW-2.
3 tasks
oferchen
added a commit
that referenced
this pull request
May 21, 2026
…-2 rename (#4676) RJN-2 (PR #4660, merged) chose the rename path over the fanout-refactor path, discharging the RJN-1 audit (PR #4656, merged) with apply_chunk_parallel -> apply_one_chunk plus a rustdoc redirect to apply_batch_parallel. RJN-3 (implement fanout) and RJN-4 (bench scheduler shape) were the "if RJN-2 chose refactor" branch that did not get taken; this doc closes both. - RJN-3 stays closed: zero production callers of apply_one_chunk; the real multi-chunk win sits in apply_batch_parallel, where ABW-1 (PR #4670) already recommended deferring per its quantified 1.03x-1.50x speedup range; the ABW-2/3/4 closure doc defers that track pending BR-3j.f (#2508) bench data. - RJN-4 is N/A: with RJN-3 deferred there is no "after" cell to measure; the production-weighted scheduler-shape bench effort belongs in parallel_receive_delta_perf via BR-3j.f, not at the per-chunk entry point. - Re-open conditions: a production caller of apply_one_chunk ships AND profiling shows the per-chunk path is hot. Project memory references project_rayon_join_per_chunk_noop.md and project_apply_batch_write_serial.md - both observations remain accurate under the renamed function.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pure research/audit (no source changes). Investigates whether
ParallelDeltaApplier::apply_batch_parallelshould pipeline its parallelverify phase with its serial write phase (the question raised in
project_apply_batch_write_serial.md).crates/engine/src/concurrent_delta/parallel_apply.rs:515-542: apar_iter().collect()barrier between verify and a serial drain thatacquires the per-file
Mutex<FileSlot>per chunk.I/O-bound-write regimes. Writes dominate in all three because verify
scales with
Kworkers and writes do not.identifies the data dependency (per-file
chunk_sequenceorder, alreadyenforced by
FileSlot::ingest+ReorderBuffer), and estimates thespeedup ceiling: ~1.13x balanced, ~1.5x verify-dominated, ~1.03x
write-dominated.
Recommendation
Skip ABW-2/3 unless BR-3i.f
(
crates/engine/benches/parallel_verify_chunk.rs) andparallel_receive_delta_perfshow verify and write costs within 2x ofeach other on a production-relevant workload. Otherwise the design
complexity (bounded channel, writer thread, error propagation rework, new
test cells) exceeds the gain. Per-file
apply_batch_parallelhas zeroproduction callers today (per RJN-1 / PR #4656), so no current path
regresses.
Test plan
closing the line of work with a note on
project_apply_batch_write_serial.md.