Skip to content

docs(audit): apply_batch_parallel verify-vs-write overlap potential (ABW-1)#4670

Merged
oferchen merged 1 commit into
masterfrom
docs/audit-abw-1-verify-write-overlap
May 21, 2026
Merged

docs(audit): apply_batch_parallel verify-vs-write overlap potential (ABW-1)#4670
oferchen merged 1 commit into
masterfrom
docs/audit-abw-1-verify-write-overlap

Conversation

@oferchen
Copy link
Copy Markdown
Owner

Summary

Pure research/audit (no source changes). Investigates whether
ParallelDeltaApplier::apply_batch_parallel should pipeline its parallel
verify phase with its serial write phase (the question raised in
project_apply_batch_write_serial.md).

  • Catalogues the current two-phase shape at
    crates/engine/src/concurrent_delta/parallel_apply.rs:515-542: a
    par_iter().collect() barrier between verify and a serial drain that
    acquires the per-file Mutex<FileSlot> per chunk.
  • Quantifies wall-clock breakdown across balanced, CPU-bound-verify, and
    I/O-bound-write regimes. Writes dominate in all three because verify
    scales with K workers and writes do not.
  • Sketches the bounded-channel + writer-thread pipelined alternative,
    identifies the data dependency (per-file chunk_sequence order, already
    enforced by FileSlot::ingest + ReorderBuffer), and estimates the
    speedup ceiling: ~1.13x balanced, ~1.5x verify-dominated, ~1.03x
    write-dominated.

Recommendation

Skip ABW-2/3 unless BR-3i.f
(crates/engine/benches/parallel_verify_chunk.rs) and
parallel_receive_delta_perf show verify and write costs within 2x of
each other on a production-relevant workload. Otherwise the design
complexity (bounded channel, writer thread, error propagation rework, new
test cells) exceeds the gain. Per-file apply_batch_parallel has zero
production callers today (per RJN-1 / PR #4656), so no current path
regresses.

Test plan

  • Audit reviewed; recommendation actioned via ABW-2 design doc or by
    closing the line of work with a note on
    project_apply_batch_write_serial.md.

…ABW-1)

Catalogues the current `apply_batch_parallel` two-phase shape
(parallel verify + serial drain), quantifies the wall-clock breakdown
across balanced/CPU-bound/I/O-bound regimes, sketches the bounded-channel
pipelined alternative, and recommends gating ABW-2/3 on BR-3i.f bench
evidence showing verify and write costs within 2x of each other.

Findings:

- Today's shape has zero production callers (per RJN-1 audit); promotion
  is the open question.
- For single-file batches the per-file Mutex serialises every write;
  pipelining buys nothing.
- For balanced workloads the expected speedup is ~1.13x; for verify-
  dominated workloads ~1.5x; for write-dominated workloads ~1.03x.
- Only the middle regime justifies the design complexity (bounded channel,
  writer thread, error propagation rework).

Recommendation: skip ABW-2/3 unless measurement places verify and write
within 2x; otherwise close the line as investigated.
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 21, 2026
@oferchen oferchen merged commit e9d9872 into master May 21, 2026
10 checks passed
@oferchen oferchen deleted the docs/audit-abw-1-verify-write-overlap branch May 21, 2026 13:42
oferchen added a commit that referenced this pull request May 21, 2026
Discharges the ABW-1 audit's recommendation (PR #4670, section 4) to
skip the pipelined verify/write design for apply_batch_parallel until
bench evidence shows verify and write costs within 2x of each other on
a production-relevant workload cell.

- Recaps the ABW-1 quantified speedup table (1.13x balanced, 1.50x
  CPU-bound, 1.03x I/O-bound, ~0x single-file).
- Documents why deferring the design (not just the implementation) is
  the right call: peak benefit is workload-dependent, complexity-to-
  payoff is poor in the measured cells, and the PIP-3+5 dispatch
  heuristic (PR #4666) already gates the degenerate single-file case
  out of parallel-receive-delta.
- Names BR-3j.f (#2508) as the gating re-bench task and lifts the
  audit's decision gate verbatim.
- Preserves the option: per-file Mutex is the real bottleneck; a
  future multi-threaded-per-file writer or a CPU-bound verify regime
  would re-open ABW-2.
oferchen added a commit that referenced this pull request May 21, 2026
…-2 rename (#4676)

RJN-2 (PR #4660, merged) chose the rename path over the fanout-refactor path,
discharging the RJN-1 audit (PR #4656, merged) with apply_chunk_parallel ->
apply_one_chunk plus a rustdoc redirect to apply_batch_parallel. RJN-3
(implement fanout) and RJN-4 (bench scheduler shape) were the "if RJN-2 chose
refactor" branch that did not get taken; this doc closes both.

- RJN-3 stays closed: zero production callers of apply_one_chunk; the real
  multi-chunk win sits in apply_batch_parallel, where ABW-1 (PR #4670) already
  recommended deferring per its quantified 1.03x-1.50x speedup range; the
  ABW-2/3/4 closure doc defers that track pending BR-3j.f (#2508) bench data.
- RJN-4 is N/A: with RJN-3 deferred there is no "after" cell to measure; the
  production-weighted scheduler-shape bench effort belongs in
  parallel_receive_delta_perf via BR-3j.f, not at the per-chunk entry point.
- Re-open conditions: a production caller of apply_one_chunk ships AND
  profiling shows the per-chunk path is hot.

Project memory references project_rayon_join_per_chunk_noop.md and
project_apply_batch_write_serial.md - both observations remain accurate
under the renamed function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant