bench(engine): decompose parallel dispatch overhead at 100K items (#1551) by oferchen · Pull Request #4206 · oferchen/rsync

oferchen · 2026-05-17T08:17:23Z

Summary

Adds a Criterion bench that decomposes the concurrent delta pipeline's dispatch-time cost into its three constituent components at the 100K work-item scale. Each group is stripped of the other two so reviewers can see which one dominates and steer the next round of optimisation accordingly.

thread_spawn_only - pure std::thread::spawn / join lifecycle, swept over {1, 4, 8, 16}.
channel_only - bounded work_queue::bounded (crossbeam_channel) send/recv, no workers, no reorder buffer.
reorderbuffer_only - ReorderBuffer insert + drain on a deterministically shuffled 100K-sequence.

Inputs are pre-allocated outside the timed section via iter_batched. Every group reports throughput as Throughput::Elements(100_000) so ops/sec figures are directly comparable across components.

The top-of-file module doc cross-references the related benches (#4180 reorder cache, #4203 sync-channel, #4204 reorder memory, #1885 in-bench metrics) and names the optimisation tracks the results will steer (#1681 lock-free MPSC, #1370 per-thread pools, #1271 buffer slab).

Refs #1551.

Test plan

cargo bench -p engine --bench parallel_dispatch_overhead runs all three groups under the ~3 second per-iteration budget on a workstation-class machine
CI fmt + clippy stays clean (no new warnings from the new bench)
CI nextest stays green (bench-only change, no runtime code modified)

Add a Criterion bench that isolates each of the three dispatch-time costs the concurrent delta pipeline pays at 100K-file scale: - thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16} - channel_only: bounded crossbeam work_queue send/recv with no workers - reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence Each group pre-allocates inputs outside the timed section and reports throughput via Throughput::Elements(100_000) so cells produce comparable ops/sec figures. The top-of-file module doc cross-references the existing reorder cache, sync-channel, and memory benches and names the optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the results will steer. Refs #1551

…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.

…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.

Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.

) (#4206) * bench(engine): decompose parallel dispatch overhead at 100K items Add a Criterion bench that isolates each of the three dispatch-time costs the concurrent delta pipeline pays at 100K-file scale: - thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16} - channel_only: bounded crossbeam work_queue send/recv with no workers - reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence Each group pre-allocates inputs outside the timed section and reports throughput via Throughput::Elements(100_000) so cells produce comparable ops/sec figures. The top-of-file module doc cross-references the existing reorder cache, sync-channel, and memory benches and names the optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the results will steer. Refs #1551 * style(engine): avoid clippy doc list false positive in bench

…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.

…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.

Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.

) (#4206) * bench(engine): decompose parallel dispatch overhead at 100K items Add a Criterion bench that isolates each of the three dispatch-time costs the concurrent delta pipeline pays at 100K-file scale: - thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16} - channel_only: bounded crossbeam work_queue send/recv with no workers - reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence Each group pre-allocates inputs outside the timed section and reports throughput via Throughput::Elements(100_000) so cells produce comparable ops/sec figures. The top-of-file module doc cross-references the existing reorder cache, sync-channel, and memory benches and names the optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the results will steer. Refs #1551 * style(engine): avoid clippy doc list false positive in bench

…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.

…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.

Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.

oferchen added 2 commits May 17, 2026 11:17

style(engine): avoid clippy doc list false positive in bench

f723011

oferchen merged commit 77b3ef6 into master May 17, 2026
40 checks passed

oferchen deleted the bench/parallel-dispatch-overhead-1551 branch May 17, 2026 09:20

This was referenced May 17, 2026

bench(engine): single-producer vs multi-producer WorkQueue overhead (#1572) #4209

Merged

docs(design): tune CAPACITY_MULTIPLIER from parallel-dispatch benches (#1553) #4212

Merged

oferchen mentioned this pull request May 17, 2026

docs(design): parallel receive-side delta application (#1368) #4223

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

bench(engine): decompose parallel dispatch overhead at 100K items (#1551)#4206

bench(engine): decompose parallel dispatch overhead at 100K items (#1551)#4206
oferchen merged 2 commits into
masterfrom
bench/parallel-dispatch-overhead-1551

oferchen commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

oferchen commented May 17, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant