bench(engine): decompose parallel dispatch overhead at 100K items (#1551)#4206
Merged
Conversation
Add a Criterion bench that isolates each of the three dispatch-time
costs the concurrent delta pipeline pays at 100K-file scale:
- thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16}
- channel_only: bounded crossbeam work_queue send/recv with no workers
- reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence
Each group pre-allocates inputs outside the timed section and reports
throughput via Throughput::Elements(100_000) so cells produce comparable
ops/sec figures. The top-of-file module doc cross-references the
existing reorder cache, sync-channel, and memory benches and names the
optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the
results will steer.
Refs #1551
This was referenced May 17, 2026
oferchen
added a commit
that referenced
this pull request
May 17, 2026
…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.
oferchen
added a commit
that referenced
this pull request
May 17, 2026
…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
4 tasks
oferchen
added a commit
that referenced
this pull request
May 17, 2026
Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
) (#4206) * bench(engine): decompose parallel dispatch overhead at 100K items Add a Criterion bench that isolates each of the three dispatch-time costs the concurrent delta pipeline pays at 100K-file scale: - thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16} - channel_only: bounded crossbeam work_queue send/recv with no workers - reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence Each group pre-allocates inputs outside the timed section and reports throughput via Throughput::Elements(100_000) so cells produce comparable ops/sec figures. The top-of-file module doc cross-references the existing reorder cache, sync-channel, and memory benches and names the optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the results will steer. Refs #1551 * style(engine): avoid clippy doc list false positive in bench
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
) (#4206) * bench(engine): decompose parallel dispatch overhead at 100K items Add a Criterion bench that isolates each of the three dispatch-time costs the concurrent delta pipeline pays at 100K-file scale: - thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16} - channel_only: bounded crossbeam work_queue send/recv with no workers - reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence Each group pre-allocates inputs outside the timed section and reports throughput via Throughput::Elements(100_000) so cells produce comparable ops/sec figures. The top-of-file module doc cross-references the existing reorder cache, sync-channel, and memory benches and names the optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the results will steer. Refs #1551 * style(engine): avoid clippy doc list false positive in bench
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Captures the design for parallelizing the receiver's per-file delta apply loop while preserving per-file token order and wire-format parity. Documents the current sequential surface, the dormant ParallelDeltaPipeline that would host the change, the backpressure model, and the gating prerequisites - chiefly the parity-test gap flagged by audit #4205 (G2) - that block default adoption. Recommends a phased opt-in rollout: land the sequential-vs-parallel parity test first, then add a hidden CLI gate, then collect #4214 / #4206 bench evidence, and only then consider flipping the default.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a Criterion bench that decomposes the concurrent delta pipeline's dispatch-time cost into its three constituent components at the 100K work-item scale. Each group is stripped of the other two so reviewers can see which one dominates and steer the next round of optimisation accordingly.
thread_spawn_only- purestd::thread::spawn/joinlifecycle, swept over {1, 4, 8, 16}.channel_only- boundedwork_queue::bounded(crossbeam_channel) send/recv, no workers, no reorder buffer.reorderbuffer_only-ReorderBufferinsert + drain on a deterministically shuffled 100K-sequence.Inputs are pre-allocated outside the timed section via
iter_batched. Every group reports throughput asThroughput::Elements(100_000)so ops/sec figures are directly comparable across components.The top-of-file module doc cross-references the related benches (#4180 reorder cache, #4203 sync-channel, #4204 reorder memory, #1885 in-bench metrics) and names the optimisation tracks the results will steer (#1681 lock-free MPSC, #1370 per-thread pools, #1271 buffer slab).
Refs #1551.
Test plan
cargo bench -p engine --bench parallel_dispatch_overheadruns all three groups under the ~3 second per-iteration budget on a workstation-class machine