bench(engine): single-producer vs multi-producer WorkQueue overhead (#1572)#4209
Merged
Merged
Conversation
…1572) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
3 tasks
oferchen
added a commit
that referenced
this pull request
May 17, 2026
…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.
3 tasks
oferchen
added a commit
that referenced
this pull request
May 17, 2026
…1405) (#4224) Replaces the prior multi-root-focused note at the same path with a focused #1405 design discussion for parallel generator fan-in. Surveys the current SP / MP shape (feature-gated Clone at work_queue/multi_producer.rs), the candidate use cases, the three adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit producer-vector constructor), and the ordering, capacity, and bench evidence that would have to land before flipping the default. Recommends keeping WorkQueueSender Send + !Clone in default builds and the multi-producer cargo feature opt-in. Documents the cost so reviewers can reject naive MP refactors. Promotion to default-on is gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench showing no regression, and an actual fan-in caller materialising.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1405) (#4224) Replaces the prior multi-root-focused note at the same path with a focused #1405 design discussion for parallel generator fan-in. Surveys the current SP / MP shape (feature-gated Clone at work_queue/multi_producer.rs), the candidate use cases, the three adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit producer-vector constructor), and the ordering, capacity, and bench evidence that would have to land before flipping the default. Recommends keeping WorkQueueSender Send + !Clone in default builds and the multi-producer cargo feature opt-in. Documents the cost so reviewers can reject naive MP refactors. Promotion to default-on is gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench showing no regression, and an actual fan-in caller materialising.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…#4212) Document the four CAPACITY_MULTIPLIER sites and the two duplicate hard-coded `2`s in delta_pipeline.rs, justify each against the recent dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206 dispatch decomposition, #4209 sp vs mp), and recommend keeping the default at 2 with one follow-up bench specified to challenge it.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1405) (#4224) Replaces the prior multi-root-focused note at the same path with a focused #1405 design discussion for parallel generator fan-in. Surveys the current SP / MP shape (feature-gated Clone at work_queue/multi_producer.rs), the candidate use cases, the three adjacent designs (Clone via #1569, Arc-wrap via #1610, explicit producer-vector constructor), and the ordering, capacity, and bench evidence that would have to land before flipping the default. Recommends keeping WorkQueueSender Send + !Clone in default builds and the multi-producer cargo feature opt-in. Documents the cost so reviewers can reject naive MP refactors. Promotion to default-on is gated on PR #4209 SP-vs-MP throughput parity, PR #4214 drain bench showing no regression, and an actual fan-in caller materialising.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
crates/engine/benches/sp_vs_mp_workqueue.rs(Criterion bench) with two groups that both move 100K items through the concurrent delta work queue:sp/1p/100k: one producer thread pushes 100K pre-allocatedDeltaWorkitems via the default-buildSend + !CloneWorkQueueSenderpath.mp/4p/100k: four producer threads each push 25K pre-allocated items via the gatedCloneimpl onWorkQueueSender. Compiled only when--features multi-produceris set.Throughput::Elements(100_000)so items/sec compare directly regardless of producer count. Inputs are pre-allocated outside the timed section viaiter_batched, matching the discipline used by theparallel_dispatch_overheadbench (bench(engine): decompose parallel dispatch overhead at 100K items (#1551) #4206).multi-producer = []atcrates/engine/Cargo.toml:91. Only the[[bench]]entry is added.Why
PR #4173 audited every live
WorkQueueSendersite and concluded that all three production producer sites are correctly single-producer today. Task #1572 produces the quantitative SP-vs-MP overhead figure that gates whether themulti-producerfeature should stay opt-in or graduate to default-on if a future caller wants MP fan-in. The bench's top-of-file documentation spells out the decision criteria (>=15% delta threshold, matching the buffer-pool sharded-benchmark gate) and cross-references the #4173 audit, the #4203sync_channelbench, and the #4206parallel_dispatch_overheadbench so reviewers can read all four data points together.Test plan
cargo bench -p engine --bench sp_vs_mp_workqueuebuilds and runs the SP group on the default build (locally verified viacargo fmtonly; benches deferred to CI per repo policy).cargo bench -p engine --features multi-producer --bench sp_vs_mp_workqueuebuilds and runs both groups when the feature is enabled.