docs(audits): WorkQueueSender multi-producer usage audit (#1383)#4173
Merged
Conversation
Promote the in-source audit at multi_producer_audit.rs into a workspace-level audit with explicit file:line citations for every production producer site and a complete inventory of test-only sites. Findings: all 3 production producer sites correctly use single-producer ownership; zero sites require or pseudo-require multi-producer. Keep WorkQueueSender Send+!Clone by default, keep the multi-producer feature gated, and do not introduce an Arc<WorkQueueSender> primitive.
9698d42 to
8feb963
Compare
Merged
3 tasks
oferchen
added a commit
that referenced
this pull request
May 17, 2026
… WorkQueue (#1573) (#4207) Engages with the #4173 audit conclusion that WorkQueueSender stays single-producer. Shows I1 (#2196) is the wrong instrument for the first-byte hypothesis - enumeration runs before send_file_list entry, so I1 excludes it by construction. Recommends defer pending a W1 benchmark (process start to first inbound flist byte) on multi-root cold-cache workloads.
This was referenced May 17, 2026
oferchen
added a commit
that referenced
this pull request
May 17, 2026
Design note for the lock-free MPSC variant of WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel swap, documents the ordering contract delegated to ReorderBuffer, defines the 20% threshold on the #4214 drain_parallel_alternatives bench that gates the migration, and lays out a feature-flag rollout plan. Recommendation is to defer the implementation until the #4214 numbers land on the reference host. Cross-refs #4170, #4173, #4203, #4214.
oferchen
added a commit
that referenced
this pull request
May 17, 2026
…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
3 tasks
oferchen
added a commit
that referenced
this pull request
May 17, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).
oferchen
added a commit
that referenced
this pull request
May 17, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).
oferchen
added a commit
that referenced
this pull request
May 17, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Promote the in-source audit at multi_producer_audit.rs into a workspace-level audit with explicit file:line citations for every production producer site and a complete inventory of test-only sites. Findings: all 3 production producer sites correctly use single-producer ownership; zero sites require or pseudo-require multi-producer. Keep WorkQueueSender Send+!Clone by default, keep the multi-producer feature gated, and do not introduce an Arc<WorkQueueSender> primitive.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
… WorkQueue (#1573) (#4207) Engages with the #4173 audit conclusion that WorkQueueSender stays single-producer. Shows I1 (#2196) is the wrong instrument for the first-byte hypothesis - enumeration runs before send_file_list entry, so I1 excludes it by construction. Recommends defer pending a W1 benchmark (process start to first inbound flist byte) on multi-root cold-cache workloads.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Design note for the lock-free MPSC variant of WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel swap, documents the ordering contract delegated to ReorderBuffer, defines the 20% threshold on the #4214 drain_parallel_alternatives bench that gates the migration, and lays out a feature-flag rollout plan. Recommendation is to defer the implementation until the #4214 numbers land on the reference host. Cross-refs #4170, #4173, #4203, #4214.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Promote the in-source audit at multi_producer_audit.rs into a workspace-level audit with explicit file:line citations for every production producer site and a complete inventory of test-only sites. Findings: all 3 production producer sites correctly use single-producer ownership; zero sites require or pseudo-require multi-producer. Keep WorkQueueSender Send+!Clone by default, keep the multi-producer feature gated, and do not introduce an Arc<WorkQueueSender> primitive.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
… WorkQueue (#1573) (#4207) Engages with the #4173 audit conclusion that WorkQueueSender stays single-producer. Shows I1 (#2196) is the wrong instrument for the first-byte hypothesis - enumeration runs before send_file_list entry, so I1 excludes it by construction. Recommends defer pending a W1 benchmark (process start to first inbound flist byte) on multi-root cold-cache workloads.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Design note for the lock-free MPSC variant of WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel swap, documents the ordering contract delegated to ReorderBuffer, defines the 20% threshold on the #4214 drain_parallel_alternatives bench that gates the migration, and lays out a feature-flag rollout plan. Recommendation is to defer the implementation until the #4214 numbers land on the reference host. Cross-refs #4170, #4173, #4203, #4214.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
…1572) (#4209) Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion groups that both move 100K items through the concurrent delta work queue: - `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork items via the default-build `Send + !Clone` WorkQueueSender path. - `mp/4p/100k`: four producer threads each push 25K pre-allocated items via the gated `Clone` impl on WorkQueueSender. Compiled only when `--features multi-producer` is set. Both groups report `Throughput::Elements(100_000)` so items/sec figures compare directly regardless of how the work is split. Inputs are pre-allocated outside the timed section via `iter_batched`, matching the discipline used by the parallel_dispatch_overhead bench. The MP group is feature-gated behind the engine crate's existing `multi-producer` feature - no Cargo.toml dependency change is required; the gate already exists at `crates/engine/Cargo.toml:91`. The top-of- file documentation cross-references the audit at `docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173), #4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench, and spells out the decision criteria (>=15% SP-vs-MP delta) that this bench informs.
oferchen
added a commit
that referenced
this pull request
May 18, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in strategies on the WorkQueueReceiver drain path: the current sharded Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and 100K items across 4, 8, and 16 rayon workers, reporting throughput in elements per second so reviewers can compare per-iteration cost directly. Pre-allocates DeltaWork items outside the timed section, isolates each worker count in a private rayon pool, and shares the same simulated per-item compute across strategies so the only delta between groups is the collector itself. Closes the measurement gap that #1681 needs to decide whether the current sharded Mutex<Vec> warrants replacement (#1682, refs #4170 / #4173 / #4203).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
crates/engine/src/concurrent_delta/multi_producer_audit.rsinto a workspace-level audit atdocs/audits/workqueue-sender-multi-producer-audit.mdwith explicit file:line citations for every production producer site and a complete inventory of test-only sites.ParallelDeltaPipeline/ThresholdDeltaPipelineplus the sender-drop shutdown signal) correctly use single-producer ownership; zero sites require multi-producer; zero use Arc/Mutex wrappers that would qualify as pseudo-multi-producer.WorkQueueSenderSend + !Cloneby default, keep themulti-producercargo feature gated, and do not introduce anArc<WorkQueueSender>primitive (Fix Windows device identifier metadata usage #1610 / Fix Windows cross-compilation by gating unix-only user lookups #1613). Thecrossbeam_channel::Senderis already internally anArc; doubling the refcount layer adds nothing.docs/audits/workqueue-sp-vs-mp-overhead.md(Expand documentation branding validation coverage #1572) benchmark plan anddocs/design/arc-workqueue-sender-eval.md(Clarify --bwlimit burst syntax in help output #1383 evaluation note).Test plan