Skip to content

docs(audits): WorkQueueSender multi-producer usage audit (#1383)#4173

Merged
oferchen merged 1 commit into
masterfrom
docs/workqueue-sender-audit-1383
May 16, 2026
Merged

docs(audits): WorkQueueSender multi-producer usage audit (#1383)#4173
oferchen merged 1 commit into
masterfrom
docs/workqueue-sender-audit-1383

Conversation

@oferchen
Copy link
Copy Markdown
Owner

Summary

Test plan

  • CI fmt+clippy
  • CI nextest (stable)
  • CI Windows, macOS, Linux musl
  • Pure docs change; no source files modified

@github-actions github-actions Bot added the documentation Improvements or additions to documentation label May 16, 2026
Promote the in-source audit at multi_producer_audit.rs into a
workspace-level audit with explicit file:line citations for every
production producer site and a complete inventory of test-only sites.

Findings: all 3 production producer sites correctly use single-producer
ownership; zero sites require or pseudo-require multi-producer. Keep
WorkQueueSender Send+!Clone by default, keep the multi-producer feature
gated, and do not introduce an Arc<WorkQueueSender> primitive.
@oferchen oferchen force-pushed the docs/workqueue-sender-audit-1383 branch from 9698d42 to 8feb963 Compare May 16, 2026 20:03
@oferchen oferchen merged commit b8d19d3 into master May 16, 2026
7 checks passed
@oferchen oferchen deleted the docs/workqueue-sender-audit-1383 branch May 16, 2026 20:03
oferchen added a commit that referenced this pull request May 17, 2026
… WorkQueue (#1573) (#4207)

Engages with the #4173 audit conclusion that WorkQueueSender stays
single-producer. Shows I1 (#2196) is the wrong instrument for the
first-byte hypothesis - enumeration runs before send_file_list entry,
so I1 excludes it by construction. Recommends defer pending a W1
benchmark (process start to first inbound flist byte) on multi-root
cold-cache workloads.
oferchen added a commit that referenced this pull request May 17, 2026
Design note for the lock-free MPSC variant of
WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel
swap, documents the ordering contract delegated to ReorderBuffer,
defines the 20% threshold on the #4214 drain_parallel_alternatives
bench that gates the migration, and lays out a feature-flag rollout
plan. Recommendation is to defer the implementation until the
#4214 numbers land on the reference host. Cross-refs #4170, #4173,
#4203, #4214.
oferchen added a commit that referenced this pull request May 17, 2026
…1572) (#4209)

Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion
groups that both move 100K items through the concurrent delta work
queue:

- `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork
  items via the default-build `Send + !Clone` WorkQueueSender path.
- `mp/4p/100k`: four producer threads each push 25K pre-allocated items
  via the gated `Clone` impl on WorkQueueSender. Compiled only when
  `--features multi-producer` is set.

Both groups report `Throughput::Elements(100_000)` so items/sec figures
compare directly regardless of how the work is split. Inputs are
pre-allocated outside the timed section via `iter_batched`, matching
the discipline used by the parallel_dispatch_overhead bench.

The MP group is feature-gated behind the engine crate's existing
`multi-producer` feature - no Cargo.toml dependency change is required;
the gate already exists at `crates/engine/Cargo.toml:91`. The top-of-
file documentation cross-references the audit at
`docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173),
#4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench,
and spells out the decision criteria (>=15% SP-vs-MP delta) that this
bench informs.
oferchen added a commit that referenced this pull request May 17, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in
strategies on the WorkQueueReceiver drain path: the current sharded
Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final
concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and
100K items across 4, 8, and 16 rayon workers, reporting throughput in
elements per second so reviewers can compare per-iteration cost
directly.

Pre-allocates DeltaWork items outside the timed section, isolates each
worker count in a private rayon pool, and shares the same simulated
per-item compute across strategies so the only delta between groups is
the collector itself. Closes the measurement gap that #1681 needs to
decide whether the current sharded Mutex<Vec> warrants replacement
(#1682, refs #4170 / #4173 / #4203).
oferchen added a commit that referenced this pull request May 17, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in
strategies on the WorkQueueReceiver drain path: the current sharded
Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final
concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and
100K items across 4, 8, and 16 rayon workers, reporting throughput in
elements per second so reviewers can compare per-iteration cost
directly.

Pre-allocates DeltaWork items outside the timed section, isolates each
worker count in a private rayon pool, and shares the same simulated
per-item compute across strategies so the only delta between groups is
the collector itself. Closes the measurement gap that #1681 needs to
decide whether the current sharded Mutex<Vec> warrants replacement
(#1682, refs #4170 / #4173 / #4203).
oferchen added a commit that referenced this pull request May 17, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in
strategies on the WorkQueueReceiver drain path: the current sharded
Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final
concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and
100K items across 4, 8, and 16 rayon workers, reporting throughput in
elements per second so reviewers can compare per-iteration cost
directly.

Pre-allocates DeltaWork items outside the timed section, isolates each
worker count in a private rayon pool, and shares the same simulated
per-item compute across strategies so the only delta between groups is
the collector itself. Closes the measurement gap that #1681 needs to
decide whether the current sharded Mutex<Vec> warrants replacement
(#1682, refs #4170 / #4173 / #4203).
oferchen added a commit that referenced this pull request May 18, 2026
Promote the in-source audit at multi_producer_audit.rs into a
workspace-level audit with explicit file:line citations for every
production producer site and a complete inventory of test-only sites.

Findings: all 3 production producer sites correctly use single-producer
ownership; zero sites require or pseudo-require multi-producer. Keep
WorkQueueSender Send+!Clone by default, keep the multi-producer feature
gated, and do not introduce an Arc<WorkQueueSender> primitive.
oferchen added a commit that referenced this pull request May 18, 2026
… WorkQueue (#1573) (#4207)

Engages with the #4173 audit conclusion that WorkQueueSender stays
single-producer. Shows I1 (#2196) is the wrong instrument for the
first-byte hypothesis - enumeration runs before send_file_list entry,
so I1 excludes it by construction. Recommends defer pending a W1
benchmark (process start to first inbound flist byte) on multi-root
cold-cache workloads.
oferchen added a commit that referenced this pull request May 18, 2026
Design note for the lock-free MPSC variant of
WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel
swap, documents the ordering contract delegated to ReorderBuffer,
defines the 20% threshold on the #4214 drain_parallel_alternatives
bench that gates the migration, and lays out a feature-flag rollout
plan. Recommendation is to defer the implementation until the
#4214 numbers land on the reference host. Cross-refs #4170, #4173,
#4203, #4214.
oferchen added a commit that referenced this pull request May 18, 2026
…1572) (#4209)

Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion
groups that both move 100K items through the concurrent delta work
queue:

- `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork
  items via the default-build `Send + !Clone` WorkQueueSender path.
- `mp/4p/100k`: four producer threads each push 25K pre-allocated items
  via the gated `Clone` impl on WorkQueueSender. Compiled only when
  `--features multi-producer` is set.

Both groups report `Throughput::Elements(100_000)` so items/sec figures
compare directly regardless of how the work is split. Inputs are
pre-allocated outside the timed section via `iter_batched`, matching
the discipline used by the parallel_dispatch_overhead bench.

The MP group is feature-gated behind the engine crate's existing
`multi-producer` feature - no Cargo.toml dependency change is required;
the gate already exists at `crates/engine/Cargo.toml:91`. The top-of-
file documentation cross-references the audit at
`docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173),
#4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench,
and spells out the decision criteria (>=15% SP-vs-MP delta) that this
bench informs.
oferchen added a commit that referenced this pull request May 18, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in
strategies on the WorkQueueReceiver drain path: the current sharded
Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final
concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and
100K items across 4, 8, and 16 rayon workers, reporting throughput in
elements per second so reviewers can compare per-iteration cost
directly.

Pre-allocates DeltaWork items outside the timed section, isolates each
worker count in a private rayon pool, and shares the same simulated
per-item compute across strategies so the only delta between groups is
the collector itself. Closes the measurement gap that #1681 needs to
decide whether the current sharded Mutex<Vec> warrants replacement
(#1682, refs #4170 / #4173 / #4203).
oferchen added a commit that referenced this pull request May 18, 2026
Promote the in-source audit at multi_producer_audit.rs into a
workspace-level audit with explicit file:line citations for every
production producer site and a complete inventory of test-only sites.

Findings: all 3 production producer sites correctly use single-producer
ownership; zero sites require or pseudo-require multi-producer. Keep
WorkQueueSender Send+!Clone by default, keep the multi-producer feature
gated, and do not introduce an Arc<WorkQueueSender> primitive.
oferchen added a commit that referenced this pull request May 18, 2026
… WorkQueue (#1573) (#4207)

Engages with the #4173 audit conclusion that WorkQueueSender stays
single-producer. Shows I1 (#2196) is the wrong instrument for the
first-byte hypothesis - enumeration runs before send_file_list entry,
so I1 excludes it by construction. Recommends defer pending a W1
benchmark (process start to first inbound flist byte) on multi-root
cold-cache workloads.
oferchen added a commit that referenced this pull request May 18, 2026
Design note for the lock-free MPSC variant of
WorkQueueReceiver::drain_parallel. Sketches the crossbeam_channel
swap, documents the ordering contract delegated to ReorderBuffer,
defines the 20% threshold on the #4214 drain_parallel_alternatives
bench that gates the migration, and lays out a feature-flag rollout
plan. Recommendation is to defer the implementation until the
#4214 numbers land on the reference host. Cross-refs #4170, #4173,
#4203, #4214.
oferchen added a commit that referenced this pull request May 18, 2026
…1572) (#4209)

Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion
groups that both move 100K items through the concurrent delta work
queue:

- `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork
  items via the default-build `Send + !Clone` WorkQueueSender path.
- `mp/4p/100k`: four producer threads each push 25K pre-allocated items
  via the gated `Clone` impl on WorkQueueSender. Compiled only when
  `--features multi-producer` is set.

Both groups report `Throughput::Elements(100_000)` so items/sec figures
compare directly regardless of how the work is split. Inputs are
pre-allocated outside the timed section via `iter_batched`, matching
the discipline used by the parallel_dispatch_overhead bench.

The MP group is feature-gated behind the engine crate's existing
`multi-producer` feature - no Cargo.toml dependency change is required;
the gate already exists at `crates/engine/Cargo.toml:91`. The top-of-
file documentation cross-references the audit at
`docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173),
#4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench,
and spells out the decision criteria (>=15% SP-vs-MP delta) that this
bench informs.
oferchen added a commit that referenced this pull request May 18, 2026
Adds drain_parallel_alternatives benchmark comparing three fan-in
strategies on the WorkQueueReceiver drain path: the current sharded
Mutex<Vec> indexed by rayon thread index, a per-thread Vec with final
concat (no mutex), and a crossbeam_channel MPSC drain. Runs at 10K and
100K items across 4, 8, and 16 rayon workers, reporting throughput in
elements per second so reviewers can compare per-iteration cost
directly.

Pre-allocates DeltaWork items outside the timed section, isolates each
worker count in a private rayon pool, and shares the same simulated
per-item compute across strategies so the only delta between groups is
the collector itself. Closes the measurement gap that #1681 needs to
decide whether the current sharded Mutex<Vec> warrants replacement
(#1682, refs #4170 / #4173 / #4203).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant