Skip to content

bench(engine): decompose parallel dispatch overhead at 100K items (#1551)#4206

Merged
oferchen merged 2 commits into
masterfrom
bench/parallel-dispatch-overhead-1551
May 17, 2026
Merged

bench(engine): decompose parallel dispatch overhead at 100K items (#1551)#4206
oferchen merged 2 commits into
masterfrom
bench/parallel-dispatch-overhead-1551

Conversation

@oferchen
Copy link
Copy Markdown
Owner

Summary

Adds a Criterion bench that decomposes the concurrent delta pipeline's dispatch-time cost into its three constituent components at the 100K work-item scale. Each group is stripped of the other two so reviewers can see which one dominates and steer the next round of optimisation accordingly.

  • thread_spawn_only - pure std::thread::spawn / join lifecycle, swept over {1, 4, 8, 16}.
  • channel_only - bounded work_queue::bounded (crossbeam_channel) send/recv, no workers, no reorder buffer.
  • reorderbuffer_only - ReorderBuffer insert + drain on a deterministically shuffled 100K-sequence.

Inputs are pre-allocated outside the timed section via iter_batched. Every group reports throughput as Throughput::Elements(100_000) so ops/sec figures are directly comparable across components.

The top-of-file module doc cross-references the related benches (#4180 reorder cache, #4203 sync-channel, #4204 reorder memory, #1885 in-bench metrics) and names the optimisation tracks the results will steer (#1681 lock-free MPSC, #1370 per-thread pools, #1271 buffer slab).

Refs #1551.

Test plan

  • cargo bench -p engine --bench parallel_dispatch_overhead runs all three groups under the ~3 second per-iteration budget on a workstation-class machine
  • CI fmt + clippy stays clean (no new warnings from the new bench)
  • CI nextest stays green (bench-only change, no runtime code modified)

oferchen added 2 commits May 17, 2026 11:17
Add a Criterion bench that isolates each of the three dispatch-time
costs the concurrent delta pipeline pays at 100K-file scale:

- thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16}
- channel_only: bounded crossbeam work_queue send/recv with no workers
- reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence

Each group pre-allocates inputs outside the timed section and reports
throughput via Throughput::Elements(100_000) so cells produce comparable
ops/sec figures. The top-of-file module doc cross-references the
existing reorder cache, sync-channel, and memory benches and names the
optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the
results will steer.

Refs #1551
@oferchen oferchen merged commit 77b3ef6 into master May 17, 2026
40 checks passed
@oferchen oferchen deleted the bench/parallel-dispatch-overhead-1551 branch May 17, 2026 09:20
oferchen added a commit that referenced this pull request May 17, 2026
…#4212)

Document the four CAPACITY_MULTIPLIER sites and the two duplicate
hard-coded `2`s in delta_pipeline.rs, justify each against the recent
dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206
dispatch decomposition, #4209 sp vs mp), and recommend keeping the
default at 2 with one follow-up bench specified to challenge it.
oferchen added a commit that referenced this pull request May 17, 2026
…1572) (#4209)

Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion
groups that both move 100K items through the concurrent delta work
queue:

- `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork
  items via the default-build `Send + !Clone` WorkQueueSender path.
- `mp/4p/100k`: four producer threads each push 25K pre-allocated items
  via the gated `Clone` impl on WorkQueueSender. Compiled only when
  `--features multi-producer` is set.

Both groups report `Throughput::Elements(100_000)` so items/sec figures
compare directly regardless of how the work is split. Inputs are
pre-allocated outside the timed section via `iter_batched`, matching
the discipline used by the parallel_dispatch_overhead bench.

The MP group is feature-gated behind the engine crate's existing
`multi-producer` feature - no Cargo.toml dependency change is required;
the gate already exists at `crates/engine/Cargo.toml:91`. The top-of-
file documentation cross-references the audit at
`docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173),
#4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench,
and spells out the decision criteria (>=15% SP-vs-MP delta) that this
bench informs.
oferchen added a commit that referenced this pull request May 17, 2026
Captures the design for parallelizing the receiver's per-file delta
apply loop while preserving per-file token order and wire-format
parity. Documents the current sequential surface, the dormant
ParallelDeltaPipeline that would host the change, the backpressure
model, and the gating prerequisites - chiefly the parity-test gap
flagged by audit #4205 (G2) - that block default adoption.

Recommends a phased opt-in rollout: land the sequential-vs-parallel
parity test first, then add a hidden CLI gate, then collect #4214 /
#4206 bench evidence, and only then consider flipping the default.
oferchen added a commit that referenced this pull request May 18, 2026
) (#4206)

* bench(engine): decompose parallel dispatch overhead at 100K items

Add a Criterion bench that isolates each of the three dispatch-time
costs the concurrent delta pipeline pays at 100K-file scale:

- thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16}
- channel_only: bounded crossbeam work_queue send/recv with no workers
- reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence

Each group pre-allocates inputs outside the timed section and reports
throughput via Throughput::Elements(100_000) so cells produce comparable
ops/sec figures. The top-of-file module doc cross-references the
existing reorder cache, sync-channel, and memory benches and names the
optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the
results will steer.

Refs #1551

* style(engine): avoid clippy doc list false positive in bench
oferchen added a commit that referenced this pull request May 18, 2026
…#4212)

Document the four CAPACITY_MULTIPLIER sites and the two duplicate
hard-coded `2`s in delta_pipeline.rs, justify each against the recent
dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206
dispatch decomposition, #4209 sp vs mp), and recommend keeping the
default at 2 with one follow-up bench specified to challenge it.
oferchen added a commit that referenced this pull request May 18, 2026
…1572) (#4209)

Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion
groups that both move 100K items through the concurrent delta work
queue:

- `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork
  items via the default-build `Send + !Clone` WorkQueueSender path.
- `mp/4p/100k`: four producer threads each push 25K pre-allocated items
  via the gated `Clone` impl on WorkQueueSender. Compiled only when
  `--features multi-producer` is set.

Both groups report `Throughput::Elements(100_000)` so items/sec figures
compare directly regardless of how the work is split. Inputs are
pre-allocated outside the timed section via `iter_batched`, matching
the discipline used by the parallel_dispatch_overhead bench.

The MP group is feature-gated behind the engine crate's existing
`multi-producer` feature - no Cargo.toml dependency change is required;
the gate already exists at `crates/engine/Cargo.toml:91`. The top-of-
file documentation cross-references the audit at
`docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173),
#4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench,
and spells out the decision criteria (>=15% SP-vs-MP delta) that this
bench informs.
oferchen added a commit that referenced this pull request May 18, 2026
Captures the design for parallelizing the receiver's per-file delta
apply loop while preserving per-file token order and wire-format
parity. Documents the current sequential surface, the dormant
ParallelDeltaPipeline that would host the change, the backpressure
model, and the gating prerequisites - chiefly the parity-test gap
flagged by audit #4205 (G2) - that block default adoption.

Recommends a phased opt-in rollout: land the sequential-vs-parallel
parity test first, then add a hidden CLI gate, then collect #4214 /
#4206 bench evidence, and only then consider flipping the default.
oferchen added a commit that referenced this pull request May 18, 2026
) (#4206)

* bench(engine): decompose parallel dispatch overhead at 100K items

Add a Criterion bench that isolates each of the three dispatch-time
costs the concurrent delta pipeline pays at 100K-file scale:

- thread_spawn_only: pure OS thread lifecycle, swept over {1, 4, 8, 16}
- channel_only: bounded crossbeam work_queue send/recv with no workers
- reorderbuffer_only: ReorderBuffer insert + drain on a shuffled sequence

Each group pre-allocates inputs outside the timed section and reports
throughput via Throughput::Elements(100_000) so cells produce comparable
ops/sec figures. The top-of-file module doc cross-references the
existing reorder cache, sync-channel, and memory benches and names the
optimisation tracks (lock-free MPSC, per-thread pools, buffer slab) the
results will steer.

Refs #1551

* style(engine): avoid clippy doc list false positive in bench
oferchen added a commit that referenced this pull request May 18, 2026
…#4212)

Document the four CAPACITY_MULTIPLIER sites and the two duplicate
hard-coded `2`s in delta_pipeline.rs, justify each against the recent
dispatch benches (#4203 channel overhead, #4204 reorder memory, #4206
dispatch decomposition, #4209 sp vs mp), and recommend keeping the
default at 2 with one follow-up bench specified to challenge it.
oferchen added a commit that referenced this pull request May 18, 2026
…1572) (#4209)

Adds `crates/engine/benches/sp_vs_mp_workqueue.rs` with two Criterion
groups that both move 100K items through the concurrent delta work
queue:

- `sp/1p/100k`: one producer thread pushes 100K pre-allocated DeltaWork
  items via the default-build `Send + !Clone` WorkQueueSender path.
- `mp/4p/100k`: four producer threads each push 25K pre-allocated items
  via the gated `Clone` impl on WorkQueueSender. Compiled only when
  `--features multi-producer` is set.

Both groups report `Throughput::Elements(100_000)` so items/sec figures
compare directly regardless of how the work is split. Inputs are
pre-allocated outside the timed section via `iter_batched`, matching
the discipline used by the parallel_dispatch_overhead bench.

The MP group is feature-gated behind the engine crate's existing
`multi-producer` feature - no Cargo.toml dependency change is required;
the gate already exists at `crates/engine/Cargo.toml:91`. The top-of-
file documentation cross-references the audit at
`docs/audits/workqueue-sender-multi-producer-audit.md` (PR #4173),
#4203 sync_channel bench, and #4206 parallel_dispatch_overhead bench,
and spells out the decision criteria (>=15% SP-vs-MP delta) that this
bench informs.
oferchen added a commit that referenced this pull request May 18, 2026
Captures the design for parallelizing the receiver's per-file delta
apply loop while preserving per-file token order and wire-format
parity. Documents the current sequential surface, the dormant
ParallelDeltaPipeline that would host the change, the backpressure
model, and the gating prerequisites - chiefly the parity-test gap
flagged by audit #4205 (G2) - that block default adoption.

Recommends a phased opt-in rollout: land the sequential-vs-parallel
parity test first, then add a hidden CLI gate, then collect #4214 /
#4206 bench evidence, and only then consider flipping the default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant