Skip to content

feat(engine): SpillableReorderBuffer wiring + parallel receive delta apply (#1884, #1368)#4319

Merged
oferchen merged 1 commit into
masterfrom
feat/spill-and-parallel-recv-delta-redo
May 17, 2026
Merged

feat(engine): SpillableReorderBuffer wiring + parallel receive delta apply (#1884, #1368)#4319
oferchen merged 1 commit into
masterfrom
feat/spill-and-parallel-recv-delta-redo

Conversation

@oferchen
Copy link
Copy Markdown
Owner

Summary

  • Wires the existing SpillableReorderBuffer into DeltaConsumer behind a new opt-in ConcurrentDeltaConfig { spill_threshold_bytes, spill_dir } and exposes spill activity via DeltaConsumer::stats() -> DeltaConsumerStats { spill_events }. Default path (spawn, spawn_bypass, spawn_with_config(.., ConcurrentDeltaConfig::default())) is bit-for-bit identical to the prior behaviour; the existing histogram/metrics machinery on the bare path is preserved verbatim.
  • Lands the parallel receive-side delta apply scaffold behind a new parallel-receive-delta cargo feature on the engine crate (forwarded from transfer). engine::concurrent_delta::parallel_apply introduces DeltaChunk and ParallelDeltaApplier with per-file Mutex write serialization and per-file ReorderBuffer chunk replay. ReceiverContext::enable_parallel_receive_delta installs the existing ParallelDeltaPipeline only when the feature is compiled in.
  • crates/engine/src/concurrent_delta/mod.rs re-exports the union of ConcurrentDeltaConfig, DeltaConsumer, DeltaConsumerStats, HistogramStats, ReorderMetrics, ReorderBuffer, and (feature-gated) DeltaChunk, ParallelDeltaApplier.

Replaces the conflict-stalled PRs #4299 and #4300 with a single combined change on top of current master. The wire protocol is unchanged - the spill layer is strictly receiver-local memory management; the parallel apply path is opt-in for benchmarking the rollout in docs/design/parallel-receive-delta-application.md.

Test plan

  • cargo check -p engine --all-features --tests clean
  • cargo check -p transfer --features parallel-receive-delta --tests clean
  • cargo fmt --all -- --check clean
  • Unit: spillable_consumer_preserves_order_under_pressure - 1000 items through a 1 KiB budget, deliberately delayed head-of-line item, verifies in-order delivery and spill_events > 0
  • Unit: spillable_consumer_matches_bare_output_byte_for_byte - compares spill vs bare paths via SpillCodec encoding
  • Unit: spawn_with_config_off_matches_spawn, stats_zero_when_spill_disabled pin the default-off invariants
  • Unit: ConcurrentDeltaConfig covers default, off, with_spill_threshold, with_spill_dir, spill_enabled
  • Unit: parallel_apply::* - in-order, shuffled, batched byte-equality across two files, missing/double registration, finish-with-pending errors
  • Property: random_chunk_sizes_and_permutations_match_sequential (48 cases, deterministic permutation per seed)
  • All 26 existing concurrent_delta::consumer::* tests pass alongside the new spill suite

…apply

Wires the existing SpillableReorderBuffer into DeltaConsumer behind a new
opt-in ConcurrentDeltaConfig, and lands the parallel receive-side delta
apply scaffold behind the parallel-receive-delta cargo feature.

SpillableReorderBuffer wiring (#1884)
- New ConcurrentDeltaConfig { spill_threshold_bytes, spill_dir } selects
  between the bare ReorderBuffer (default, behaviour unchanged) and the
  bounded-memory SpillableReorderBuffer when a threshold is supplied.
- DeltaConsumer::spawn_with_config dispatches via a ReorderMode enum so
  spawn / spawn_bypass / spawn_with_config share one inner loop entry.
- DeltaConsumerStats surfaces the cumulative spill_events counter via a
  lock-free AtomicU64 published by the reorder thread.
- Spill backend construction or I/O failures map to DeltaResult::failed
  for the offending sequence so the receiver maps to upstream exit code
  11 (FileIo) and aborts. Existing histogram/metrics machinery on the
  bare path is preserved verbatim.

Parallel receive delta apply (#1368)
- New parallel-receive-delta feature on engine (forwarded from transfer).
  Default off so production receivers continue to drive the sequential
  apply loop in receiver/transfer.rs.
- engine::concurrent_delta::parallel_apply adds DeltaChunk and
  ParallelDeltaApplier. Per-file Mutex serialises destination writes,
  per-file ReorderBuffer replays chunks in submission order, and
  rayon::join / par_iter fans the verify step across the rayon pool
  while keeping per-file byte order exact.
- ReceiverContext::enable_parallel_receive_delta installs the existing
  ParallelDeltaPipeline only when the feature is compiled in, leaving
  the default receiver loop untouched.

Re-exports the union of ConcurrentDeltaConfig, DeltaConsumerStats, and
(feature-gated) DeltaChunk / ParallelDeltaApplier from
crates/engine/src/concurrent_delta/mod.rs alongside the existing
HistogramStats, ReorderMetrics, and ReorderBuffer surface.

Tests
- spillable_consumer_preserves_order_under_pressure drives 1000 items
  through a 1 KiB budget with a deliberately delayed head-of-line item
  and asserts both in-order delivery and spill_events > 0.
- spillable_consumer_matches_bare_output_byte_for_byte compares spill
  vs bare paths via SpillCodec encoding.
- spawn_with_config_off_matches_spawn and stats_zero_when_spill_disabled
  pin the default-off invariants.
- parallel_apply: in-order, shuffled, and batched byte-equality tests
  plus a proptest over random chunk sizes / deterministic permutations.

Replaces the conflict-stalled PRs #4299 and #4300 with a single
combined change on top of current master.

Closes #1884
Closes #1368
@oferchen oferchen merged commit 82fd842 into master May 17, 2026
4 of 15 checks passed
oferchen added a commit that referenced this pull request May 17, 2026
Add criterion bench and design doc to inform the default-on decision for
the parallel-receive-delta feature (#1368 followup, PR #4319 scaffold).

The bench drives ParallelDeltaApplier vs a sequential baseline across
three workload classes: small_files (10000 x 4 KiB, 50/50 delta/whole),
mixed (1000 files, 4 KiB - 4 MiB, 50/50), and large_files (10 x 256 MiB,
all delta). Both cells write to in-memory sinks so the comparison
isolates apply-loop scheduling cost from disk I/O.

The doc lays out five promotion criteria (small_files >= 10% wall-clock
win at 4+ threads, zero wire-format divergence, no single-workload
regression > 5%, one release cycle of opt-in soak, two consecutive
nightly green runs) and three promotion paths (default-features flip,
runtime auto-detect on file_count/total_size, per-workload CLI flag),
with Path B as the recommended default unless the bench shows parallel
wins universally.
@github-actions github-actions Bot added the enhancement New feature or request label May 17, 2026
oferchen added a commit that referenced this pull request May 18, 2026
…apply (#4319)

Wires the existing SpillableReorderBuffer into DeltaConsumer behind a new
opt-in ConcurrentDeltaConfig, and lands the parallel receive-side delta
apply scaffold behind the parallel-receive-delta cargo feature.

SpillableReorderBuffer wiring (#1884)
- New ConcurrentDeltaConfig { spill_threshold_bytes, spill_dir } selects
  between the bare ReorderBuffer (default, behaviour unchanged) and the
  bounded-memory SpillableReorderBuffer when a threshold is supplied.
- DeltaConsumer::spawn_with_config dispatches via a ReorderMode enum so
  spawn / spawn_bypass / spawn_with_config share one inner loop entry.
- DeltaConsumerStats surfaces the cumulative spill_events counter via a
  lock-free AtomicU64 published by the reorder thread.
- Spill backend construction or I/O failures map to DeltaResult::failed
  for the offending sequence so the receiver maps to upstream exit code
  11 (FileIo) and aborts. Existing histogram/metrics machinery on the
  bare path is preserved verbatim.

Parallel receive delta apply (#1368)
- New parallel-receive-delta feature on engine (forwarded from transfer).
  Default off so production receivers continue to drive the sequential
  apply loop in receiver/transfer.rs.
- engine::concurrent_delta::parallel_apply adds DeltaChunk and
  ParallelDeltaApplier. Per-file Mutex serialises destination writes,
  per-file ReorderBuffer replays chunks in submission order, and
  rayon::join / par_iter fans the verify step across the rayon pool
  while keeping per-file byte order exact.
- ReceiverContext::enable_parallel_receive_delta installs the existing
  ParallelDeltaPipeline only when the feature is compiled in, leaving
  the default receiver loop untouched.

Re-exports the union of ConcurrentDeltaConfig, DeltaConsumerStats, and
(feature-gated) DeltaChunk / ParallelDeltaApplier from
crates/engine/src/concurrent_delta/mod.rs alongside the existing
HistogramStats, ReorderMetrics, and ReorderBuffer surface.

Tests
- spillable_consumer_preserves_order_under_pressure drives 1000 items
  through a 1 KiB budget with a deliberately delayed head-of-line item
  and asserts both in-order delivery and spill_events > 0.
- spillable_consumer_matches_bare_output_byte_for_byte compares spill
  vs bare paths via SpillCodec encoding.
- spawn_with_config_off_matches_spawn and stats_zero_when_spill_disabled
  pin the default-off invariants.
- parallel_apply: in-order, shuffled, and batched byte-equality tests
  plus a proptest over random chunk sizes / deterministic permutations.

Replaces the conflict-stalled PRs #4299 and #4300 with a single
combined change on top of current master.

Closes #1884
Closes #1368
oferchen added a commit that referenced this pull request May 18, 2026
Add criterion bench and design doc to inform the default-on decision for
the parallel-receive-delta feature (#1368 followup, PR #4319 scaffold).

The bench drives ParallelDeltaApplier vs a sequential baseline across
three workload classes: small_files (10000 x 4 KiB, 50/50 delta/whole),
mixed (1000 files, 4 KiB - 4 MiB, 50/50), and large_files (10 x 256 MiB,
all delta). Both cells write to in-memory sinks so the comparison
isolates apply-loop scheduling cost from disk I/O.

The doc lays out five promotion criteria (small_files >= 10% wall-clock
win at 4+ threads, zero wire-format divergence, no single-workload
regression > 5%, one release cycle of opt-in soak, two consecutive
nightly green runs) and three promotion paths (default-features flip,
runtime auto-detect on file_count/total_size, per-workload CLI flag),
with Path B as the recommended default unless the bench shows parallel
wins universally.
oferchen added a commit that referenced this pull request May 18, 2026
…apply (#4319)

Wires the existing SpillableReorderBuffer into DeltaConsumer behind a new
opt-in ConcurrentDeltaConfig, and lands the parallel receive-side delta
apply scaffold behind the parallel-receive-delta cargo feature.

SpillableReorderBuffer wiring (#1884)
- New ConcurrentDeltaConfig { spill_threshold_bytes, spill_dir } selects
  between the bare ReorderBuffer (default, behaviour unchanged) and the
  bounded-memory SpillableReorderBuffer when a threshold is supplied.
- DeltaConsumer::spawn_with_config dispatches via a ReorderMode enum so
  spawn / spawn_bypass / spawn_with_config share one inner loop entry.
- DeltaConsumerStats surfaces the cumulative spill_events counter via a
  lock-free AtomicU64 published by the reorder thread.
- Spill backend construction or I/O failures map to DeltaResult::failed
  for the offending sequence so the receiver maps to upstream exit code
  11 (FileIo) and aborts. Existing histogram/metrics machinery on the
  bare path is preserved verbatim.

Parallel receive delta apply (#1368)
- New parallel-receive-delta feature on engine (forwarded from transfer).
  Default off so production receivers continue to drive the sequential
  apply loop in receiver/transfer.rs.
- engine::concurrent_delta::parallel_apply adds DeltaChunk and
  ParallelDeltaApplier. Per-file Mutex serialises destination writes,
  per-file ReorderBuffer replays chunks in submission order, and
  rayon::join / par_iter fans the verify step across the rayon pool
  while keeping per-file byte order exact.
- ReceiverContext::enable_parallel_receive_delta installs the existing
  ParallelDeltaPipeline only when the feature is compiled in, leaving
  the default receiver loop untouched.

Re-exports the union of ConcurrentDeltaConfig, DeltaConsumerStats, and
(feature-gated) DeltaChunk / ParallelDeltaApplier from
crates/engine/src/concurrent_delta/mod.rs alongside the existing
HistogramStats, ReorderMetrics, and ReorderBuffer surface.

Tests
- spillable_consumer_preserves_order_under_pressure drives 1000 items
  through a 1 KiB budget with a deliberately delayed head-of-line item
  and asserts both in-order delivery and spill_events > 0.
- spillable_consumer_matches_bare_output_byte_for_byte compares spill
  vs bare paths via SpillCodec encoding.
- spawn_with_config_off_matches_spawn and stats_zero_when_spill_disabled
  pin the default-off invariants.
- parallel_apply: in-order, shuffled, and batched byte-equality tests
  plus a proptest over random chunk sizes / deterministic permutations.

Replaces the conflict-stalled PRs #4299 and #4300 with a single
combined change on top of current master.

Closes #1884
Closes #1368
oferchen added a commit that referenced this pull request May 18, 2026
Add criterion bench and design doc to inform the default-on decision for
the parallel-receive-delta feature (#1368 followup, PR #4319 scaffold).

The bench drives ParallelDeltaApplier vs a sequential baseline across
three workload classes: small_files (10000 x 4 KiB, 50/50 delta/whole),
mixed (1000 files, 4 KiB - 4 MiB, 50/50), and large_files (10 x 256 MiB,
all delta). Both cells write to in-memory sinks so the comparison
isolates apply-loop scheduling cost from disk I/O.

The doc lays out five promotion criteria (small_files >= 10% wall-clock
win at 4+ threads, zero wire-format divergence, no single-workload
regression > 5%, one release cycle of opt-in soak, two consecutive
nightly green runs) and three promotion paths (default-features flip,
runtime auto-detect on file_count/total_size, per-workload CLI flag),
with Path B as the recommended default unless the bench shows parallel
wins universally.
@oferchen oferchen deleted the feat/spill-and-parallel-recv-delta-redo branch May 19, 2026 19:29
oferchen added a commit that referenced this pull request May 21, 2026
…euristic (PIP-3 + PIP-5) (#4666)

* perf(transfer): enable parallel receive-delta by default via Path B heuristic

Wires the receiver-side parallel delta apply path into production per
`docs/design/parallel-receive-delta-default-on.md` Path B, combining
steps 4 and 5 in a single change.

- Add `PARALLEL_RECEIVE_FILE_COUNT_THRESHOLD = 100` and
  `PARALLEL_RECEIVE_BYTES_THRESHOLD = 64 MiB` on the receiver. Thresholds
  match the existing rayon parallel-stat cutoff convention and the
  `copy_file_range` 64 MiB crossover.
- Add `ReceiverContext::select_receiver_strategy(file_count, total_size)`
  (pure heuristic), `total_source_bytes()` (sums the in-memory file
  list), and `dispatch_receiver_strategy()` (logs the decision under the
  `GENR` debug channel and swaps the delta pipeline when the
  `parallel-receive-delta` feature is on).
- Call `dispatch_receiver_strategy` at the top of `run_sync`,
  `run_pipelined`, and `run_pipelined_incremental`, immediately after
  `setup_transfer` returns the file count and the file list is in
  memory.
- Surface the choice as `TransferStats::receiver_strategy_chosen` via
  the new `ReceiverStrategy { Sequential, Parallel }` enum
  (`Sequential` by default).
- Promote `parallel-receive-delta` into the `default = [...]` set on
  `engine`, `transfer`, `core`, `cli`, and the workspace binary so the
  shipped `oc-rsync` picks up the dispatch with no opt-in flag. Stay
  compatible with `--no-default-features` builds: when the feature is
  compiled out the dispatcher logs `receiver_strategy=parallel_unavailable`
  and falls back to sequential so the telemetry counter never lies
  about the path actually taken.

Tests:
- Unit coverage for the heuristic boundary matrix (below/above each
  threshold, exact-boundary, empty transfer).
- File-list-integration coverage for `total_source_bytes` and the full
  `dispatch_receiver_strategy` flow with populated file lists.

Wire-format parity stays guarded by the existing proptests (#4300 +
#4319). The soak + bench gates (criteria 1, 3, 4, 5 in section 5) are
explicit risk acceptance; PIP-4 (interop suite re-run) and PIP-6 (bench
backfill) remain as follow-ups.

Refs #1368, #2566, #2568.

* style(transfer): apply rustfmt to PIP-3+5 receiver-strategy code

CI fmt+clippy diff: collapsed select_receiver_strategy multi-line
signature and the FileEntry::new_file argument list in
dispatch_large_bytes_picks_parallel onto single lines.

Behaviour-neutral.

* fix(transfer): drop identity-op multiplier in receiver-strategy tests

clippy::identity-op fired on `1 * 1024 * 1024`. Collapsed to
`1024 * 1024` in the two boundary-matrix tests.
oferchen added a commit that referenced this pull request May 21, 2026
…PIP-3+5 wire-up (#4677)

Combined closure note covering three trackers whose deliverables have
already shipped or were absorbed into existing design surface:

- PIP-2 (#2565): the design doc was supposed to describe the
  token_loop -> ParallelDeltaApplier migration. That document already
  exists as docs/design/parallel-receive-delta-default-on.md
  (shipped in PR #4319 scaffold, extended by PR #4666 default-on
  flip). Marked completed retroactively.
- FFB-3 (#2576): conditional task that only activated if FFB-1 chose
  Option A or B. FFB-1 (PR #4659) chose Option D - barrier baked into
  finish_file - so every existing caller already gets the barrier
  without callsite migration. Marked deferred N-A.
- FFB-4 (#2577): explicit flush_workers call at the PIP-3 wire-up
  site. PIP-3 + PIP-5 (PR #4666 merged) routes through finish_file,
  which already barriers via Option D. No explicit call needed.
  Marked satisfied-by-design.

Note: FFB-2 implementation (PR #4665, in CI) needed a spin-then-yield
fix around finish_file's try_unwrap to close a Windows release-race;
future work to restructure DecrementGuard would be FFB-5+ if filed.

No source changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant