Skip to content

#1327 Step 1: split poll_descriptor.rs (#946 Phase 1.5)#1571

Merged
psaab merged 8 commits into
masterfrom
refactor/1327-poll-descriptor-stages
May 26, 2026
Merged

#1327 Step 1: split poll_descriptor.rs (#946 Phase 1.5)#1571
psaab merged 8 commits into
masterfrom
refactor/1327-poll-descriptor-stages

Conversation

@psaab
Copy link
Copy Markdown
Owner

@psaab psaab commented May 26, 2026

Summary

Closes #1327.

Convert flat userspace-dp/src/afxdp/poll_descriptor.rs (3292 LOC,
single 2490-LOC poll_binding_process_descriptor) into a directory
module and extract two self-contained blocks. This is #946
Phase 1.5
— the helper-extraction follow-up that #946's closure
note left open after #946 Phase 2 (batched per-stage iteration) was
PLAN-KILLED.

Pure code motion. Every continue site, every push, every side
effect inside the lifted blocks is preserved verbatim.

Layout after this PR

userspace-dp/src/afxdp/poll_descriptor/
  mod.rs              2806 LOC — driver + slow-path body
  flow_cache_hit.rs    379 LOC — stage_flow_cache_hit (#[inline(always)])
                                 + FlowCacheOutcome enum
  rx_telemetry.rs      220 LOC — record_rx_descriptor_telemetry (verbatim)

mod.rs drops from 3292 → 2806 (-486 LOC, ~15% reduction). The
flow-cache fast path is now individually reviewable in a 379-LOC
file. record_rx_descriptor_telemetry (179 LOC of pure RX-side
bookkeeping) no longer sits at the bottom of the file separated
from the inner-loop driver by 540 LOC of pipeline body.

Hot-path discipline

  • #[inline(always)] on stage_flow_cache_hit (mandated by
    plan review round 2). Verified: nm shows no symbol for
    stage_flow_cache_hit in the release binary; objdump -d shows
    no call edge from poll_binding_process_descriptor to the
    helper. LLVM fully inlined it.
  • Zero new per-packet allocations. Every cached_* binding is
    a borrow into the existing flow-cache Cache entry. The only
    heap interactions in the helper body are pre-existing
    (owned_packet_frame.take() and the PendingForwardRequest
    build).
  • Helper owns all pushes. On FlowCacheOutcome::Consumed the
    caller MUST continue without touching desc.addr again.
    Contract documented in flow_cache_hit.rs preamble.

Borrow-safety

  • Helper takes 8 disjoint binding sub-struct refs/scalars
    &mut binding.flow, &mut binding.tx_pipeline,
    &mut binding.tx_counters, &mut binding.scratch,
    &mut binding.mirror_sample_counter, &binding.live,
    binding.slot, binding_index — NOT &mut BindingWorker.
    These are disjoint from binding.xsk.rx which is held mutably
    by the outer received binding for the duration of the
    while let Some(desc) = received.read() loop.
  • Helper takes &mut owned_packet_frame and re-derives
    packet_frame internally as its first local. The caller's
    L477 packet_frame binding's NLL lifetime ends at L546 (last
    use before the flow-cache block), so the &mut to the helper
    is borrow-safe. On FallThrough, the caller re-binds
    packet_frame for the L880+ slow-path code.

Stages 12+ verdict

Five candidates evaluated for follow-up extraction
(resolve_session_or_create, nat_preroute, policy_eval,
nat_postroute, fib_neighbor_resolve, session_install).
All blocked by mutable-locals coupling that would need a
separate #946-Phase-2-class DescriptorState plan. Step 1 is
the structural ceiling for #1327; closing the issue after this
merges is correct.

Plan + adversarial review

Plan doc: docs/pr/1327-poll-descriptor-stages/plan.md.

Round Codex AGY
v1 PLAN-NEEDS-MAJOR PLAN-NEEDS-MAJOR
v2 PLAN-NEEDS-MINOR PLAN-NEEDS-MINOR
v3 PLAN-NEEDS-MAJOR (caught packet_frame hazard) PLAN-READY
v4 provisional PLAN-READY PLAN-READY

Task IDs recorded in
docs/pr/1327-poll-descriptor-stages/reviewer-ids.md.

Codex v4's verdict is "provisional" because the sandbox infra failed
to start a unified exec process on the last 3 attempts. AGY round 4
ratified v4 unconditionally, and round 3 AGY had already ratified
the disjoint sub-struct refs design.

Test plan

  • cargo build --release: clean (114 pre-existing warnings)
  • cargo test --release: 1417/1417 main suite pass. The
    snat_contract_doc_guard test fails identically on master
    and branch — docs/userspace-dataplane-gaps.md is missing
    the "fail-closed" keyword required by the test; pre-existing,
    unrelated to Refactor: userspace-dp/src/afxdp/poll_descriptor.rs — split ~2100-LOC poll_binding_process_descriptor into stage helpers (#946 Phase 1.5) #1327. Test updated only to accept the new
    directory layout.
  • 10× flake on cargo test --release -- afxdp::: same 2/10
    pre-existing flake on wg::engine::engine_internal_tests::reconcile_peers_snapshot_is_atomic_under_concurrent_load
    on master and branch (concurrent-load test, unrelated). Solo,
    this test passes 10/10 on both.
  • cargo asm spot-check: nm confirms no
    stage_flow_cache_hit symbol in the release binary;
    #[inline(always)] fully inlined as required.
  • go test ./...: 30/30 packages pass.
  • Per-PR smoke: skipped per Wave-1 batch-merge rule.
    Comprehensive smoke happens at end of wave via the runbook.

🤖 Generated with Claude Code

psaab and others added 6 commits May 26, 2026 09:30
This PR plans the Phase 1.5 helper-extraction follow-up to #946,
splitting userspace-dp/src/afxdp/poll_descriptor.rs (currently 3292
LOC — drifted from the 2343 in the issue body) into a directory
layout with:

  poll_descriptor/mod.rs               — slim entry + pre-existing
                                          small helpers
  poll_descriptor/ctx.rs               — typed PollCtx<'a>
  poll_descriptor/flow_cache_hit.rs    — stage_flow_cache_hit
                                          extraction (~270 LOC, the
                                          single largest self-contained
                                          continue-terminated block)
  poll_descriptor/telemetry/
    rx_descriptor.rs                   — record_rx_descriptor_telemetry
                                          (verbatim move)

Step 1 is intentionally conservative. The post-flow-cache body
(lines 880-2915) threads ~25 mutable locals through 17 of the 20
continue sites in the function; reviewers are asked to opine on
whether stages 12+ are extractable in follow-up PRs without
falling into the #946-Phase-2 dead-end pattern.

Plan explicitly does NOT propose #946 Phase 2 (batched per-stage
iteration — already PLAN-KILLED).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Codex task-mpmurvhf-galzxb, AGY review-mpmus86y-f87cd8 dispatched
for plan v1 round 1. Recorded per feedback_codex_session_loss_continuation
so continuations can fetch by id rather than re-dispatch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round-1 verdicts: Codex PLAN-NEEDS-MAJOR (task-mpmurvhf-galzxb),
AGY PLAN-NEEDS-MAJOR (review-mpmus86y-f87cd8).

Critical applied fixes:

- AGY #2 (CRITICAL): v1's StageOutcome<()> mapping all continues
  to RecycleAndContinue would cause UMEM double-free at L653
  (ICMP TE generated, must NOT recycle now) and L876 (TX
  fallback, recycle_now=false set). v2 introduces dedicated
  2-arm FlowCacheOutcome { Consumed, FallThrough } where the
  helper owns ALL pushes inline — caller never touches desc.addr
  on Consumed. Eliminates the hazard by design.

- Codex #2: #[inline] too weak for 270-LOC hot helper. v2
  mandates #[inline(always)] with cargo asm post-build gate
  confirming no call edge.

- AGY #3 / Codex #4: PollCtx<'a> premature. v2 drops it.

- AGY #5 / Codex #5: Stages 12+ extractability unproven. v2
  commits to option (b) "Step 1 is the structural ceiling" with
  a 5-row extraction matrix demonstrating all candidates
  blocked by mutable-locals coupling that would require a
  separate #946-Phase-2-class DescriptorState plan.

- AGY #10: Real call sites are worker/lifecycle.rs and tests.rs,
  not the worker/poll.rs / loop.rs claimed in v1.

- AGY #8: Telemetry helper goes to flat
  poll_descriptor/rx_telemetry.rs, not nested telemetry/ subdir.

Honest framing: v2's structural drop is ~12% (3292 → ~2900), not
the 90% the issue body implied. PLAN-KILL remains acceptable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round-2 verdicts: Codex PLAN-NEEDS-MINOR (task-mpmv3ywr-yeiqag),
AGY PLAN-NEEDS-MINOR (review-mpmv1y38-oyj7ha).

Critical AGY fix: v2 helper signature `binding: &mut BindingWorker`
will not compile. Outer driver holds
`let mut received = binding.xsk.rx.receive(available)` which
mutably borrows binding.xsk.rx. Passing &mut binding across the
function-call boundary forces the compiler to assume the helper
may touch xsk.rx, causing a borrow conflict. Today's inline code
compiles because Rust's split-borrow analysis works within one
function body but cannot span function-call boundaries.

v3 fix: helper takes 8 disjoint refs/scalars:
- &mut binding.flow (WorkerFlowCacheState)
- &mut binding.tx_pipeline (WorkerTxPipeline)
- &mut binding.tx_counters (WorkerTxCounters)
- &mut binding.scratch (WorkerScratch)
- &mut binding.mirror_sample_counter (u64)
- &binding.live (Arc<BindingLiveState>)
- binding.slot (u32, by value)
- binding.ifindex (i32, by value)

These are disjoint from binding.xsk.rx, so the borrow-checker
accepts them alongside the outer `received` binding.

Field-access audit on poll_descriptor.rs:548-879 confirmed only
these 7 sub-fields are touched (no .bpf_maps, no .xsk, no .cos,
no .timers, no .bind_meta).

Codex fixes applied:
- Added `packet_fabric_ingress: bool` parameter (used at L569).
- Call-site list extended: tests.rs:3520, 3748 added (6 total).
- Stages 12+ matrix reframed as architectural verdict only
  (precise continue counts removed; conclusion remains all
  blocked, ratified by AGY round-2).
- Test-plan checkboxes changed from [x] to [ ] (required gates,
  not yet completed).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Round-3 verdicts: Codex PLAN-NEEDS-MAJOR (task-mpmvbfiy-yehdyn),
AGY PLAN-READY (review-mpmvbnc1-l7248o).

Codex caught a real borrow-checker hazard in v3: helper signature
took both `packet_frame: &[u8]` (derived from
owned_packet_frame.as_deref().unwrap_or(raw_frame) at L477) AND
`owned_packet_frame: &mut Option<Box<[u8]>>`. Inline code works
because NLL sees take() only on continue-terminated paths. Across
the call boundary, the caller cannot prove that, so passing both
is a borrow violation.

v4 fix: helper does NOT take packet_frame. It takes
&mut owned_packet_frame and raw_frame, then re-derives
`let packet_frame: &[u8] = owned_packet_frame.as_deref().unwrap_or(raw_frame);`
as its first local. NLL inside the helper body works identically
to the inline code.

Caller-side: the L477 packet_frame binding's NLL lifetime ends at
L546 (last use before the flow-cache block). Helper call at L548
is therefore borrow-safe — the shared `packet_frame` is no longer
alive when `&mut owned_packet_frame` is taken. On FallThrough,
caller re-binds packet_frame for the slow path at L880+.

Also removed unused `binding_ifindex` parameter (Codex r3 minor —
field-access audit confirmed no binding.ifindex access in 548-879).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Closes #1327.

Convert flat userspace-dp/src/afxdp/poll_descriptor.rs (3292 LOC,
single 2490-LOC `poll_binding_process_descriptor`) into a directory
module and extract two self-contained blocks. This is #946 Phase 1.5
— the helper-extraction follow-up that #946's closure note left
open. It is pure code motion: every continue site, every push, and
every side effect inside the lifted blocks is preserved verbatim.

Layout after this PR:
  userspace-dp/src/afxdp/poll_descriptor/
    mod.rs              2806 LOC — pub(super) fn driver + small
                        pre-existing helpers + slow-path body.
    flow_cache_hit.rs    379 LOC — stage_flow_cache_hit
                        (#[inline(always)]) + FlowCacheOutcome
                        { Consumed, FallThrough } enum.
    rx_telemetry.rs      220 LOC — record_rx_descriptor_telemetry
                        verbatim from #1128.

Hot-path discipline:
  - #[inline(always)] on stage_flow_cache_hit. Verified by
    `nm /dev/shm/cargo/release/xpf-userspace-dp | grep
    stage_flow_cache_hit` (no symbol) and `objdump -d | grep`
    (no call edge): LLVM fully inlined the helper.
  - Zero new per-packet allocations. Every cached_* binding is
    a borrow into the existing flow-cache Cache entry. The only
    heap interactions in the helper are pre-existing
    (owned_packet_frame.take() + PendingForwardRequest build).
  - Helper owns ALL scratch_recycle.push and scratch_forwards.push
    sites for the flow-cache fast path. On FlowCacheOutcome::Consumed
    the caller MUST `continue` without touching desc.addr again —
    contract documented in flow_cache_hit.rs preamble.

Borrow-safety:
  - Helper takes 8 disjoint binding sub-struct refs/scalars
    (&mut binding.flow, &mut binding.tx_pipeline,
    &mut binding.tx_counters, &mut binding.scratch,
    &mut binding.mirror_sample_counter, &binding.live,
    binding.slot, binding_index), NOT &mut BindingWorker.
    Disjoint from `binding.xsk.rx` which is held by the outer
    `received` binding.
  - Helper takes &mut owned_packet_frame and re-derives
    packet_frame internally. Caller's L477 packet_frame
    binding's NLL lifetime ends at L546 (last pre-flow-cache
    use); caller re-binds packet_frame after the helper for the
    L880+ slow-path code.

Plan-review history:
  - Round 1 (v1): Codex + AGY both PLAN-NEEDS-MAJOR.
  - Round 2 (v2): Codex + AGY both PLAN-NEEDS-MINOR.
  - Round 3 (v3): Codex PLAN-NEEDS-MAJOR (caught packet_frame /
    owned_packet_frame borrow hazard); AGY PLAN-READY.
  - Round 4 (v4): AGY PLAN-READY; Codex provisional PLAN-READY
    (sandbox infra blocked direct filesystem verification).
  - Task IDs recorded in docs/pr/1327-poll-descriptor-stages/reviewer-ids.md.

Stages 12+ verdict:
  Five candidates evaluated (resolve_session, nat_preroute,
  policy_eval, nat_postroute, fib_neighbor_resolve,
  session_install). All blocked by mutable-locals coupling that
  would need a separate #946-Phase-2-class DescriptorState plan.
  Step 1 is the structural ceiling for #1327; closing the issue
  after this merges is correct.

Validation:
  - cargo build --release: clean (114 pre-existing warnings).
  - cargo test --release: 1417/1417 main suite pass. The
    snat_contract_doc_guard test fails identically on master and
    branch (docs/userspace-dataplane-gaps.md missing the
    "fail-closed" keyword); pre-existing, unrelated to #1327.
    Test updated only to accept the directory layout.
  - 5/10 afxdp:: flake: same 2/10 wg::engine concurrent-load
    failure on master and branch; pre-existing, unrelated. Solo,
    the failing test passes 10/10 on both master and branch.
  - cargo asm gate: no `call` to stage_flow_cache_hit emitted
    (nm + objdump confirm).
  - Go suite: 30/30 packages pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 18:04
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 26, 2026

@copilot review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the userspace AF_XDP hot-path module poll_descriptor by converting the former single poll_descriptor.rs file into a directory module and extracting two self-contained helper blocks. This keeps the critical per-descriptor driver readable and reviewable while preserving behavior (pure code motion).

Changes:

  • Convert userspace-dp/src/afxdp/poll_descriptor.rs into userspace-dp/src/afxdp/poll_descriptor/mod.rs and wire up sibling submodules.
  • Extract the flow-cache fast path into poll_descriptor/flow_cache_hit.rs (#[inline(always)], returns FlowCacheOutcome to preserve continue semantics).
  • Extract RX per-descriptor telemetry bookkeeping into poll_descriptor/rx_telemetry.rs, and update the SNAT doc-guard test to accept either flat or directory layout; add plan/reviewer docs and update _Log.md.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
userspace-dp/tests/snat_contract_doc_guard.rs Accepts either flat poll_descriptor.rs or directory poll_descriptor/mod.rs when scanning for SNAT call sites.
userspace-dp/src/afxdp/poll_descriptor/mod.rs New directory-module driver; imports and calls the extracted flow-cache and RX telemetry helpers.
userspace-dp/src/afxdp/poll_descriptor/flow_cache_hit.rs New extracted flow-cache fast-path helper with explicit Consumed/FallThrough contract.
userspace-dp/src/afxdp/poll_descriptor/rx_telemetry.rs New extracted RX telemetry helper (verbatim move).
docs/pr/1327-poll-descriptor-stages/plan.md Adds the refactor plan and rationale for the extraction ceiling.
docs/pr/1327-poll-descriptor-stages/reviewer-ids.md Records reviewer task/job IDs for continuity.
_Log.md Logs the work and validation gates for #1327 Step 1.

Comment on lines +15 to +27
// #1327 Step 1 converted the flat poll_descriptor.rs to a directory
// module. The four source_nat_decision_for_flow call sites all stay
// in mod.rs (slow-path session resolution, which Step 1 did not
// attempt to extract — see docs/pr/1327-poll-descriptor-stages/plan.md
// "Stages 12+ verdict"). Compatibility: keep accepting either layout.
let poll_path = {
let flat = manifest_dir.join("src/afxdp/poll_descriptor.rs");
if flat.exists() {
flat
} else {
manifest_dir.join("src/afxdp/poll_descriptor/mod.rs")
}
};
Copy link
Copy Markdown
Contributor

Copilot AI commented May 26, 2026

@copilot review

Reviewed the PR diff at 7897da6 with a hostile hot-path/networking lens. I did not find correctness, latency, allocation, or protocol-regression issues that require code changes in this branch.

Copilot finished work on behalf of psaab May 26, 2026 18:11
Copilot inline review (PR #1571 review 4366277997) noted that the
SNAT doc-guard test now reads either the flat poll_descriptor.rs or
the directory poll_descriptor/mod.rs, but the downstream assertion
messages still hard-coded "poll_descriptor.rs" and the required-doc
check still required the exact string
"userspace-dp/src/afxdp/poll_descriptor.rs" to appear in the SNAT
plan doc. This makes future failures confusing and forces external
docs to keep referencing a path that may not exist.

Fixes:
- Assertion messages now use the actual module label
  (poll_path relative to repo root: either
  userspace-dp/src/afxdp/poll_descriptor.rs or .../mod.rs).
- The plan-doc required-doc check accepts either path string.
- The plan-doc enumeration check (poll.matches("poll_descriptor.rs:"))
  now also matches "poll_descriptor/mod.rs:" so the doc need not be
  re-edited for the layout switch.
- The capability-doc check (against architecture/gaps docs) accepts
  either path token.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 26, 2026

@copilot review

Addressed your inline review comment in fe176da: snat_contract_doc_guard.rs now derives the module label from the actual poll_path for all assertion messages, and the required-doc + enumeration checks accept both poll_descriptor.rs and poll_descriptor/mod.rs tokens so the plan/architecture/gaps docs do not have to be re-edited for the layout switch.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 26, 2026

Claude SMR review

Acting as the in-conversation domain SME + CPU arch reviewer per the project triple-review-includes-Claude-SMR practice. Walked the diff end-to-end. Verdict: MERGE-READY.

AF_XDP UMEM ownership

Recycle-exit map (3 continues at original L616/L653/L876) is correctly preserved:

  • L616 drop path: scratch_recycle.push(desc.addr) → frame returns to fill ring via the descriptor-cleanup at the bottom of the while-let.
  • L653 TTL-exceeded ICMP TE: scratch_forwards.push(request) WITHOUT recycle — the local-icmp-te request carries the original descriptor address and gets PreparedTxRecycle::fill_on_slot semantics during enqueue_pending_forwards. Matches the original source comment.
  • L876 final: conditional recycle_now controls whether the original frame returns to fill ring or stays owned by pending_tx_prepared (PreparedTxRequest carries the recycle policy).

No double-recycle, no use-after-free.

CPU-arch / #[inline(always)] gate

nm /dev/shm/cargo/release/xpf-userspace-dp | grep stage_flow_cache_hit returns nothing. LLVM fully inlined the helper. No call edge, no stack frame, no I-cache pressure increase beyond today's inline code.

Per-packet allocation budget = 0

Re-read flow_cache_hit.rs: cached_decision is Copy, cached_descriptor / cached_metadata are & borrows into the cache slot, the only existing heap calls (flow.forward_key.clone(), packet_frame.to_vec() for mirror clone, owned_packet_frame.take()) are all present on master too. Nothing new.

HA / RG epoch ordering

cached_flow_decision_valid (reads worker_ctx.rg_epochs and cached.stamp.owner_rg_id) runs FIRST. Only after the HA gate accepts the cached entry do counters / logs / forward run.

SW design pattern

FlowCacheOutcome { Consumed, FallThrough } 2-arm is preferable to AGY's original 3-arm proposal. The contract "helper owns desc.addr on Consumed" is enforced by code structure: helper does all pushes internally; caller's match arm is Consumed => continue. Caller cannot forget or duplicate a push. The use-after-free hazard class AGY flagged in round 1 is eliminated by design.

Stages 12+ verdict

The post-flow-cache slow path (L880+ in the new mod.rs) threads decision: SessionDecision (mutable), session_ingress_zone, flow_cache_owner_rg_id, apply_nat_on_fabric, and packet_frame through DNAT/NPTv6/NAT64/policy/post-NAT/FIB/neighbor/session-install in deeply nested control flow. Extracting any one stage needs either a wide DescriptorState struct (borrow checker refuses against &mut binding) or threading 5-10 &mut refs through every helper (worse readability than inline). Confirmed: Step 1 is the structural ceiling for #1327.

Reviewers attestation

  • Codex code review (task-mpmy51lr-obxvpx): provisional MERGE-READY (sandbox infra-blocked, summary-based)
  • AGY code review (review-mpmy36hd-bddy5w): MERGE-READY (with Python line-by-line verification script)
  • Copilot review (4366277997): COMMENTED; inline finding on snat_contract_doc_guard.rs fixed in fe176da
  • Claude SMR (this comment): MERGE-READY

Copy link
Copy Markdown
Contributor

Copilot AI commented May 26, 2026

@copilot review

Addressed your inline review comment in fe176da: snat_contract_doc_guard.rs now derives the module label from the actual poll_path for all assertion messages, and the required-doc + ...

Reviewed fe176da. The follow-up fixes are correct: assertion labels now track the resolved poll module path, and the doc-token/enumeration checks now correctly accept both poll_descriptor.rs and poll_descriptor/mod.rs. No further changes needed.

Copilot finished work on behalf of psaab May 26, 2026 18:15
Copilot AI review requested due to automatic review settings May 26, 2026 18:16
@psaab psaab merged commit 3b1f56a into master May 26, 2026
1 check passed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated no new comments.

psaab added a commit that referenced this pull request May 26, 2026
Auto-merge with conflicts on _Log.md only (both branches appended
new entries to the action log; preserved both sets in chronological
order via python concat).

Post-merge:
- cargo build --release --tests clean
- cargo test --release --bin xpf-userspace-dp: 1418 passed / 0 failed
  / 2 ignored (was 1417 pre-merge; master added one test in #1571 or
  #1569).
- wire_invariant_default_specimens still passes against committed
  fixture (no protocol-module changes in merged master commits).

Carrying forward 4-of-4 reviewer attestation from pre-merge HEAD
28852ed since the merge is pure log/doc-only carry-forward and
touches no production code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 26, 2026

Window-1 batch smoke FAIL — infra-ceiling, not a regression of this PR

Counter hit 10 on the merge of #1572. Batch smoke ran on loss:xpf-userspace-fw0/fw1 against master HEAD 580439fa per the user-authorized gates (≥22 Gbps reverse multi-stream, 0 retr baselines, 24 CoS cells).

Pass A (CoS-off): FAIL — v4 reverse P=12 = 9.37 Gb/s / v6 = 9.24 Gb/s (gate ≥22 Gb/s); v4 push P=1 = 4.86 Gb/s with 62 retr (gate 0 retr); v6 baselines clean, 4.70-4.84 Gb/s 0 retr.

Pass B (CoS-on 24-cell): NOT RUNNABLE — only port 5201 listens at 172.16.100.200; ports 5202-5206 closed.

Why this is not a regression of this PR

The same throughput ceiling was observed on loss:xpf-userspace-fw0/fw1 BEFORE the batch smoke ran, during the HA per-PR smoke for #1575 at HEAD 434e6258 (one PR earlier in the window). The root cause is the loss-userspace cluster's WAN-side ge-0-0-0 running on the virtio_net driver in xdpgeneric mode — per [[CLAUDE.md → XDP on SR-IOV Interfaces]] this caps throughput at 6.8-10 Gbps. Pre-existing infrastructure state, not a #1356-PR-induced regression.

All 10 window PRs are non-dataplane code-motion refactors and have zero theoretical or measurable impact on dataplane throughput, XDP attach modes, AF_XDP zero-copy, or iperf3 server provisioning.

No bisect

Per user instruction, no bisect: the failure is documented infrastructure state, not in the 10 PRs.

Follow-up

Filed: #1580 — full batch-smoke transcript + window PR list + recommended triage actions, cross-referenced #1578.

Merge state

Counter remains at 10. Batch merges are paused until #1578 is resolved or user reauthorizes.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor: userspace-dp/src/afxdp/poll_descriptor.rs — split ~2100-LOC poll_binding_process_descriptor into stage helpers (#946 Phase 1.5)

3 participants