Skip to content

Fix buildDHCPv6Modifiers: DUID added before nil opts guard#2

Closed
Copilot wants to merge 2 commits intomasterfrom
copilot/suggest-code-improvements
Closed

Fix buildDHCPv6Modifiers: DUID added before nil opts guard#2
Copilot wants to merge 2 commits intomasterfrom
copilot/suggest-code-improvements

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Feb 25, 2026

buildDHCPv6Modifiers was unconditionally calling getDUID() before the opts == nil early return, causing TestBuildDHCPv6Modifiers/nil_opts to fail — 1 modifier returned instead of 0.

Change

Move the opts == nil guard to the top of the function. The DUID client-ID modifier is only meaningful alongside actual DHCPv6 options; nil opts means no request will be made.

// Before
func (m *Manager) buildDHCPv6Modifiers(ifaceName string, opts *DHCPv6Options) []dhcpv6.Modifier {
    var mods []dhcpv6.Modifier
    if duid, err := m.getDUID(ifaceName); err == nil { // runs even when opts == nil
        mods = append(mods, dhcpv6.WithClientID(duid))
    }
    if opts == nil {
        return mods // returns [DUID modifier] instead of nil
    }
    ...
}

// After
func (m *Manager) buildDHCPv6Modifiers(ifaceName string, opts *DHCPv6Options) []dhcpv6.Modifier {
    if opts == nil {
        return nil
    }
    var mods []dhcpv6.Modifier
    if duid, err := m.getDUID(ifaceName); err == nil {
        mods = append(mods, dhcpv6.WithClientID(duid))
    }
    ...
}

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

The DUID modifier was being added before the opts == nil early-return,
causing buildDHCPv6Modifiers("eth0", nil) to return 1 modifier instead
of 0 and failing TestBuildDHCPv6Modifiers/nil_opts.

Move the nil guard to the top of the function so that nil opts returns
immediately with no modifiers. The DUID is only needed when there are
actual DHCPv6 options to send.

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
Copilot AI changed the title [WIP] Review code and suggest improvements Fix buildDHCPv6Modifiers: DUID added before nil opts guard Feb 25, 2026
Copilot AI requested a review from psaab February 25, 2026 11:01
psaab added a commit that referenced this pull request Feb 25, 2026
…il opts

Four fixes sourced from GitHub Copilot PR review (#2-#5):

1. NAT64 state cleanup (PR #5): compileNAT64() returned early when
   len(ruleSets)==0, skipping SetNAT64Count(0) and DeleteStaleNAT64().
   Removing all NAT64 rules left stale prefixes in BPF maps.

2. DNAT wildcard port-0 (PR #3): skip redundant dnat_table wildcard
   lookup when meta->dst_port is already 0 — the lookup would be
   identical to the one that just failed. Applied to both v4 and v6.

3. GC scratch buffer reuse (PR #4): sweep() allocated fresh slices
   every 10s cycle. Reuse backing arrays via [:0] reset to reduce
   allocation churn under high session turnover.

4. DHCPv6 nil opts guard (PR #2): move opts==nil check before getDUID()
   so nil opts returns nil modifiers, not a DUID-only modifier list.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@psaab psaab closed this Feb 25, 2026
psaab added a commit that referenced this pull request Mar 22, 2026
Extract retry_pending_neigh() function and call it on BOTH:
1. End of RX batch processing (existing path)
2. Early return when rx.available() == 0 (NEW — the critical fix)

Without #2, the buffered SYN sits until the next packet arrives
(TCP retransmit ~1s) because the worker sleeps in poll() and only
checks pending_neigh after processing RX packets. With #2, the
retry runs on every poll wake (every 1ms in interrupt mode),
forwarding the buffered packet within ~2ms of ARP resolution.

23.3 Gbps warm throughput confirmed. Cold connect still needs
work — the retry fires but TCP handshake timing may cause issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@psaab psaab deleted the copilot/suggest-code-improvements branch April 7, 2026 02:45
psaab added a commit that referenced this pull request Apr 19, 2026
Four changes addressing the review findings:

HIGH — bounded ingest → drain_shaped_tx loop:
  Single re-ingest (previous commit) only closes the race window
  up to its call site. Peers can still push to the MPSC inbox
  during the prepared/backup drain that follows. Replace the
  single re-ingest with a bounded loop (budget 4) of
  ingest → drain_shaped_tx that widens the window until either
  all three pending buffers quiesce or the budget is exhausted.
  Budget sized to absorb a 3× burst without risking starvation
  of the outer worker loop.

HIGH — belt-and-suspenders drop filter at backup paths:
  Add drop_cos_bound_prepared_leftovers + drop_cos_bound_local_leftovers.
  Before the post-CoS backup transmit, scan pending_tx_prepared
  and pending_tx_local; any item whose cos_queue_id.is_some() is
  dropped (UMEM slot recycled via recycle_prepared_immediately
  for prepared items; Vec<u8> freed for local items) instead of
  transmitted unshaped. Makes the CoS cap a hard guarantee
  regardless of how many race windows remain open upstream.

MEDIUM — suppress owner_pps/peer_pps attribution on re-ingest:
  Items left in pending_tx_local after the first pass get
  re-classified as owner-local by the second ingest's
  owner_local_count / peer_count split; that corrupts the
  per-binding pps provenance telemetry (inflates owner_pps,
  deflates peer_pps). Split ingest into
  ingest_cos_pending_tx_with_provenance(count_pps: bool); the
  first pass counts, re-ingests do not.

LOW #1 — cacheline isolation:
  Move drain_sent_bytes_shaped_unconditional and
  post_drain_backup_bytes from BindingLiveState (which shares a
  line with multi-writer redirect_inbox_overflow_drops) onto the
  already-isolated #[repr(align(64))] OwnerProfileOwnerWrites
  block. Owner-only writes no longer ping-pong with peer overflow
  writers during overload. Also move the two new drop-filter
  counters there.

LOW #2 — document read skew:
  drain_sent_bytes_shaped_unconditional, post_drain_backup_bytes,
  and tx_bytes are written at separate sites and read at separate
  scrape instants. Doc-comment clarifies these are
  delta-over-window diagnostics, not point-in-time equalities.

Also adds BindingStatus fields for post_drain_backup_cos_drops
and post_drain_backup_cos_drop_bytes so operators can observe
the drop filter firing via the JSON snapshot.

All 703 Rust tests + Go dataplane/userspace tests pass.

Refs #760, #772, #773

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 20, 2026
Address both adversarial reviews on the D3 RSS indirection PR —
Codex HIGH #1/#2/MEDIUM and Go MEDIUM #1/#2/#3 + LOWs.

Kill switch (Codex MEDIUM, rollback):
  New `set system dataplane rss-indirection enable|disable` knob.
  Default enabled; `disable` short-circuits applyRSSIndirection so the
  kernel's default hashing stays intact. Compiled into a bool on
  UserspaceConfig, gated at both the startup and reconcile call sites.

Reconcile on config change (Codex HIGH #2):
  applyConfig now calls reapplyRSSIndirection() after cluster transport
  reconcile. Worker-count changes (`set system dataplane workers N`) and
  kill-switch toggles take effect on commit, not only first boot.
  Idempotent: matching tables skip the write.

Explicit per-interface mlx5 guard at the call site (Codex HIGH #1):
  applyRSSIndirection now driver-filters before invoking ethtool. Any
  non-mlx5 netdev logs at slog.Debug and is skipped without shelling
  out. The existing per-interface check inside applyRSSIndirectionOne
  is retained as defense in depth.

Injectable sysfs scan (Go MEDIUM #3):
  rssExecutor grows a listInterfaces() method; tests no longer leak to
  real /sys/class/net. This also unblocks Go MEDIUM #2.

Real non-mlx5 skip test (Go MEDIUM #2):
  TestApplyRSSIndirection_NonMlx_Skips now calls applyRSSIndirection
  with a virtio_net + iavf fixture and asserts zero ethtool calls.
  Adds TestApplyRSSIndirection_MixedDrivers_OnlyMlxTouched to cover
  the mixed-driver host case from HIGH #1.

Stable errors.Is for ethtool-not-found (Go MEDIUM #1):
  isExecNotFound swapped to errors.Is(err, exec.ErrNotFound). Drops
  the brittle substring match and its stale comment. Pinned by a new
  TestIsExecNotFound_DetectsSentinel.

Two-call idempotency test (Go LOW #3):
  TestApplyRSSIndirection_TwiceIsIdempotent asserts the second call
  after a matching-table first call adds exactly one probe ethtool
  invocation and zero writes.

Fixture cleanup (Go LOW #1):
  Dead `ethtoolXFailNotFound` field removed from fakeRSSExecutor.

Boot-time log demotion (Go LOW #2):
  `workers <= 0` (non-userspace deploys) downgraded to slog.Debug.
  `workers == 1` (explicit operator choice) stays at slog.Info.

Config schema + compiler:
  `rss-indirection` appears under `system dataplane` with args=1 and
  compiles the string `disable` to UserspaceConfig.RSSIndirectionDisabled.
  TestUserspaceDataplaneRSSIndirectionDisable pins the default-off,
  disable-flips-on, enable-is-no-op behaviour.

No functional regression when D3 is enabled (default): the weight
vector, idempotency logic, and ordering invariants are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 20, 2026
…ning

Address PR #796 review findings from Codex (HIGH + MEDIUM + LOW)
and Rust reviewer (3x MEDIUM).

== Codex HIGH — queue_vtime round-trip neutrality ==

`cos_queue_pop_front()` advanced `queue_vtime` by the drained
packet's bytes BEFORE TX was attempted. On TX-ring-full retry the
popped item bounced back via `cos_queue_push_front()`, but
`queue_vtime` stayed advanced — so a newly-active flow arriving
after the retry inherited an inflated vtime anchor and was pushed
behind established traffic even though zero bytes had actually
drained. This broke the byte-rate fairness contract under
TX-ring-full conditions.

Fix: per-item symmetric rewind. `cos_queue_push_front` subtracts
the pushed item's bytes from `queue_vtime` (mirroring pop_front's
add). Per-item (not a single snapshot) is the right granularity
because TX-ring-full settle paths pop N items into a scratch Vec,
submit M, and push_front the remaining N-M in LIFO order — a
single snapshot would only work for full rollback; symmetric
rewind handles partial commit too.

== Rust MEDIUM #1 — drained-bucket round-trip neutrality ==

Closely related to the Codex HIGH: when the popped item was the
sole packet in its bucket, the bucket drained, head/tail reset
to 0, and a push_front re-anchored to `max(0, vtime) + bytes` —
one packet past the pre-pop head. Fix: new
`CoSQueuePopSnapshot` captures pre-pop bucket + head-finish +
tail-finish on every pop; push_front on a matching bucket
restores head/tail exactly. Snapshot invalidated on any
`cos_queue_push_back` (new enqueue changes bucket state
underneath). Snapshot is per-queue, size-bounded (16 bytes +
option discriminant), no hot-path allocation.

== Codex MEDIUM — brief-idle re-entry pin ==

New pin `mqfq_brief_idle_reentry_exercises_both_max_arms` —
bucket drains, another bucket drains advancing vtime modestly,
first bucket re-enqueues. Exercises BOTH arms of the
`max(tail_finish, queue_vtime) + bytes` re-anchor formula:
first re-enqueue (tail=0 < vtime, picks vtime) and second
enqueue (tail > vtime, picks tail). Prior pins only covered
large-idle cases.

== Rust MEDIUM #2 — real-field headroom pin ==

Rewrote `mqfq_finish_time_u64_has_decades_of_headroom` from a
calculator (recomputes wrap math in Rust, never reads the real
field) into an actual pin: drives `queue_vtime` to
`u64::MAX - 10_000`, enqueues 9000-byte packets via
`cos_queue_push_back`, asserts the bucket's head/tail did not
wrap AND that the second enqueue saturates at `u64::MAX`. A
regression to u32 or non-saturating add would now fail the test.
Kept the original 40-year calculator as a supplementary sanity
check.

== Rust MEDIUM #3 — golden-vector table ==

New pin `mqfq_golden_vector_pop_order_vs_drr` — a small table of
(packet_sizes, expected_mqfq_order, reference_drr_order) rows,
asserting MQFQ pop order per row AND that the table includes at
least one row where MQFQ diverges from DRR. Documents the
tie-break rule in `cos_queue_min_finish_bucket` and locks the
MQFQ-vs-DRR divergence into the test surface for regression
protection.

== Codex round-2 LOW #2 / Rust LOW — idle-return anchor pin ==

New pin `mqfq_idle_flow_reanchors_at_frontier_not_zero` — drains
flow A for 4500 bytes, idle-returning flow B's first packet must
anchor at `queue_vtime + bytes = 5700`, not at `0 + bytes = 1200`.
Complements `mqfq_queue_vtime_advances_by_drained_bytes` and
`mqfq_bucket_drain_resets_finish_time` by asserting the
CONSEQUENCE of those invariants.

== Rust LOW — back-reference in field doc ==

Added `Read by cos_queue_min_finish_bucket` back-reference on
`flow_bucket_head_finish_bytes` so a reader starting from the
field learns the consuming function.

== Existing pin updated ==

`mqfq_push_front_is_finish_time_neutral_on_active_bucket` now
also captures `pre_pop_vtime` and asserts it equals post-restore
`queue_vtime`. Prior version missed the Codex HIGH (test was
green while the bug shipped).

== Tests ==

All 11 MQFQ-tagged tests pass. All 9 flow_fair-tagged tests
pass. Full test suite: 731 passed, 0 failed, 1 ignored.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 20, 2026
Append round-2 verification section to
`docs/785-phase3-pr-review-rust.md` covering the fixes in
`4416eb27`.

All 3 round-1 MEDIUMs CLOSED:
- MED #1 (drained-bucket round-trip) — `CoSQueuePopSnapshot` +
  `last_pop_snapshot: Option<...>` on `CoSQueueRuntime`, taken in
  `cos_queue_push_front`, invalidated in `cos_queue_push_back`.
  Allocation-free, leak-free, composes with Codex HIGH's
  symmetric vtime rewind.
- MED #2 (headroom calculator) — rewritten pin now drives
  `queue_vtime` to `u64::MAX - 10_000` and calls real
  `cos_queue_push_back` twice, asserting first enqueue anchors
  exactly at `near_wrap + 9000` and second saturates at
  `u64::MAX`.
- MED #3 (mixed-size golden vector) — 3-row table with hard-coded
  MQFQ pop order; DRR reference column documented-not-executed;
  `any_divergent` meta-assert guards against accidental
  collapse-to-DRR.

New-angle audits clean:
- No new production unsafe/unwrap/expect/eprintln/panic (two new
  `.expect()` are test-only).
- New pins assert full state (head, tail, bytes, vtime, active
  count, items.len()).
- Snapshot cost: 24-byte Option on queue struct, written on every
  pop, cleared on every push_back — no allocation, in cache line.

11 mqfq_* tests pass (`cargo test --release ... mqfq`).

Merge-ready from Rust-quality angle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 20, 2026
Incorporates two adversarial reviews into the plan:

Codex review (docs/line-rate-investigation-plan-review.md)
- HIGH #1 (counter ordering for H-FWD-1): Steps 1 and 2 swapped —
  userspace-dp counters (dbg_tx_ring_full, pending_tx_local_
  overflow_drops, outstanding_tx, flow_steer_snapshot) now captured
  BEFORE NIC counters. See "Step 1: capture userspace-dp counters".
- HIGH #2 (fairness rollback brittleness): fixed-5pp gate replaced
  with matched 5-run pre/post statistical protocol, rollback if
  mean-CoV delta > 2 × pre-stddev, retr > +100, or p99 regression.
  See "Statistical protocol".
- MEDIUM (TCP CC uncontrolled): new Step 0.3 pins tcp_congestion_
  control on client + server, verifies via ss -ti during runs.
- MEDIUM (XSK bind mode / fallback): new Step 0.6 captures
  xsk_bind_mode, zero-copy, XDP_FALLBACK per binding.
- MEDIUM (H-REV-3 observables): hypothesis restated against
  TX-ring-full / overflow / restore-helper observables.
- MEDIUM/LOW (success criteria, rollback mechanics): single success
  definition + explicit deferred-findings section.

Systems review (docs/line-rate-investigation-plan-review-systems.md)
- HIGH S-1 (IRQ ↔ worker CPU alignment): new Step 0.1 audits
  /proc/irq/*/smp_affinity_list for mlx5 queues; misalignment is a
  zero-code fix applied before any dataplane change.
- HIGH S-2 (no latency metric): p50/p99 via concurrent ping -i 0.01
  + ss -ti added to Step 3 and to the Validation section; latency
  non-regression gate added to rollback protocol.
- HIGH S-3 (NAPI/coalescence/busy-poll unaudited): new Step 0.2
  audits ethtool -c, netdev_budget, gro_flush_timeout, SO_BUSY_POLL.
  H-FWD-5 added. mpstat now breaks out %usr/%sys/%soft/%irq.
- MEDIUM S-4 (ring quadruple): Step 0.4 + renamed Step 5 audit all
  four rings (RX/TX/fill/completion), distinguishing TX-produce
  from completion-reap starvation.
- MEDIUM S-5 (L1d pressure of 24 KB MQFQ state): perf stat -e
  L1-dcache-load-misses added to Step 3 during-test sampling.
- MEDIUM S-6 (MQFQ small-ACK pathology): new H-REV-6 hypothesis +
  new Step 7 jumbo-MSS diagnostic; ACK-bucket histogram capture.
- LOW S-7 (C-states / frequency): turbostat + cpupower folded as
  Step 0.5 + Step 3 diagnostic observable.

Hard stops expanded: any mqfq_* OR flow_steer_* regression triggers
rollback. Step 0 (zero-code audit) lands inside Phase B, runs
before Step 1. Phases A-E structure preserved.

Plan grew from 296 to 522 lines (+292 / -66).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 21, 2026
Round-2 review HIGH #2 (subsumes MEDIUM schema-mismatch and
MEDIUM no-failure-handling): the prior step1-capture.sh used the
wrong control-socket schema and had no robust handling for
realistic mid-cell failures.

Schema corrections (verified against userspace-dp/src/protocol.rs):
- Request envelope key is `type`, not `request_type`
  (ControlRequest, protocol.rs:625-627)
- Response shape is `{ok, status: {...}}`; per_binding,
  cos_interfaces, and filter_term_counters all live under
  `.status.*`, not at the top level (protocol.rs:719-736, 980-992)
- Filter-term hit counter field is `packets`, not
  `matched_packets` (FirewallFilterTermCounterStatus,
  protocol.rs:919-930)

Failure-mode handling:
- ctl_status() helper retries socat once on timeout / decode
  failure / `ok != true` before giving up (round-2: "on socket
  timeout, retry once then mark cell SUSPECT").
- Pre-cell with-cos check: if .status.cos_interfaces is empty
  when with-cos is requested, auto-invoke apply-cos-config.sh
  once and wait up to 30 s for reconciliation; if either fails,
  halt_with_dump() writes a state dump under
  step1-evidence/halt-<ts>/ and exits 3 (round-2: "on CoS apply
  failure, bail with error message").
- Pre-cell no-cos check: if cos_interfaces is non-empty when
  no-cos is requested, halt with dump (orchestrator must remove
  CoS first; the script will not toggle CoS mid-run).
- Post-run failover detection: if the post-run control-socket
  call fails, recheck primary via cli `show chassis cluster
  status`. If primary is still fw0, halt (likely daemon crash);
  if primary moved, mark cell SUSPECT (failover during cell)
  rather than silently saving a half-empty cell.
- 12-sample sampler: each socat call retries once on failure;
  on second failure writes an `_error: control_socket_timeout`
  placeholder line so I6 detects it deterministically.

I6 invariant updated to count `_error` placeholders as failures
and to read .status.per_binding (the nested path). I10 now
correctly reads .status.cos_interfaces and .status.filter_term_counters
with the `packets` field.

Adds a top-of-file comment block citing the exact protocol.rs line
ranges so future maintainers can re-verify the schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 21, 2026
Extends the Step 1 statistical scripts with the two derivations Codex
round-3 demanded:

1. step1-rss-multinomial.py --skewed-worker0 <p0>: adds a power-case
   simulation where worker 0 draws p0 of the multinomial mass and the
   other three share (1-p0)/3. Reports per-cell Verdict A fire rate
   under the alternative, plus multi-cell aggregation P(>= 2 of N
   cells fire). Closes the round-3 partial on finding #3 (defense of
   max>=9 was untested on skewed RSS): at p0=0.56, per-cell fire rate
   is 0.6302 — NOT the claimed >99% — but multi-cell aggregation at
   N=8 forward cells gives 0.9949, which matches the actual plan-level
   defense (the >=2-of-N FP-discount policy).

2. step1-rate-spread-analysis.py: adds a nonparametric bootstrap
   (default 10000 trials, seed=42) for a 95% CI on Y = mean + 2*stddev.
   Closes round-3 finding #2 HIGH: the published Y=2.72 sits near the
   upper end of the [1.82, 2.88] CI with n=4, so the point estimate is
   noisy. Script output documents the re-derivation policy (re-run
   with expanded --cells when Step 1 produces new baseline cells).

No weakening: both scripts are deterministic and re-runnable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 21, 2026
Closes the four Codex round-3 items on step1-plan.md:

- Round-3 #3 (PARTIAL): §4.2 now publishes the Monte Carlo TRUE-
  POSITIVE power under a 56 %-skew alternative (per-cell fire rate
  0.6302 at max>=9), NOT the >99% the prior defense implied. The
  honest defense is the multi-cell aggregation in §4.6: P(>=2 of 8
  cells fire | 56% skew) = 0.9949.

- Round-3 #2 HIGH: §4.7 adds a Threshold Y re-derivation policy.
  Bootstrap 95% CI for Y is [1.82, 2.88] with n=4, so Y=2.72 sits
  near the CI upper edge. When Step 1 produces new baseline cells,
  the script must be re-run with the expanded --cells list; if Y
  moves more than the CI half-width (0.53), update the plan.

- Round-3 #4 HIGH (0.77 expected false A across 12 cells): §4.6
  commits the FP-discount / multi-cell aggregation policy. Single
  Verdict A firings are treated as noise; Verdict A triggers Step 2
  only when k_A >= 2 cells fire. Same rule applies to Verdict B.
  Verdict C stays single-cell (per-cell FP < 0.01). §8 Step-2
  triggers are rewritten to use k_ counts from §4.6.

- Round-3 #3 HIGH (CoS apply no-rollback): §6 step 2 documents the
  new atomic apply-cos-config.sh contract (commit check + atomic
  commit + post-commit verify + rollback 1 on failure). §6 step 7
  (remove-with-cos) now requires the same atomic pattern. §10
  risks table adds a dedicated row for the apply-atomicity
  guarantee.

No weakening of thresholds. The FP-discount policy is the
tightening Codex asked for: single-cell A was previously load-
bearing; it is now explicitly discounted. All derivations are
reproducible via committed Python scripts (stdlib only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 21, 2026
HIGH #1 — submit-side amortization collapse on small batches. §3.1
revised: per-commit stamp (one monotonic_nanos() per writer.commit())
applied to the `inserted` descriptors only; retry-tail not stamped.
Why `now_ns` reuse is rejected (up to ~1 ms staleness across the
worker loop at worker.rs:619 + afxdp.rs:176-178).

HIGH #2 — relaxed-atomic cross-CPU visibility on ARM. §3.6.a decision:
keep Relaxed on both writer and reader, document bounded-skew
semantics (mirrors existing umem.rs:1322-1329 pattern), downgrade
invariants 6/7 and §8 hard-stop #4 to "|sum - count| / count ≤ 0.01"
rather than exact equality. Upgrade to Release/Acquire rejected on
~2 % ARM cost grounds.

HIGH #3 — sidecar false sharing across workers. §3.3 rewritten after
reading the code: each binding has its own UMEM via
`WorkerUmemPool::new` at worker.rs:445 with `shared_umem=false`,
wrapped in `Rc<WorkerUmemInner>` at umem.rs:16-18 (single-owner
thread). Cross-worker false-sharing is structurally impossible.
UMEM-headroom in-frame approach (c) rejected on blast-radius grounds.

HIGH #4 (Codex numbering) — overhead arithmetic. §3.4 rewritten with
three operating-point numbers (`inserted = 256 / 64 / 1`); honest
worst-case 45 ns/pkt = 9.4 % of per-queue 481 ns budget. Corrects
earlier 0.13 % figure that hid the partial-batch regime and divided
by workers instead of queue.

Also closes MED #5 (sentinel vs clock-0 collision), MED #6 (sidecar
size 192 KiB not 64 KiB; 3×ring_entries per bind.rs:37-44), MED #7
(wire-size growth ~8-10 KiB per status poll), MED #8 (Bonferroni
family corrected to 3 composite tests per cell × 12 cells = 36),
MED #9 (symbolic two-thread race test replaced by partial-batch,
retry-unwind, and bounded-skew tests). LOW #10 (no ambient `now_ns`
reuse), LOW #13 (named const asserts + boundary test).

Plan-ready: pending Codex round-2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab pushed a commit that referenced this pull request Apr 21, 2026
…wned-static-send, 5% stop, bucket-0)

Round-2 review left all three HIGH findings PARTIAL or OPEN, plus two MED
findings around silent acceptance-criteria regression and bucket-0
deferral. Each is now closed with an in-place plan change:

HIGH #1 (VDSO fast path, PARTIAL -> CLOSED). Add §3.4a grounding the
"NOT a syscall" claim on strace (host) + AT_SYSINFO_EHDR
(bpfrx-fw0 VM) evidence committed in the prior commit. Chose option
(a) + (c) from the findings — verify on the target AND document the
deployment dependency with a named remediation path (graceful degrade
via the existing sentinel rule, not panic). Invariant 1 in §4 now
cites the evidence and names the fallback.

HIGH #2 (1% skew tolerance untested, Bonferroni tighter than
tolerance, OPEN -> CLOSED). Derive K_skew = ceil(λ × W_read) = 1
completion per snapshot at 1 Mpps × 1 µs in §3.6 R2. Replace §11.3
Bonferroni with a cell-level block-permutation test (Fisher-Pitman
style) whose null distribution is constructed from the data itself —
within-block reshuffles absorb K_skew noise by construction, so the
0.0014 bound is neither needed nor applied. Add §6.1 test #7 that
computes K_skew from the harness's own measured write rate and
read-window, asserting |sum - count| <= K_skew + 2 (paranoia margin
for TSO/ARM). The 1% integration hard-stop in §8 is now defended as
the scheduler-preemption-robust bound, not a guess.

HIGH #3 (snapshot-thread crossing, PARTIAL -> CLOSED). Add §3.5a
pinning BindingCountersSnapshot to owned values via a compile-time
const assert that requires 'static + Send. The struct already derives
Clone + Serialize + Deserialize + Default on u32/u64/i32 scalars; the
#812 extension adds Vec<u64> + two u64s, all owned. Rust type-system
trick: 'static on a struct with no lifetime parameter mechanically
rejects any future &'a U field addition. Defense in depth: §6.1 test
#4 (JSON round-trip) also fails for any accidental reference field
via serde's DeserializeOwned default.

MED (overhead hard-stop silently widened from 1% to 5%/10%).
Defended in §8.1 (option (b) from findings): the 1% bound was proposed
before §3.4 re-derived actual costs. Typical case 3.1%, worst case
9.4% on the inserted==1 partial-batch regime documented at
tx.rs:5953-5961 / tx.rs:6164-6174. A 1% gate would hard-stop on the
instrumentation's own presence under partial-batch traffic; that is
the wrong failure mode. 5% steady-state + 10% small-batch soft-gate
is the cheapest correctness-preserving budget; alternatives (rdtsc,
sampled stamping, bucket-midpoint sum) all lose measurement fidelity.

MED (bucket-0 coarseness deferred). Resolved as out-of-scope in §3.2
and §12 item 8. The §11 classifier's pre-registered statistics
(D1/D2/D3) all read from buckets 3+ — no verdict depends on sub-µs
resolution. Bucket 0 is intentionally coarse because the
MQFQ-vs-shaper separation (verdict B) lives in tens-of-microseconds
(buckets 4-7), not sub-µs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 21, 2026
psaab added a commit that referenced this pull request Apr 21, 2026
…d 3 HIGH #2)

λ upper bound bumped from 1 Mpps to 2 Mpps per-worker to cover the small-packet (64B at 25 Gbps line rate → ~2 Mpps per-worker under single-direction load) case that Codex round 3 flagged.

K_skew = ceil(2e6 × 1e-6) = 2 completions (was 1). Downstream:
- Integrity test assertion remains |sum − count| ≤ K_skew_observed + 2 — unchanged wording, new upper bound.
- §8 hard-stop 1% still tolerates K_skew of 2 comfortably at typical C ≥ 1000.
- Off-CPU pathology analysis: 4ms × 2Mpps = 8000 completions (was 4000); still bounded within one order of magnitude of 1% gate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 21, 2026
…ntics, block-permutation math)

Three focused round-4 fixes to the plan:

1. HIGH #2 PARTIAL: widen λ from 2 Mpps to 3 Mpps per worker to
   match the plan's own small-packet line-rate derivation
   (25 Gbps / 64 B / 4 workers ≈ 3.05 Mpps). K_skew = ceil(3e6 ×
   1e-6) = 3 completions, not 2. Downstream numbers updated
   throughout §3.6: pure memory-ordering bound `K_skew / C ≤
   0.03 %` at C ≥ 10 000; off-CPU pathology `4 ms × 3 Mpps =
   12 000` completions, `12 000 / 200 000 = 6 %`.

2. NEW (1 % gate vs 4 ms preemption): adopt the "observability of
   the event is the point" stance explicitly. The 1 % gate IS
   expected to fire under ≥ ~1.5 ms CFS preemption; that fire is
   a measurement pathology the harness should surface, not a
   false positive. §3.6 now calibrates three regimes: pure
   memory-ordering (`≤ 0.03 %`, unit-test §6.1 #7), preemption-
   extended short-window (up to ~6 %, investigation-worthy not
   auto-block), integration (60-s run at C ≥ 10^7 where 4 ms
   preemption dilutes to 0.12 %, sustained > 1 % = merge block).
   §8 hard-stop #4 rewritten as a two-part rule: aggregate
   `|sum - count| / count > 0.01` on the full run AND
   per-snapshot fire rate > 5 % across N = 20 short-window
   snapshots (the "system not stable enough to measure" signal).

3. §11.3 NEW: replace the degenerate whole-window mass-ratio
   formulation (order-invariant under block permutation, trivial
   null — round-4 Codex finding) with a formally specified
   two-sample Fisher-Pitman block permutation test on PER-BLOCK
   statistics. Pre-registered per-block `T_D1,b`, `T_D2,b`,
   `T_D3,b` (order-sensitive because they vary across blocks),
   cell-level reduction `T_v = max_b T_v,b`, two-sample
   `Δ = mean(cell) − mean(baseline)` statistic, N_perm = 10 000,
   one-sided empirical p-value `p_v ≤ 0.05`. Cites Pesarin &
   Salmaso (2010) §3.2 and provides concrete
   `scipy.stats.permutation_test` invocation with
   `permutation_type='independent'`, `n_resamples=10_000`,
   `alternative='greater'`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 21, 2026
Add 9 Rust unit tests covering the plan §6.1 / §5 test surface:

Rust-side (umem.rs test module):

  1. tx_latency_hist_bucket_boundary_roundtrip — plan §6.1 #1 /
     §5.1. Drive record_tx_completions_with_stamp with
     deterministic T0 and T0+K for K ∈ {500, 1500, 10_000, 100_000,
     10_000_000}; assert exactly one count in the bucket predicted
     by bucket_index_for_ns and zero in every other bucket. Pairs
     with the existing bucket boundary test so a bucket-layout
     drift fails BOTH pins.
  2. tx_latency_hist_partial_batch_stamping_only_touches_accepted_prefix
     — plan §6.1 #2 / §3.1 R1. `inserted ∈ {1, 2, 32, 64, 256}`;
     assert only the first `inserted` sidecar slots hold the stamp,
     tail stays at TX_SIDECAR_UNSTAMPED. The Codex HIGH #1 small-
     batch regime contract.
  3. tx_latency_hist_retry_unwind_leaves_no_stamps — plan §6.1 #3.
     The `inserted == 0` retry-unwind path hands an empty iterator
     to stamp_submits; every sidecar slot stays at the sentinel.
  4. tx_latency_hist_sentinel_skip_for_unstamped_completion —
     plan §6.1 #5 / §5.4 / Codex MED #5. A completion against an
     unstamped slot (never stamped, or canonicalized 0 from a
     VDSO-failure stamp) drops the sample — no bucket increment,
     no count/sum increment.
  5. tx_latency_hist_single_thread_sum_equals_count — plan §6.1
     #6 / §5.2. Drive N = 10_000 synthetic stamps + completions
     in one thread; assert `sum(hist) == count` exactly AND
     `sum_ns == sum(deltas)` exactly (single-thread invariant 6
     per plan §4).
  6. tx_latency_hist_cross_thread_snapshot_skew_within_bound —
     plan §6.1 #7 / §3.6 R2. Writer thread fires 1M+ fetch_adds;
     reader thread snapshots ≥ 5000 times. For each snapshot
     compute K_skew_i = ceil(λ_obs × W_read_i) + 2 using
     Instant::now() bracketing, assert |sum - count| ≤ K_skew_i.
     The derivation-driven cross-thread bound, not a free
     parameter.
  7. tx_submit_ns_sidecar_single_writer_ownership_is_rc_not_arc —
     plan §6.1 #6. Compile-time pin via fn-pointer probes that
     `WorkerUmem::shares_allocation_with` (calls Rc::ptr_eq) and
     `WorkerUmem::allocation_ptr` (calls Rc::as_ptr) retain the
     Rc shape. A future Rc→Arc migration breaks both bodies —
     silent drift becomes a loud compile failure.
  8. (extension of binding_live_snapshot_propagates_709_owner_profile_counters)
     — pin that the snapshot() path copies all three new
     atomics (hist + count + sum_ns) into BindingLiveSnapshot.

main.rs tests (wire-contract side):

  9. tx_latency_hist_serialization_roundtrip — plan §6.1 #4.
     JSON encode/decode round-trip on a non-trivial histogram;
     assert field-equality including Vec<u64> contents.
 10. tx_latency_hist_backward_compat_old_payload_deserializes —
     plan §6.1 #4 (second half) / §7 PR #804 wire-contract
     break row. Pre-#812 JSON payload (fields absent)
     deserializes with empty Vec and zero u64 via
     `#[serde(default)]`. THE backward-compat contract for
     step1-capture.
 11. tx_latency_hist_binding_counters_snapshot_is_static_send —
     plan §6.1 #8. Runtime corollary of the named compile-time
     const-assert added in commit 5 — exercising the `'static
     + Send` bound at test time too. Defence-in-depth if the
     const-assert were ever silently removed.

Also extends existing
binding_counters_snapshot_serializes_with_expected_wire_keys
and binding_counters_snapshot_projects_ring_pressure_fields to
include the three new wire keys so a rename or misattribution
is caught.

Refactor: factored the per-offset reap fold out of
reap_tx_completions into a new shared helper
`record_tx_completions_with_stamp` so the unit pins exercise
the PRODUCTION algorithm, not a test-only fake. The live
reap_tx_completions now calls the helper; same semantics,
same atomics, same order.

Go-side (pkg/dataplane/userspace/protocol.go):

  Add tx_submit_latency_hist / _count / _sum_ns fields on
  BindingStatus and BindingCountersSnapshot. omitempty keeps
  forward-compat — a pre-#812 helper that lacks these fields
  decodes into empty slice / zero u64 without Unmarshal
  erroring. Needed by the Go decoder that the daemon's
  status-poll path uses for step1-capture.

Plan: docs/pr/812-tx-latency-histogram/plan.md §6.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 21, 2026
…r threads

Codex round-1 HIGH #2 (docs/pr/812-tx-latency-histogram/codex-code-review.md
§7): the previous pin named `..._cross_thread_snapshot_skew_within_bound`
did reads AND the K_skew computation on the MAIN thread in a loop after
spawning a writer that called raw `fetch_add` on the atomics. That is
not a cross-thread test of the production code — it was testing a
synthetic concurrency model. Codex §7 also noted the λ_obs derivation
used the current snapshot's count / current elapsed, which
underestimates the rate during startup and made the bound nonsense
— the landed pin was red on a clean build.

Rewrite:

1. Writer thread drives the PRODUCTION helpers `stamp_submits` +
   `record_tx_completions_with_stamp` in a tight loop (NOT raw
   `fetch_add`). The helpers own sidecar mutation AND the owner-
   profile atomics, so the pin exercises the exact shipped fold.

2. Separate reader thread does the snapshot loop and captures
   per-sample `(skew, w_read_ns)` into a shared Mutex<Vec<_>>.

3. Writer warmup gate: writer spins through 10k cycles BEFORE
   signalling `reader_warm = true`. Reader blocks until the gate
   (bounded 2s) so the samples it takes are steady-state, not
   startup-biased.

4. λ_obs is computed ONCE, post-hoc, from `count_final /
   elapsed_wall_ns` AFTER stopping the writer (plan §6.1 /
   Codex §7). W_read_max is the max observed read window across
   all reader samples. Bound: `K_skew = ceil(λ_obs × W_read_max) + 2`.

5. Single assertion: `max(|sum − count|) ≤ K_skew`. +2 margin is
   the derivation-independent TSO / ARM ordering allowance.

Harness sanity-checks: `count_final > 0`, `samples.len() > 0`.

Verified: `cargo test --release --bin xpf-userspace-dp tx_latency_hist_`
passes all 11 pins in 0.2s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 25, 2026
* #805 D3 RSS refresh on workers↔queues transition — plan

Bug: when `system dataplane workers` is bumped from <queue_count
to >=queue_count (e.g. 4→6 on 6-queue mlx5),
applyRSSIndirectionOne early-returns because computeWeightVector
gives nil for workers>=queues. The previously-written
[1,1,1,1,0,0] table stays live; queues 4 and 5 starve.

Fix: when nil-because-workers>=queues, inspect the live table;
if not default round-robin, run `ethtool -X iface default`.

Plan covers: root cause, detection helper, implementation
(single-file rss_indirection.go change), tests (5 cases — 2
behavioral + 2 parser + 1 regression), acceptance (test
cluster live deploy + Codex review + Copilot + test-failover),
risks (rebalance race already mitigated by rssWriteMu, manual
table writes intentionally clobbered by daemon ownership).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #805 plan — address Codex round 1 (2 HIGH, 6 MED, 1 LOW)

R1 verdict was PLAN-READY NO. Critical correctness fixes:

HIGH:
- 1: claimed `rssWriteMu`, `applyRSSIndirectionLocked`, epoch
  bumps that came from #840 (REVERTED). Master state has no
  locking, void return. §3 added documenting actual master
  contract. Fix is purely additive within the current
  void-returning shape.
- 2: claimed bool return on `applyRSSIndirectionOne`. Master
  has void return. Plan now uses void `maybeRestoreDefault`
  helper consistent with master shape.

MED:
- 3: "uses every queue at least once" check accepts
  false-positive customs. §4 captured empirically on
  loss:xpf-userspace-fw0 — mlx5 default is exact round-robin
  `entry[i] = i mod queue_count`. §7 tightened to exact-match.
- 4: empirical default-table capture committed via §4 +
  fixture in test #8.
- 5: skip-reason was string-parsing. §5 uses structured
  `workers > 1 && workers >= queues` condition directly.
- 6: probe-error behavior unspecified. §6 mirrors existing
  apply-path's ErrNotFound + generic err handling.
- 7: missing tests for queue_count=0, workers>queues
  (not just ==). §9 tests #2 + #6 added.

LOW:
- 8: workers∈{0,1} regression guard added (§9 tests #4 + #5).
- 9: `make test-failover` demoted from merge blocker to
  optional defense-in-depth.

§13 review-response table maps each finding to resolution.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #805 plan — address Codex round 2 (1 MED real bug, 2 LOW spec, 1 LOW wording, 1 LOW test)

R2 verdict was PLAN-READY NO. Fixes:

MED:
- 2: indirectionTableIsDefault could return true vacuously on
  empty/unparseable ethtool -x output. Added sawAnyRow guard
  (mirrors existing indirectionTableMatches shape) + §9 test
  #11 to pin the empty-output case.
- 3: runtime queue-count-only changes (ethtool -L without config
  commit) explicitly listed out-of-scope in §12. Operators
  changing ringparam are expected to follow with a config
  commit. Netlink-watch loop for ringparam events is separate
  scope.

LOW:
- 1: §3 wrongly cited d.applySem as the startup-vs-reconcile
  serializer. Real mechanism is lifecycle ordering — startup
  runs from enumerateAndRenameInterfaces before API/CLI is
  wired up. Wording corrected so reviewers don't waste time
  chasing a non-existent semaphore.
- 4: §9 test #8 fixture location pinned: inline string literal,
  not a testdata/ file. Output is small enough to embed cleanly.
- 5: §9 test #12 added: BootSequence_4then6_RestoresDefault
  covers the operator workflow end-to-end (workers=4 writes
  constrained, workers=6 sees stale and restores default).

§13 R2 review-response table added (5 findings).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #805 plan — address Codex round 3 (1 LOW edge case)

R3 LOW: sawAnyRow flag was set when a row-index parsed but
BEFORE any queue token was parsed. Input like "0:\n" (row index
with empty value list) would set sawAnyRow=true, the inner
for-loop over bytes.Fields would not execute, and the function
would return true — vacuously-default, exactly the failure mode
R2#2 was supposed to prevent.

Fix: rename sawAnyRow → sawAnyEntry, set inside the inner field
loop only AFTER a queue value has been parsed AND matched the
expected round-robin position. Now requires at least one verified
queue entry before returning true.

§9 test #11 expanded to enumerate three failure cases:
1. empty []byte{}
2. non-row text (header only)
3. row index with no queue tokens (the R3 case)

§13 R3 review-response row added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #805 D3 RSS indirection refresh on workers↔queues transition

When `system dataplane workers` is bumped from a value below the
NIC RX-queue count to a value at-or-above it (e.g. 4 → 6 on a
6-queue mlx5), the previously-written `[1,1,1,1,0,0]` indirection
table stays live. Queues 4 and 5 now host worker-bound AF_XDP
sockets but receive no RSS traffic, starving those workers.

Root cause: `applyRSSIndirectionOne` calls `computeWeightVector`,
which returns nil when `workers >= queues`. The caller treats
nil as "skip" and never touches the live table. Correct on a
fresh install (kernel default round-robin = what we want) but
wrong on the workers<queues → workers>=queues transition.

Fix: on the nil-weights skip path, when
`workers > 1 && workers >= queues > 0`, inspect the live
indirection table; if it isn't the kernel's default round-robin
shape, run `ethtool -X iface default` to restore it.

`indirectionTableIsDefault` is a strict round-robin parser:
`entry[i] == i mod queueCount` exactly. Verified empirically
against the live mlx5 default table on
`loss:xpf-userspace-fw0/ge-0-0-2`. Stricter than
`indirectionTableMatches` (which would accept any custom
table that uses every queue at least once); rejects empty,
unparseable, value-less, and non-round-robin inputs via
`sawAnyEntry` guard.

`maybeRestoreDefault` mirrors the existing apply-path's
ethtool-probe failure handling: ErrNotFound → log Warn, return;
generic err → log Warn with output, return; never attempts a
write on probe failure.

Other skip paths (workers <= 0, workers == 1) leave the table
alone — those bring-up / single-worker cases have no prior
workers<queues state to undo.

12 new tests in pkg/daemon/rss_indirection_test.go:
- 7 behavioral (workers transition cases, stale-table preserved
  on workers∈{0,1}, queueCount=0 short-circuit, probe-failure
  skip)
- 4 parser (round-robin true, concentrated false, every-queue-
  but-non-round-robin false, empty/unparseable/value-less false)
- 1 end-to-end (BootSequence_4then6 covers operator workflow:
  step 1 writes [1,1,1,1,0,0]; step 2 sees stale and restores)

Plan + 4 Codex review rounds documented at
docs/pr/805-rss-refresh/plan.md (PLAN-READY YES at R4).

Closes #805.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #805 address Codex code-review LOWs (3 fixes: doc + 4 tests)

Codex code review verdict: MERGE YES with 3 LOW (cosmetic doc +
missing test coverage). All applied:

LOW 1: Stale §M2 invariant comment in applyRSSIndirection's
docblock said "workers >= queue_count is skipped" but the new
maybeRestoreDefault path probes and may write. Updated to
reflect the new behavior.

LOW 2: Generic-error branches in maybeRestoreDefault (non-
ErrNotFound -x probe failure; non-ErrNotFound -X default write
failure) had no regression coverage.
- Added argvErr scripted-error mechanism to fakeRSSExecutor:
  argv-prefix-keyed (out, err) tuples that take precedence
  over the existing per-iface ethtoolX/ethtoolC paths.
- Test #13 RestoreEthtoolXProbeGenericError_LogAndSkip: -x
  returns generic error → no -X default invocation.
- Test #14 RestoreEthtoolXDefaultGenericError_LoggedAndSwallowed:
  -X default returns generic error → both probe and write
  recorded, function returns normally without propagating.

LOW 3: New branch's interaction with non-mlx5 / empty-driver
guard untested at applyRSSIndirectionOne entry-point level.
- Test #15 NonMlxDriver_WorkersEqualsQueues_NotTouched:
  virtio_net driver with workers=6,queues=6 stale table →
  zero ethtool calls (driver guard short-circuits before
  restore path).
- Test #16 EmptyDriver_WorkersEqualsQueues_NotTouched: empty
  driver string (sysfs unreadable) → same expectation.

All 16 #805 tests pass plus 880+ existing tests clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #805 address Copilot inline (queues>0 vs queues>1 mismatch)

Comment in applyRSSIndirectionOne said "workers >= queues > 1"
but the guard was `workers > 1 && workers >= queues && queues > 0`.
On a single-queue NIC (queues=1) the guard would let
maybeRestoreDefault run, even though there's no possible
concentration to undo: queueCount=1 means entry[i] = i mod 1 = 0
for every i, which is both the default and the only possible
layout.

Tighten the guard to `queues > 1` to match the comment's intent
and skip the unnecessary probe on single-queue NICs.

Test #17 QueueCountOne_NoOp pins the new behavior with
workers=6, queues=1, stale-looking table — expects zero
ethtool calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #805 address Copilot inline review (5 fixes — clarity + determinism)

1. rss_indirection.go:242 — log "rss weight reshaping skipped"
   instead of "rss indirection skipped" so it's clear that the
   weight-vector write is what's skipped, not the whole RSS path.
   When workers>=queues>1 with a stale table, the next line(s)
   show the actual restore action.

2. rss_indirection_test.go:47 — comment for argvErr said
   "checked AFTER" but code checks BEFORE. Aligned to actual
   precedence.

3. rss_indirection_test.go:71 — Go map iteration is randomized,
   so two overlapping prefix keys would produce flaky results.
   Implemented longest-prefix-wins tie-breaking for determinism.

4. rss_indirection_test.go:846 — test #17 appeared before #16.
   Reordered to be sequential.

5. queues>1 vs queues>0 mismatch was already fixed in 744f8b2
   (Copilot saw the change but flagged a stale comment-vs-code
   diff — addressed via the comment update accompanying #1).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 26, 2026
- mouse_latency_probe.py: drop unused `random` and `struct` imports
  (Copilot R2 #6).
- mouse_latency_probe.py: docstring now documents all three
  min-attempts floors (M=1: 500, 2≤M<10: 1000, M≥10: 5000) so the
  intermediate branch is no longer surprising (Copilot R2 #5).
- test-mouse-latency.sh: distinguish iperf3-settle pull failure
  from a real cwnd-not-settled, with `INVALID-iperf3-settle-pull-failed`
  attribution (Copilot R2 #1).
- test-mouse-latency.sh: explicit probe.json availability check —
  `INVALID-probe-pull-failed`/`probe-missing`/`probe-invalid-json`
  instead of letting the matrix wrapper silently lose attribution
  (Copilot R2 #2).
- test-mouse-latency.sh: journalctl stderr now lands in
  `${OUT_DIR}/jc-stderr-${FW}.txt` instead of a per-rep `/tmp/`
  file on the caller, so it's captured alongside the other rep
  artifacts (Copilot R2 #3).
- test-mouse-latency-matrix.sh: preflight JSON parsing wrapped in
  try/except + .get() with explicit `FAIL invalid-json=...` /
  `FAIL missing-field=...` lines, so a partial write or schema
  drift produces an actionable diagnosis instead of an aborting
  stack trace (Copilot R2 #4).

70 tests still green; both shell scripts pass `bash -n`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 26, 2026
- mouse_latency_probe.py: per-attempt connect/recv timeouts now
  bounded by remaining time to the deadline, so a probe near the
  deadline can no longer overrun by up to 10s and consume the
  iperf3 slack budget. Probe runtime now consistently ≤ duration
  + small constant. (Copilot R3 #1)

- mouse_latency_probe.py: payload generated once per coroutine
  instead of per-attempt — uniqueness is irrelevant for the
  byte-stateless echo path; the per-attempt os.urandom was
  avoidable CPU on the source. (Copilot R3 #2)

- test-mouse-latency-matrix.sh: WALL_CAP_HIT flag so the outer
  cell loop also stops once the wall budget is exceeded —
  previously `return 0` only exited run_cell and the outer loop
  would iterate the remaining cells, repeatedly tripping the cap.
  (Copilot R3 #3)

- test-mouse-latency-matrix.sh: rep_is_valid wraps the JSON
  parse + key access in try/except so a malformed probe.json or
  schema drift produces a clean "invalid" verdict instead of a
  Python stack trace in the matrix log. (Copilot R3 #4)

- test-mouse-latency.sh: apply-cos-config now targets the current
  RG0 primary (resolved via current_primary) instead of hard-coded
  fw0. If the cluster is already failed over before the rep
  starts, hardcoding fw0 would attempt to apply the fixture on
  the secondary. (Copilot R3 #5)

- test-mouse-latency.sh: screen-pre snapshot now captured from
  the same node the post-snapshot will follow (current_primary at
  pre-time, same node at post-time). The pre/post diff is now
  consistent regardless of which node was primary at rep start.
  Records the captured node in screen-pre-fw.txt for analyst
  cross-reference. (Copilot R3 #6)

70 tests still green; both shell scripts pass `bash -n`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 26, 2026
…w-up (#906)

* #905 mouse-latency tail: plan v4 (7 Codex rounds, PLAN-NEEDS-MINOR)

Plan for measurement-only PR characterizing mouse-latency tail
under elephant load using the operator-provisioned echo server
on 172.16.80.200:7. 12-cell run matrix (N elephants × M mice),
PASS gate at p99(N=128,M=10) ≤ 2× idle baseline.

Seven Codex hostile review rounds; disposition tables in §11:
- R1: 17 findings, PLAN-NEEDS-MAJOR
- R2: 7 findings, PLAN-NEEDS-MAJOR
- R3: 6 findings, PLAN-NEEDS-MAJOR
- R4: 8 findings, PLAN-NEEDS-MAJOR
- R5: 5 findings, PLAN-NEEDS-MINOR
- R6: 3 findings, PLAN-NEEDS-MINOR
- R7: 2 findings, PLAN-NEEDS-MINOR

Stopping plan iteration at R7 per the #838 lesson: late rounds
surfacing minor spec-clarity issues with no structural defects
indicate it's time to let code review take over. Implementation
follows.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #905 mouse-latency tail: harness implementation

Adds the measurement harness specified in
docs/pr/905-mouse-latency/plan.md. 12 new files in test/incus/:

Python parsers + tests (no third-party deps; stdlib statistics +
asyncio only):
- cluster_status_parse.py — `cli show chassis cluster status` →
  list of (rg_id, node_id, state) triples (incl. secondary-hold)
- iperf3_sum_parse.py — iperf3 text-mode `[SUM]` row → bps
- mouse_latency_probe.py — closed-loop M-coroutine TCP probe
  driver with histogram + statistics.quantiles percentiles
  + per-coroutine RPS distribution + validity gate
- mouse_latency_aggregate.py — per-cell median-by-p99 +
  decision verdict; honours orchestrator INVALID-* markers
- mouse_latency_orchestrate.py — cwnd-settle / collapse /
  RG-flap helper subcommands

Shell orchestrator + matrix wrapper:
- test-mouse-latency.sh — one rep with the full validity
  pipeline from plan §4.5 (CoS preflight, mpstat over the
  probe window only, dual-node journalctl HA-transition diff,
  1Hz RG state polling, post-snapshot follows current primary)
- test-mouse-latency-matrix.sh — 12-cell matrix with
  preflight gating + 10/15-rep accounting per plan §4.7

68 unit tests pass via
`python3 -m unittest discover -s test/incus/ -p '*_test.py'`.

Five rounds of Codex hostile code review; each round's findings
addressed inline. Smoke run on the loss cluster has not yet
been executed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #905 mouse-latency: address Copilot R1 review (8 inline comments)

- plan.md: filename references hyphenated → underscored to match the
  Python module names actually shipped (mouse_latency_probe.py etc.).
- plan.md: §7.1 merge gate dropped "findings.md exists" (the harness
  PR merges first; findings come in a follow-up after the matrix runs).
- plan.md: §4.5 step 2 now states the implemented mpstat lifecycle
  (`mpstat 1 <duration>` over the probe window only, not duration+30).
- plan.md: §4.5 step 3 clarifies that polling the primary alone is
  sufficient because `Manager.FormatStatus` returns both nodes' rows
  in one query.
- mouse_latency_probe.py: compute_validity uses the previously-unused
  `completed` parameter to surface count-bookkeeping inconsistencies.
  Two new unit tests.
- mouse_latency_probe.py + mouse_latency_aggregate.py: per-coroutine
  RPS distribution renamed to attempts_per_second_per_coroutine_*
  (the previous achieved_rps_per_coroutine_* conflated workload-
  offered attempts with completion-rate, which uses different
  totals).
- test-mouse-latency-matrix.sh: top-of-file comment now describes
  the actual rep-accounting behavior ("up to 15 reps as needed for
  10 valid"), not a >30%-trigger conditional that was simplified
  out.
- test-mouse-latency-matrix.sh: preflight comment explains the 60s
  duration choice (M=1 floor of 500 attempts).

70 tests green via the discover command from §7.1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #905 mouse-latency: address Codex R6 (1 HIGH + 1 MED)

R6 found a stale-artifact contamination bug. Two cells with the
same rep index (e.g. cell_N0_M10/rep_00 and cell_N128_M10/rep_00)
both wrote to /tmp/probe-rep_00.json on the source container; a
failed pull on the second cell silently inherited the first
cell's data. Includes the cell name in REP_TAG so remote temp
paths are unique, plus an explicit `rm -f` of the temp files at
rep start as defense in depth.

R6 MED: a failover/failback hidden inside missed 1Hz RG poll
samples could pass as stable. Added a final RG-state snapshot at
end of rep and an explicit initial-vs-final triple-set comparison
(in addition to the per-sample drift check on the poll log).

Both shell scripts pass `bash -n`; 70 unit tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #905 mouse-latency: address Codex R7 HIGH (local stale-artifact)

R6 HIGH (remote temp-file collision) and R7 fix moved both REP_TAG
and remote-side rm to be cell-aware. Codex R7 found the same class
of bug on the LOCAL side: if OUT_DIR is reused (e.g. a rerun into
an existing rep dir) and the probe run fails before overwriting
probe.json, the previous run's stale data would silently survive
and be picked up by `rep_is_valid` in the matrix wrapper.

Add an explicit rm of all per-rep local artifacts (probe.json,
iperf3.txt, mpstat.txt, RG state files, INVALID-* markers, etc.)
at rep start.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #905 mouse-latency: address Copilot R2 review (6 inline comments)

- mouse_latency_probe.py: drop unused `random` and `struct` imports
  (Copilot R2 #6).
- mouse_latency_probe.py: docstring now documents all three
  min-attempts floors (M=1: 500, 2≤M<10: 1000, M≥10: 5000) so the
  intermediate branch is no longer surprising (Copilot R2 #5).
- test-mouse-latency.sh: distinguish iperf3-settle pull failure
  from a real cwnd-not-settled, with `INVALID-iperf3-settle-pull-failed`
  attribution (Copilot R2 #1).
- test-mouse-latency.sh: explicit probe.json availability check —
  `INVALID-probe-pull-failed`/`probe-missing`/`probe-invalid-json`
  instead of letting the matrix wrapper silently lose attribution
  (Copilot R2 #2).
- test-mouse-latency.sh: journalctl stderr now lands in
  `${OUT_DIR}/jc-stderr-${FW}.txt` instead of a per-rep `/tmp/`
  file on the caller, so it's captured alongside the other rep
  artifacts (Copilot R2 #3).
- test-mouse-latency-matrix.sh: preflight JSON parsing wrapped in
  try/except + .get() with explicit `FAIL invalid-json=...` /
  `FAIL missing-field=...` lines, so a partial write or schema
  drift produces an actionable diagnosis instead of an aborting
  stack trace (Copilot R2 #4).

70 tests still green; both shell scripts pass `bash -n`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #905 mouse-latency: address Codex R9 (2 small fixes)

- test-mouse-latency.sh stale-artifact cleanup list now includes
  jc-stderr-*.txt so a rerun into an existing rep dir doesn't
  inherit prior journalctl stderr (Copilot R2 #3 introduced this
  artifact path; Codex R9 noticed it was missing from the wipe).

- test-mouse-latency-matrix.sh preflight JSON parser now coerces
  `validity` to {} when it isn't a dict, so schema-drifted JSON
  produces a clean FAIL line instead of stack-tracing on
  `validity.get(...)` (Copilot R2 #4 partial residual).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #905 mouse-latency: address Copilot R3 review (6 inline comments)

- mouse_latency_probe.py: per-attempt connect/recv timeouts now
  bounded by remaining time to the deadline, so a probe near the
  deadline can no longer overrun by up to 10s and consume the
  iperf3 slack budget. Probe runtime now consistently ≤ duration
  + small constant. (Copilot R3 #1)

- mouse_latency_probe.py: payload generated once per coroutine
  instead of per-attempt — uniqueness is irrelevant for the
  byte-stateless echo path; the per-attempt os.urandom was
  avoidable CPU on the source. (Copilot R3 #2)

- test-mouse-latency-matrix.sh: WALL_CAP_HIT flag so the outer
  cell loop also stops once the wall budget is exceeded —
  previously `return 0` only exited run_cell and the outer loop
  would iterate the remaining cells, repeatedly tripping the cap.
  (Copilot R3 #3)

- test-mouse-latency-matrix.sh: rep_is_valid wraps the JSON
  parse + key access in try/except so a malformed probe.json or
  schema drift produces a clean "invalid" verdict instead of a
  Python stack trace in the matrix log. (Copilot R3 #4)

- test-mouse-latency.sh: apply-cos-config now targets the current
  RG0 primary (resolved via current_primary) instead of hard-coded
  fw0. If the cluster is already failed over before the rep
  starts, hardcoding fw0 would attempt to apply the fixture on
  the secondary. (Copilot R3 #5)

- test-mouse-latency.sh: screen-pre snapshot now captured from
  the same node the post-snapshot will follow (current_primary at
  pre-time, same node at post-time). The pre/post diff is now
  consistent regardless of which node was primary at rep start.
  Records the captured node in screen-pre-fw.txt for analyst
  cross-reference. (Copilot R3 #6)

70 tests still green; both shell scripts pass `bash -n`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 26, 2026
1. mouse_latency_orchestrate.py: rg-state-flapped sample sort now
   keys on int(ts) rather than the string ts to avoid lexicographic
   mis-ordering if timestamp digit width ever varies (Copilot R3 #1).

2. mouse_latency_aggregate.py: select_valid_reps now requires p99
   to be numerically present, not just validity.ok=True. Median
   selection no longer coerces missing p99 to 0 (which could
   mis-pick a malformed rep as the median and skew the verdict).
   (Copilot R3 #2)

3. test-mouse-latency.sh: rename "SYN-cookie counter snapshot" to
   "Screen flood-counter snapshot" — the underlying CLI command
   `show security screen statistics zone wan` reports SCREEN
   flood-event counters, NOT SYN-cookie-specific counters. The
   manifest's `screen_engaged` field still measures the right
   signal; just the comments were misleading. (Copilot R3 #3, #4)

Edits are mid-flight-safe for the running iperf-c shared matrix
(naming and parsing fixes; no behavior change to active reps).

72 tests still green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request Apr 29, 2026
* docs: #921 plan v1 — ifindex_to_zone + EgressInterface.zone → u16

* docs: #921 plan v2 — address Codex round-1 NEEDS-REVISION

Codex round-1 (task-mojw27u8-zcmsih, 4m50s) found 5 real issues
(plus one wire-compat confirmation):

1. GRE handling was misdescribed: gre.rs:260-265 already produces
   a u16 ingress_zone for UserspaceDpMeta — after this PR it becomes
   a single direct id lookup, not the v1's "translate to name via
   zone_id_to_name" round-trip. Critically, GRE decap fires from
   afxdp.rs:947-955 BEFORE the flow cache, so this is a per-packet
   path on GRE-tunnel workloads (not just session-miss). Real win.

2. Unknown / skipped-zone invariant was underspecified. After this
   PR an interface pointing at a dropped zone (reserved id, > u8 max)
   collapses to zone_id == 0. v2 documents this as the canonical
   "unknown" sentinel and adds two tests:
   - interface_pointing_at_skipped_zone_collapses_to_zone_id_zero
   - egress_with_unknown_zone_name_collapses_to_zone_id_zero

3. Churn inventory was wrong. Real numbers:
   - ifindex_to_zone references: 12 in 5 Rust files.
   - EgressInterface constructors: 31 across 9 files.
   - Total touched: ~14 files; estimate <300 LOC.
   - test_fixtures.rs has NO EgressInterface ctors (v1 was wrong).

5. Debug-log: must update both forwarding_build.rs:440 (FWD_STATE
   ifindex_to_zone) AND :445-451 (per-egress eg.zone). v1 only
   covered the first.

6. EgressInterface read sites: only 5 real reads (forwarding.rs:147,
   :176, forwarding_build.rs:450, tunnel.rs:184-188, frame.rs:3625
   test). No long-lived String borrows that would force translation.

(4) wire compat is correct — confirmed unchanged.

Test count target updated 849 → 851 (+4 new tests including the
two unknown-zone collapse tests).

* docs: #921 plan v2.1 — correct EgressInterface ctor file list (Codex round-2)

* #921: ifindex→zone and EgressInterface.zone string→u16

Eliminates the last two String-valued zone maps in ForwardingState
that survived #919/#922:

1. ForwardingState.ifindex_to_zone: FastMap<i32, String>
   → ifindex_to_zone_id: FastMap<i32, u16>.
2. EgressInterface.zone: String → zone_id: u16.

Both maps are populated once at config commit time (in
forwarding_build.rs) by translating the snapshot's zone NAME via
zone_name_to_id. Hot-path callers now do ONE direct u16 lookup
where they previously did TWO HashMap lookups (ifindex → String,
then String → u16) plus a String hash on the second.

Per Codex round-1 finding #1, GRE decap (afxdp.rs:947-955) fires
BEFORE the flow cache, so gre.rs:260-265's zone-id resolution is
on the per-packet path for matching native GRE traffic — the
direct-id win there is per-packet, not just session-miss.

Per Codex round-1 finding #2, an interface whose snapshot zone
string refers to a zone that was dropped at config build (reserved
id, > u8 max) collapses to zone_id == 0 (the canonical "unknown"
sentinel). Tests pin this invariant.

Files (per Codex round-1 inventory correction):
- types.rs: 2 field type changes.
- forwarding_build.rs: 4 sites — populate ifindex_to_zone_id from
  snapshot zone NAME, populate EgressInterface.zone_id, render
  both debug-log lines via zone_id_to_name, plus 4 new tests.
- forwarding.rs: 3 sites — zone_pair_for_flow_with_override
  (test-only, slow-path String-from-id), zone_pair_ids_for_flow_with_override
  (hot path: direct u16 lookup), cluster_peer_return_fast_path.
- gre.rs:260: per-packet GRE decap direct id lookup.
- tunnel.rs:184: direct EgressInterface.zone_id load.
- afxdp.rs: 3 sites — screen profile lookup (slow path) + DNAT
  fallback (slow path) + zone-encoded fabric redirect (direct id).
- 30 EgressInterface constructors across 9 files (coordinator,
  flow_cache, frame, frame_tx, ha, session_glue, tests, worker,
  forwarding_build): zone: "lan".to_string() → zone_id: TEST_LAN_ZONE_ID.
  Done via targeted Python pass (stateful tracking of EgressInterface
  scope to avoid corrupting the unrelated InterfaceSnapshot.zone and
  TunnelEndpointSnapshot.zone snapshot fields, which this PR DOES NOT
  touch — those remain String wire-protocol fields).
- frame.rs:3625 test assertion: egress.zone → egress.zone_id.

cargo test --release: 851 passed (847 baseline + 4 new), 0 failed.
cargo build --release: clean.

Plan: docs/pr/921-zone-id-maps/plan.md (v2.1, Codex round-3
PLAN-READY at task-mojwmnmr-d63ipm in 17s).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request May 1, 2026
Three findings, all valid:

1. neighbor_dispatch.rs:52 — `mem::take(&mut binding.pending_neigh)` +
   `binding.pending_neigh.reserve(...)` drops the original preallocated
   VecDeque buffer every sweep that has retained items, then allocates a
   new one. Defeats reuse and introduces per-poll alloc/free churn during
   cold-start storms.

   Replaced with in-place `pop_front` × snapshotted-len + `push_back`
   rotation. Items pushed back land at the tail and are NOT re-visited
   in this sweep because the for-loop iterates exactly `pending_len`
   times. The same backing buffer stays attached to `binding.pending_neigh`
   the whole time. Behavior is unchanged; allocation pattern is fixed.

2. afxdp.rs:273 — comment incorrectly claimed "allocated lazily, idle
   bindings stay near zero footprint" when worker/mod.rs's
   `VecDeque::with_capacity(MAX_PENDING_NEIGH)` was eager preallocation.
   Comment was wrong about what the code did.

3. afxdp.rs:274 — preallocating ~576 KB per binding per worker × N
   bindings × M workers added real upfront RSS cost. With the cap at
   4096 and typical deployments having dozens of binding-worker pairs,
   this ran into 50-100 MB just for empty queues.

Resolution for #2 + #3: switch worker/mod.rs init from
`VecDeque::with_capacity(MAX_PENDING_NEIGH)` to `VecDeque::new()` (zero
initial capacity). The 4096 admission cap is enforced at the push-front
site in poll_descriptor.rs:2314 — the constant remains the upper bound,
but the underlying buffer grows on push only when traffic queues up.
Idle bindings stay at zero footprint, matching the (now-true) comment.

cargo test --release: 865 passed, 0 failed (no behavior change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request May 1, 2026
…(N) (#1075)

* Cold Start 1: MAX_PENDING_NEIGH 64→4096 + retry_pending_neigh O(N²)→O(N)

GEMINI-NEXT.md Section 3 ("Cold Start Resolution") items 1 + 2.
Targets the 1-2 second initial-connection delay during connection
bursts (cluster failback, parallel-connect storms, etc.).

## Item 1: buffer cap 64 → 4096

`MAX_PENDING_NEIGH` raised from 64 to 4096. PendingNeighPacket is
~144 B (XdpDesc + UserspaceDpMeta + decision + queued_ns), so the
worst-case per-binding footprint when fully populated is ~576 KB —
allocated lazily (VecDeque grows on demand), so idle bindings stay
near zero. The previous 64-cap meant a 65th concurrent new-flow
packet during the 2 s ARP/NDP probe window dropped silently;
parallel TCP-connect storms during cluster failback hit this
trivially.

## Item 2: O(N²) → O(N) drain-classify-restore

Previous `retry_pending_neigh` body did `binding.pending_neigh.remove(i)`
inside a while-i-loop. VecDeque::remove is O(n) per call, so the
sweep was O(n²) in queue depth. With the cap bumped to 4096 the
quadratic cost would be a real fairness hazard.

New pattern:
1. `let pending = std::mem::take(&mut binding.pending_neigh);` —
   take the whole VecDeque out at zero cost.
2. Walk `pending` once consuming each PendingNeighPacket. Classify:
   - timeout → push frame to pending_fill_frames, drop pkt.
   - neighbor still missing → `binding.pending_neigh.push_back(pkt)`
     to retain for the next sweep.
   - neighbor resolved → rewrite frame in place, build PreparedTxRequest,
     enqueue to local or peer binding.
3. Each item touched exactly once → O(n).

The classifier is now a flat sequence of `let-else` early-exits
instead of nested if/else, so each outcome reads top-to-bottom.
Behavior is equivalent: the three outcomes (drop, retain, dispatch)
match the previous remove-or-skip-or-process-and-remove flow.

cargo test --release: 865 passed, 0 failed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Cold Start 1 fixup: address Copilot review on PR #1075

Three findings, all valid:

1. neighbor_dispatch.rs:52 — `mem::take(&mut binding.pending_neigh)` +
   `binding.pending_neigh.reserve(...)` drops the original preallocated
   VecDeque buffer every sweep that has retained items, then allocates a
   new one. Defeats reuse and introduces per-poll alloc/free churn during
   cold-start storms.

   Replaced with in-place `pop_front` × snapshotted-len + `push_back`
   rotation. Items pushed back land at the tail and are NOT re-visited
   in this sweep because the for-loop iterates exactly `pending_len`
   times. The same backing buffer stays attached to `binding.pending_neigh`
   the whole time. Behavior is unchanged; allocation pattern is fixed.

2. afxdp.rs:273 — comment incorrectly claimed "allocated lazily, idle
   bindings stay near zero footprint" when worker/mod.rs's
   `VecDeque::with_capacity(MAX_PENDING_NEIGH)` was eager preallocation.
   Comment was wrong about what the code did.

3. afxdp.rs:274 — preallocating ~576 KB per binding per worker × N
   bindings × M workers added real upfront RSS cost. With the cap at
   4096 and typical deployments having dozens of binding-worker pairs,
   this ran into 50-100 MB just for empty queues.

Resolution for #2 + #3: switch worker/mod.rs init from
`VecDeque::with_capacity(MAX_PENDING_NEIGH)` to `VecDeque::new()` (zero
initial capacity). The 4096 admission cap is enforced at the push-front
site in poll_descriptor.rs:2314 — the constant remains the upper bound,
but the underlying buffer grows on push only when traffic queues up.
Idle bindings stay at zero footprint, matching the (now-true) comment.

cargo test --release: 865 passed, 0 failed (no behavior change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request May 2, 2026
Codex round-1 verdict on the closeout PR was MERGE-NEEDS-MAJOR with
4 specific FAILs (3 doc-correctness + 1 base-ref misread):

- Doc accuracy (FAIL #2): the `PaddedVtimeSlot::publish` doc and the
  plan both said "5 publish sites". Codex caught a 6th: the
  rollback-path direct `slot.publish` at `cos/queue_ops/push.rs:126`
  bypasses the `publish_committed_queue_vtime` helper. Updated both
  the doc and the plan to "6 publish sites total: 4 post-settle + 1
  demote-restore + 1 direct rollback".

- Memory-ordering overclaim (FAIL #3): the doc on
  `participating_v_min_snapshot` claimed "the result reflects SOME
  consistent snapshot of participating peers". There is no lock,
  seqlock, retry, or epoch making the result linearizable across
  slots. Weakened the claim to "the set of values observed during
  the scan, where each individual value is a valid Acquire-load of
  that slot at some moment within the scan window".

- Stale plan-vs-shipped reference (FAIL #6): plan §"What this PR
  does" still said "extend the doc on `read_v_min` and
  `participating_peer_count`" while the diff actually deletes those
  helpers. Updated plan to describe the actual diff (rewrite stale
  publish doc + replace helpers).

Codex's "diff scope" finding was based on a stale base ref
(`efa06ee8`, PR #1141's merge commit) instead of the current master
(`5196abd4`, PR #1142's merge). The PR's actual GitHub-base diff is
3 files. No rebase needed.

Codex's "Closes #N → Refs #N" point (FAIL #7) addressed via PR
description update (separate from this commit).

Tests still 940 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request May 2, 2026
…1143)

* #940/#941/#942 closeout: doc fixes + dead-helper consolidation

Audit closeout for the V_min trio whose implementation work landed
in PRs #950 (#940), #952 (#941), #953 (#942) but whose issues stayed
open. The implementation is correct; this PR fixes the residual
documentation gaps the original acceptance criteria called out and
consolidates two unused helpers into a single canonical entry point.

Changes:

- types/shared_cos_lease.rs:
  - Fix stale `PaddedVtimeSlot::publish` doc that claimed a
    first-enqueue publish the implementation deliberately omits;
    document the rationale (NOT_PARTICIPATING peers are skipped
    in the V_min reduction; no committed work exists pre-settle;
    publishing pre-vacate `queue_vtime` would falsely throttle peers).
  - Replace dead `read_v_min` and `participating_peer_count` (zero
    callers; the actual iteration was inlined in
    `cos_queue_v_min_continue`) with a single
    `participating_v_min_snapshot(worker_id) -> (u32, Option<u64>)`
    helper. The new helper carries the canonical memory-ordering doc:
    each `slot.read()` is an independent Acquire load paired with the
    corresponding Release store; the iteration is non-atomic across
    slots; the V_min reduction's correctness contract tolerates SOME
    consistent snapshot, not all-reads-atomic.

- cos/queue_ops/v_min.rs:
  - Replace inlined slot iteration in `cos_queue_v_min_continue` with
    a call to `floor.participating_v_min_snapshot(worker_id)`.
    Byte-for-byte semantic equivalence (smoke confirms identical
    per-class throughput pre/post: iperf-a 954 Mb/s, iperf-b 9.54 Gb/s,
    iperf-c 23.5 Gb/s, iperf-d 12.4 Gb/s, iperf-e 15.3 Gb/s, iperf-f
    18.1 Gb/s).
  - Add module-level doc capturing the publish-only-on-commit
    invariant, the no-first-enqueue-publish rationale, and the
    enforcing tests (`vmin_pop_snapshot_does_not_publish`,
    `vmin_no_first_enqueue_publish`).

Acceptance criteria coverage table is in
docs/pr/917-vmin-trio-closeout/plan.md.

Microbench result (#940 AC): bench_pop_commit_settle_publish ran
in --release at 237.4 ns/pkt (640k packets, 151.96 ms). No regression
gate yet — this becomes the future baseline.

HA-failover validation (#942 AC last): scripts/userspace-ha-failover-validation.sh
ran cleanly on failover correctness (session sync, fabric forwarding,
no zero-throughput intervals, expected retx=319). One marginal
threshold miss (0.938 vs 1.0 Gb/s gate during the failover window)
is pre-existing test-gate strictness, not a V_min regression.

cargo test --release: 940 passed.

Closes #940, #941, #942 (closure comments to follow this merge).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* PR-1143 round-2: address Codex MERGE-NEEDS-MAJOR

Codex round-1 verdict on the closeout PR was MERGE-NEEDS-MAJOR with
4 specific FAILs (3 doc-correctness + 1 base-ref misread):

- Doc accuracy (FAIL #2): the `PaddedVtimeSlot::publish` doc and the
  plan both said "5 publish sites". Codex caught a 6th: the
  rollback-path direct `slot.publish` at `cos/queue_ops/push.rs:126`
  bypasses the `publish_committed_queue_vtime` helper. Updated both
  the doc and the plan to "6 publish sites total: 4 post-settle + 1
  demote-restore + 1 direct rollback".

- Memory-ordering overclaim (FAIL #3): the doc on
  `participating_v_min_snapshot` claimed "the result reflects SOME
  consistent snapshot of participating peers". There is no lock,
  seqlock, retry, or epoch making the result linearizable across
  slots. Weakened the claim to "the set of values observed during
  the scan, where each individual value is a valid Acquire-load of
  that slot at some moment within the scan window".

- Stale plan-vs-shipped reference (FAIL #6): plan §"What this PR
  does" still said "extend the doc on `read_v_min` and
  `participating_peer_count`" while the diff actually deletes those
  helpers. Updated plan to describe the actual diff (rewrite stale
  publish doc + replace helpers).

Codex's "diff scope" finding was based on a stale base ref
(`efa06ee8`, PR #1141's merge commit) instead of the current master
(`5196abd4`, PR #1142's merge). The PR's actual GitHub-base diff is
3 files. No rebase needed.

Codex's "Closes #N → Refs #N" point (FAIL #7) addressed via PR
description update (separate from this commit).

Tests still 940 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* PR-1143 round-3: address Codex MERGE-NEEDS-MINOR

Codex round-2 verdict was MERGE-NEEDS-MINOR with 3 remaining nits:

1. The negation "does NOT produce a linearizable cross-slot snapshot"
   still contained the word "linearizable". Rewrote the paragraph
   to assert what the iteration IS (a set of independent
   Acquire-loads with bounded staleness) without naming the
   property it isn't, removing the term entirely.

2. Residual stale references to the deleted helpers in:
   - shared_cos_lease.rs:89 — said "see read_v_min and the inlined
     iterator". Updated to reference participating_v_min_snapshot.
   - shared_cos_lease.rs:168 — historical "Replaces the prior" doc.
     Reframed as a description of what the helper does today
     (single-pass for cos_queue_v_min_continue).
   - docs/pr/917-vmin-trio-closeout/plan.md — scrubbed "AND on
     first enqueue" wording, "5 publish sites" stale count
     (was already corrected in audit table; removed from rev-2
     changelog), and "inconsistent snapshots" phrase. Historical
     references to read_v_min / participating_peer_count remain
     where they document what the PR replaced; clarified inline.

3. The round-2 commit message used generic "#N" wording instead of
   citing #940/#941/#942 explicitly. The round-2 commit is already
   pushed; this round-3 commit cites all three issues by number
   here for the durable record.

Affected issues remain #940, #941, #942 (the V_min trio).

Tests still 940 passing; no code changes in this round (doc only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* PR-1143 round-4: plan internal-consistency fix (Copilot)

The plan's "What this PR does NOT do" section claimed "no
HA-failover smoke (out of scope)" while Gap 4 said the PR DOES
invoke scripts/userspace-ha-failover-validation.sh. Reconciled:
the PR runs the existing script (and records the result); what's
out of scope is BUILDING a new HA-replay-storm harness.

Affected: docs/pr/917-vmin-trio-closeout/plan.md only.

Tests still 940 passing; doc-only change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* PR-1143 round-5: final stale-reference scrub (Codex round-3, #940/#941/#942)

Codex round-3 verification surfaced two remaining stale references
in docs/pr/917-vmin-trio-closeout/plan.md:

1. Audit row for #942's HA-failover smoke still said "Out of scope
   for closeout — would need a dedicated test harness". This
   contradicted Gap 4 (which says the PR runs
   scripts/userspace-ha-failover-validation.sh) and the corrected
   "What this PR does NOT do" section. Updated the audit row to
   "PARTIAL" with the actual status: existing script ran, all
   failover-correctness checks PASS; building a new HA-replay-
   storm harness remains out of scope.

2. Gap 2 section header described the dead helpers
   (read_v_min, participating_peer_count) in current tense as
   "defined" with "zero call sites". Reworded to past tense
   ("Pre-PR state ... were defined ...") so the section reads as
   a description of what the PR replaced, not what currently
   exists.

This commit explicitly references #940, #941, and #942 (the V_min
trio) — Codex flagged the prior round-4 commit `4f308a32` for
omitting these issue numbers from its commit message.

Affected: docs/pr/917-vmin-trio-closeout/plan.md only. No code
change. Tests still 940 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request May 5, 2026
Codex round-6 PLAN-NEEDS-MINOR with 3 text-level nits:

1. RegenerateNeighborSnapshot now filters to publishable-only
   before publishUpdateNeighbors (closes v5 nit #2 fully)
2. neighborSnapshotPublishable uses net.ParseIP / net.ParseMAC
   instead of string-empty checks (matches Rust accept invariant)
3. Removed 'or repurposed to call resolveNeighborsInner' from
   section 6 — stale option from v3 contradicting force-probe

All 6 trajectory rounds:
- v1 PLAN-NEEDS-MAJOR (1 hole)
- v2 PLAN-NEEDS-MAJOR (5 round-1 findings)
- v3 PLAN-NEEDS-MAJOR (4 round-3 findings, post-redesign)
- v4 PLAN-NEEDS-MAJOR (5 round-3 findings)
- v5 PLAN-NEEDS-MAJOR (5 round-4 findings)
- v6 PLAN-NEEDS-MINOR (6 round-5 nits)
- v7 PLAN-NEEDS-MINOR (3 round-6 text fixes)

Convergence reached. Per #946 / Tier B / Tier S pattern, NEEDS-MINOR
after specific text fixes = apply and proceed to implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request May 5, 2026
* #1197 plan v1: skip-on-conflict + snapshot self-heal (Option A + A1)

Defensive fix for the stale-MAC preinstall bug discovered on
2026-05-05. preinstallSnapshotNeighbors must check kernel ARP
state before overwriting, and trigger snapshot regeneration when
a skip fires so the snapshot self-heals.

Pending adversarial plan review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 plan v2: address all 5 Codex round-1 findings

v1 (commit 2709752) only fixed the second NeighSet loop. Codex
PLAN-NEEDS-MAJOR caught:

1. First loop also broken: promotes STALE→REACHABLE before
   second-loop check runs. v2 makes first loop state-preserving
   (only re-confirm REACHABLE; leave STALE/DELAY/PROBE alone).

2. Authority model needed: steady-state vs takeover need
   different semantics. v2 splits to 'missing-only outside
   takeover' for periodic; takeover-context preinstall deferred.

3. A1 (snapshot regeneration) is MANDATORY: userspace-dp Rust
   forwarding (forwarding/mod.rs:1464) checks static snapshot
   BEFORE dynamic netlink. Even if kernel ARP is fixed, stale
   state.neighbors wins forwarding. v2 adds explicit
   Manager.RegenerateNeighborSnapshot() called on conflict.

4. NeighList primitive: v2 batches per (ifindex,family) instead
   of per-entry. Uses state bitmasks. PERMANENT included in
   learnedMask (operator-installed; respect it).

5. IPv6 NDP in scope: same code path; same fix applies.

Gemini Pro 3 ACP timeout (8th today); proceeded on Codex per
feedback_gemini_infra_outage_merge_policy.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 plan v3: kernel-as-authority redesign per user feedback

User: 'shouldn't we be listening for neighbor advertisements and
updating those when it changes... shouldn't we doing a neighbor
solicit when our view of the world expires... pre-emptively send
out an NS before we expire the entry to make sure it's still there.'

That's RFC 4861 NUD applied to the snapshot. v2 was incremental
defense (don't overwrite fresher kernel entry); v3 is the right
architecture (don't push to kernel at all in steady-state; listen
and proactively probe).

v3 design:
- DELETE preinstallSnapshotNeighbors entirely (stop fighting kernel NUD)
- ADD neighborListener: NeighSubscribe → RTM_NEWNEIGH/DELNEIGH → regen snapshot
- ADD Manager.RegenerateNeighborSnapshot(): publish update_neighbors on change
- REPLACE 15s preinstall tick with 15s resolveNeighborsInner probe tick
  (existing function; just a new caller)

Leverages existing infra: NeighSubscribe in daemon_ha_fabric.go:809
(narrowly scoped to fabric); resolveNeighbors in daemon_neighbor.go:117
(already sends ARP/NS for known next-hops).

Pending Codex review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 plan v4: address 4 Codex round-3 holes

v3 (commit 867bd36) was directionally right but missing critical
details. Codex round-3 PLAN-NEEDS-MAJOR caught:

1. resolveNeighborsInner skips STALE — 15s tick on it is no-op
   for stale entries → add forceProbeNeighbors sibling that
   probes regardless of NUD state
2. Monitor filter must match snapshot keyspace (all kernel
   neighbors on configured forwarding/fabric ifaces), not
   narrowed to static next-hops
3. RTM_NEWNEIGH MAC-only check inadequate — also handle
   RTM_DELNEIGH eviction, FAILED/INCOMPLETE transitions; add
   forwarding-effective diff (ignore REACHABLE↔STALE churn on
   same MAC)
4. NeighSubscribe is lossy — use NeighSubscribeWithOptions with
   ListExisting:true, error-callback resubscribe, debounce, 60s
   safety reconciliation tick

Phase 1 PR1 must contain ALL THREE: stale probing + del-event
regen + broad filter. Otherwise need TTL or dynamic-first lookup
as foundation (deferred).

Gemini Pro 3 ACP timeout (9th today).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 plan v5: address 5 Codex round-4 holes

v4 (commit 4bcb794) had directionally right kernel-as-authority
shape but Codex round-4 PLAN-NEEDS-MAJOR caught:

1. collectMonitoredNeighbors undefined; force path needs RG-
   takeover trigger too (not only 15s tick); add target
   prioritization (STALE/PROBE/DELAY first → next-hops → rest;
   cap at 256)
2. Listener resubscribe break-in-select broken; extract
   runOneSubscription helper with explicit lifetime + ReceiveBufferSize
3. Forwarding-effective diff in WRONG layer — neighborsEqual
   compares raw state. Add manager-level neighborsEqualForwarding
   on (key, MAC, usable-bit); use in RegenerateNeighborSnapshot
   AND BumpFIBGeneration paths
4. Filter must EXACTLY match buildNeighborSnapshots keyspace.
   Export MonitoredInterfaceLinkIndexes(cfg) helper; reuse in
   both buildNeighborSnapshots and isMonitoredNeighbor — drift
   impossible
5. shouldTriggerRegen MUST treat DELAY/PROBE/NOARP as usable
   too (userspace forwarding/mod.rs:45 does); only ignore same-
   MAC same-usable-class churn
6. Atomic-PR constraint: PR1 must contain delete + listener +
   force-probe + regen-diff atomically; do not ship deletion alone

Pending Codex round-5 verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 plan v6: address 6 Codex round-5 nits; ready to implement

v5 (commit 0ae008d) hit PLAN-NEEDS-MINOR (convergence). 6 fixes:

1. neighborsEqualForwarding now filters to publishable entries
   (skip empty-MAC INCOMPLETE/NONE); old-usable→INCOMPLETE
   detected via key disappearance
2. NUD_NONE explicitly rejected in Go publish + matched in
   neighborSnapshotPublishable predicate; mirrors userspace
3. isMonitoredNeighbor adds snapshot-key fallback for runtime
   ifindex drift (delete events on disappeared links still
   processed)
4. collectMonitoredNeighbors tiers globally on annotated NUD
   state (build set → annotate → sort risky→fresh + criticality)
5. runOneSubscription: explicit close(done) on subscribe error;
   regenDebouncer uses same-goroutine timer-channel pattern
   (no AfterFunc race on pending)
6. HA section text corrected: takeover calls forceProbeNeighbors
   not resolveNeighbors; Risk table updated

Pending Codex round-6 verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 plan v7: 3 final text fixes; proceeding to implementation

Codex round-6 PLAN-NEEDS-MINOR with 3 text-level nits:

1. RegenerateNeighborSnapshot now filters to publishable-only
   before publishUpdateNeighbors (closes v5 nit #2 fully)
2. neighborSnapshotPublishable uses net.ParseIP / net.ParseMAC
   instead of string-empty checks (matches Rust accept invariant)
3. Removed 'or repurposed to call resolveNeighborsInner' from
   section 6 — stale option from v3 contradicting force-probe

All 6 trajectory rounds:
- v1 PLAN-NEEDS-MAJOR (1 hole)
- v2 PLAN-NEEDS-MAJOR (5 round-1 findings)
- v3 PLAN-NEEDS-MAJOR (4 round-3 findings, post-redesign)
- v4 PLAN-NEEDS-MAJOR (5 round-3 findings)
- v5 PLAN-NEEDS-MAJOR (5 round-4 findings)
- v6 PLAN-NEEDS-MINOR (6 round-5 nits)
- v7 PLAN-NEEDS-MINOR (3 round-6 text fixes)

Convergence reached. Per #946 / Tier B / Tier S pattern, NEEDS-MINOR
after specific text fixes = apply and proceed to implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 impl: kernel-as-authority neighbor reconciliation

Replace the buggy periodic preinstallSnapshotNeighbors mechanism
with an event-driven listener + force-probe design. Kernel ARP/
NDP becomes the source of truth; xpfd listens for changes and
proactively probes monitored neighbors to keep the kernel state
fresh — instead of pushing a stale snapshot back into the kernel
every 15s.

Implementation matches plan v7 (commit a9f8e2e) after 6 rounds
of Codex hostile review (round-1 PLAN-NEEDS-MAJOR through round-6
PLAN-NEEDS-MINOR). Phase 1 ships atomically per Codex's explicit
constraint: delete + listener + force-probe + regen-diff together.

Changes:

- pkg/dataplane/userspace/snapshot.go:
  - Add neighborsEqualForwarding (publishable-only diff on
    (ifindex, ip, mac); ignores REACHABLE↔STALE aging churn)
  - Add neighborSnapshotPublishable predicate (matches Rust
    accept rules at handlers.rs:165 / forwarding/mod.rs:45;
    rejects 'failed', 'incomplete', 'none', empty/malformed
    MAC, invalid IP)
  - Add MonitoredInterfaceLinkIndexes(cfg) — exported
    enumeration shared by snapshot publish + listener filter,
    making drift impossible

- pkg/dataplane/userspace/manager.go:
  - Add Manager.RegenerateNeighborSnapshot — event-driven
    rebuild + publish (used by listener debouncer + 60s safety
    tick)
  - Add Manager.LookupSnapshotNeighbor — O(N) read for
    forwarding-effective diff in shouldTriggerRegen
  - Add Manager.SnapshotHasIfindex — listener-filter fallback
    for runtime ifindex drift (delete events on disappeared
    links still processed)
  - Add filterPublishableNeighbors helper
  - BumpFIBGeneration: switch to neighborsEqualForwarding
    + filterPublishableNeighbors on publish

- pkg/daemon/daemon_neighbor.go:
  - DELETE preinstallSnapshotNeighbors (lines 24-105 in master
    — the bug source)
  - Remove the call from maintainClusterNeighborReadiness
  - Add forceProbeNeighbors call to the 15s tick (alongside
    existing skip-stale resolveNeighbors)

- pkg/daemon/daemon_neighbor_listener.go (new):
  - neighborListener: NeighSubscribeWithOptions{ListExisting,
    ReceiveBufferSize=1MB} with resubscribe loop on close;
    60s safety reconciliation tick
  - runOneSubscription: explicit lifetime helper; no
    break-in-select; close(done) on subscribe error path
  - regenDebouncer: 100ms coalescer with same-goroutine
    timer-channel pattern (no AfterFunc race)
  - shouldTriggerRegen: forwarding-effective diff per event;
    triggers on MAC change / DELNEIGH / FAILED / INCOMPLETE;
    ignores REACHABLE↔STALE↔DELAY↔PROBE on same MAC
  - isMonitoredNeighbor: MonitoredInterfaceLinkIndexes union
    with SnapshotHasIfindex fallback
  - usableNUD bitmask = REACHABLE|STALE|DELAY|PROBE|PERMANENT|
    NOARP (matches Rust forwarding/mod.rs:45; excludes NUD_NONE)
  - forceProbeNeighbors: probe all monitored neighbors
    INCLUDING STALE/DELAY/PROBE (resolveNeighbors skips them);
    target prioritization tier1→tier3 (stale-risk → critical
    REACHABLE → rest); cap at neighborProbeMaxTargets=256
  - collectMonitoredNeighbors: snapshot-keys + fabric-peers,
    annotated with current kernel NUD state, sorted by tier

- pkg/daemon/daemon_ha_vip.go:
  - VRRP MASTER announce calls forceProbeNeighbors instead of
    just resolveNeighbors so stale entries are re-validated
    on takeover (resolveNeighbors handles missing/cold)

- pkg/daemon/daemon_run.go:
  - Start neighborListener goroutine alongside the existing
    runPeriodicNeighborResolution

Tests:

- pkg/dataplane/userspace/snapshot_neighbors_1197_test.go:
  - TestNeighborSnapshotPublishable (12 subtests covering all
    accept/reject categories)
  - TestNeighborsEqualForwarding (8 subtests covering MAC change,
    state transitions, key removal, churn-ignore)
  - TestFilterPublishableNeighbors

- pkg/daemon/daemon_neighbor_listener_test.go:
  - TestProbeTierClassification (10 subtests)
  - TestUsableNUDMask (9 subtests covering the bitmask
    invariant)
  - TestNeighborListenerNUDStateBitmaskCoverage
  - TestNeighborListenerEventTypes (RTM_NEWNEIGH/DELNEIGH
    differ)

Full Go test suite passes; build clean.

Pending: Codex hostile code review on this commit + Gemini
adversarial code review + smoke matrix on loss userspace
cluster (#1197 trigger condition reproducibly fixed) +
PR with Copilot review.

Closes #1197 (after smoke + reviews pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 impl v2: 4 Codex code-review fixes

Round-1 code review (MERGE-NEEDS-MAJOR) caught 3 majors + 1 minor:

1. collectMonitoredNeighbors omitted configured-targets source.
   Cold-start path (no snapshot yet) had empty probe set. Fixed:
   added Source 2 (configured next-hops + DHCP gateways + backup
   router + DNAT pools + static NAT translated + address-book
   /32 /128 hosts) via new shared collectNeighborProbeTargets
   helper extracted from resolveNeighborsInner.

2. Listener hot path was O(N): isMonitoredNeighbor recomputed
   monitored ifindexes via netlink LinkByName per event;
   LookupSnapshotNeighbor did locked O(N) scan. Fixed: added
   cached neighborIndex map[neighborIndexKey]*NeighborSnapshot
   on Manager (rebuilt on snapshot publish), exposed via
   MonitoredIfindexes() returning defensive copy. Listener now
   does O(1) ifindex membership check + O(1) snapshot lookup.

3. neighborSnapshotPublishable used exact-string match for
   'failed'/'incomplete'; Rust uses substring contains match
   after lowercase (forwarding/mod.rs:45). Fixed: substring
   match. Also: shouldTriggerRegen now treats NUD_NONE for
   existing snapshot entries as transition-to-unusable
   (previously could be ignored until 60s safety tick).

4. neighborProbeMaxTargets was hardcoded const; plan says
   configurable. Fixed: read from BPFRX_NEIGHBOR_PROBE_MAX_TARGETS
   env, default 256.

Tests pass, build clean.

Pending: round-2 code review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 impl v3: 5 Codex round-2 fixes

Round-2 (MERGE-NEEDS-MAJOR) found 3 majors + 2 minors. v3 fixes:

1. neighborIndex now indexes ONLY publishable entries
   (rebuildNeighborIndex applies neighborSnapshotPublishable
   filter). Prevents the unpublishable→publishable transition
   on same MAC from being silently ignored: if existing entry
   was failed/incomplete (not in index) and new event makes it
   reachable, listener sees 'no existing' → trigger fires.

2. Replaced MonitoredIfindexes() (returned full map copy per
   event = O(N) + heap churn) with IsMonitoredIfindex(int) bool
   (O(1) direct lookup, no allocation). Listener hot path is
   now genuinely O(1).

3. rebuildMonitoredIfindexes now called from BumpFIBGeneration
   and RegenerateNeighborSnapshot paths too (not just full
   snapshot publish at line 299). Catches link recreation under
   new ifindex during steady-state.

4. LookupSnapshotNeighbor: clarified semantics; returns a value
   copy via *NeighborSnapshot (caller-friendly API). Index hit
   means publishable entry only.

5. resolveNeighborsInner refactored to use shared
   collectNeighborProbeTargets helper. Drift risk between
   resolve+force-probe paths eliminated; single canonical
   target-collection source.

Plus: composite NUD state safety. usable now requires both 'has
usable bit' AND 'no FAILED/INCOMPLETE bit' so a (REACHABLE|FAILED)
state is correctly classified as unusable.

Tests pass, build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 impl v4: ordering bug fixes + composite-state test

Round-3 (MERGE-NEEDS-MAJOR) caught 2 ordering bugs + 1 minor:

1. neighborIndex was rebuilt at 'm.lastSnapshot = snap'
   (line 299) but apply_snapshot fired later (line 343). On
   apply_snapshot failure, the listener thought entries were
   published when they weren't, suppressing real same-MAC
   transitions. Fixed: rebuild AFTER apply_snapshot succeeds.

2. rebuildMonitoredIfindexes was gated on 'neighbors-diff-changed
   AND publish-success'. If a link was recreated under a new
   ifindex without a neighbor change, the cache stayed stale →
   events on new ifindex dropped. Fixed: rebuild
   UNCONDITIONALLY whenever buildNeighborSnapshots runs (in
   BumpFIBGeneration and RegenerateNeighborSnapshot paths) —
   doesn't depend on diff result.

3. Added TestCompositeNUDStateUnusable verifying composite
   NUD safety (REACHABLE|FAILED → unusable, etc.) — Codex
   v3 noted the v3 logic was correct but had no test coverage.

Tests pass, build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 impl v5: 2 Codex round-4 fixes (deferred-publish + apply_snapshot filter)

Round-4 (MERGE-NEEDS-MAJOR) caught 2 majors:

1. syncSnapshotLocked() (the deferred-publish path that fires
   when Compile() defers during XSK startup) didn't rebuild
   listener caches on apply_snapshot success. Same stale-entry
   class as v3 fixed for the direct path. Fixed: added
   rebuildNeighborIndex() + rebuildMonitoredIfindexes() after
   the deferred apply_snapshot success.

2. apply_snapshot was sending raw snap.Neighbors. Rust's full-
   snapshot build accepts state='none' entries that Go's
   predicate rejects. Result: Rust holds entries Go can't index
   → can't track removal. Fixed: filter to publishable-only via
   filterPublishableNeighbors before sending apply_snapshot,
   matching the update_neighbors path's filter.

Tests pass, build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 impl v6: status-loop catch-up path rebuilds caches

Codex round-5 (MERGE-NEEDS-MAJOR) caught one remaining stale-cache
path: syncSnapshotLocked's status-loop catch-up branch (helper
already has snapshot per LastSnapshotGeneration). That branch
sets publishedSnapshot but didn't rebuild listener caches.

Same fix pattern as the direct + deferred publish paths:
add rebuildNeighborIndex() + rebuildMonitoredIfindexes() after
marking the snapshot as published.

Tests pass, build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197 impl v7: catch-up branch full bookkeeping

Codex round-6 found the v6 catch-up branch only rebuilt caches
but didn't mirror the FULL successful-apply_snapshot bookkeeping.
publishedPlanKey + lastSnapshotHash were left stale. That breaks
the same-plan-during-XSK-startup exception (publishedPlanKey check
at process.go:283), can force unnecessary helper restarts, or
preserve the deadlock this catch-up branch exists to repair.

v7 mirrors the full bookkeeping:
  m.publishedSnapshot = m.lastSnapshot.Generation
  m.publishedPlanKey = planKey
  m.lastSnapshotHash = hash (if hashOK)
  m.rebuildNeighborIndex()
  m.rebuildMonitoredIfindexes()

Tests pass, build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197: 5 Copilot review fixes

Copilot inline review on PR #1198 caught:

1. NeighList cache key collision: cacheKey=ifindex*2+family
   collides because FAMILY_V4=2, FAMILY_V6=10 → (if=1,v6) and
   (if=5,v4) both = 12. Fixed: struct key {ifindex, family}.

2. RegenerateNeighborSnapshot bumped lastSnapshot.Generation but
   didn't advance publishedSnapshot or refresh lastSnapshotHash.
   Status loop saw it as unpublished. Fixed: mirror the same
   bookkeeping the apply_snapshot path does.

3. Cold-start pass only ran resolveNeighbors() (skips STALE) so
   inherited stale entries weren't re-validated until first 15s
   tick. Fixed: forceProbeNeighbors() also runs at startup.

4. shouldTriggerRegen had no direct unit coverage. Refactored
   into pure shouldTriggerRegenWithProvider() so a stub
   neighborSnapshotProvider (not the full DataPlane interface)
   can drive it. Added TestShouldTriggerRegen with 8 subtests
   covering DELNEIGH, MAC change, REACHABLE→STALE churn (ignored),
   transition to FAILED/INCOMPLETE, new publishable, new
   without MAC, composite REACHABLE|FAILED.

5. lastSnapshotHash was computed from raw snapshot.Neighbors,
   so churn in filtered-out rows (state=none, malformed MAC) would
   shift the hash and force unnecessary apply_snapshot republishes.
   Fixed: snapshotContentHash() now uses filterPublishableNeighbors()
   for the hash payload — matches what userspace-dp actually sees.

Tests pass (66+ subtests total), build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #1197: 7 Copilot v2 review fixes (duplicate probes + criticality + docs)

Copilot re-reviewed at e908b96 and found 7 new substantive issues:

1-3. forceProbeNeighbors and resolveNeighbors walked overlapping
   target sets at all 3 call sites (startup, 15s tick, VRRP
   takeover). Result: every cold-start / steady-state probe
   doubled, exactly when minimizing ARP burst matters most.

   Fix: removed source 2 (configured next-hops + NAT + address-
   book) from collectMonitoredNeighbors. forceProbeNeighbors now
   walks only snapshot-keys + fabric peers — non-overlapping with
   resolveNeighbors's configured-target set. Cold-start startup
   pass uses resolveNeighbors only (snapshot empty anyway).

4. Criticality was a single boolean conflating address-book
   hosts with real next-hops/fabric peers. Under cap pressure
   (256), large address-book could crowd out gateways.
   Fix: 3-level criticality scale (Normal=0, NextHop=1,
   Fabric=2). Sort within tier honors finer-grained priority.
   (Address-book is removed from force-probe set entirely above,
   so this becomes future-proofing for any added critical-but-
   non-fabric targets.)

5. Source 1 of collectMonitoredNeighbors used SnapshotNeighbors
   (raw lastSnapshot.Neighbors) which includes filtered-out
   entries the dataplane never received. Fix: new
   Manager.ForEachSnapshotNeighbor walks neighborIndex
   (publishable-only); listener uses that.

6. plan.md claimed configurable knob 'neighbor-probe-max-targets';
   only env var BPFRX_NEIGHBOR_PROBE_MAX_TARGETS exists. Plan
   doc updated to match shipped behavior.

7. plan.md claimed TTL-based snapshot expiry shipped; it didn't.
   Plan doc clarified: kernel events + 60s safety reconciliation
   are sufficient; per-entry TTL is deferred follow-up.

Tests: TestProbeTierClassification updated to use 3-level scale.
All tests pass; build clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request May 6, 2026
… (Codex round-4 #1) + scrub stale 'leak' / 'the leaked one' framing (Codex round-4 #2) + soften cache-line wording to acknowledge unspecified layout (Codex round-4 #3)
psaab added a commit that referenced this pull request May 6, 2026
…1202)

* #1187 plan v1: extend BatchCounters to cover disposition + screen_drops + tx_errors

* #1187 plan v2: scope narrowed (drop tx_errors, defer 3 small counters), reframe primary value as DDoS resilience per Gemini Pro 3.1, fix forward_candidate_packets leak per Codex finding #5, route through TelemetryContext per Codex finding #4

* #1187 plan v3: screen_drops MANDATORY batching (no fallback) per both reviewers; preserve exception_packets current semantics; correct cache-line wording

* #1187 plan v4: round-3 fixes — TelemetryContext (not BatchCounters), delete stale open-questions, correct cache-line wording, reframe forward_candidate_packets at disposition.rs:161 as cold (not hot-path) leak per Codex finding

* #1187 plan v5: name record_forwarding_disposition_hot/cold explicitly (Codex round-4 #1) + scrub stale 'leak' / 'the leaked one' framing (Codex round-4 #2) + soften cache-line wording to acknowledge unspecified layout (Codex round-4 #3)

* #1187 plan v6: scrub final hot-path leak residue (§4.4 heading + framing); reword section 4.4 from Option A/B to 'out of scope'; update §10 question 3 to reflect softened cache-line wording

* #1187 plan v7: scrub stale Option A at lines 95+194; correct §4.4 framing — coordinator/inject.rs is RPC-driven cold path, not 1Hz status poll

* #1187 Phase 1: extend BatchCounters with disposition + screen_drops counters

Per docs/pr/1187-telemetry-double-buffer/plan.md v7 (PLAN-READY).
Codex rounds 1-7 + Gemini Pro 3.1 rounds 1-2.

Adds 8 new u64 fields to BatchCounters (afxdp/mod.rs:308-336):
screen_drops, policy_denied_packets, route_miss_packets,
neighbor_miss_packets, discard_route_packets, next_table_packets,
local_delivery_packets, exception_packets. Extends flush() with
8 new conditional flush blocks.

Introduces DispositionCounters<'a> enum in afxdp/disposition.rs
with Hot(&mut BatchCounters) / Cold(&BindingLiveState) variants and
per-counter bump methods. Refactors record_disposition and
record_forwarding_disposition to take DispositionCounters instead
of `live: &BindingLiveState`. Hot callers in
poll_descriptor.rs:2071 and 2096 pass
DispositionCounters::Hot(telemetry.counters); cold callers in
coordinator/inject.rs:43 and 59 pass DispositionCounters::Cold(live).

stage_screen_check at poll_stages.rs:227 now takes
counters: &mut BatchCounters in place of binding_live (only used
for screen_drops). Drop verdict bumps counters.screen_drops += 1
through the batch — DDoS resilience: SYN flood is the primary
trigger, unbatched atomics here would cause MESI ping-pong with
the coordinator's status reads under volumetric attack.

Three counters explicitly DEFERRED to a follow-up PR (plan §2):
config_gen_mismatches, fib_gen_mismatches, unsupported_packets —
they fire only during reconcile windows AND are gated by
record_exception()'s mutex/timestamp/string work, so the per-atomic
saving is dominated by other costs. They keep direct fetch_add to
live.

forward_candidate_packets at disposition.rs:161 (now in cold-path
ForwardCandidate arm) is reachable only from coordinator/inject.rs
RPC path, not the worker per-packet hot path. Hot path already
routes through telemetry.counters.forward_candidate_packets at
poll_descriptor.rs:213,1706 (existing). PR keeps the cold-path
direct write — bumping a counter on the coordinator-inject path
that fires at RPC rate has no MESI-thrash concern.

tx_errors batching DROPPED from this PR (Codex round-1 finding):
real fan-out spans 6+ sites including umem/mod.rs, tx/drain.rs,
tx/transmit.rs, cos/queue_service/mod.rs, worker/cos.rs. Also,
BatchCounters is created AFTER the first drain_pending_tx() call
in worker/lifecycle.rs:59, so a TX-only error during early drain
would be silently lost. Needs separate design.

cargo build --release: clean
cargo test --release: 974/974 pass

* #1187 review: add #[inline] to DispositionCounters::bump_*, privatize 8 new BatchCounters fields, fix plan version comment

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/312e8185-0492-4934-ae75-9da522b56a35

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

* #1187 review: fix stale BatchCounters comment, update plan §4.2+§4.5 to match impl, add 3 hot/cold regression tests

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/232c0153-2abb-447d-86ad-668df14a1d44

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
psaab added a commit that referenced this pull request May 6, 2026
Both reviewers converged on the same core finding: v1 adapter trait
underspecified for the 4 current variants' divergences. Concrete
fixes:

1. Trait owns descriptor insertion + stamping (Gemini Trait Boundary
   Flaw). Skeleton can't iterate A::Scratch to map XdpDesc or extract
   offsets for stamp_submits — the adapter must.

   New methods:
     fn insert_descriptors(scratch, writer) -> u32
     fn stamp_accepted(scratch, inserted, ts_submit, tx_submit_ns)

2. Rollback unified into single cancel_submission method (Gemini #2)
   — local flow-fair restores TX frames AND queue items;
   prepared flow-fair pushes items back without touching TX frames;
   FIFO drops/releases. cancel_submission takes Option<&mut Queue>
   + free_tx_frames + scratch.

3. drain_to_scratch signature widened with required resources for
   prepared variants (pending_fill_frames, in_flight_prepared_recycles)
   and free_tx_frames for local variants — Codex finding #1.

4. settle_submission stays for queue accounting + partial-rollback
   of inserted prefix; stamping moves OUT of settle into stamp_accepted
   (Codex finding #3).

5. Plan slop fixed (Codex): VecDeque<u64> not Vec<TxFrame>;
   pseudocode return is sent_packets > 0 not !sent_packets == 0.

6. Generic monomorphization confirmed acceptable by both — no dyn
   dispatch risk. Hot-path inlining discipline: zero-sized adapters,
   #[inline] tiny methods, verify with size/cargo bloat if needed.

7. Order serialized: implement #1207 ONLY after #1206 lands on
   master, then rebase. Concurrent work creates avoidable conflicts
   in the same hot-path files.

Reviewers (round-1):
  Codex task-mou6u87v-3y6ih0 PLAN-NEEDS-MAJOR
  Gemini Pro 3 task-mou6vz0k-2bk35b PLAN-NEEDS-MAJOR
psaab added a commit that referenced this pull request May 7, 2026
…-KILL

Codex (task-moupsqds) PLAN-NEEDS-MAJOR — not KILL on race-safety; v1
issues fixable. v2 applies all 5 fixes:

#1. AFD accounting moves from pop-time to settle-time.
    Pop-time fetch_add over-states served work because TX-failure
    restoration pushes items back. Same pitfall that killed #1215 v1.
    v2 hooks into apply_cos_send_result family in tx_completion.rs
    and increments per-bucket served_bytes only for the inserted
    prefix. No fetch_sub anywhere; rebuild bucket hash from
    settle-time item (cheap, batch-amortized).

#2. Published summary carries window DELTA, not cumulative.
    v1 compared cumulative aggregate_served to one-window fair_share
    -> would mark/drop active buckets forever. v2 adds
    window_served_delta + fair_share_window_bytes (matching units);
    snapshot owner stores prior_aggregate in its own state and
    publishes the delta.

#3. Batch-hoist ArcSwap.load AND ECN-write costs.
    v1 budget claim 3-5ns for ArcSwap.load was wrong; real cost ~30ns
    per load (arc-swap v1.8.2 docs/source). v2 hoists ArcSwap.load
    to drain-batch entry (~30ns / TX_BATCH_SIZE=64 = ~0.5ns
    amortized per pop). Per-pop cost: 2 array reads + 1 conditional
    ECN write = ~10ns marked, ~4ns unmarked.

#4. ECN write fast-path helper.
    v1 said ~5ns; ecn.rs:98 actually parses IPv4 + updates checksum.
    v2 adds cos_item_mark_ce_fast that uses cached parsed metadata
    from xdp_main / forward.rs instead of re-parsing. ~10ns marked
    write vs ~50ns full parse.

#5. Smoothed CSFQ drop probability (vs Gemini's drop-cliff critique).
    v1 binary 'if delta > 2*fair { drop }' creates bursty-drop
    pathology. v2 uses CSFQ formula:
      p_drop = max(0, (delta - fair) / delta)
    scaled to 0-255 for cheap random-byte comparison. Matches the
    classic CSFQ paper.

Gemini (task-mouptde4) PLAN-KILL: cache-line bouncing concern still
on the table for ArcSwap.load (Gemini disputes the hazard-pointer
fast path is contention-free under 12M loads/sec). v2 mitigates by
batch-hoisting (#3) but Gemini may still PLAN-KILL on its position
that 'PR #1217 should be the steady-state product.' That's
acceptable: v2 keeps the research-only framing; #1211 does NOT
block #1217 shipping.

ECN deployment reality (Gemini B): also acceptable — even if <50%
of TCP receivers honor CE marks, the CSFQ probabilistic-drop
fallback (#5) still applies pressure to ECN-stripped flows.

Submit-failure pitfall (Gemini E + Codex #1) FIXED by #1 above —
not a residual concern in v2.
psaab added a commit that referenced this pull request May 7, 2026
…ILL #1)

Round-2 verdicts:
- Codex (task-mousvzy4): PLAN-NEEDS-MAJOR — 4 majors (canonical text
  contradicts v2 fixes; bpf_random_u32 won't compile; ECN metadata
  premise wrong; settle hook location wrong)
- Gemini (task-mousw4d4): PLAN-KILL — RFC 3168 violation (binary
  100% ECN mark vs probabilistic drop) is a fairness cliff that
  starves ECN flows; document inconsistency between v2 fixes section
  and §3.2 hot path; complexity-vs-value (with #1217 as alternative)

v3 fixes:

#6 RFC 3168-compliant ECN marking (Gemini #1 FATAL):
   v2 had `if delta > fair && ECT { mark }` — unconditional 100% mark.
   RFC 3168 mandates same probability for mark vs drop. v3 unifies
   under the smoothed CSFQ probability: same p applies to ECT (marks)
   and non-ECT (drops).

#7 Per-worker non-atomic PRNG (Codex #2):
   bpf_random_u32() doesn't exist in Rust userspace. v3 uses splitmix64
   with per-worker state seeded from existing cos_flow_hash_seed_from_os
   path; no rand crate dep needed.

#8 ECN-mark moves to pre-submit (Codex #3 + design reorg):
   TxRequest doesn't carry L3 offsets, so the 'fast-path' claim was
   wrong. v3 reorganizes: per-pop reads delta/fair (batch-hoisted),
   pre-submit in tx/dispatch.rs decides mark/drop/pass with cached
   L3 offsets, settle increments served_bytes for inserted prefix
   (Fix #1 unchanged).

#9 Correct settle hook reference (Codex #4):
   settle_exact_*_scratch_submission_flow_fair (queue_service/mod.rs
   ~740), not apply_cos_send_result.

§3.2 reconciliation (both reviewers' document inconsistency finding):
   Added explicit NOTE pointing readers to the v2 fixes section as
   the canonical design, and listing each §3.2 element that's
   superseded. The original §3.2 text retained for traceability of
   design evolution but marked clearly as not-the-implementation.

Gemini's PLAN-KILL on complexity-vs-value remains a legitimate
position. v3 doesn't dispute that view; it ensures the technical
flaws Gemini and Codex flagged are fixed so the value/complexity
debate is the only remaining issue. Per user directive: '#1211
stays research-only; PR #1217 ships even if AFD dies'. PLAN-KILL
on complexity-vs-value grounds is acceptable.
psaab added a commit that referenced this pull request May 7, 2026
…xes)

Round-2 verdicts:
- Gemini (task-mov1moka): PLAN-READY
- Codex (task-mov1mk84): PLAN-NEEDS-MAJOR with 2 NEW blockers

Codex new blocker #1: FlowCacheEntry has no last_used_ns field.
v2's '100ms scan flow_cache for fresh entries' was unimplementable.

v3 Fix #1: add last_used_epoch: u16 to FlowCacheEntry. Owner-only
single u16 store on every lookup() hit (~1-2 ns/lookup). Per-binding
current_epoch atomic incremented at the worker's existing 100ms
tick. Active-flow count = entries with epoch within last 10 ticks
(1s window). Tick-side O(N) scan over 8192 entries every 100ms =
80K loads/sec/worker = ~10us/sec. Wraparound-safe via wrapping_sub
on u16 (256 epochs = 25.6s headroom vs 1s window).

Codex new blocker #2: per-queue + ≥1% throughput qualification not
joinable. iperf3 -P N uses single destination port; RSS decides
binding; no way to map 'stream X had ≥1% throughput' to 'stream X
landed on binding Y'.

v3 Fix #2: distinct source ports per iperf3 stream
(--cport CPORT_BASE+i). Each stream has unique 5-tuple. Harness
reads kernel RSS config (ethtool -x for indirection table; ethtool
-u for n-tuple rules; RSS key from ethtool) and computes Toeplitz
hash for each stream's 5-tuple to deterministically map streams to
RX queues. RX queue → binding via existing BindingStatus.bound_rx_queue
field. ≥1% qualification applied at harness using mapped streams.
Self-test on harness startup: single-stream prediction must match
observed binding TX bytes; fail-fast on mismatch.

Hot-path cost: single u16 store per lookup. Order of magnitude
smaller than v1's HashMap-insert proposal (60 ms/sec/core).

Production observability still deferred to #1220.
psaab added a commit that referenced this pull request May 7, 2026
Codex round-3 (task-mov1wpqo) PLAN-NEEDS-MAJOR with 4 findings:
1. v3 harness sketch ran iperf3 -P N without --cport, contradicting
   the per-stream --cport claim earlier
2. RSS tuple wrong for -R workload (data direction reversed; on-wire
   RX tuple at measured-direction interface, not control tuple;
   plus possible NAT translation)
3. bound_rx_queue field doesn't exist in BindingStatus (real fields:
   QueueID/WorkerID/Interface/Ifindex). No public proto for binding
   status — must specify control-socket vs Manager.Status() vs new gRPC
4. Stale impl details on last_used_epoch

v4 architectural simplification: drop RSS-join entirely. After
verifying the actual codebase via pkg/dataplane/userspace/protocol.go:615:

- Data plane has flow_cache + epoch counter (Fix #v3-1 unchanged)
- 100ms-tick scan publishes per-binding active_flow_count to a
  Prometheus metric: xpf_userspace_binding_active_flow_count{binding_slot=N}
- Harness scrapes /metrics every 1s during the iperf3 run
- {a_i} = per-binding active_flow_count read directly. No RSS hash
  computation, no indirection-table read, no direction-reversal
  complications for -R, no kernel hash key extraction.

This addresses ALL 4 Codex findings:
- #1: dropped per-stream --cport. Single iperf3 -P N -J. Per-stream
  join via start.connected[].socket <-> intervals[].streams[].socket
  (the canonical iperf3 JSON join Codex suggested)
- #2: irrelevant; we don't compute RSS hashes
- #3: harness reads via existing /metrics endpoint. ONE new metric
  added (snapshot, not rolling-window — much narrower than v1's
  production exports deferred to #1220)
- #4: last_used_epoch impl details cleaned up

Trade-offs explicitly acknowledged + documented:
- Per-binding count includes all flows, not just per-CoS-queue.
  Fine for iperf3-only test workloads. Production observability with
  per-queue qualification deferred to #1220.
- No ≥1% throughput qualification at data plane. Conservative effect
  (slightly higher Cstruct than strict contract reading; gate slightly
  more forgiving). Starved-flow gate (Gate 1) catches starved flows
  exactly via iperf3 per-stream throughputs at harness layer.

Net architecture: data plane tracks epoch (1-2 ns/lookup), publishes
binding count on tick (10 us/sec/worker), exposes via /metrics.
Harness uses 1 new Rust binary, 1 new Prometheus metric, 1 new
flow_cache field. Total ~300 LOC vs v1's ~600 LOC.
psaab added a commit that referenced this pull request May 7, 2026
…1 finding #2)

Pins the Prometheus emitter's wire shape per Codex round-1 review:
3-binding fixture → 3 metrics, slot=0 ActiveFlowCount=5 → gauge value 5
with correct labels {binding_slot=0, queue_id=0, worker_id=0,
iface=ge-0-0-1}. Mirrors the existing emitWorkerRuntime test pattern.

Atop Copilot SWE Agent's autonomous review-fix commits (87973ff +
1526ff5) which addressed: epoch-0 sentinel, hot-path double-borrow,
slow-path inline, parse_args panic, BUG message clarity. My local
161a3435 'round-1 review fixes' commit got 2-way conflict on
flow_cache.rs + fairness-eval.rs vs the agent's edits — agent's
versions were equivalent on the substantive fixes (epoch-0, panic
exits, etc) plus added the double-borrow + slow-path-inline fixes
I missed. Reset local to the agent's tip and re-applied only the
Go metric test (which the agent didn't add).
psaab added a commit that referenced this pull request May 7, 2026
….md cleanup

Codex round-2 (task-mov55tld) MERGE-NEEDS-MAJOR. Round-1 finding #1
(start.connected[] seeding) regressed when I reset to Copilot agent's
87973ff during the rebase conflict — the agent's version of
fairness-eval.rs didn't include the seeding, and my reapplied seeding
got dropped. Re-applied here:

1. Seed per_stream_buckets from start.connected[].socket so streams
   that contributed zero bytes for the entire steady-state window
   appear in the map with empty bucket vec → starved_flow_count
   correctly increments.

2. n_iperf_streams derives from start.connected[].len() when
   non-empty, falling back to test_start.num_streams. This makes the
   harness fail-fast guard actually catch missing-from-intervals
   streams.

3. plan.md cleanup (Codex round-2 finding #2):
   - 8192 entries → 4096 entries (matches FLOW_CACHE_SIZE constant)
   - 100ms tick → ~65ms gate (call-rate dependent)
   - 256 epochs → fixed already in code, plan was already correct

Test verification:
- 2-stream JSON with socket 5 active + socket 7 silent →
  starved_flow_count: 1 (was 0 before fix), verdict: FAIL with
  Gate 1 diagnostic. Codex's round-2 reproducer scenario.
- All 22 fairness pure-fn tests + 22 flow_cache tests + 8 drift CI pass.
- cargo build --release clean (1 warning: dead-code on local_port,
  marked #[allow] earlier).
psaab added a commit that referenced this pull request May 7, 2026
…ount + Prometheus + fairness-eval (#1220)

* #1219 plan v1: fairness harness — Cstruct + distinct-flow-count + Prometheus

Implementation plan for the harness work mandated by the
fairness-regimes contract (PR #1217 e1ec6b9). Three deliverables:

1. Rust pure-fn module (userspace-dp/src/fairness/mod.rs):
   - compute_cstruct(distribution: &[u32]) -> f64
   - compute_observed_cov(per_flow: &[u64]) -> f64
   - starved_flow_count(per_flow_buckets: &[Vec<u64>]) -> u32
   - is_saturated(buckets: &[u64], cap: u64) -> bool
   Unit-tested against the contract's 5-row worked-example table
   (0,0.47,0.20,0.58,0,0).

2. Per-binding distinct-flow-count signal (DistinctFlowTracker
   with bounded LRU set, single-writer record() on flow-cache hit
   path, periodic age_out() on worker tick, atomic snapshot read).
   MAX_TRACKED_FLOWS=1024, FLOW_AGE_OUT_NS=1s.

3. Prometheus exports (4 metrics: cstruct, observed_cov,
   starved_flows, saturated). Re-implemented in Go from same spec
   with shared test-vector CI gate to prevent Rust/Go drift.

Test harness fairness-harness.sh wraps iperf3 -> per-flow buckets
-> Go helper -> contract gates -> PASS/FAIL.

Smoke fixture (deterministic RSS-skew per Codex Path 0) is a
separate follow-up issue; v1 uses iperf3's natural RSS.

7 open questions for adversarial review including the distinct-
flow cap, hash collisions, Rust/Go drift strategy, HA failover
timing of the tracker, steady-state window detection, Prometheus
cadence, harness self-test scope.

PLAN-KILL is acceptable if the harness logic complexity outweighs
the operational value (the immediate goal: answer 'is 47% iperf3
CoV at structural ceiling or scheduler bug').

* v2: massively reduced scope addressing 4 Codex blockers + 3 Gemini fatals

Round-1 verdicts both PLAN-NEEDS-MAJOR with convergent fatals:
- Codex: DistinctFlowTracker locality wrong (Arc + &mut self
  conflict); 6% CPU on hot path; semantic drift (per-binding vs
  per-queue); Go can't compute observed_cov from current status
- Gemini: HashMap on hot path = FATAL; 1024 cap = FATAL high-fan-in
  false-pass; Rust+Go formula drift = FAIL

v2 fundamental rescope:

DROPPED FROM v1:
- Production Prometheus exports (xpf_fairness_*) -> deferred to #1220
- DistinctFlowTracker HashMap entirely
- Per-packet record() on flow-cache lookup
- Rust+Go re-implementation
- Continuous starved-flow counter

KEPT FROM v1:
- Rust pure-fn module (Codex independently verified Cstruct math correct)
- gRPC per-binding active_flow_count field

NEW IN v2:
- Active flow count comes from EXISTING flow_cache via O(N) scan at
  worker's existing 100ms tick (Gemini's remediation). Cost ~10us/sec/
  worker on a tick path that has spare time. NO per-packet write.
- Per-queue active flows + ≥1% throughput qualification: computed
  AT THE HARNESS from iperf3 JSON output, not in the data plane
  (addresses Codex blocker #3).
- Single source of truth = Rust binary 'fairness-eval' that reads
  iperf3-out.json + binding-flows.jsonl and emits {Cstruct,
  observed_cov, regime, verdict}. No Go side at all in v2.
  (Addresses Gemini D / Codex #4)

Net effect: v2 ships ~150 LOC pure-fns + ~30 LOC flow_cache scan +
~20 LOC gRPC plumbing + ~120 LOC eval binary + ~80 LOC bash harness =
total ~400 LOC vs v1's ~600 LOC, with simpler hot path (no new
HashMap), no Prometheus changes, no Rust+Go drift.

Production observability (Prometheus + rolling windows) becomes
issue #1220 to file after v2 lands.

Operational goal preserved: answer 'is 47% iperf3 CoV at structural
ceiling or scheduler bug?' This is the deliverable test in §7.

* v3: add flow_cache epoch + RSS-join harness mapping (Codex round-2 fixes)

Round-2 verdicts:
- Gemini (task-mov1moka): PLAN-READY
- Codex (task-mov1mk84): PLAN-NEEDS-MAJOR with 2 NEW blockers

Codex new blocker #1: FlowCacheEntry has no last_used_ns field.
v2's '100ms scan flow_cache for fresh entries' was unimplementable.

v3 Fix #1: add last_used_epoch: u16 to FlowCacheEntry. Owner-only
single u16 store on every lookup() hit (~1-2 ns/lookup). Per-binding
current_epoch atomic incremented at the worker's existing 100ms
tick. Active-flow count = entries with epoch within last 10 ticks
(1s window). Tick-side O(N) scan over 8192 entries every 100ms =
80K loads/sec/worker = ~10us/sec. Wraparound-safe via wrapping_sub
on u16 (256 epochs = 25.6s headroom vs 1s window).

Codex new blocker #2: per-queue + ≥1% throughput qualification not
joinable. iperf3 -P N uses single destination port; RSS decides
binding; no way to map 'stream X had ≥1% throughput' to 'stream X
landed on binding Y'.

v3 Fix #2: distinct source ports per iperf3 stream
(--cport CPORT_BASE+i). Each stream has unique 5-tuple. Harness
reads kernel RSS config (ethtool -x for indirection table; ethtool
-u for n-tuple rules; RSS key from ethtool) and computes Toeplitz
hash for each stream's 5-tuple to deterministically map streams to
RX queues. RX queue → binding via existing BindingStatus.bound_rx_queue
field. ≥1% qualification applied at harness using mapped streams.
Self-test on harness startup: single-stream prediction must match
observed binding TX bytes; fail-fast on mismatch.

Hot-path cost: single u16 store per lookup. Order of magnitude
smaller than v1's HashMap-insert proposal (60 ms/sec/core).

Production observability still deferred to #1220.

* v4: drop RSS-join entirely; harness reads {a_i} via Prometheus scrape

Codex round-3 (task-mov1wpqo) PLAN-NEEDS-MAJOR with 4 findings:
1. v3 harness sketch ran iperf3 -P N without --cport, contradicting
   the per-stream --cport claim earlier
2. RSS tuple wrong for -R workload (data direction reversed; on-wire
   RX tuple at measured-direction interface, not control tuple;
   plus possible NAT translation)
3. bound_rx_queue field doesn't exist in BindingStatus (real fields:
   QueueID/WorkerID/Interface/Ifindex). No public proto for binding
   status — must specify control-socket vs Manager.Status() vs new gRPC
4. Stale impl details on last_used_epoch

v4 architectural simplification: drop RSS-join entirely. After
verifying the actual codebase via pkg/dataplane/userspace/protocol.go:615:

- Data plane has flow_cache + epoch counter (Fix #v3-1 unchanged)
- 100ms-tick scan publishes per-binding active_flow_count to a
  Prometheus metric: xpf_userspace_binding_active_flow_count{binding_slot=N}
- Harness scrapes /metrics every 1s during the iperf3 run
- {a_i} = per-binding active_flow_count read directly. No RSS hash
  computation, no indirection-table read, no direction-reversal
  complications for -R, no kernel hash key extraction.

This addresses ALL 4 Codex findings:
- #1: dropped per-stream --cport. Single iperf3 -P N -J. Per-stream
  join via start.connected[].socket <-> intervals[].streams[].socket
  (the canonical iperf3 JSON join Codex suggested)
- #2: irrelevant; we don't compute RSS hashes
- #3: harness reads via existing /metrics endpoint. ONE new metric
  added (snapshot, not rolling-window — much narrower than v1's
  production exports deferred to #1220)
- #4: last_used_epoch impl details cleaned up

Trade-offs explicitly acknowledged + documented:
- Per-binding count includes all flows, not just per-CoS-queue.
  Fine for iperf3-only test workloads. Production observability with
  per-queue qualification deferred to #1220.
- No ≥1% throughput qualification at data plane. Conservative effect
  (slightly higher Cstruct than strict contract reading; gate slightly
  more forgiving). Starved-flow gate (Gate 1) catches starved flows
  exactly via iperf3 per-stream throughputs at harness layer.

Net architecture: data plane tracks epoch (1-2 ns/lookup), publishes
binding count on tick (10 us/sec/worker), exposes via /metrics.
Harness uses 1 new Rust binary, 1 new Prometheus metric, 1 new
flow_cache field. Total ~300 LOC vs v1's ~600 LOC.

* v5: address Codex round-4 (4 findings; task-mov2afuw)

1. Stale v3 RSS section deleted entirely (lines 340-403 in v4
   still had the 6-step ethtool/Toeplitz/RSS-join design that
   v4 was supposed to replace). v5 leaves only a one-line
   '§3.4.1 (DELETED) v3 RSS-join steps removed' marker.

2. Metric pipeline path explicit. v4 was internally inconsistent
   between 'harness polls gRPC' (line 175) and 'Prometheus
   unchanged' (line 487) and 'harness scrapes new Prometheus'
   (round-3 resolution). v5 spells out the full plumbing:
   Rust BindingLiveState.active_flow_count (AtomicU32, owner
   100ms tick) -> Rust snapshot BindingStatus.active_flow_count ->
   helper-process control-socket JSON -> Go BindingStatus
   ActiveFlowCount uint32 (pkg/dataplane/userspace/protocol.go:615)
   -> Prometheus emitter in pkg/api/metrics.go:424 with new
   metrics_test.go case.

3. Harness fail-fast guard for {a_i} correctness. Without ≥1%
   throughput qualification at the data plane, over-counting
   could move Cstruct either direction. v5 mandates: during
   steady-state, sum(per_binding_active_flow_count) must match
   non-starved iperf3 stream count within max(2, 0.10 * N)
   tolerance, else harness exits with diagnostic.

4. fairness-eval binary location corrected: was userspace-dp/bin
   (doesn't exist); v5 uses canonical Cargo src/bin/ subdirectory:
   userspace-dp/src/bin/fairness-eval.rs (auto-discovered, no
   [[bin]] entry needed in Cargo.toml).

Section 4 (Public API preservation) updated: Prometheus is no
longer 'unchanged' — adds 1 new metric
xpf_userspace_binding_active_flow_count{binding_slot=N} (snapshot,
not rolling window; rolling-window production exports remain
deferred to #1220).

* v6: purge gRPC/proto references; pin to helper-process JSON only

Codex round-5 (task-mov2ix1o): PLAN-NEEDS-MINOR with single
residual issue — §3.3 still framed active_flow_count as 'published
to gRPC' with a proto/xpf/v1/dataplane.proto example block, even
though v5's intended path is JSON over the helper-process control
socket. §4 'gRPC: 1 new field on per-binding status JSON' had the
same ambiguity.

v6 fixes:

§3.3 'Atomic gauge published to gRPC' renamed to 'Atomic gauge
published via existing JSON status path'. Replaced the proto
example with a Rust BindingStatus snapshot struct that's
serialized into the helper-process control-socket JSON and
decoded by Go BindingStatus (pkg/dataplane/userspace/protocol.go
:615). Explicit note: 'No public gRPC / proto change'.

§4 Public API: 'gRPC: 1 new field on per-binding status JSON'
replaced with 'gRPC / public proto: no change' + separate bullet
clarifying the new field is on the internal helper-process
status JSON, backward-compatible via json:omitempty.

All other Codex round-4 findings remain addressed. v6 is a
single-edit cleanup; expecting PLAN-READY.

* v7: purge 6 remaining stale gRPC refs (Codex round-6 minor)

Codex round-6 (task-mov2qck7) PLAN-NEEDS-MINOR. v6 fixed §3.3 + §4
correctly but 6 stale 'gRPC' references remained scattered:

- line 175-178: 'BindingStatus.active_flow_count copied to gRPC status'
  -> 'copied to the helper-process status JSON'
- line 240: snapshot doc-comment 'reads via the gRPC status surface'
  -> 'reads via the helper-process status JSON surface'
- line 524: test plan 'verify gRPC field is populated'
  -> 'verify the new active_flow_count field is populated in the
     helper-process status JSON (and surfaced through Manager.Status())'
- line 563: open question 'gRPC poll cadence'
  -> 'Prometheus scrape cadence'
- line 578: HA-failover detection 'gRPC reports role_change_at'
  -> 'status JSON reports role_change_at'
- line 583: verdict request 'gRPC field' -> 'helper-process status
  JSON field'

Remaining 'gRPC' mentions are all legitimate: header + the §3.3 +
§4 'No public gRPC / proto change' negations + line 23/48 'not via
a new gRPC RPC'.

Plan should now be internally consistent on the helper-process JSON
path. Expecting PLAN-READY.

* #1219 part 1/N: Rust fairness pure-fns module

Pure-fn module for the fairness contract gate computations per
docs/fairness-regimes.md (merged via PR #1217 e1ec6b9).

Functions:
- compute_cstruct(distribution: &[u32]) -> f64
  Population CoV across the per-flow share multiset
  {1/a_i : repeated a_i times for active workers}. Idle workers
  excluded.
- compute_observed_cov(per_flow_throughputs: &[u64]) -> f64
- starved_flow_count(per_flow_buckets: &[Vec<u64>]) -> u32
  Counts flows below 1% of mean per-cell throughput for the
  ENTIRE window (transient dips don't count).
- is_saturated(aggregate_buckets: &[u64], cap: u64) -> bool
  >= 95% of cap for >= 80% of buckets.

Tests pin all 5 worked-example Cstruct values from the contract:
- {2,2,2,2,2,2}: 0.00
- {1,1,2,2,3,3}: 0.47
- {0,2,2,2,3,3}: 0.20
- {1,3,0,0,0,0}: 0.58
- {6,0,0,0,0,6}: 0.00

22/22 tests pass. cargo build clean. Zero hot-path impact (the
module is invoked only by the harness binary, not by the data
plane).

Next parts: flow_cache last_used_epoch counter (part 2);
BindingLiveState + 100ms tick scan (part 3); status JSON
extension + Go decoder (part 4); Prometheus metric (part 5);
fairness-eval binary (part 6); harness script (part 7).

Plan PLAN-READY through 7 review rounds; Codex final round-7
(task-mov2x81x) no findings.

* #1219 part 2/N: flow_cache last_used_epoch + tick_advance_epoch + count_active_flows

Per plan §3.2 (Fix #v3-1): add a per-binding epoch counter to
FlowCache and a u16 last_used_epoch on FlowCacheEntry written
on every successful lookup() hit.

Per-hit cost: single u16 store on a struct already in cache
from the key check. ~1-2 ns/lookup vs the v1 HashMap-insert
proposal at ~30 ns.

Periodic count_active_flows scan iterates entries in O(N) =
8192 cells, comparing wrapping_sub(current_epoch, last_used_epoch)
< 10 (1s window at 100ms tick cadence). Wraparound-safe via u16
wrapping arithmetic; window << 25.6s wraparound period.

Worker tick will call tick_advance_epoch() at 100ms cadence
(part 3 will wire this into the per-binding live state +
status snapshot path).

Tests added (5 new):
- count_active_flows_starts_at_zero
- count_active_flows_excludes_never_touched_entries
- count_active_flows_marks_recently_hit
- count_active_flows_ages_out_after_window
- count_active_flows_handles_epoch_wraparound

cargo test 999/999 + 32 flow_cache (was 27). cargo build clean.
No existing tests break.

* #1219 part 3/N: wire active_flow_count into BindingLiveState + ~65ms tick

Hooks the flow_cache epoch counter (part 2) into the worker's
existing periodic tick at update_binding_debug_state (umem/mod.rs:986).

BindingLiveState gains:
- active_flow_count: AtomicU32 (initialized to 0)

update_binding_debug_state now:
- calls binding.flow.flow_cache.tick_advance_epoch() per tick
- stores binding.flow.flow_cache.count_active_flows() into
  binding.live.active_flow_count via Relaxed atomic store

Tick cadence note: the actual cadence is ~65ms (driven by the
0xFFFF debug-state counter at ~1M calls/sec / 65536 = ~15 Hz).
The plan's 100ms target is approximate; with ACTIVE_WINDOW_EPOCHS=10
the active-flow window is ~650ms, comfortably under the 1s target
and well clear of u16 wraparound (~16.6s).

cargo test 1004/1004 (was 999). cargo build clean. Hot-path
unaffected — no per-packet write change beyond part 2's u16 store.

* #1219 part 4/N: BindingStatus snapshot + JSON pipeline

Wires the BindingLiveState.active_flow_count atomic (part 3) through
the snapshot path so the helper-process control-socket JSON carries
the field for the Go decoder + Prometheus emitter (part 5).

Changes:
- BindingDebugSnapshot (worker/mod.rs:1778): add active_flow_count: u32
- BindingStatus (protocol.rs:1149): add #[serde] active_flow_count: u32
- BindingCountersSnapshot (protocol.rs:1502): add active_flow_count: u32
  + propagate via b.active_flow_count in From impl (protocol.rs:1656)
- umem/mod.rs:672 snapshot construction: load active_flow_count.load(Relaxed)
- main_tests.rs: 3 BindingCountersSnapshot test sites add field

cargo test 1004/1004 + 8/8 drift CI clean.

* #1219 part 5/N: Go BindingStatus.ActiveFlowCount + Prometheus emitter

Plumbs the helper's active_flow_count snapshot through the Go
manager and emits it as a new Prometheus gauge.

Changes:
- pkg/dataplane/userspace/protocol.go: add ActiveFlowCount uint32
  on BindingStatus + on the lean per-binding snapshot struct
  (`active_flow_count` JSON, omitempty for forward-compat)
- pkg/api/metrics.go:
  - new xpfCollector.bindingActiveFlowCount Desc
  - new prometheus.NewDesc("xpf_userspace_binding_active_flow_count",
    labels: binding_slot, queue_id, worker_id, iface)
  - emitBindingActiveFlowCount() walks status.Bindings, emits
    one GaugeValue per binding
  - hooked into collectUserspaceStatus()
  - registered in Describe()

go build ./... clean. go test ./pkg/api/ ./pkg/dataplane/ clean.

Net Rust → Go → Prometheus pipeline now complete:
flow_cache.last_used_epoch (per-hit u16 store) → BindingLiveState
.active_flow_count (~65ms tick scan) → BindingDebugSnapshot
→ helper-process status JSON → Go BindingStatus.ActiveFlowCount
→ Prometheus xpf_userspace_binding_active_flow_count gauge.

Per the plan: NO public gRPC change. The new field is on the
internal helper-process control-socket JSON only.

* #1219 part 6/N: fairness-eval Rust binary

Test harness binary at userspace-dp/src/bin/fairness-eval.rs (Cargo's
auto-discovered src/bin/ subdirectory; no [[bin]] entry needed).

Reads:
- iperf3 -P N -J --forceflush JSON output
- binding-flows.tsv (timestamp, binding_slot, count) — produced by
  the harness scraping /metrics for xpf_userspace_binding_active_flow_count

Computes per docs/fairness-regimes.md:
- {a_i}: median per-binding count over the steady-state window
- Cstruct: structural CoV ceiling from {a_i}
- observed_CoV: sample CoV across per-stream window-mean throughputs
- starved_flow_count: streams below 1% of mean for entire window
- saturated: aggregate vs (Na/Nv) × shaper_rate
- gap = observed - Cstruct; PASS iff gap <= epsilon (0.05) AND
  starved_flow_count == 0 AND harness fail-fast guard holds

Harness fail-fast guard (Codex round-4 finding #3):
sum(a_i) ≈ non-starved iperf stream count within max(2, 10% × N).
Disagreement flags background-flow pollution / RSS unexpected
behavior; harness should not report a verdict from inconsistent
inputs.

Pure-fns shared with the main binary via #[path = '../fairness.rs'] mod
so the math is single-source-of-truth (Rust only; no Go re-impl).

Output: JSON verdict {distribution_a_i, n_active, cstruct,
observed_cov, gap, epsilon, saturated, aggregate_mbps,
starved_flow_count, a_i_sum_check_ok, verdict, failure_reasons}.
Exit code 0 on PASS, 1 on FAIL, 2 on parse/IO error.

Smoke-tested with synthetic data: 2-stream {1Gbps, 500Mbps} test
correctly reports CoV=0.33, gap=0.33, FAIL with Gate 2 diagnostic.

Build: cargo build --release --bin fairness-eval clean

* #1219 part 7/N: fairness-harness.sh test driver script

Bash wrapper at test/incus/fairness-harness.sh that:

1. Runs iperf3 -P N -J --forceflush against $TARGET on $PORT for $T s
2. Concurrently scrapes /metrics every 1s, extracts
   xpf_userspace_binding_active_flow_count{binding_slot=N} into a TSV
3. Calls fairness-eval Rust binary on both inputs
4. Returns the binary's exit code (0 PASS, 1 FAIL, 2 parse error)

Defaults match the iperf-c P=12 -R workload that produced today's 47%
per-flow CoV measurement — the immediate operational target.

Configurable via positional args + env (METRICS_URL, N_WORKERS,
SHAPER_RATE_BPS, WARMUP, FINAL_BURST, FAIRNESS_EVAL path).

awk extracts the metric using the standard Prometheus exposition
format (single value at end of line; binding_slot label parsed via
regex). Per-second cadence over the steady-state window gives 60+
samples per binding for stable {a_i} median.

bash -n syntax-clean.

Parts 1-7 SHIPPED. Implementation pipeline complete:
- Rust pure-fns (part 1) ✓
- flow_cache last_used_epoch (part 2) ✓
- BindingLiveState + ~65ms tick (part 3) ✓
- Snapshot → JSON pipeline (part 4) ✓
- Go BindingStatus + Prometheus emitter (part 5) ✓
- fairness-eval binary (part 6) ✓
- harness script (part 7) ✓

Next: smoke matrix on loss userspace cluster + open PR + dispatch
triple-review of code.

* #1219 part 4 fix: bridge active_flow_count through coordinator::refresh_bindings

Smoke-test discovery: even though parts 1-7 shipped, the metric
xpf_userspace_binding_active_flow_count stayed at 0 during iperf3
load. Tracing the snapshot pipeline revealed that
coordinator::refresh_bindings (afxdp/coordinator/mod.rs:1130) is
where BindingLiveSnapshot fields are copied into BindingStatus
(the wire-visible JSON struct). I missed adding active_flow_count
to that copy block.

Without this bridge, BindingLiveState.active_flow_count atomic was
correctly being updated by the worker tick (visible via the snapshot
fn at umem/mod.rs:669), but the value was never landing on the
BindingStatus that gets serialized to the helper-process status JSON
(and thus never reached the Go side / Prometheus emitter).

One-line fix: copy snap.active_flow_count -> binding.active_flow_count
alongside the existing flow_cache_collision_evictions copy.

cargo test 1004/1004 + 22 fairness + 8 drift CI clean.

Pipeline now end-to-end:
flow_cache.last_used_epoch (per-hit u16 store)
  -> count_active_flows on 100ms tick
  -> BindingLiveState.active_flow_count atomic
  -> BindingLiveSnapshot.active_flow_count
  -> coordinator::refresh_bindings copies it into BindingStatus  ← THIS WAS MISSING
  -> Helper-process status JSON
  -> Go BindingStatus.ActiveFlowCount
  -> Prometheus xpf_userspace_binding_active_flow_count gauge.

Re-deploying to verify metric now reports non-zero during load.

* #1219 part 7 fix: bidirectional flow_cache entries + portable awk

End-to-end smoke discoveries:

1. mawk on the cluster doesn't support gawk's 3-arg match($0, re, arr).
   The harness scrape_metrics function fell back to no extraction,
   producing an empty binding-flows.tsv. Replaced with portable
   sed -nE pattern that works on both gawk and mawk.

2. Each TCP flow creates flow_cache entries on BOTH ingress AND
   egress bindings (forward flow + reverse flow). With 12 streams,
   sum(a_i) is ~24 not ~12. Updated the harness fail-fast guard:
     expected_sum = 2 × n_non_starved
     tolerance = max(2, 0.10 × expected_sum)

OPERATIONAL ANSWER (the deliverable that motivated the harness):

End-to-end run on the loss userspace cluster, 12 stream iperf3
P=12 t=30 -R against 172.16.80.200:5201:

  distribution_a_i = [2,7,0,2,2,0,0,2,0,3,0,0,1,2,0,4,1,0]
  n_active        = 10 / 18
  cstruct         = 59.9%   (structural CoV ceiling for this RSS distribution)
  observed_cov    = 51.4%
  gap             = -8.45pp (observed BELOW ceiling)
  saturated       = true (22.3 Gbps)
  starved_flow_count = 0
  verdict         = PASS

This is the answer to the user's mandate question: 'is 47% iperf3 CoV
at structural ceiling or scheduler bug?'. The 47-51% per-flow CoV is
8.45pp BELOW the structural ceiling for the observed RSS distribution.
The per-worker scheduler is doing better than the RSS placement
allows. There is no scheduler bug. The remaining variance is pure
RSS placement skew.

The fairness contract gates that PR #1217 codified are now empirically
verifiable on the test bench. The user's 'flow evenness' work is
either:
- Done (gate passes; current implementation is at-or-below the
  structural ceiling)
- Or requires moving to the killed-architectural-paths
  (#937 ingress XDP_REDIRECT — kernel-blocked; #1211 AFD ECN
  overlay — research-only)

Per the fairness-regimes contract, this measurement result LANDS the
contract: gate satisfied for this run.

* Remove fairness-eval binary committed by accident; built artifact, not source

* review fixes: epoch-0 sentinel, hot-path double-borrow, slow-path inline, parse_args panic

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/0bcf2f3a-3a88-4a11-a42c-0ec3d48666b0

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

* review nits: clarify BUG message + assertion text

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/0bcf2f3a-3a88-4a11-a42c-0ec3d48666b0

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

* test: add TestEmitBindingActiveFlowCount_LabelsAndValue (Codex round-1 finding #2)

Pins the Prometheus emitter's wire shape per Codex round-1 review:
3-binding fixture → 3 metrics, slot=0 ActiveFlowCount=5 → gauge value 5
with correct labels {binding_slot=0, queue_id=0, worker_id=0,
iface=ge-0-0-1}. Mirrors the existing emitWorkerRuntime test pattern.

Atop Copilot SWE Agent's autonomous review-fix commits (87973ff +
1526ff5) which addressed: epoch-0 sentinel, hot-path double-borrow,
slow-path inline, parse_args panic, BUG message clarity. My local
161a3435 'round-1 review fixes' commit got 2-way conflict on
flow_cache.rs + fairness-eval.rs vs the agent's edits — agent's
versions were equivalent on the substantive fixes (epoch-0, panic
exits, etc) plus added the double-borrow + slow-path-inline fixes
I missed. Reset local to the agent's tip and re-applied only the
Go metric test (which the agent didn't add).

* #1219 round-2 review fixes: re-apply start.connected[] seeding + plan.md cleanup

Codex round-2 (task-mov55tld) MERGE-NEEDS-MAJOR. Round-1 finding #1
(start.connected[] seeding) regressed when I reset to Copilot agent's
87973ff during the rebase conflict — the agent's version of
fairness-eval.rs didn't include the seeding, and my reapplied seeding
got dropped. Re-applied here:

1. Seed per_stream_buckets from start.connected[].socket so streams
   that contributed zero bytes for the entire steady-state window
   appear in the map with empty bucket vec → starved_flow_count
   correctly increments.

2. n_iperf_streams derives from start.connected[].len() when
   non-empty, falling back to test_start.num_streams. This makes the
   harness fail-fast guard actually catch missing-from-intervals
   streams.

3. plan.md cleanup (Codex round-2 finding #2):
   - 8192 entries → 4096 entries (matches FLOW_CACHE_SIZE constant)
   - 100ms tick → ~65ms gate (call-rate dependent)
   - 256 epochs → fixed already in code, plan was already correct

Test verification:
- 2-stream JSON with socket 5 active + socket 7 silent →
  starved_flow_count: 1 (was 0 before fix), verdict: FAIL with
  Gate 1 diagnostic. Codex's round-2 reproducer scenario.
- All 22 fairness pure-fn tests + 22 flow_cache tests + 8 drift CI pass.
- cargo build --release clean (1 warning: dead-code on local_port,
  marked #[allow] earlier).

* #1219 round-3 review fixes: per-WORKER {a_i} aggregation + Copilot test additions

Codex round-3 (task-mov5ervt) MERGE-NEEDS-MAJOR + Gemini round-1
(task-mov4lrtd) MERGE-NEEDS-MAJOR convergent fatal: harness was
building {a_i} by binding_slot across ALL interfaces (3 ifaces × 6
queues = 18 bindings), producing fictitious 18-element distribution
instead of the contract's per-worker count for the test's
data-direction interface.

v3 fix:

1. fairness-harness.sh scrape_metrics now writes 6 columns
   (timestamp, binding_slot, queue_id, worker_id, iface, count)
   so fairness-eval can filter and aggregate properly.

2. fairness-eval --iface arg added; backward-compat parser handles
   both new 6-column TSV and legacy 3-column.

3. fairness-eval {a_i} computation rewritten:
   - filter rows to --iface (test's data-direction)
   - aggregate by (timestamp, worker_id) summing counts
   - take median per-worker over the steady-state window
   - distribution_a_i is now per-worker, not per-binding-slot

4. Default IFACE=ge-0-0-2 in harness script (loss cluster's data
   direction; configurable via env).

Plus Copilot 5 comments addressed:

5. Test pinning active_flow_count projection through
   BindingCountersSnapshot::From: assert snap.active_flow_count == 71
   added to the existing projection test.

6. Wire-key assertion now includes 'active_flow_count' in the
   binding_counters_snapshot_serializes_with_expected_wire_keys
   test; fixture value changed from 0 → 31 so omitempty doesn't
   skip the key.

7. local_port: u32 marked #[allow(dead_code)] with rationale.

Test verification:
- 6-col TSV with balanced 6-worker distribution → {a_i} =
  [2,2,2,2,2,2], Cstruct=0.00 (correct).
- cargo test 1006/1006 + 22 fairness + 8 drift CI clean.
- cargo build --release clean (1 expected dead-code warning on
  dropped local_port path).

* review fixes (round 3): fix 100ms doc comments, u16 wraparound math, read_to_string panic, mod fairness cfg(test)

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/a70b49a8-2bcc-408d-9839-19d6b7ba6ff2

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

* Codex round-4 (task-mov5x7j7) MERGE-NEEDS-MAJOR — caught a real bug
in the round-3 fix:

1. **Major: fail-fast guard was still 2× n_streams.** With the
   --iface filter introduced in round-3, sum(a_i) is one direction
   only. P=12 should expect ~12, not ~24 — round-3's untouched
   guard would reject correct single-iface data. Fixed: select
   multiplier based on whether iface filtering is in effect.

2. **Minor: silent zeros on legacy-input + --iface combo.** A
   legacy 3-col TSV (iface="") combined with a fresh harness's
   --iface flag silently dropped every row → empty {a_i} → bogus
   PASS. Fixed: detect "filter set but no iface label in any
   row" and warn loudly + treat --iface as unset for that input.

3. **Minor: stale 100ms / 8192-entry prose.** flow_cache.rs:359
   doc-comment and plan.md:291 still claimed the old 100ms tick
   and ~8192-entry cap — now ~65ms (umem 0xFFFF gate) and 4096
   throughout.

4. **Real test gap (Codex check #6):** added two unit tests in
   src/bin/fairness-eval.rs:
   - six_col_multi_iface_per_worker_aggregation: 2 ts × 3 ifaces
     × 6 workers, verify --iface filter collapses to per-worker
     sums on the filtered iface only and noise on other ifaces
     is dropped.
   - three_col_legacy_parses_with_empty_iface: pin legacy parser
     produces iface="" and worker_id=binding_slot.

Test verification:
- cargo test --release --bin fairness-eval: 24/24 pass
  (was 22 fairness pure-fns; +2 new tsv_tests).
- Full cargo test --release: 1006/1006 + 24 + 8 drift CI clean.

* #1219 round-5 review fixes: extract aggregate_per_worker helper + clean up stale prose

Codex round-5 (task-mov69swk) MERGE-NEEDS-MINOR. Two findings:

1. **Tests didn't cover the production fix.** Round-4's added tests
   only exercised the parser/filter shape — not iface_filter_active,
   per_ts_worker grouping, or direction_multiplier. A reverted
   `direction_multiplier = 2` would not have failed any test.

   Round-5 fix: extract two helpers that ARE the production fix:
   - `aggregate_per_worker(rows, iface_arg, n_workers, warmup, final)`
     → returns AggregateResult { distribution_a_i, iface_filter_active }.
   - `direction_multiplier(iface_filter_active: bool) -> u32`.

   main() now calls both. New aggregation_tests module exercises:
   - filter_iface_and_groups_by_worker (noise on other iface MUST NOT
     contaminate filtered distribution)
   - sums_multiple_queues_per_worker (BTreeMap<(ts, worker_id), u32>
     accumulator must SUM across queues, not replace)
   - legacy_3col_disables_filter (legacy parser produces iface=""
     even with --iface set; filter must collapse to inactive)
   - missing_workers_default_to_zero
   - median_smooths_jitter (single outlier on either side is filtered)
   - direction_multiplier_iface_filter_active_is_one (this would have
     caught the round-3 bug Codex round-4 found)
   - direction_multiplier_no_iface_filter_is_two (legacy bidirectional
     fall-through preserved)

   These tests directly fail under any of:
     - per-binding grouping in place of per-worker;
     - reverted direction_multiplier;
     - filter applied to legacy iface="" rows;
     - sum-then-median replaced by raw count.

2. **Stale 100ms / 8192 prose** at plan.md:34, 169, 235, 270, 279,
   364, umem/mod.rs:235, flow_cache.rs:497. All mechanically replaced
   with ~65 ms (umem 0xFFFF gate) and 4096 entries / ~650 ms window
   throughout. Final grep confirms zero residual matches.

Test verification:
- cargo test --release --bin fairness-eval: 31/31 pass (was 24; +7 new)
- Full suite: 1006/1006 + 31 + 8 drift CI clean.
- Final grep -rn '100 ms|100ms|~8192|8192 entries|256 epochs|25\.6 s|256 ticks'
  on docs/pr/1219-fairness-harness/, userspace-dp/src/afxdp/, and
  userspace-dp/src/bin/fairness-eval.rs returns zero matches.

* #1219 round-6 review fixes: metrics.go HELP text + plan.md "1s" → "~650 ms"

Codex round-6 (task-mov6lp4k) MERGE-NEEDS-MINOR. Two findings, both
addressed:

1. **pkg/api/metrics.go:362 stale Prometheus HELP text.** The
   xpf_userspace_binding_active_flow_count gauge was documented as
   "active in the last ~1s" but the ACTIVE_WINDOW_EPOCHS=10 ×
   ~65ms tick = ~650 ms. User-visible metric documentation; fixed
   to match the implementation.

2. **plan.md:95 / :281 / :366 still said "1 second window"** /
   "1s" / "1 second" — three more sites missed in the round-5
   sweep. Now consistent with the rest of the doc at ~650 ms.

   Note: plan.md:36 ("Harness scrapes /metrics every 1 second")
   and :466 ("last 1s" final-burst) and umem/mod.rs:384/:409
   ("~1s debug-report tick") are about UNRELATED 1-second
   intervals (Prometheus scrape cadence, iperf3 final-burst
   window, and the existing debug-report tick) — they are
   correct as-is and were intentionally left untouched.

Gemini round-3 (task-mov6mk28) verdict: **MERGE-READY**. Helper
extraction parity, test coverage of subtle bugs (worker_id !=
binding_slot, multi-queue summation), direction_multiplier code
paths, Cstruct math regression, plan doc consistency, worker_id
stability, hot-path impact, architectural completeness, JSON wire
schema, and arg validation all PASS. "You are good to merge."

Test verification:
- go test ./pkg/api/...: pass.
- cargo test --release: 1006 + 31 + 8 unchanged.

* review fixes (round 5): min steady-state guard, worker_id validation, metrics 1s→650ms, omitempty comment, plan.md stale text, harness label-order comment

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/93fd3daa-a3d2-472e-843c-8413351b77bd

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

* review fixes (round 5+): min steady-state guard, worker_id Result, omitempty comment, harness label-order comment, plan.md stale text

Agent-Logs-Url: https://github.com/psaab/xpf/sessions/93fd3daa-a3d2-472e-843c-8413351b77bd

Co-authored-by: psaab <196946+psaab@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: psaab <196946+psaab@users.noreply.github.com>
psaab added a commit that referenced this pull request May 7, 2026
…_bytes-based rate

Both round-2 reviewers converged on the same fix; v3 adopts both:

Codex round-2 (task-mow38bom, PLAN-NEEDS-MAJOR) — 5 findings:
1. active_flow_count is binding-wide, not per (egress_ifindex, queue_id).
2. Naive cap-check on MQFQ front HOL-blocks; need eligible-bucket scan.
3. observed_bps NOT existing flow_cache state.
4. SharedCoSQueueLease doesn't auto-free tokens for nonempty queues.
5. class_rate concrete source needs definition (exact vs surplus phase).

Gemini round-2 (task-mow3904l, PLAN-KILL) — narrow but valid:
- C. observed_bps via TX completion path crosses worker boundaries
  → would re-introduce v1's contention problem. FATAL for v2.
- B. SharedCoSQueueLease behavior 'elegantly correct' for goal.
- F. PerClassFairnessState belongs in CoSQueueConfigState (Arc'd
  cross-worker), NOT FlowFairState (per-worker Box).
- Concrete v3 alternative: track bytes natively via existing
  flow_bucket_bytes [u64; COS_FLOW_FAIR_BUCKETS] in FlowFairState
  (verified at types/cos.rs:563).

User calibration ('gemini can be wrong a lot') applied: this PLAN-KILL
is the substantive case — narrow point with code-cited grounds and
a concrete fix. Adopted, not capitulated.

v3 design:

1. PerClassFairnessState in CoSQueueConfigState (per egress_ifindex,
   queue_id) — Codex #1, Gemini F.
2. Per-bucket rate via flow_bucket_bytes diff + 10ms local timestamp.
   No flow_cache touch. No cross-worker write — Gemini C fix.
3. Cap-aware MQFQ selector scans active buckets, skips over-cap,
   falls back to lowest-finish if all capped — Codex #2.
4. Per-queue active_flow_count via extension of count_active_flows scan
   (Option A: single extra pass on owner-only path) — Codex #1.
5. class_rate concrete: exact-phase = transmit_rate / N_active_workers;
   surplus-phase = root_shaping_rate × local_share — Codex #5.
6. Conditional shared_lease.release_local_tokens() when locally-capped
   for N consecutive batches — Codex #4.

Acceptance unchanged: per-flow CoV ≤ Cstruct + 0.10 on the user's exact
command, no aggregate regression > 5%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
psaab added a commit that referenced this pull request May 8, 2026
User chose option C from the v2 STOP report. v3 abandons token
accumulation entirely. Per-epoch counters reset every 200µs;
no pool to leak, no balance to steal share from, no accumulating
credit to violate cap.

Mechanism:
- Per-class epoch state: epoch_start_ns, total_grant_cap,
  active_worker_count, seq
- Per-worker grants: Box<[AtomicU64]> reset on epoch rotation
- Per-worker active_flow_buckets: Box<[AtomicU32]> read at epoch
  rotation only
- Acquire: primary share = total_cap / active_workers; surplus
  capped by total_cap - sum_of_grants
- Epoch rotation: CAS-serialized; one winner resets all counters
  and publishes new cap

Addresses every convergent v1+v2 kill point:
- F1 polling skew: primary share cap enforced per epoch
- F2 token lifecycle: no tokens, only per-epoch grant counters
- F3 epoch CAS: simpler (single CAS, reset is the whole work)
- F4 sparse worker_id: max_worker_id+1 sizing, matches_config
  includes it, bounds-checked
- F5 rehydration: at epoch rotation, 200µs window
- F6 scope: explicitly Guarantee-phase exact queue only; root
  lease and surplus-sharing keep legacy path
- Gemini #2 token leak: no pool to leak
- Gemini #7 cap violation: enforced at acquire (sum check before
  surplus grant)
- Gemini #10 my_count==0 bypass: no greedy fallback; surplus
  capped by class total

Pending v3 adversarial Codex + Gemini Pro 3 review.
psaab added a commit that referenced this pull request May 8, 2026
…t parity

Codex MERGE-NEEDS-MAJOR (task-moxb516r-4c2nxg) on PR #1230 v8
implementation. Spine sound; one HIGH bug + two MEDIUM doc fixes.

HIGH (Codex finding #1) — rehydration clobber:
rehydrate_worker_active_count used store(count) but multiple
runtimes can share a worker thread + lease (multi-binding
same-worker case). The second install would overwrite the
first runtime's count, eventually allowing decrements to drive
the slot to 0 while the other runtime still has active flows.

Fix: change store -> fetch_add. Per-runtime install + Arc::ptr_eq
detection (token_bucket.rs::ensure_v8_lease_attached) guarantees
each runtime fires rehydration exactly once per (runtime, lease-Arc)
pair, so the additive sum across runtimes is correct.

Also: skip rehydrate when count == 0 (saves a no-op atomic).

Tests: rename v8_rehydrate_worker_active_count_idempotent to
v8_rehydrate_worker_active_count_is_additive; add new
v8_rehydrate_multiple_workers_isolated +
v8_rehydrate_multi_binding_same_worker_summation tests proving
the Codex finding's failure mode is fixed.

MEDIUM (Codex finding #2 + Copilot) — misleading helper invariant:
active_buckets.rs documented helpers as MUST-call but they're
dead code; production sites inline. Update doc to reflect
inlined reality + clarify helpers are reference-only for
future call sites without an existing as_mut borrow.

Plus: add debug_assert in inlined unbump path (accounting.rs:84
dequeue site) to match dead-code helper's assertion. Codex
note #2 verbatim. Catches saturating_sub-past-zero in dev
builds.

Doc fixes:
- flow_bucket_tx_bytes wrap-time math: was '~5e9 seconds at 100
  Gbps' (wrong); is '~47 years at 100 Gbps sustained' (correct,
  u64 wraps at 2^64 bytes / 12.5 GB/s).
- flow_bucket_observed_bps unit: was 'bytes/sec' (wrong); is
  'BITS/sec' matching account_flow_bucket_tx's bytes × 8 × 1e9
  / dt_ns formula. Aligns with name suffix (_bps = bits-per-sec).
- flow_bucket_tx_bytes update sites: now lists all 4 paths,
  not just the post-settle 2.

Tests: 1032 + 47 = 1079 cargo test --release pass (was 1077).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants