Skip to content

userspace: add bounded dataplane event producer#1404

Merged
psaab merged 3 commits into
masterfrom
codex/1403-1379-event-infra
May 18, 2026
Merged

userspace: add bounded dataplane event producer#1404
psaab merged 3 commits into
masterfrom
codex/1403-1379-event-infra

Conversation

@psaab
Copy link
Copy Markdown
Owner

@psaab psaab commented May 17, 2026

Summary

Refs #1379.

Adds the bounded userspace dataplane event producer infrastructure needed before wiring policy-deny, screen-drop, and filter-log call sites:

  • adds fixed-size, non-blocking RT_FLOW event emission from EventStreamWorkerHandle
  • adds per-event-kind/per-ingress-zone GCRA rate limiting
  • adds per-kind sent/drop accounting split by rate-limit, queue-full, and disconnected reasons
  • updates event-stream stats/tests and docs to describe the infrastructure/call-site boundary

This intentionally does not claim full #1379 closure. Runtime producer call sites, operator-surfaced helper counters, and end-to-end syslog validation remain open.

Validation

  • cargo test event_stream
  • go test ./pkg/dataplane/userspace
  • go test ./pkg/api
  • git diff --check

Copilot AI review requested due to automatic review settings May 17, 2026 18:47
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

@copilot review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds bounded userspace dataplane event producer infrastructure for future RT_FLOW policy-deny, screen-drop, and filter-log emission, while keeping runtime call-site wiring as follow-up work.

Changes:

  • Adds Rust producer-side rate limiting, non-blocking enqueue, and per-kind/drop-reason counters.
  • Updates event-stream stats behavior/tests for queued/sent accounting.
  • Updates documentation/status tests to reflect the infrastructure vs. remaining call-site gap.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
userspace-dp/src/event_stream/producer.rs Adds dataplane event producer, limiter, outcomes, and counters.
userspace-dp/src/event_stream/producer_tests.rs Covers enqueue, rate-limit, queue-full, and disconnected outcomes.
userspace-dp/src/event_stream/mod.rs Wires producer module, stats fields, shared state, and send accounting.
userspace-dp/src/event_stream/tests.rs Adds sent-counter assertions for existing send paths.
userspace-dp/src/event_stream/README.md Documents producer API and RT_FLOW telemetry requirements.
pkg/dataplane/userspace/statusfmt_test.go Extends status summary expectations for event-stream counters.
docs/userspace-dataplane-gaps.md Updates dataplane event logging gap status.
docs/pr/1373-retire-ebpf-dataplane/plan-1379-dataplane-events.md Updates implementation status, invariants, tests, and remaining gaps.

Comment on lines +200 to +204
pub(crate) fn try_emit_dataplane_event_at(
&self,
event: DataplaneEventPayload,
now_ns: u64,
) -> DataplaneEventEmitOutcome {
Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

@copilot review

Completed an adversarial pass on 000ba46 (hot-path + protocol/accounting focus). No additional correctness/perf blockers were found that require a follow-up code change in this PR.

Copilot finished work on behalf of psaab May 17, 2026 18:52
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-1 Codex review on 000ba46e

Codex verdict: MERGE-NEEDS-MAJOR

MAJOR 1 — Hot-path limiter does redundant divisions per attempted event

userspace-dp/src/event_stream/producer.rs:63-71: every call recomputes interval_ns() and burst_horizon_ns(), and burst_horizon_ns() calls interval_ns() again.

Under a deny/log storm, even rate-limited DROPS pay these divisions per packet. Both functions appear cheap individually but sum to material overhead at 1M+ events/s.

Fix: precompute interval_ns, burst_horizon_ns, and unlimited once in the DataplaneEventRateLimiter constructor.

MAJOR 2 — Starvation invariant not actually enforced

Dataplane telemetry and session deltas share the SAME mpsc::sync_channel:

  • mod.rs:39-40, :209-223 — single channel
  • afxdp/session_delta.rs:143-145 — session push uses the same channel

The limiter is per (kind, zone) with default 1000/s and burst 1024 (producer.rs:14-15). 256 zone buckets × 3 kinds × 1024 burst = potentially 786K frames in a multi-zone storm.

Worked failure: multi-zone deny storm consumes the entire 8192-frame queue. Session open/close events drop because there's no space left.

Violates the plan's claim: "prevent deny storms from starving session open/close/update events"

Fix paths:

  • Global telemetry cap (sum across all (kind,zone) buckets bounded)
  • Reserved session capacity in the queue (e.g., session events get 2048-frame reserved slot)
  • Priority queueing (session events preempt telemetry)
  • Separate telemetry queue from session deltas

Scope note (Codex)

The producer is intentionally unwired in this slice (producer.rs:1). No policy/screen/filter runtime call sites yet. No config compile → snapshot → runtime emission path. This is the FOUNDATION layer of #1379, NOT end-to-end behavior. Plan accurately lists wiring as deferred.

Other checks

  • ✓ Backwards-compat: no snapshot/schema changes
  • ✓ Wire path: existing 136-byte RT_FLOW codec; Go decode exists
  • ⊝ Tests cover helper API/rate/drop only; don't pin Rust producer → Go listener → logging end-to-end; no session-starvation-under-telemetry-pressure test

Recommendation

Block on:

  1. Precompute interval_ns/burst_horizon_ns in the rate-limiter constructor
  2. Enforce the actual starvation invariant: global telemetry cap, reserved session capacity, OR priority queueing

Codex task: task-mpa6kye9-d0qq2q. Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-2 Codex review on e101bf7d

Codex verdict: MERGE-NEEDS-MAJOR (early-release budget bug)

r1 verification

  • ⊝ Producer-side fairness: partially addressed. r2 added total + per-kind telemetry caps, but they don't enforce true in-flight bound.
  • ⊝ Test coverage: not closed — new tests pin admission only, not end-to-end.

NEW MAJOR — Queue budget released too early

userspace-dp/src/event_stream/producer.rs:288: producer acquires budget BEFORE try_send_frame(). But I/O loop at mod.rs:471-485 releases the budget as soon as it dequeues from the Rust mpsc channel — BEFORE:

  • The frame is written to the Unix socket
  • Daemon callback ACK
  • The frame possibly only sits in write_buf or replay_buf

Worked failure (Codex): If daemon/socket is slow, write_buf grows at mod.rs:493-505 while the budget has been freed → producers admit ANOTHER half-channel of telemetry. Storm sits ahead of session/control frames downstream of the mpsc channel.

Plan doc claims "in-flight telemetry cap"; implementation only caps "still-in-channel". Misleading.

Fix: release budget only when:

  • Frame successfully written to socket AND popped from write_buf (and replay_buf after ACK trims), OR
  • Track in-flight at socket-write boundary, not mpsc dequeue boundary

Test gap (Codex)

producer_tests.rs:170-207 only fills test mpsc channel. Never runs run_connected_loop, never exercises socket backpressure/WouldBlock, never verifies replay-buffer behavior, never proves session frame delivered+ACKed under telemetry storm.

The test would PASS with the early-release bug above.

Recommendation

Block on:

  1. Move budget release to socket-write boundary (not mpsc dequeue)
  2. Add end-to-end test: telemetry storm via run_connected_loop → assert session frame still ACKs within bounded time

Codex task: task-mpa7x31p-4g4zc2. Not merging — author's decision.

Copilot AI review requested due to automatic review settings May 17, 2026 21:14
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

@copilot review

Copy link
Copy Markdown
Contributor

Copilot AI commented May 17, 2026

Round-1 Codex review on 000ba46e

Codex verdict: MERGE-NEEDS-MAJOR...

Addressed in e101bf7: limiter runtime now precomputes interval/burst horizon once, and telemetry admission adds bounded total/per-kind queue budgets so session/control frames retain reserved shared-channel capacity.

Round-2 Codex review on e101bf7d

Codex verdict: MERGE-NEEDS-MAJOR (early-release budget bug)...

Addressed in 9fc6d0b: telemetry budget is no longer released on mpsc dequeue; it is held through replay retention and released on ACK trim or definitive drop paths (eviction/drain), with connected-loop tests covering the would-block/drain behavior.

@copilot review

Completed a fresh adversarial pass on the current head (e101bf7 + 9fc6d0b). I don’t see additional major blockers beyond the issues those two commits already resolved.

Copilot finished work on behalf of psaab May 17, 2026 21:17
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Comment on lines +340 to +342
fn rate_bucket_index(kind: DataplaneEventKind, ingress_zone_id: u16) -> usize {
let zone_bucket = usize::from(ingress_zone_id).min(DATAPLANE_EVENT_ZONE_BUCKETS - 1);
kind_index(kind) * DATAPLANE_EVENT_ZONE_BUCKETS + zone_bucket
Comment on lines +263 to +272
impl EventStreamWorkerHandle {
/// Fixed-size, non-blocking dataplane telemetry emission.
///
/// `now_ns` is a caller-supplied monotonic timestamp used only for rate
/// limiting; `event.timestamp_ns` remains the on-wire event timestamp.
pub(crate) fn try_emit_dataplane_event_at(
&self,
event: DataplaneEventPayload,
now_ns: u64,
) -> DataplaneEventEmitOutcome {
@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Claude r3 review on 9fc6d0b8

Verdict: MERGE-READY (pending Codex/Gemini cross-check)

r2 MAJOR closure verified

r2 raised: budget released on channel drain BEFORE consumer ack, allowing replay_buf to retain frames whose budget was already returned.

r3 fix is structurally correct:

  1. pop_replay_frame() (mod.rs:734) is now the sole release path — release_dataplane_event_queue_budget(shared, frame) is called only via this helper.
  2. All four callers of pop_replay_frame:
    • process_control_frames ack-trim loop (mod.rs:604) — release on ACK
    • push_replay_frame LRU eviction (mod.rs:725) — release on definitive drop
    • release_replay_dataplane_event_queue_budget shutdown drain (mod.rs:371, 743-748) — release on io_thread exit
  3. Channel→replay_buf intake (run_connected_loop mod.rs:469 + handle_drain_request mod.rs:653) no longer releases budget at drain time.
  4. producer.rs:285 reordered: try_acquire now precedes encode_dataplane_event, so the encode allocation is skipped on budget exhaustion.

Verification

  • git show 9fc6d0b8 -- userspace-dp/src/event_stream/mod.rs — confirmed two release_dataplane_event_queue_budget removals from drain paths, one new shutdown release.
  • Producer ordering matches r2 critique ("acquire-then-encode").

Open risk

  • Shutdown drain: if io_thread_main panics before the new release_replay_dataplane_event_queue_budget(&shared, &mut replay_buf) line runs (mod.rs:371), residual budget leaks until process exit. Acceptable — process is dying anyway. Worth a comment if tests don't cover.
  • 194-LOC test addition needs verification it exercises the disconnect-with-frames-in-replay-buf path.

Awaiting Codex (task-mpaag88k-mtifyw) + Gemini Pro 3 (task-mpaagqdu-37x92v). Not merging — author's decision.

@psaab
Copy link
Copy Markdown
Owner Author

psaab commented May 17, 2026

Round-3 triple-review synthesis on 9fc6d0b8

Reviewer Verdict
Claude MERGE-READY
Codex MERGE-READY
Gemini Pro 3 MERGE-READY

All three converge. r2 budget-release MAJOR is closed.

Codex confirmation

  • Producer reordering at producer.rs:268-323: acquire-before-encode; rate-limit and queue-full paths return BEFORE budget acquisition.
  • Release encapsulated in pop_replay_frame (mod.rs:734-741); four legitimate caller sites (ACK trim 599-605, replay eviction 723-731, shutdown drain 707-715 + 745-749).
  • New test at tests.rs:152-245 explicitly pins the r2 bug: drains mpsc into replay, asserts second emit gets QueueFull (would have succeeded under r2's drain-time release).
  • README at README.md:57-67 accurately describes acquire → queue → replay-retain → ACK/drop → release.

Gemini confirmation

Same findings with quote evidence. Noted "the only Vec::new and vec! calls introduced exist strictly inside test cases" — no hot-path allocation.

Open note

Codex flagged decrement_if_positive underflow guard is weak in release build, but no reachable double-release path in this patch. Tracker-only.

Recommendation

Merge-ready.

Codex task: task-mpaag88k-mtifyw. Gemini task: task-mpaagqdu-37x92v. Not merging — author's decision.

@psaab psaab force-pushed the codex/1403-1379-event-infra branch from 9fc6d0b to aa6f366 Compare May 18, 2026 03:26
@psaab psaab merged commit e8a1c66 into master May 18, 2026
psaab added a commit that referenced this pull request May 18, 2026
Adds bounded dataplane event producer (#1404) in userspace.

Fixes event producer queue fairness. Holds event telemetry budget until
ack or drop.

Touches userspace-dp/src/event_stream, docs/pr/1373-retire-ebpf-dataplane,
pkg/dataplane/userspace, and docs/userspace-dataplane-gaps.md across
documentation, tests, userspace dataplane, and dataplane integration. The
largest file deltas are userspace-dp/src/event_stream/producer.rs,
userspace-dp/src/event_stream/producer_tests.rs, and
userspace-dp/src/event_stream/tests.rs. The diff is 1025 additions and 34
deletions across 9 files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants