diff --git a/docs/audits/bgid-lifecycle.md b/docs/audits/bgid-lifecycle.md new file mode 100644 index 000000000..e6c58aa21 --- /dev/null +++ b/docs/audits/bgid-lifecycle.md @@ -0,0 +1,211 @@ +# BGID lifecycle audit (BGE-1, BGE-2) + +Tracking issues: #2293 (allocation audit) and #2294 (release audit). Sibling +material: [`io-uring-bgid-namespace.md`](io-uring-bgid-namespace.md), +[`io-uring-bgid-exhaustion.md`](io-uring-bgid-exhaustion.md), +[`iouring-bgid-namespace.md`](iouring-bgid-namespace.md). This document is +the precondition for BGE-3 (high-water-mark stat) and BGE-4 (free-list +recycling tuning); it freezes the current allocation + release call graph +so subsequent tasks can land surgically. + +## 1. Namespace + +- Kernel `struct io_uring_buf_reg.bgid` is `u16`, bounded to 65 536 IDs. + Definition mirrored at + `crates/fast_io/src/io_uring/buffer_ring.rs:90` + (`IoUringBufReg.bgid: u16`). +- mmap offset folds the bgid at bit 16 + (`crates/fast_io/src/io_uring/buffer_ring.rs:439`, + `IORING_OFF_PBUF_RING | (u64::from(config.bgid) << 16)`), so the + per-io_uring-instance namespace cannot be widened without a kernel ABI + change. +- `BGID_NAMESPACE_SIZE = u16::MAX as u32 + 1` + (`crates/fast_io/src/io_uring/buffer_ring.rs:129`) is the hard cap the + allocator enforces. + +## 2. Allocation flow + +Single entry point: `BgidAllocator::allocate()` at +`crates/fast_io/src/io_uring/buffer_ring.rs:193`. + +1. Drain the process-wide free-list first + (`buffer_ring.rs:197-203`, helper `bgid_free_list()` at `:146`). A + `OnceLock>>` returns the most recently deallocated id + ahead of advancing the monotonic counter. +2. If the free-list is empty, bump the monotonic counter + `NEXT_BGID: AtomicU32` (`buffer_ring.rs:138`) via + `fetch_add(1, Relaxed)` at `buffer_ring.rs:208`. Stored as `u32` so the + `>= 65 536` boundary is observable without wrap. +3. On overshoot the allocator undoes the fetch (`fetch_sub(1, Relaxed)` + at `buffer_ring.rs:216`) and returns `BufferRingError::BgidExhausted`. + The undo keeps the counter capped at `BGID_NAMESPACE_SIZE` so repeated + calls after exhaustion never wrap back into the valid `u16` range and + silently collide with a live ring. + +Construction wrapper: `BufferRing::new_with_allocator()` at +`buffer_ring.rs:564` calls `BgidAllocator::allocate()`, threads the id +into `BufferRingConfig::bgid`, then sets `self.allocator_owned = true` +(`buffer_ring.rs:572`) so `Drop` knows to recycle. If `BufferRing::new()` +fails after allocation, the wrapper hands the id back via +`BgidAllocator::deallocate(bgid)` at `buffer_ring.rs:576` before +propagating the error, so a failed registration does not leak a slot. + +## 3. Release flow + +Single release path: `BufferRing::Drop` at +`crates/fast_io/src/io_uring/buffer_ring.rs:720`. + +1. Issue `IORING_UNREGISTER_PBUF_RING` (opcode 23) with the matching bgid + (`buffer_ring.rs:723-741`). Kernel reclaims the group slot + immediately; the same bgid may then be re-registered. +2. `munmap` the ring descriptor region (`buffer_ring.rs:746-748`). +3. `dealloc` the buffer slab (`buffer_ring.rs:754-756`). +4. If `self.allocator_owned` is `true` + (`buffer_ring.rs:761-763`), call `BgidAllocator::deallocate(bgid)`. + `deallocate` (`buffer_ring.rs:245`) takes the free-list mutex and + `push`es the id, guarded by `!free_list.contains(&bgid)` for + idempotence on double-drop scenarios. + +Caller-supplied bgids (`BufferRing::new`) leave `allocator_owned = false`, +so the kernel slot is unregistered but no process-side reuse happens. +The caller continues to own that namespace slot, matching the original +intent for fixed-purpose probes. + +## 4. `BgidExhausted` + +- Variant declared at + `crates/fast_io/src/io_uring_common.rs:370` + (`BgidExhausted` with display + `io_uring buffer group ID namespace exhausted (limit: 65535)`). +- `From for io::Error` (`io_uring_common.rs:378`) maps + it to `io::ErrorKind::InvalidInput` at `io_uring_common.rs:387` so + upstream callers see a typed I/O error. +- Raised in two places: + - `BgidAllocator::allocate()` at + `crates/fast_io/src/io_uring/buffer_ring.rs:217` when the counter + crosses the namespace limit and the free-list is empty. + - Non-Linux stub + `crates/fast_io/src/io_uring_stub/buffer_ring.rs:104` + (`pub fn allocate() -> Result { Err(BgidExhausted) }`) + so the stub keeps a parity surface but never hands out IDs. + +## 5. Daemon session usage + +`BufferRing` and `BgidAllocator` are not yet wired into any production +caller outside `fast_io`. A workspace grep over `crates/daemon`, +`crates/engine`, `crates/core`, and `crates/cli` returns no hits for +`BufferRing` or `new_with_allocator`. The accept loop in +`crates/daemon/src/...` does not provision a per-session PBUF ring +today, so accepted connections do not consume a bgid, do not need to +return one on close, and the namespace cap is effectively untouched in +the steady state. + +Implication: the allocator is currently defensive plumbing for the +follow-ups tracked under #1936/#1937 (per-session PBUF wiring) and the +shared-ring work in `shared_ring.rs`. When those land, every accepted +connection that opens a read PBUF ring consumes one bgid, and a write +ring consumes a second. Until then the only `allocate`/`deallocate` +exercise comes from the in-file unit tests at +`buffer_ring.rs:1067-1190`. + +## 6. Risk profile + +The lifecycle is correct on paper, but the surface is narrow enough to +hide regressions. The audit flagged four exposures. + +1. **No high-water-mark telemetry.** `BgidAllocator::remaining()` exists + (`buffer_ring.rs:259`) but there is no counter for peak in-use bgids, + no `tracing::warn!` when occupancy crosses a threshold, and no Prom + metric. A long-running daemon that leaks rings (forgotten `Box`, + `Arc` cycle) consumes the namespace silently until the first + `BgidExhausted` failure. BGE-3 should land + `BgidAllocator::peak_used()` + a `tracing::warn!` at 50 % occupancy + (>= 32 768 in flight) throttled to one log per minute. +2. **Free-list is unbounded.** `bgid_free_list()` returns a + `Mutex>`. The vector grows monotonically up to 65 536 + entries (1 MiB at `u64` granularity, 128 KiB packed) and never + shrinks. Harmless in size, but worth a `Vec::with_capacity(1 << 12)` + to avoid 16+ realloc rounds during ramp-up. +3. **Free-list mutex on every allocate / deallocate.** The cited comment + at `buffer_ring.rs:194-196` accepts this trade for "one buffer ring + per long-running task". Once per-session rings land + (#1936/#1937) the cadence rises to twice per accepted connection; + acceptable at hundreds of accepts/sec, hot at tens of thousands. + BGE-4 should consider a sharded free-list or a lock-free `ArrayQueue` + if the per-session work materially raises contention. +3. **Idempotence comment vs. behaviour.** `deallocate` is idempotent for + double-drop (`!free_list.contains(&bgid)`), but a stray + `BgidAllocator::deallocate(constant)` from outside the crate would + pollute the free-list with an id the allocator never issued. Public + visibility on `deallocate` is necessary for the wrapper failure path + at `buffer_ring.rs:576`; documentation at `buffer_ring.rs:236-244` + already calls this out. No code change needed; the audit recommends + `#[doc(hidden)]` plus a `pub(crate)` review if the visibility is not + required by an external consumer (today it is not). +4. **Counter cap relies on `fetch_sub`.** Under a hypothetical 65 536 + concurrent threads racing past `BGID_NAMESPACE_SIZE`, each + `fetch_add` past the limit is followed by a `fetch_sub`, so the + counter never escapes by more than `thread_count - 1` and the next + load still trips the exhaustion check. Correct, but subtle. A + `compare_exchange` loop would be easier to reason about; the + `fetch_add`+`fetch_sub` pair is preserved because it avoids the loop + on the hot path. + +## 7. BGE-4 fix plan: free-list pool keyed off Drop signal + +The current implementation already wires a Drop signal +(`allocator_owned` -> `deallocate`). What is missing is the +*observability* and *bounded growth* hooks. Concrete deltas: + +1. **Pre-size the free-list.** + ```rust + FREE_LIST.get_or_init(|| Mutex::new(Vec::with_capacity(1 << 12))) + ``` + at `buffer_ring.rs:148`. Cap initial allocation; the `Vec` still + grows under sustained churn but skips the early doubling rounds. +2. **Peak tracker.** Add `static PEAK_USED: AtomicU32 = AtomicU32::new(0)` + alongside `NEXT_BGID`. On every successful `allocate`, compute + `in_use = NEXT_BGID - free_list.len()` and `PEAK_USED.fetch_max(...)`. + Expose via `BgidAllocator::peak_used()` so the daemon can surface + the value to its existing metrics endpoint. +3. **Threshold warning.** When `in_use` crosses + `BGID_NAMESPACE_SIZE / 2`, emit + `tracing::warn!(target: "fast_io::bgid", peak = ..., remaining = ..., + "io_uring bgid namespace 50% full")`. Throttle via a + `OnceLock>` so steady-state busy daemons emit at most + once per minute. +4. **Optional per-ring weak handle.** For BGE-4 stretch goal, replace + the global free-list with `Arc` carried inside + `BufferRing` (mirrors the sketch in + [`io-uring-bgid-exhaustion.md`](io-uring-bgid-exhaustion.md) section 5). + That allows distinct pools per io_uring instance, useful once the + shared-ring work in `shared_ring.rs` allows multiple rings per + process. + +## 8. BGE-5 stress-test plan + +The existing tests cover correctness (`buffer_ring.rs:1066-1190`) but +not steady-state churn. BGE-5 should add: + +1. **Churn loop.** `#[test]` that allocates and immediately drops + `BufferRing::new_with_allocator` 200 000 times and asserts + `BgidAllocator::peak_used() <= ring_size + thread_count`. Run under + the existing test serialiser (`bgid_test_lock()` at + `buffer_ring.rs:1081`) to avoid cross-test interference. +2. **Concurrent allocate / deallocate.** `loom` model (the `fast_io` + crate already pulls `loom` for other tests) covering the + `fetch_add` / `fetch_sub` cap and the free-list pop / push race. +3. **Exhaustion + recovery.** Allocate `u16::MAX + 1` ids without + dropping, assert `BgidExhausted`, drop one, assert + `BgidAllocator::allocate()` returns the freed id. Already partially + covered by `bgid_allocator_reuses_freed_ids` at `buffer_ring.rs:1156` + and `bgid_allocator_free_list_persists_after_exhaustion` at + `buffer_ring.rs:1169`; extend with the recovery half. +4. **`tracing` capture.** Use `tracing-test` or a custom subscriber to + assert the 50 % warning fires exactly once per throttle window under + a synthetic ramp. +5. **Long-soak harness.** Optional `#[ignore]` test that runs for + `RUST_TEST_TIME_SOAK=600` seconds with two threads allocating and + dropping in tight loops, asserting `peak_used()` stays within + tolerance. Useful before tagging a release that wires per-session + PBUF rings (#1936/#1937).