evpn-linux: ADR-0059 slice 3b — wire FDB-NHG onto the apply path#88
Conversation
First slice in the ADR-0059 chain with operational behavior change. Multi-homed Type 2 routes now program FDB nexthop groups via `NDA_NH_ID` in the kernel; traffic to a multi-homed MAC fans out across every observed alias VTEP on the encap path. Slice 4 (M40 manual containerlab smoke against FRR) closes out the validation story. What lands: - `compute_diff` Pass 1b — emits `InstallFdbNhg` / `UpdateFdbNhgMembers` / `RemoveFdbNhg` based on `RemoteMacEntry::alias_group_key`. IPv6 alias members fall back to single-dst FDB with `warn!` per `(VNI, MAC)` (slice 2's socket rejects v6; graceful degradation until v6 follow-up). Six transition tests cover the owned↔desired matrix. - `NexthopOps` impls on `LinuxDataplane` (delegates to slice 2's `NexthopSocket` + slice 3a's `linux::fdb_nhg`) and `InMemoryDataplane` (records calls for unit-testing the coordinator). - Reconcile actor coordinator — intercepts the three FDB-NHG op variants before `Dataplane::apply`, calls `NexthopOps` methods in ADR-0059 §5 invariant 1+2 order (install: members → group → FDB row; teardown: FDB row → group → members per refcount). `ActorState` extends with `nh_id_alloc`, `groups`, `adopted_unreferenced`, `adoption_done`. - `NexthopSocket::dump_owned` — `RTM_GETNEXTHOP` multipart parser tolerating `NHA_GROUP_TYPE` / `NHA_OP_FLAGS` per research §1, filtering by tag bits + `NHA_FDB`. 7 unit tests. - Startup adoption + deferred stale cleanup: first `reconcile_once` pass dumps tagged nexthops, reserves their IDs in the allocator (prevents `NLM_F_REPLACE` collisions with a prior daemon's state), and tracks them in `adopted_unreferenced`. After the first apply phase, anything still unreferenced is true-stale and gets deleted + released. Gated to one shot via `adoption_done`. Per research §8, this is startup-only; periodic drift recovery deferred to slice 3.5. - New 3 new `DataplaneOpKind` variants from slice 3a are now emitted by the actor and surface in `AppliedOp` / `FailedOp`. - Docker-runnable netns integration test (`tests/netns_fdb_nhg.rs`): round-trip install + verify via `ip nexthop show` + `bridge fdb show nhid` + idempotent del + CVE-2025-39851 guard negative path. Harness extended with `fdb_nhg` / `fdb_nhg_roundtrip` / `fdb_nhg_cve` selectors. Deviation from approved plan: adoption happens in the first `reconcile_once` pass rather than at `connect()` time. Keeps allocator + refcount state in `ActorState` (per ADR §7 boundary) without changing `connect()`'s signature. Net behavior identical — adoption completes before any `InstallFdbNhg` is emitted. Stats: 1622 lines added / 57 changed across 12 files. 226 lib tests passing (227 with the 7 dump-parser tests + 6 transition tests added on this branch); all pre-existing unit + integration tests still green. References: - ADR-0059 (merged #83): design. - Slice 1 (#84), 2 (#86), 3a (#87): foundation. - Research findings on member-order canonicalization, atomic replace semantics, CVE test scope, `bridge fdb show` regex, `ip nexthop show` format, Docker capabilities, per-VTEP NH key, RTNLGRP_NEIGH coverage gaps all folded into the implementation.
There was a problem hiding this comment.
Pull request overview
Wires ADR-0059 slice 3b into the EVPN Linux dataplane apply path: multi-homed Type 2 MACs are now programmed using kernel FDB nexthop groups (NHGs) with startup adoption of tagged NHIDs and an end-to-end privileged netns integration test.
Changes:
- Extend
compute_diffto emit FDB-NHG ops (InstallFdbNhg/UpdateFdbNhgMembers/RemoveFdbNhg) and add IPv6 graceful fallback. - Add reconcile-actor coordination for FDB-NHG lifecycle (allocator + refcounts + adoption/cleanup).
- Implement
NexthopOpsfor Linux + in-memory dataplanes and addRTM_GETNEXTHOPdump support + Docker/netns harness updates.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| src/evpn_dataplane.rs | Tightens supervisor spawn generic bound to require NexthopOps. |
| ROADMAP.md | Updates roadmap status for ADR-0059 slice chain. |
| CHANGELOG.md | Adds unreleased entry describing slice 3b operational change and wiring. |
| crates/evpn-linux/src/diff.rs | Emits FDB-NHG ops (Pass 1b), handles transitions, adds tests. |
| crates/evpn-linux/src/reconcile.rs | Adds NHG coordinator apply path, startup adoption + stale cleanup, actor state. |
| crates/evpn-linux/src/linux/mod.rs | Adds dedicated nexthop socket and NexthopOps impl for LinuxDataplane. |
| crates/evpn-linux/src/linux/fdb_nhg.rs | Updates module docs to reflect slice 3b wiring status. |
| crates/evpn-linux/src/linux/nexthop_raw/encode.rs | Adds RTM_GETNEXTHOP dump encoder. |
| crates/evpn-linux/src/linux/nexthop_raw/socket.rs | Implements multipart dump parsing + dump_owned() with unit tests. |
| crates/evpn-linux/src/linux/nexthop_raw/uapi.rs | Adds missing UAPI consts needed for dump tolerance. |
| crates/evpn-linux/src/in_memory.rs | Implements NexthopOps in the in-memory dataplane for coordinator tests. |
| crates/evpn-linux/tests/netns_fdb_nhg.rs | Adds privileged netns integration test for FDB-NHG programming + CVE guard. |
| crates/evpn-linux/tests/docker/run-netns-tests.sh | Adds selectors to run the new netns test binary and subtests. |
Comments suppressed due to low confidence (2)
crates/evpn-linux/src/reconcile.rs:917
cleanup_unreferenced_adoptionsreleases NHID allocator slots even whendel_nexthop(id)fails. If the kernel object is still present, releasing the ID can allow a later allocation to reuse it and inadvertently REPLACE/overwrite a still-live kernel nexthop. Safer approach: onlyrelease(id)after a confirmed successful delete (including idempotent ENOENT), and keep failed IDs inadopted_unreferencedso cleanup can retry on the next pass (which likely also implies not flippingadoption_doneyet).
if let Err(e) = self.dataplane.del_nexthop(id).await {
tracing::debug!(?e, id, "adoption cleanup: del_nexthop failed (best-effort)");
}
self.state.nh_id_alloc.release(id);
}
crates/evpn-linux/src/reconcile.rs:1419
- In the
RemoveFdbNhgpath,del_nexthoperrors are ignored (let _ = ...) but the allocator slot is still released immediately afterward. If the delete failed and the kernel object remains, the allocator may later reuse the same ID andNLM_F_REPLACEcould overwrite a still-live nexthop/group. Consider releasing IDs only after a confirmed successful delete (including idempotent ENOENT), and preserving enough state to retry cleanup on the next reconcile pass when deletion fails.
RefDelta::GroupShouldDelete { id, members } => {
let _ = dataplane.del_nexthop(id).await;
state.nh_id_alloc.release(id);
for ip in members {
if let Some(vtep_id) = state.groups.record_member_unref(ip, *group_key) {
let _ = dataplane.del_nexthop(vtep_id).await;
state.nh_id_alloc.release(vtep_id);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -143,6 +209,168 @@ pub fn compute_diff( | |||
| Plan { ops } | |||
| // Single-dst → FDB-NHG transition. Remove the old row, | ||
| // install the group-backed one. | ||
| updates.push(DataplaneOp::RemoveRemoteFdb { vni, mac }); |
| // Gated to the first pass so we don't re-walk an | ||
| // already-empty map on every reconcile. | ||
| if !self.state.adoption_done { |
| // Step 1: install any per-VTEP members not yet present. | ||
| for ip in members { | ||
| if state.groups.vtep_nh(ip).is_none() { | ||
| let id = state.nh_id_alloc.alloc_vtep_nh().map_err(|e| { | ||
| crate::error::DataplaneError::Other(format!( | ||
| "ADR-0059: vtep NH alloc failed: {e}" | ||
| )) | ||
| })?; | ||
| dataplane.add_nexthop_member(id, *ip).await?; | ||
| state.groups.record_member_install(*ip, id); | ||
| state.adopted_unreferenced.remove(&id); |
|
|
||
| // GroupOwned and pre-existing struct imports kept used. The | ||
| // `GroupOwned` type isn't directly constructed in these tests | ||
| // but is referenced indirectly via `GroupOwnedMap`'s helpers. | ||
| fn _silence_unused_import_warning() -> Option<GroupOwned> { | ||
| None | ||
| } |
| - [x] **EVPN Phase 3 partial — Gate 8b enforcement intent foundation** (v0.17.0) — DF-election role state now feeds the Linux dataplane supervisor as a portable `(ESI, VNI)` BUM-enforcement table. The reconciler resolves bridge, VXLAN ifindex, and CE-facing port identity and reports the desired action (`allow` for DF, `suppress` for Non-DF) through `DataplaneReport.bum_enforcement`. Observable-only: no kernel filter mutation yet. | ||
| - [x] **EVPN Phase 3 partial — Gate 9 slice 6 symmetric Interface-less IRB end-to-end** (v0.18.0, ADR-0058) — `[[evpn_ip_vrfs]]` TOML schema with VRF / L3VXLAN device binding and operator-supplied Router MAC, pure-logic Type 5 origination + projection helpers (RFC 9136 §4.4.2 Interface-less IRB), `IpVrfStatus` readiness probe checking the seven ADR-0058 §3 predicates, Linux rtnetlink dumps for VRF / L3VXLAN inventory, and `Dataplane::probe_ip_vrfs` called every reconcile pass with readiness surfaced via `tracing::info!` / `warn!` + `DataplaneReport.ip_vrf_status` + `EvpnService.ListIpVrfs` / `GetIpVrf` gRPC + `rustbgpctl evpn vrfs`. Slice 6 PR A (#77) wires daemon-side origination (kernel-route dump + conservative classifier + L3 originator task + `originated_routes_count`); slice 6 PR B (#78) wires dataplane-side import (best-path subscription + projection + transactional `L3OwnedState` with value-aware drift detection + four-phase apply ordering + Router MAC conflict detection + foreign-state preservation + `installed_routes_count`). M39 manual containerlab smoke validates the bidirectional symmetric IRB datapath against FRR 10.3.1. | ||
| - [ ] **EVPN Phase 3 — Multi-homing enforcement (Gate 8b residual) + Gate 9 overlay-index IRB** — Gate 8b residuals: aliasing dataplane ECMP per [ADR-0059](docs/adr/0059-evpn-aliasing-fdb-nexthop-groups.md) (FDB nexthop groups via `NDA_NH_ID` + `NHA_FDB`; raw-netlink construction because `rtnetlink 0.21` has no nexthop API). Slice 1 (portable intent — `RemoteMacEntry::alias_group_key`) and slice 2 (`nexthop_raw` netlink primitive) have shipped on `main`; **slice 3 is in flight, split into 3a (state types + `NexthopOps` trait + `fdb_nhg` apply + CVE-2025-39851 guard, infrastructure only) and 3b (diff Pass 1b + reconcile actor coordinator + NHID startup adoption + Docker-runnable netns test, where operational behavior change lands)**. Slice 3.5 deferred: periodic `RTM_GETNEXTHOP` drift recovery + `RTNLGRP_NEXTHOP` (out of scope for the ADR). Slice 4 (M40 manual containerlab smoke against FRR) is the final piece. Then production-default multi-homing enforcement after the 24 h soak. Gate 9 follow-ups: overlay-index IRB (RFC 9135 — Gate 9 slice 6 ships the Interface-less variant only), auto-derived Route Targets (RFC 8365 §5.1.2.1), privileged-runner CI gate for the M39 smoke. Needed for active-active ToR deployments and broader tenant-routing topologies. | ||
| - [ ] **EVPN Phase 3 — Multi-homing enforcement (Gate 8b residual) + Gate 9 overlay-index IRB** — Gate 8b residuals: aliasing dataplane ECMP per [ADR-0059](docs/adr/0059-evpn-aliasing-fdb-nexthop-groups.md) (FDB nexthop groups via `NDA_NH_ID` + `NHA_FDB`; raw-netlink construction because `rtnetlink 0.21` has no nexthop API). Slice 1 (portable intent — `RemoteMacEntry::alias_group_key`), slice 2 (`nexthop_raw` netlink primitive), slice 3a (state types + apply primitive + CVE-2025-39851 guard, infrastructure-only), and slice 3b (diff Pass 1b + reconcile actor coordinator + `NexthopOps` impls + startup NHID adoption + Docker-runnable netns test — **operational behavior change**) have all shipped on `main`. Slice 4 (M40 manual containerlab smoke against FRR) is the remaining piece before production-default multi-homing enforcement after the 24 h soak. Slice 3.5 deferred follow-ups: periodic `RTM_GETNEXTHOP` drift recovery (research §8 confirms forced NHG deletion doesn't emit RTNLGRP_NEIGH events), `apply_aliasing_ecmp` operator toggle, IPv6 alias members. Gate 9 follow-ups: overlay-index IRB (RFC 9135 — Gate 9 slice 6 ships the Interface-less variant only), auto-derived Route Targets (RFC 8365 §5.1.2.1), privileged-runner CI gate for the M39 smoke. Needed for active-active ToR deployments and broader tenant-routing topologies. |
| let nh_after = ip_nexthop_show(); | ||
| assert!( | ||
| nh_after.trim().is_empty() | ||
| || (!nh_after.contains("0x30000001") && !nh_after.contains("0x40000001")), | ||
| "nexthop table should be empty (of ours): {nh_after}", |
Apply 7 review findings on the slice 3b coordinator PR: 1. Plan ordering: emit deletes before creates+updates so transitions (SingleDst↔FdbNhg, group-key drift) sequence remove → add at the kernel-op level, not just per-op. 2. SingleDst→FdbNhg `RemoveRemoteFdb` gated on kernel snapshot presence — keeps the op idempotent when the prior row never actually landed (failed install carried in `last_applied`). 3. Adoption cleanup gated on `failed.is_empty()` so a partial first-reconcile pass doesn't wipe stale-but-still-needed adopted NHIDs before the second pass gets a chance. 4. `apply_nhg_op` rolls back the allocator slot (`nh_id_alloc.release(id)`) on netlink failure so a transient kernel error doesn't leak bitmap IDs in either the install or update-members path. 5. Remove the unused `GroupOwned` import and the `_silence_unused_import_warning` helper from diff.rs. 6. ROADMAP slice 3b reads "in flight, pending merge" rather than claiming shipped on main. 7. Netns post-teardown emptiness asserted via `dump_owned_nexthops()` instead of hex-substring matching on `ip nexthop show` (which prints IDs in decimal — the prior assertion could pass even when entries were live).
| state | ||
| .groups | ||
| .record_group_member_change(*group_key, new_members); |
| // No record of the group — log + skip; the next | ||
| // InstallFdbNhg-driven pass will create it. | ||
| tracing::warn!( | ||
| ?group_key, | ||
| "UpdateFdbNhgMembers: group not in owned-state map; skipping" | ||
| ); | ||
| return Ok(()); |
| if !self.state.adoption_done { | ||
| match self.dataplane.dump_owned_nexthops().await { | ||
| Ok(adopted) => { | ||
| for nh in adopted { | ||
| match self.state.nh_id_alloc.reserve(nh.id) { | ||
| Ok(()) => { | ||
| self.state.adopted_unreferenced.insert(nh.id, nh); | ||
| } | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| ?e, | ||
| id = nh.id, | ||
| "adoption: reserve failed; ignoring" | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| Err(e) => { | ||
| // Dump failed — log + continue. The actor will | ||
| // re-attempt adoption on the next pass since | ||
| // adoption_done stays false until apply_plan | ||
| // succeeds at least once. | ||
| tracing::warn!(error = %e, "adoption: dump_owned_nexthops failed"); | ||
| } | ||
| } |
| // Drain to avoid holding the borrow across the await. | ||
| let stale: Vec<u32> = self.state.adopted_unreferenced.keys().copied().collect(); | ||
| for id in stale { | ||
| // del_nexthop is idempotent on ENOENT (slice 2). Errors | ||
| // are logged at debug; the next reconcile won't try | ||
| // again because adoption_done is set after this loop. | ||
| if let Err(e) = self.dataplane.del_nexthop(id).await { | ||
| tracing::debug!(?e, id, "adoption cleanup: del_nexthop failed (best-effort)"); | ||
| } | ||
| self.state.nh_id_alloc.release(id); | ||
| } | ||
| self.state.adopted_unreferenced.clear(); |
Four findings on the previous fixup pass, all in `reconcile.rs`: 1. `InstallFdbNhg` drift path: when an existing group's member set drifts, `record_group_member_change` returns the removed IPs; previously that return was discarded, leaving per-VTEP nexthops orphaned in `vtep_nhs` and the kernel. Now mirrors the `UpdateFdbNhgMembers` path — `record_member_unref` → `del_nexthop` → allocator release for each removed IP whose last group-ref dropped. 2. `UpdateFdbNhgMembers` missing-group: previously logged + returned `Ok(())`, which cleared retry state on a no-op. `compute_diff` only emits this op when the group exists in `GroupOwnedMap`, so the branch indicates `owned` and `groups` have drifted out of sync. Now returns a transient `Other` error so the actor retries / surfaces it. 3. Adoption completion: track `dump_succeeded` from the `dump_owned_nexthops()` call and AND it into the `adoption_done = true` gate. Without this, a dump failure on the first pass would let `adoption_done` flip true while the allocator had reserved no kernel NHIDs, permanently stranding collision risk for future `InstallFdbNhg` ops. 4. `cleanup_unreferenced_adoptions`: now returns whether every staged delete succeeded; only release the allocator slot on a per-ID `Ok`. On `Err`, leave the slot reserved + the entry in `adopted_unreferenced` so the next reconcile pass retries — releasing on transient netlink failure would let a future `alloc_vtep_nh` / `alloc_nhg` hand out an ID still live in the kernel. `adoption_done` only flips true when cleanup returned `true`.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (2)
crates/evpn-linux/src/reconcile.rs:1472
- Same issue as above: this GC loop ignores
del_nexthoperrors but still releases the allocator slot. Releasing on failure can lead to reusing an ID that is still present in the kernel. Prefer releasing only on successful delete (del already treats ENOENT as Ok), and keep failed deletions reserved + queued for retry.
for ip in removed {
if let Some(vtep_id) = state.groups.record_member_unref(ip, *group_key) {
let _ = dataplane.del_nexthop(vtep_id).await;
state.nh_id_alloc.release(vtep_id);
}
crates/evpn-linux/src/reconcile.rs:1493
- On group teardown,
del_nexthoperrors are ignored but the group/member IDs are released anyway. If a delete fails (e.g., transient netlink error), the allocator can hand out an ID that still exists in the kernel. Recommend releasing IDs only afterdel_nexthopsucceeds, and retaining enough state to retry deletion later while keeping the IDs reserved.
RefDelta::GroupShouldDelete { id, members } => {
let _ = dataplane.del_nexthop(id).await;
state.nh_id_alloc.release(id);
for ip in members {
if let Some(vtep_id) = state.groups.record_member_unref(ip, *group_key) {
let _ = dataplane.del_nexthop(vtep_id).await;
state.nh_id_alloc.release(vtep_id);
}
| let _ = dataplane.del_nexthop(vtep_id).await; | ||
| state.nh_id_alloc.release(vtep_id); |
| let res = match op { | ||
| DataplaneOp::InstallFdbNhg { .. } | ||
| | DataplaneOp::UpdateFdbNhgMembers { .. } | ||
| | DataplaneOp::RemoveFdbNhg { .. } => { | ||
| apply_nhg_op(&mut self.dataplane, &mut self.state, op).await | ||
| } | ||
| _ => self.dataplane.apply(op).await, | ||
| }; |
| /// presence of `NHA_FDB`. ADR-0059 slice 3b uses this at | ||
| /// `LinuxDataplane::connect()` time so the allocator can |
Three findings on the second fixup pass: 1. Steady-state GC paths (InstallFdbNhg drift, UpdateFdbNhgMembers, RemoveFdbNhg) released allocator slots unconditionally even when `del_nexthop` errored — same allocator-collision risk that the previous round's `cleanup_unreferenced_adoptions` fix addressed, but applied only to startup adoption. Extract the pattern into a shared `try_del_and_release_alloc` helper used by all three apply-time GC sites: release only when `del_nexthop` returned `Ok`, otherwise log + retain the slot to prevent a future `alloc_vtep_nh` / `alloc_nhg` from re-handing an in-kernel ID. 2. Retry/permanent-failure key for `UpdateFdbNhgMembers` was `(VNI, MacAddress::zero())` because the op has no per-MAC identity — every group update within the same VNI collided on a placeholder key. Adds a dedicated `nhg_retry: RetrySchedule<AliasGroupKey>` and `nhg_permanent_failures: BTreeMap<AliasGroupKey, DataplaneOp>` to `ActorState`, mirroring the BUM (ifindex-keyed) precedent. The apply loop, retry-deadline aggregation, permanent-suppression check, transient-retry record, and success path all dispatch `UpdateFdbNhgMembers` through the new key space. Extracts the permanent-suppression-check pattern into a generic `check_permanent_suppression` helper so the 3-way dispatch stays readable. 3. `NexthopSocket::dump_owned` doc said it ran "at `LinuxDataplane::connect()` time"; adoption was deferred to the reconcile actor's first pass per the slice 3b plan's ownership- boundary decision. Update the doc to match the call site.
| fn map_nexthop_error(e: nexthop_raw::NexthopError) -> DataplaneError { | ||
| match e { | ||
| nexthop_raw::NexthopError::Io(io) => DataplaneError::Io(io), | ||
| nexthop_raw::NexthopError::Kernel(errno) => { | ||
| DataplaneError::InvalidArgument(format!("nexthop kernel errno {errno}")) | ||
| } | ||
| nexthop_raw::NexthopError::Validation(v) => { | ||
| DataplaneError::InvalidArgument(format!("nexthop validation: {v}")) | ||
| } | ||
| nexthop_raw::NexthopError::Ipv6Unsupported => { | ||
| DataplaneError::InvalidArgument("nexthop IPv6 unsupported (slice 2.5 follow-up)".into()) | ||
| } | ||
| other => DataplaneError::Other(format!("nexthop socket: {other}")), | ||
| } |
| // IPv6 fallback — slice 2 `NexthopSocket` rejects v6 | ||
| // gateways. Diff degrades to single-dst (primary VTEP | ||
| // only) and emits one warn per (VNI, MAC) so the | ||
| // operator sees the degradation. | ||
| tracing::warn!( | ||
| ?vni, | ||
| mac = %mac, | ||
| "ADR-0059 IPv6 alias members not yet supported (slice 2.5 follow-up); \ | ||
| falling back to single-dst FDB row at primary VTEP", | ||
| ); |
| /// Build a bridge + VXLAN port with `nolearning` in the current netns. | ||
| /// Returns the VNI for downstream test setup. | ||
| async fn build_l2_topology() -> EvpnInstanceId { | ||
| run("ip", &["addr", "add", "127.0.0.10/32", "dev", "lo"]); | ||
| run("ip", &["link", "set", "lo", "up"]); | ||
| run("ip", &["link", "add", "name", "br100", "type", "bridge"]); | ||
| run("ip", &["link", "set", "br100", "up"]); | ||
| run( | ||
| "ip", | ||
| &[ | ||
| "link", | ||
| "add", | ||
| "name", | ||
| "vxlan100", | ||
| "type", | ||
| "vxlan", | ||
| "id", | ||
| "100", | ||
| "local", | ||
| "127.0.0.10", | ||
| "dstport", | ||
| "4789", | ||
| "nolearning", | ||
| ], | ||
| ); | ||
| run("ip", &["link", "set", "vxlan100", "master", "br100"]); | ||
| run("ip", &["link", "set", "vxlan100", "up"]); | ||
| EvpnInstanceId::new(100).expect("VNI 100") | ||
| } |
Three findings on the third fixup pass: 1. `map_nexthop_error` collapsed every `NexthopError::Kernel(errno)` into `DataplaneError::InvalidArgument`, which the actor classifies as `Permanent`. That meant transient errnos (`EBUSY`, `ENOMEM`, `ENOBUFS`) got permanently suppressed, stranding allocator reservations and kernel state until restart. Splits the errno classification out into `map_nexthop_kernel_errno`, mirroring the dispatch in `linux::fdb::errno_to_dataplane_error`: `EPERM`/`EACCES` → `PermissionDenied`, `EOPNOTSUPP` → `KernelTooOld` (with the NDA_NH_ID kernel-version hint), `EINVAL` → `InvalidArgument`, everything else → `Other` (transient). 2. IPv6-alias fallback `tracing::warn!` fired in `compute_diff` on every reconcile pass — sustained log spam for stable configurations with v6 multi-homed entries. Moves the warn-trigger responsibility to the reconcile actor: `compute_diff` now returns the `(VNI, MAC)` set currently in fallback on `Plan`, and the actor warns only for `(VNI, MAC)` keys that newly entered the set since the previous pass. Keys that drop out are pruned so a future re-entry produces a fresh warn. 3. `build_l2_topology` in the netns test was `async fn` with no `.await` points — trips `clippy::unused_async` (not enabled by default but flagged on review). Drops `async` + adjusts the one call site.
| // (FDB row deleted out-of-band), re-emit Install for | ||
| // this MAC. ADR §5 invariant 8: idempotent on | ||
| // EEXIST so this is safe even when the row is still | ||
| // there. | ||
| let kernel_has_row = snapshot | ||
| .find_fdb(vni, mac) | ||
| .is_some_and(|k| k.nh_id.is_some()); |
| /// Best-effort delete of a tagged kernel nexthop, releasing the | ||
| /// allocator slot only if the delete succeeded. On `Err` the slot | ||
| /// stays reserved so a future `alloc_vtep_nh` / `alloc_nhg` cannot | ||
| /// hand out an ID that is still live in the kernel — the failure | ||
| /// scenario flagged by ADR-0059 review. The leaked slot is bounded | ||
| /// (per-ID, per-pass) and reclaimable on a fresh daemon start via | ||
| /// the slice 3b adoption pass; periodic drift recovery (slice 3.5) | ||
| /// will eventually retry orphan deletions. | ||
| async fn try_del_and_release_alloc<D: crate::dataplane::NexthopOps>( | ||
| dataplane: &mut D, | ||
| alloc: &mut crate::nh_id_alloc::NhIdAllocator, | ||
| id: u32, | ||
| site: &'static str, | ||
| ) { | ||
| match dataplane.del_nexthop(id).await { | ||
| Ok(()) => alloc.release(id), | ||
| Err(e) => tracing::warn!( | ||
| ?e, | ||
| id, | ||
| site, | ||
| "FDB-NHG GC: del_nexthop failed; allocator slot retained to prevent kernel-ID collision" | ||
| ), | ||
| } |
Two findings on the fourth fixup pass:
1. FDB-NHG same-key drift check (`compute_diff` Pass 1b) only
verified that the kernel snapshot row had *some* `nh_id`, not
that it matched the locally-tracked group ID. A manual
`bridge fdb replace` or partial-failure carryover could leave
the row pointing at a different NHG and we'd silently forward
traffic via the wrong group. Restructured the per-MAC check to
require:
- the snapshot row exists, AND
- it has `nh_id == Some(local_id)` for the group we track
under `linux_key`.
Any other state (missing row, single-dst row with no NHID,
different NHID, or local group missing) re-emits Install —
safe via NLM_F_REPLACE. Folds the previous group-missing heal
into this single check, which dedupes naturally across MACs
via the apply path's `record_group_install`.
2. `try_del_and_release_alloc` previously logged + retained the
allocator slot on `del_nexthop` failure but had no retry path.
The surrounding op (`RemoveFdbNhg` last-ref teardown,
`UpdateFdbNhgMembers` member GC, Install-drift heal) still
returned `Ok(())` and the actor dropped the `GroupOwnedMap` /
`OwnedSet` entries — so on `EBUSY` / `ENOMEM` the kernel
orphan was permanently stranded with no signal for compute_diff
to re-emit a Remove. Now queues failed IDs into
`state.pending_deletes` (allocator slot still retained), and
adds `drain_pending_deletes` that runs after `apply_plan` each
pass — retries every queued ID, releases the slot + drops from
the queue on success, leaves entries for next pass on persistent
failure. Independent of the one-shot `cleanup_unreferenced_adoptions`
(which walks startup-adopted IDs); steady-state GC failures land
in the new queue.
| // Fresh install for this MAC. | ||
| creates.push(DataplaneOp::InstallFdbNhg { | ||
| vni, | ||
| mac, | ||
| group_key: linux_key, | ||
| members: canonical, | ||
| }); |
| let mut dump_succeeded = true; | ||
| if !self.state.adoption_done { | ||
| match self.dataplane.dump_owned_nexthops().await { | ||
| Ok(adopted) => { | ||
| for nh in adopted { | ||
| match self.state.nh_id_alloc.reserve(nh.id) { | ||
| Ok(()) => { | ||
| self.state.adopted_unreferenced.insert(nh.id, nh); | ||
| } | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| ?e, | ||
| id = nh.id, | ||
| "adoption: reserve failed; ignoring" | ||
| ); | ||
| } | ||
| } | ||
| } | ||
| } | ||
| Err(e) => { | ||
| // Dump failed — log + leave adoption_done false so | ||
| // the next reconcile pass re-attempts the dump | ||
| // before any FDB-NHG ops are applied against an | ||
| // under-reserved allocator. | ||
| tracing::warn!(error = %e, "adoption: dump_owned_nexthops failed"); | ||
| dump_succeeded = false; | ||
| } | ||
| } |
Two findings on the fifth fixup pass:
1. `compute_diff` Pass 1b `None` (fresh-install) arm emitted
`InstallFdbNhg` based solely on `last_applied` being empty,
without consulting the kernel snapshot. Two ways this hurt:
- foreign-entry preservation violation: an operator's static
FDB row at the same `(vni, mac)` would be overwritten by
NLM_F_REPLACE.
- restart case: `last_applied` is empty post-restart but the
kernel may still hold rustbgpd-owned rows from the prior
instance — fine, but we'd also overwrite foreign ones.
Mirrors `handle_existing_kernel_entry`'s gate: install only when
the snapshot row is absent or has `extern_learn` set (our marker).
Otherwise log + skip the install.
2. Startup-adoption dump failure was tracked via `dump_succeeded`
but the reconcile pass kept going into `compute_diff` + `apply_plan`
anyway, with an allocator that hadn't reserved pre-existing
kernel NHIDs — collision risk that NLM_F_REPLACE turns into
silent kernel-state overwrite. Moved the adoption dump to after
`dump_snapshot` (so the report context is available) and changed
the failure path to short-circuit: emit a no-ops report and
return, deferring to the next reconcile pass. The
`dump_succeeded` gate at the cleanup step is now unnecessary
(reaching that point implies the dump succeeded) so it's
removed in favor of a simpler `failed.is_empty()` check.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
crates/evpn-linux/src/reconcile.rs:1554
- If
add_nexthop_group(...).await?fails, any new per-VTEP members installed above remain present but (sincerecord_group_member_changeis not reached) are not attached to the group inGroupOwnedMap. A permanent failure would leave these members orphaned and allocator slots reserved. Consider best-effort rollback/enqueue of member IDs created in this Update op when the group REPLACE step fails.
// Compute new member-id list and REPLACE the group.
let member_ids: Vec<u32> = members
.iter()
.map(|ip| state.groups.vtep_nh(ip).expect("just installed").id)
.collect();
let Some(g_id) = state.groups.group(group_key).map(|g| g.id) else {
// `compute_diff` Pass 1b only emits `UpdateFdbNhgMembers`
// when the group exists in `GroupOwnedMap`, so hitting
// this branch means `owned` and `groups` have drifted
// out of sync — an internal state inconsistency.
// Surface it as a transient error (classified `Other`
// → retried) so it shows up in metrics rather than
// silently clearing retry state.
return Err(crate::error::DataplaneError::Other(format!(
"ADR-0059: UpdateFdbNhgMembers for {group_key:?} but group missing \
from GroupOwnedMap (owned/groups state drift)",
)));
};
dataplane.add_nexthop_group(g_id, &member_ids).await?;
let new_members: BTreeSet<_> = members.iter().copied().collect();
| let id = state.nh_id_alloc.alloc_nhg().map_err(|e| { | ||
| crate::error::DataplaneError::Other(format!("ADR-0059: nhg alloc failed: {e}")) | ||
| })?; | ||
| if let Err(e) = dataplane.add_nexthop_group(id, &member_ids).await { | ||
| state.nh_id_alloc.release(id); | ||
| return Err(e); | ||
| } |
| // Step 3: install the FDB row pointing at the group. | ||
| // Failure here leaves the group + members installed in | ||
| // `state.groups` — they're refcounted, so the next pass's | ||
| // Install/Update for any MAC sharing this `group_key` | ||
| // will reuse them. Per-VTEP members not yet ref'd by any | ||
| // MAC stay tracked in `groups.vtep_nhs`; the GC happens | ||
| // on the corresponding RemoveFdbNhg path's last-ref unref. | ||
| dataplane.install_fdb_nhg_row(*vni, *mac, group_id).await?; | ||
| state.groups.record_mac_ref(*group_key, *vni, *mac); |
| ) { | ||
| match dataplane.del_nexthop(id).await { | ||
| Ok(()) => state.nh_id_alloc.release(id), | ||
| Err(e) => { | ||
| tracing::warn!( | ||
| ?e, | ||
| id, | ||
| site, | ||
| "FDB-NHG GC: del_nexthop failed; queued for retry, allocator slot retained" | ||
| ); | ||
| state.pending_deletes.insert(id); | ||
| } | ||
| } |
Three findings on the sixth fixup pass: 1. `apply_nhg_op InstallFdbNhg` Step 2 (group install) or Step 3 (`install_fdb_nhg_row`) failure left any per-VTEP members newly-created in Step 1 (and the newly-created group on Step 3 failure) installed in the kernel with zero MAC refs — the actor would record the op as failed, classify as `Permanent` on e.g. permission/CVE-guard errors, and never have a signal to re-emit a Remove. Adds `rollback_partial_install` helper + new `GroupOwnedMap::drop_unreferenced_group` method that walks the group + members in reverse-creation order, clears local tracking, and routes each delete through `try_del_and_release_alloc` (so transient rollback delete failures land in the retry queue). 2. Same shape on `UpdateFdbNhgMembers`: Step 1 created members, Step 2 (group REPLACE) failure or the owned/groups-drift error path could leak the newly-installed per-VTEP members. Tracks new members + rolls them back on any later-step failure. 3. `try_del_and_release_alloc` enqueued every delete failure for retry, including permanent classes (`PermissionDenied`, `KernelTooOld`) that can't be cured by retrying. Now classifies via `DataplaneError::class()`: transient/conflict → queue + retry; permanent → warn once + retain slot (operator-intervention territory). Stops the runaway retry loop + repeated warns on permanent failure scenarios.
| @@ -47,6 +51,13 @@ pub struct Plan { | |||
| /// (sorted by `(VNI, MAC)`), then updates, then deletes. The | |||
| /// actor applies them serially with backoff per-op on failure. | |||
| pub ops: Vec<DataplaneOp>, | |||
| let status = build_instance_status(&intent.instances, &probes); | ||
| let bum_enforcement = build_bum_enforcement_status( | ||
| &intent.bum_enforcement, | ||
| &KernelSnapshot::new(), |
| && let Some((_id_from_map, _members)) = state.groups.drop_unreferenced_group(&group_key) | ||
| { | ||
| try_del_and_release_alloc(dataplane, state, group_id, site).await; |
| ) { | ||
| use crate::error::FailureClass; | ||
| match dataplane.del_nexthop(id).await { | ||
| Ok(()) => state.nh_id_alloc.release(id), |
Four small findings on the seventh fixup pass: 1. `Plan::ops` docstring claimed creates → updates → deletes ordering, but the build order changed to deletes → creates → updates in an earlier round to keep transition pairs (Remove → Install) safe within a single plan. Update the doc to match. 2. Adoption-dump-failure short-circuit built BUM enforcement against `KernelSnapshot::new()` when the real `snapshot` was already available — would report BUM readiness/action as if no links existed. Switch to the real snapshot. 3. `rollback_partial_install` ignored the `(group_id, members)` returned by `drop_unreferenced_group` and instead used the caller-supplied `group_id`. Source these from the map (the ID we actually tracked there) with a debug_assert on equality, so any future state-drift bug surfaces at test time rather than silently deleting the wrong kernel object. 4. `try_del_and_release_alloc` released the allocator slot on `Ok` but didn't remove the ID from `state.pending_deletes`. A prior- pass-queued ID that got cleared by the steady-state path would stay queued and re-attempt every reconcile pass (kernel returns ENOENT idempotent-ACK, but the trace spam adds up). Now drops the entry from `pending_deletes` on success.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
crates/evpn-linux/src/in_memory.rs:351
- Similarly,
remove_fdb_nhg_rowonly updatesfdb_nhg_rows; if you makeinstall_fdb_nhg_rowupdatestate.kernelfor snapshot consistency, this should also remove/clear the correspondingKernelSnapshotFDB entry socompute_diffconverges correctly in in-memory actor tests.
async fn remove_fdb_nhg_row(
&mut self,
vni: EvpnInstanceId,
mac: MacAddress,
) -> Result<(), DataplaneError> {
let mut state = self.state.lock().expect("poisoned");
// Idempotent — slice 3b coordinator may issue this on
// already-removed rows during stale cleanup.
state.fdb_nhg_rows.remove(&(vni, mac));
Ok(())
| .group(group_key) | ||
| .map(|g| (g.id, g.members.iter().copied().collect::<Vec<_>>())); | ||
| let group_id = if let Some((g_id, existing_members)) = existing { | ||
| if existing_members != *members { |
| async fn install_fdb_nhg_row( | ||
| &mut self, | ||
| vni: EvpnInstanceId, | ||
| mac: MacAddress, | ||
| nh_id: u32, | ||
| ) -> Result<(), DataplaneError> { | ||
| let mut state = self.state.lock().expect("poisoned"); | ||
| state.fdb_nhg_rows.insert((vni, mac), nh_id); | ||
| Ok(()) | ||
| } |
| # Test selector. `bum_*` runs the Gate 8b harness; `fdb_nhg` runs | ||
| # the ADR-0059 slice 3b FDB nexthop group integration test. | ||
| TEST_BIN="netns_bum_filter" | ||
| case "${1:-all}" in | ||
| spike) FILTER="bum_filter_spike_validates_kernel_primitive" ;; | ||
| roundtrip) FILTER="linux_dataplane_set_bum_port_flags_round_trip" ;; | ||
| all|"") FILTER="" ;; | ||
| fdb_nhg) TEST_BIN="netns_fdb_nhg"; FILTER="" ;; | ||
| fdb_nhg_roundtrip) TEST_BIN="netns_fdb_nhg"; FILTER="round_trip_install_and_remove_fdb_nhg" ;; | ||
| fdb_nhg_cve) TEST_BIN="netns_fdb_nhg"; FILTER="cve_guard_blocks_install_when_learning_enabled" ;; |
Three findings on the eighth fixup pass: 1. `InMemoryDataplane::install_fdb_nhg_row` recorded the row in `fdb_nhg_rows` but never mirrored it into `state.kernel`. The slice 3b drift check in `compute_diff` Pass 1b reads from the snapshot returned by `dump_snapshot()`, which only reflects `state.kernel`, so a unit test using the in-memory backend would never see the row and re-emit `InstallFdbNhg` forever. Insert a matching `KernelFdbEntry` (`nh_id: Some(_)`, `dst: None`, `extern_learn: true`) on install; remove it on `remove_fdb_nhg_row`. 2. `existing_members != *members` in the Install drift heal: readable as written (the `!=` operator's `PartialEq::ne` takes references, no move), but rewritten to `.as_slice() != .as_slice()` for clarity. 3. `run-netns-tests.sh` header described only the Gate 8b selectors; the script now also dispatches `fdb_nhg`, `fdb_nhg_roundtrip`, and `fdb_nhg_cve` for slice 3b. Update the usage block to document all selectors + bump the kernel requirement to >= 5.8 for the fdb_nhg bundle (NDA_NH_ID appeared then).
| /// becomes `InvalidArgument` (the actor's classifier maps it to | ||
| /// `Permanent` so a programming bug doesn't backoff-retry forever); |
| // [10.0.0.2]. Expect UpdateFdbNhgMembers emitted once. | ||
| let mut desired = RemoteMacTable::builder(); | ||
| desired | ||
| .insert(vni(100), mac(1), entry_multi_homed("10.0.0.2", &[], 7)) |
Two findings on the ninth fixup pass: 1. `map_nexthop_error` doc said `Kernel(errno)` becomes `InvalidArgument`, but slice-3b round 4 routed it through `map_nexthop_kernel_errno` for per-errno classification (`PermissionDenied` / `KernelTooOld` / `InvalidArgument` / `Other`). Update the doc to match so the retry-vs-suppression behavior isn't misread. 2. `fdb_nhg_member_set_shrinks_to_one_stays_group_backed` test constructed a `RemoteMacEntry` with `alias_group_key = Some(_)` and empty `alias_vtep_ips`, which violates the projection invariant (`empty alias_vtep_ips ⇔ alias_group_key.is_none()` per `projection.rs:259`) — production wire intent can't reach an N=1 group-backed state because the projection collapses to single-dst. Test now exercises N=3 → N=2 instead, which is the realistic "stays group-backed" path and still covers the UpdateFdbNhgMembers code path. Renamed to drop the misleading "to_one" suffix.
| // Per-group member-set diff. Dedupe across MACs that | ||
| // share the group. The group-missing case (our local | ||
| // state doesn't track this `linux_key`) is intentionally | ||
| // *not* healed here — the per-MAC kernel-drift check | ||
| // below emits one Install per MAC, which the apply path | ||
| // dedupes at the group level via `record_group_install`. | ||
| if group_seen.insert(linux_key) | ||
| && let Some(existing) = groups.group(&linux_key) | ||
| { | ||
| let existing_members: Vec<_> = existing.members.iter().copied().collect(); | ||
| if existing_members != canonical { | ||
| updates.push(DataplaneOp::UpdateFdbNhgMembers { | ||
| group_key: linux_key, | ||
| members: canonical.clone(), | ||
| }); | ||
| } | ||
| } | ||
|
|
||
| // Per-MAC kernel-snapshot drift check. Re-emit Install | ||
| // when: | ||
| // - kernel has no FDB row for `(vni, mac)`, | ||
| // - kernel row is single-dst (no `nh_id`), | ||
| // - kernel row points at an `nh_id` that doesn't | ||
| // match our locally-tracked group ID (manual | ||
| // `bridge fdb replace`, partial-failure carryover, | ||
| // or stale state — we'd silently forward via the | ||
| // wrong group otherwise), | ||
| // - or we don't track the group locally | ||
| // (startup-adoption mismatch). | ||
| // `InstallFdbNhg` is idempotent under `NLM_F_REPLACE` so | ||
| // re-emitting when the row is already correct is safe. | ||
| let kernel_row_correct = snapshot | ||
| .find_fdb(vni, mac) | ||
| .and_then(|k| k.nh_id) | ||
| .is_some_and(|kernel_id| tracked_group_id == Some(kernel_id)); | ||
| if !kernel_row_correct { | ||
| creates.push(DataplaneOp::InstallFdbNhg { | ||
| vni, | ||
| mac, | ||
| group_key: linux_key, | ||
| members: canonical, | ||
| }); | ||
| } |
One finding on the tenth fixup pass: A single plan could emit both `UpdateFdbNhgMembers` (group member-set drift) and `InstallFdbNhg` (per-MAC kernel-row drift) for the same `group_key`. `apply_nhg_op InstallFdbNhg` already does a REPLACE in its group-already-exists branch, so the subsequent `UpdateFdbNhgMembers` is a redundant second REPLACE on the same NHID — wasted netlink, more complex failure sequencing. Post-process the plan to drop `UpdateFdbNhgMembers` ops whose `group_key` is also covered by an `InstallFdbNhg` in the same plan.
| // Step 3: install the FDB row pointing at the group. On | ||
| // failure, roll back the newly-created group (if any) and | ||
| // members — they have no MAC ref yet, so they're true | ||
| // orphans without rollback. | ||
| if let Err(e) = dataplane.install_fdb_nhg_row(*vni, *mac, group_id).await { | ||
| rollback_partial_install( |
| /// ADR-0059 slice 3b coordinator: apply an FDB-NHG op by calling | ||
| /// [`NexthopOps`] methods in ADR §5 invariant 1+2 order, updating | ||
| /// the allocator + [`GroupOwnedMap`] state. | ||
| /// | ||
| /// Free function (not a method) so it can borrow `dataplane` and | ||
| /// `state` mutably from disjoint fields of the actor. |
Two findings on the eleventh fixup pass: 1. `rollback_partial_install` could corrupt `GroupOwnedMap` on the existing-group drift-heal Step 3 failure path: Step 2's successful REPLACE had already attached the new members to the (still-installed) group via `record_group_member_change`, so the post-rollback `drop_vtep_nh` would remove members that the surviving group still references — leaving the group's `members` set pointing at gone kernel state. Rollback now skips members whose per-VTEP NH still has live group refs (`vtep_nh_is_orphan` returns false); the next reconcile heals the missing FDB row. 2. Shutdown `drain()` issued `RemoveRemoteFdb` for every owned `(VNI, MAC)`, regardless of `OwnedEntryKind`. For FdbNhg entries that's the wrong op shape — `Dataplane::apply` rejects with `InvalidArgument` and the kernel group/members stay orphaned across restarts. Drain now snapshots `(vni, mac, kind)` and dispatches: SingleDst → `RemoveRemoteFdb` (existing path), FdbNhg → `RemoveFdbNhg` routed through `apply_nhg_op` so the FDB row → group → members teardown sequence runs and refcount state stays consistent for sibling MACs that share the group.
| return Err(crate::error::DataplaneError::Other(format!( | ||
| "ADR-0059: vtep NH alloc failed: {e}" | ||
| ))); |
| // Surface it as a transient error (classified `Other` | ||
| // → retried) so it shows up in metrics rather than | ||
| // silently clearing retry state. | ||
| rollback_partial_install(dataplane, state, &new_members, None, "update_no_group") | ||
| .await; | ||
| return Err(crate::error::DataplaneError::Other(format!( | ||
| "ADR-0059: UpdateFdbNhgMembers for {group_key:?} but group missing \ | ||
| from GroupOwnedMap (owned/groups state drift)", | ||
| ))); |
| /// `NHA_GROUP_TYPE` — `u16` discriminator for group flavor | ||
| /// (MPATH vs resilient). Kernel may include this on dump even when | ||
| /// we don't set it on add; slice 3b's parser tolerates and ignores. | ||
| pub(crate) const NHA_GROUP_TYPE: u16 = 3; | ||
|
|
||
| /// `NHA_OP_FLAGS` — bitmask added in newer kernels; emitted on | ||
| /// dump regardless of whether userspace set it. Parser tolerates. | ||
| pub(crate) const NHA_OP_FLAGS: u16 = 15; |
| "adoption: dump_owned_nexthops failed; deferring this reconcile pass to avoid allocator collisions" | ||
| ); | ||
| let status = build_instance_status(&intent.instances, &probes); | ||
| // `dump_snapshot` already succeeded — use the real | ||
| // snapshot so BUM enforcement reflects actual link | ||
| // state, not "no links exist". | ||
| let bum_enforcement = | ||
| build_bum_enforcement_status(&intent.bum_enforcement, &snapshot); | ||
| self.emit_report( | ||
| status, | ||
| vec![], | ||
| vec![], | ||
| bum_enforcement, | ||
| ip_vrf_status, | ||
| ip_vrf_routes, | ||
| ) | ||
| .await; | ||
| return; |
Four findings on the twelfth fixup pass: 1. Allocator failures (`alloc_vtep_nh` / `alloc_nhg` exhaustion or tag-bit collision) returned `DataplaneError::Other` (classified `Transient`), so the actor would infinite-retry an un-recoverable condition. Switch to `InvalidArgument` (Permanent) so the op-shape gets suppressed and the operator sees it surface. 2. `UpdateFdbNhgMembers` "group missing from GroupOwnedMap" branch represents an internal owned/groups state-drift invariant violation — also returned `Other` (transient) and would spin forever. Switch to `InvalidArgument` (Permanent) for the same reason; if state ever drifts, the suppression flags it as a bug rather than hiding it behind retry churn. 3. `constants_match_adr_0059_uapi_table` didn't pin the new `NHA_GROUP_TYPE = 3` and `NHA_OP_FLAGS = 15` constants added for the slice 3b dump-tolerance list. Drift here would silently break the dump-path filter — add the asserts. 4. `dump_owned_nexthops` failure short-circuited the entire reconcile pass regardless of error class. On `KernelTooOld` / `PermissionDenied` / `InvalidArgument` (no `NDA_NH_ID` / no `CAP_NET_ADMIN` / message-shape bug) that would permanently disable single-dst FDB / BUM / L3 too, not just FDB-NHG. Now classifies: Permanent → mark `adoption_done = true` + log once + keep going (FDB-NHG ops in this pass will Permanent-fail at apply time and get per-op-shape suppressed); Transient/Conflict → keep the existing short-circuit so a flaky dump doesn't under-reserve the allocator.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
crates/evpn-linux/src/reconcile.rs:1066
cleanup_unreferenced_adoptionsdeletes stale nexthops in ascending ID order. Since member IDs (0x3000_....) sort before group IDs (0x4000_....), this will try to delete members while their (stale) group still exists, which the kernel typically rejects; that forces at least one extra reconcile pass and keepsadoption_donefalse even if the group delete succeeds later in the same loop. Consider deleting stale NHGs before members (using the storedKernelNexthopKind/tag bits to order), so cleanup can succeed in a single pass and avoid avoidable warnings/retries.
async fn cleanup_unreferenced_adoptions(&mut self) -> bool {
// Drain to avoid holding the borrow across the await.
let stale: Vec<u32> = self.state.adopted_unreferenced.keys().copied().collect();
let mut all_ok = true;
for id in stale {
// del_nexthop is idempotent on ENOENT (slice 2). Only
// release + drop tracking on Ok — on Err, leave the slot
// reserved + the entry in the map so the next reconcile
// pass retries.
match self.dataplane.del_nexthop(id).await {
Ok(()) => {
self.state.nh_id_alloc.release(id);
self.state.adopted_unreferenced.remove(&id);
}
Err(e) => {
tracing::warn!(
?e,
id,
"adoption cleanup: del_nexthop failed; leaving reserved for next pass"
);
all_ok = false;
}
}
}
all_ok
| async fn drain_pending_deletes(&mut self) { | ||
| let ids: Vec<u32> = self.state.pending_deletes.iter().copied().collect(); | ||
| for id in ids { | ||
| match self.dataplane.del_nexthop(id).await { | ||
| Ok(()) => { | ||
| self.state.nh_id_alloc.release(id); | ||
| self.state.pending_deletes.remove(&id); | ||
| tracing::debug!(id, "pending_deletes: drained on retry"); | ||
| } | ||
| Err(e) => { | ||
| tracing::trace!(?e, id, "pending_deletes: still failing"); | ||
| } | ||
| } | ||
| } |
Four findings from a deep review pass on top of the 13 Copilot
rounds:
1. LOAD-BEARING — Startup adoption cleanup could delete still-live
adopted NHIDs after FDB-NHG installs were permanently suppressed.
`apply_plan`'s suppression branch `continue`s out without joining
`failed`, so once a permanent failure was recorded the next pass
saw `failed.is_empty()` and ran cleanup against
`adopted_unreferenced` — including IDs that a kernel FDB row was
still pointing at. `cleanup_unreferenced_adoptions` now:
- Blocks cleanup entirely if any FDB-NHG op is permanently
suppressed (`nhg_permanent_failures` non-empty, or
`permanent_failures` contains InstallFdbNhg / UpdateFdbNhgMembers
/ RemoveFdbNhg).
- Threads `&KernelSnapshot` through; walks `snapshot.iter_fdb()`
to build a retention set of every adopted NHID a kernel FDB
row references, plus recursive members of retained groups.
- Only deletes adopted IDs NOT in the retention set; retained
IDs stay in `adopted_unreferenced` for the next pass to
re-evaluate.
2. Teardown order — both `pending_deletes` and
`cleanup_unreferenced_adoptions` iterate `BTreeSet<u32>` in
ascending order; with the tag scheme (members at 0x3000_xxxx,
groups at 0x4000_xxxx) that deletes members before their parent
group, which the kernel can reject with `EINVAL` and which
defeats ADR-0059 §5 invariant 2 (FDB row → group → members).
Partition by `NhIdAllocator::is_nhg` and drain groups first.
3. Actor-level test coverage was missing for the slice 3b
coordinator + adoption state machine. The diff-level tests
exercise `compute_diff` Pass 1b in isolation, and the privileged
netns test (`tests/netns_fdb_nhg.rs`) validates the raw netlink
primitives — neither covers `apply_nhg_op` through
`ReconcileActor` with `GroupOwnedMap` / `OwnedSet` / report. Add
four new tests in `crates/evpn-linux/tests/reconcile_actor.rs`:
- `fdb_nhg_multihomed_intent_installs_members_group_and_row`
- `fdb_nhg_shared_group_refcount_teardown` (two MACs share
group; withdraw-one keeps it, withdraw-last tears down)
- `fdb_nhg_adoption_retains_live_kernel_fdb_refs` (preloaded
NHG + member + kernel FDB row → cleanup retains)
- `fdb_nhg_perm_suppressed_install_blocks_adoption_cleanup`
(Permanent Install + later suppressed pass with empty `failed`
→ cleanup still blocked)
Adds three `InMemoryHandle` helpers (`pre_load_nexthop_op`,
`nexthop_ops`, `kernel_fdb_nh_id`) and threads a
`take_universal_failure` hook into every `NexthopOps` method so
the test can drive a Permanent FDB-NHG apply.
4. Nit — `encode_add_fdb_member` had two `match gateway { ... }`
blocks (one for `family`, one for the wire octets). Fold into
one match so the two derivations can't drift.
First post-mortem for the symmetric Interface-less IRB datapath. Run completed cleanly at 24h00m elapsed against v0.18.0 (`5619ace` = v0.18.0 + harness PR #80) — peak PE1 RSS 14.34 MB, steady-state slope 0.025 MB/h, 0 BGP flaps, 0 installed_routes oscillations, 99.93% tenant/observed correlation. PASS, all gates green. Verdict + sign-off mirror the Gate 8b post-mortem at `docs/soak-gate8b-24h-bum-state.md`. Five non-blocking follow-ups noted for the next re-soak: - Watch for the single ~22h RSS step recurrence. - Source `pe1_observed_routes` from a Prometheus counter, not the periodic `DataplaneReport` (eliminate the single observation-lag sampling race observed at row 1386). - Fix the harness so `pe1.log` actually captures rustbgpd output (it was 0 bytes this run). - Clarify `CHURN_INTERVAL_SEC` semantics (configured 30 s, actual cadence ~60 s/transition). - Add an `originated_routes` column to the CSV now that v0.18.0 ships the Prometheus series. Raw artifacts pinned alongside per the `docs/artifacts/soak/gate8b-20260510T152451Z/` convention.
Summary
First slice in the ADR-0059 chain with operational behavior change. Multi-homed Type 2 routes now program FDB nexthop groups via
NDA_NH_IDin the kernel; traffic to a multi-homed MAC fans out across every observed alias VTEP on the encap path. Slice 4 (M40 manual containerlab smoke against FRR) closes out the validation story.This PR is the wiring follow-up to slice 3a (#87) which shipped the foundation infrastructure with zero behavior change. The user-supplied research (member-order canonicalization, atomic replace semantics, CVE test scope,
bridge fdb showregex,ip nexthop showformat, Docker capabilities, per-VTEP NH key, RTNLGRP_NEIGH coverage gaps) is folded into the implementation per-section.What's in this PR
compute_diffPass 1b (crates/evpn-linux/src/diff.rs) — emitsInstallFdbNhg/UpdateFdbNhgMembers/RemoveFdbNhgbased onRemoteMacEntry::alias_group_key+last_applied+GroupOwnedMapstate. IPv6 alias members fall back to single-dstAddRemoteFdbwith atracing::warn!per(VNI, MAC)— slice 2'sNexthopSocketrejects v6 gateways, so this is graceful degradation until the v6 fixture follow-up. 6 transition tests cover the owned↔desired matrix: single-dst → FDB-NHG, FDB-NHG → single-dst, group-key drift, N→1 drain, NotReady drain, IPv6 fallback.NexthopOpsimpls:LinuxDataplanedelegates to slice 2'sNexthopSocket+ slice 3a'slinux::fdb_nhgwith the CVE-2025-39851 guard inline;InMemoryDataplanerecordsadd_nexthop_member/add_nexthop_group/del_nexthop/install_fdb_nhg_row/remove_fdb_nhg_row/dump_owned_nexthopscalls for unit-testing the coordinator without netlink.Reconcile actor coordinator (
crates/evpn-linux/src/reconcile.rs::apply_nhg_op) — intercepts the three FDB-NHG op variants beforeDataplane::apply, callsNexthopOpsmethods in ADR-0059 §5 invariant 1+2 order:NLM_F_REPLACE) → FDB row.NexthopSocket::dump_owned(crates/evpn-linux/src/linux/nexthop_raw/socket.rs) —RTM_GETNEXTHOPmultipart parser. ToleratesNHA_GROUP_TYPE/NHA_OP_FLAGSper research §1 (kernel emits them on dump even when we don't set them on add); filters by tag bits +NHA_FDBpresence. 7 unit tests covering member-decode, group-decode, unknown-attr tolerance, tag-bit filtering, non-FDB skip,NLMSG_DONEtermination, continuation.Startup adoption + deferred stale cleanup: first
reconcile_oncepass dumps tagged nexthops, reserves their IDs inNhIdAllocator(preventsNLM_F_REPLACEcollisions with a prior daemon's state), and tracks them inadopted_unreferenced. After the first apply phase, anything still unreferenced byGroupOwnedMapis true-stale and gets deleted + released. Gated to one shot viaadoption_done. Per research §8, this is startup-only; periodicRTM_GETNEXTHOPdrift recovery (forced NHG deletion doesn't emit RTNLGRP_NEIGH) is deferred to slice 3.5.DataplaneOpKindvariants from slice 3a are now actually emitted:InstallFdbNhg { mac },UpdateFdbNhgMembers { esi, ethernet_tag },RemoveFdbNhg { mac }. Each carries the discriminator needed for operator reporting; group-level update uses(esi, ethernet_tag)since there's no per-MAC identity (thevnilives onAppliedOp).Docker-runnable netns integration test (
crates/evpn-linux/tests/netns_fdb_nhg.rs): round-trip install + verification viaip nexthop show(member formatid %u via %s fdb, group formatid %u group %u/%u fdbper research §5) +bridge fdb show(regex(^|\\s)nhid\\s+${id}(\\s|$)per research §4) +dump_owned_nexthopsround-trip + idempotent del + CVE-2025-39851 guard negative path (per research §3: don't try to crash the kernel; assertlearning=on→ install returnsInvalidArgumentreferencing the CVE, no FDB row programmed).Harness extended at
crates/evpn-linux/tests/docker/run-netns-tests.shwithfdb_nhg/fdb_nhg_roundtrip/fdb_nhg_cveselectors. Existing cap set (NET_ADMIN + SYS_ADMIN + apparmor=unconfined) is correct per research §6.Design notes / deviations from the approved plan
Adoption happens in the first
reconcile_oncepass rather than atLinuxDataplane::connect()time. Keeps allocator + refcount state inActorState(per ADR §7 boundary) without changingconnect()'s signature. Net behavior identical — adoption completes before anyInstallFdbNhgis emitted in compute_diff Pass 1b.Per-VTEP NH key is
IpAddronly (not(VNI, IpAddr)) per research §7 (FRR'szebra_evpn_l2_nh_findkeys by VTEP IP). Group key stays VNI-scoped becauseshare_l2_nhgdefaults off.Member set canonicalization: encoder produces sorted+deduped (
aliasing::group_membersalready does this);GroupOwnedMapstores inBTreeSet<IpAddr>; diff compares order-insensitively. Per research §1, kernel preserves stored order on dump but canonicalization avoids false drift.What's not in this PR (slice 3.5 / slice 4 / future)
RTM_GETNEXTHOPdrift recovery — slice 3.5. Forced NHG deletion doesn't emit RTNLGRP_NEIGH (research §8); requires periodic dump orRTNLGRP_NEXTHOPmulticast (latter explicit ADR out-of-scope).apply_aliasing_ecmpoperator off-switch — slice 3b ships enforcement-on; toggle is slice 3.5+ if operators ask.Stats
Test plan
cargo fmt --all -- --checkcargo clippy --workspace --all-targets -- -D warningscargo test --workspace --no-fail-fast(full sweep)RUSTDOCFLAGS=\"-D warnings\" cargo doc --workspace --no-depscargo +1.92 check --workspace --all-targets --locked(MSRV)References
alias_group_key).nexthop_rawnetlink primitive.6ead38147ebb.