Skip to content

evpn-linux: ADR-0059 slice 3b — wire FDB-NHG onto the apply path#88

Merged
lance0 merged 16 commits into
mainfrom
evpn/adr-0059-slice-3b-coordinator-and-adoption
May 13, 2026
Merged

evpn-linux: ADR-0059 slice 3b — wire FDB-NHG onto the apply path#88
lance0 merged 16 commits into
mainfrom
evpn/adr-0059-slice-3b-coordinator-and-adoption

Conversation

@lance0
Copy link
Copy Markdown
Owner

@lance0 lance0 commented May 12, 2026

Summary

First slice in the ADR-0059 chain with operational behavior change. Multi-homed Type 2 routes now program FDB nexthop groups via NDA_NH_ID in the kernel; traffic to a multi-homed MAC fans out across every observed alias VTEP on the encap path. Slice 4 (M40 manual containerlab smoke against FRR) closes out the validation story.

This PR is the wiring follow-up to slice 3a (#87) which shipped the foundation infrastructure with zero behavior change. The user-supplied research (member-order canonicalization, atomic replace semantics, CVE test scope, bridge fdb show regex, ip nexthop show format, Docker capabilities, per-VTEP NH key, RTNLGRP_NEIGH coverage gaps) is folded into the implementation per-section.

What's in this PR

  • compute_diff Pass 1b (crates/evpn-linux/src/diff.rs) — emits InstallFdbNhg / UpdateFdbNhgMembers / RemoveFdbNhg based on RemoteMacEntry::alias_group_key + last_applied + GroupOwnedMap state. IPv6 alias members fall back to single-dst AddRemoteFdb with a tracing::warn! per (VNI, MAC) — slice 2's NexthopSocket rejects v6 gateways, so this is graceful degradation until the v6 fixture follow-up. 6 transition tests cover the owned↔desired matrix: single-dst → FDB-NHG, FDB-NHG → single-dst, group-key drift, N→1 drain, NotReady drain, IPv6 fallback.

  • NexthopOps impls: LinuxDataplane delegates to slice 2's NexthopSocket + slice 3a's linux::fdb_nhg with the CVE-2025-39851 guard inline; InMemoryDataplane records add_nexthop_member / add_nexthop_group / del_nexthop / install_fdb_nhg_row / remove_fdb_nhg_row / dump_owned_nexthops calls for unit-testing the coordinator without netlink.

  • Reconcile actor coordinator (crates/evpn-linux/src/reconcile.rs::apply_nhg_op) — intercepts the three FDB-NHG op variants before Dataplane::apply, calls NexthopOps methods in ADR-0059 §5 invariant 1+2 order:

    • Install: per-VTEP members → group (create or NLM_F_REPLACE) → FDB row.
    • Update: add new members → group REPLACE (atomic alias-set swap) → GC removed per-VTEP members whose ref-count hits zero.
    • Remove: FDB row first → if last MAC, delete group + unref per-VTEP members; release allocator slots for anything cleaned up.
  • NexthopSocket::dump_owned (crates/evpn-linux/src/linux/nexthop_raw/socket.rs) — RTM_GETNEXTHOP multipart parser. Tolerates NHA_GROUP_TYPE / NHA_OP_FLAGS per research §1 (kernel emits them on dump even when we don't set them on add); filters by tag bits + NHA_FDB presence. 7 unit tests covering member-decode, group-decode, unknown-attr tolerance, tag-bit filtering, non-FDB skip, NLMSG_DONE termination, continuation.

  • Startup adoption + deferred stale cleanup: first reconcile_once pass dumps tagged nexthops, reserves their IDs in NhIdAllocator (prevents NLM_F_REPLACE collisions with a prior daemon's state), and tracks them in adopted_unreferenced. After the first apply phase, anything still unreferenced by GroupOwnedMap is true-stale and gets deleted + released. Gated to one shot via adoption_done. Per research §8, this is startup-only; periodic RTM_GETNEXTHOP drift recovery (forced NHG deletion doesn't emit RTNLGRP_NEIGH) is deferred to slice 3.5.

  • DataplaneOpKind variants from slice 3a are now actually emitted: InstallFdbNhg { mac }, UpdateFdbNhgMembers { esi, ethernet_tag }, RemoveFdbNhg { mac }. Each carries the discriminator needed for operator reporting; group-level update uses (esi, ethernet_tag) since there's no per-MAC identity (the vni lives on AppliedOp).

  • Docker-runnable netns integration test (crates/evpn-linux/tests/netns_fdb_nhg.rs): round-trip install + verification via ip nexthop show (member format id %u via %s fdb, group format id %u group %u/%u fdb per research §5) + bridge fdb show (regex (^|\\s)nhid\\s+${id}(\\s|$) per research §4) + dump_owned_nexthops round-trip + idempotent del + CVE-2025-39851 guard negative path (per research §3: don't try to crash the kernel; assert learning=on → install returns InvalidArgument referencing the CVE, no FDB row programmed).

    Harness extended at crates/evpn-linux/tests/docker/run-netns-tests.sh with fdb_nhg / fdb_nhg_roundtrip / fdb_nhg_cve selectors. Existing cap set (NET_ADMIN + SYS_ADMIN + apparmor=unconfined) is correct per research §6.

Design notes / deviations from the approved plan

  • Adoption happens in the first reconcile_once pass rather than at LinuxDataplane::connect() time. Keeps allocator + refcount state in ActorState (per ADR §7 boundary) without changing connect()'s signature. Net behavior identical — adoption completes before any InstallFdbNhg is emitted in compute_diff Pass 1b.

  • Per-VTEP NH key is IpAddr only (not (VNI, IpAddr)) per research §7 (FRR's zebra_evpn_l2_nh_find keys by VTEP IP). Group key stays VNI-scoped because share_l2_nhg defaults off.

  • Member set canonicalization: encoder produces sorted+deduped (aliasing::group_members already does this); GroupOwnedMap stores in BTreeSet<IpAddr>; diff compares order-insensitively. Per research §1, kernel preserves stored order on dump but canonicalization avoids false drift.

What's not in this PR (slice 3.5 / slice 4 / future)

  • M40 manual containerlab smoke against FRR — slice 4.
  • Periodic RTM_GETNEXTHOP drift recovery — slice 3.5. Forced NHG deletion doesn't emit RTNLGRP_NEIGH (research §8); requires periodic dump or RTNLGRP_NEXTHOP multicast (latter explicit ADR out-of-scope).
  • IPv6 alias members on FDB-NHG — slice 2.5 follow-up. Diff falls back to single-dst with warn today.
  • apply_aliasing_ecmp operator off-switch — slice 3b ships enforcement-on; toggle is slice 3.5+ if operators ask.

Stats

  • 13 files changed: +1999 / -57 lines (largest contributor: the reconcile coordinator + parser).
  • 227 lib tests passing (+13 new on this branch: 7 dump-parser + 6 transition). All pre-existing diff / l3_diff / linux::* / reconcile tests still green.
  • 21 integration / smoke buckets unchanged.

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --workspace --all-targets -- -D warnings
  • cargo test --workspace --no-fail-fast (full sweep)
  • RUSTDOCFLAGS=\"-D warnings\" cargo doc --workspace --no-deps
  • cargo +1.92 check --workspace --all-targets --locked (MSRV)
  • CI: build matrix + interop pass on push.
  • Manual netns smoke (Docker, no sudo):
    crates/evpn-linux/tests/docker/run-netns-tests.sh fdb_nhg

References

First slice in the ADR-0059 chain with operational behavior
change. Multi-homed Type 2 routes now program FDB nexthop groups
via `NDA_NH_ID` in the kernel; traffic to a multi-homed MAC fans
out across every observed alias VTEP on the encap path. Slice 4
(M40 manual containerlab smoke against FRR) closes out the
validation story.

What lands:

- `compute_diff` Pass 1b — emits `InstallFdbNhg` /
  `UpdateFdbNhgMembers` / `RemoveFdbNhg` based on
  `RemoteMacEntry::alias_group_key`. IPv6 alias members fall back
  to single-dst FDB with `warn!` per `(VNI, MAC)` (slice 2's
  socket rejects v6; graceful degradation until v6 follow-up).
  Six transition tests cover the owned↔desired matrix.

- `NexthopOps` impls on `LinuxDataplane` (delegates to slice 2's
  `NexthopSocket` + slice 3a's `linux::fdb_nhg`) and
  `InMemoryDataplane` (records calls for unit-testing the
  coordinator).

- Reconcile actor coordinator — intercepts the three FDB-NHG op
  variants before `Dataplane::apply`, calls `NexthopOps` methods
  in ADR-0059 §5 invariant 1+2 order (install: members → group →
  FDB row; teardown: FDB row → group → members per refcount).
  `ActorState` extends with `nh_id_alloc`, `groups`,
  `adopted_unreferenced`, `adoption_done`.

- `NexthopSocket::dump_owned` — `RTM_GETNEXTHOP` multipart
  parser tolerating `NHA_GROUP_TYPE` / `NHA_OP_FLAGS` per
  research §1, filtering by tag bits + `NHA_FDB`. 7 unit tests.

- Startup adoption + deferred stale cleanup: first
  `reconcile_once` pass dumps tagged nexthops, reserves their IDs
  in the allocator (prevents `NLM_F_REPLACE` collisions with a
  prior daemon's state), and tracks them in
  `adopted_unreferenced`. After the first apply phase, anything
  still unreferenced is true-stale and gets deleted + released.
  Gated to one shot via `adoption_done`. Per research §8, this is
  startup-only; periodic drift recovery deferred to slice 3.5.

- New 3 new `DataplaneOpKind` variants from slice 3a are now
  emitted by the actor and surface in `AppliedOp` / `FailedOp`.

- Docker-runnable netns integration test
  (`tests/netns_fdb_nhg.rs`): round-trip install + verify via
  `ip nexthop show` + `bridge fdb show nhid` + idempotent del +
  CVE-2025-39851 guard negative path. Harness extended with
  `fdb_nhg` / `fdb_nhg_roundtrip` / `fdb_nhg_cve` selectors.

Deviation from approved plan: adoption happens in the first
`reconcile_once` pass rather than at `connect()` time. Keeps
allocator + refcount state in `ActorState` (per ADR §7 boundary)
without changing `connect()`'s signature. Net behavior identical
— adoption completes before any `InstallFdbNhg` is emitted.

Stats: 1622 lines added / 57 changed across 12 files. 226 lib
tests passing (227 with the 7 dump-parser tests + 6 transition
tests added on this branch); all pre-existing unit + integration
tests still green.

References:
- ADR-0059 (merged #83): design.
- Slice 1 (#84), 2 (#86), 3a (#87): foundation.
- Research findings on member-order canonicalization, atomic
  replace semantics, CVE test scope, `bridge fdb show` regex,
  `ip nexthop show` format, Docker capabilities, per-VTEP NH key,
  RTNLGRP_NEIGH coverage gaps all folded into the implementation.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wires ADR-0059 slice 3b into the EVPN Linux dataplane apply path: multi-homed Type 2 MACs are now programmed using kernel FDB nexthop groups (NHGs) with startup adoption of tagged NHIDs and an end-to-end privileged netns integration test.

Changes:

  • Extend compute_diff to emit FDB-NHG ops (InstallFdbNhg / UpdateFdbNhgMembers / RemoveFdbNhg) and add IPv6 graceful fallback.
  • Add reconcile-actor coordination for FDB-NHG lifecycle (allocator + refcounts + adoption/cleanup).
  • Implement NexthopOps for Linux + in-memory dataplanes and add RTM_GETNEXTHOP dump support + Docker/netns harness updates.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/evpn_dataplane.rs Tightens supervisor spawn generic bound to require NexthopOps.
ROADMAP.md Updates roadmap status for ADR-0059 slice chain.
CHANGELOG.md Adds unreleased entry describing slice 3b operational change and wiring.
crates/evpn-linux/src/diff.rs Emits FDB-NHG ops (Pass 1b), handles transitions, adds tests.
crates/evpn-linux/src/reconcile.rs Adds NHG coordinator apply path, startup adoption + stale cleanup, actor state.
crates/evpn-linux/src/linux/mod.rs Adds dedicated nexthop socket and NexthopOps impl for LinuxDataplane.
crates/evpn-linux/src/linux/fdb_nhg.rs Updates module docs to reflect slice 3b wiring status.
crates/evpn-linux/src/linux/nexthop_raw/encode.rs Adds RTM_GETNEXTHOP dump encoder.
crates/evpn-linux/src/linux/nexthop_raw/socket.rs Implements multipart dump parsing + dump_owned() with unit tests.
crates/evpn-linux/src/linux/nexthop_raw/uapi.rs Adds missing UAPI consts needed for dump tolerance.
crates/evpn-linux/src/in_memory.rs Implements NexthopOps in the in-memory dataplane for coordinator tests.
crates/evpn-linux/tests/netns_fdb_nhg.rs Adds privileged netns integration test for FDB-NHG programming + CVE guard.
crates/evpn-linux/tests/docker/run-netns-tests.sh Adds selectors to run the new netns test binary and subtests.
Comments suppressed due to low confidence (2)

crates/evpn-linux/src/reconcile.rs:917

  • cleanup_unreferenced_adoptions releases NHID allocator slots even when del_nexthop(id) fails. If the kernel object is still present, releasing the ID can allow a later allocation to reuse it and inadvertently REPLACE/overwrite a still-live kernel nexthop. Safer approach: only release(id) after a confirmed successful delete (including idempotent ENOENT), and keep failed IDs in adopted_unreferenced so cleanup can retry on the next pass (which likely also implies not flipping adoption_done yet).
            if let Err(e) = self.dataplane.del_nexthop(id).await {
                tracing::debug!(?e, id, "adoption cleanup: del_nexthop failed (best-effort)");
            }
            self.state.nh_id_alloc.release(id);
        }

crates/evpn-linux/src/reconcile.rs:1419

  • In the RemoveFdbNhg path, del_nexthop errors are ignored (let _ = ...) but the allocator slot is still released immediately afterward. If the delete failed and the kernel object remains, the allocator may later reuse the same ID and NLM_F_REPLACE could overwrite a still-live nexthop/group. Consider releasing IDs only after a confirmed successful delete (including idempotent ENOENT), and preserving enough state to retry cleanup on the next reconcile pass when deletion fails.
                RefDelta::GroupShouldDelete { id, members } => {
                    let _ = dataplane.del_nexthop(id).await;
                    state.nh_id_alloc.release(id);
                    for ip in members {
                        if let Some(vtep_id) = state.groups.record_member_unref(ip, *group_key) {
                            let _ = dataplane.del_nexthop(vtep_id).await;
                            state.nh_id_alloc.release(vtep_id);
                        }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread crates/evpn-linux/src/diff.rs Outdated
Comment on lines 206 to 209
@@ -143,6 +209,168 @@ pub fn compute_diff(
Plan { ops }
Comment thread crates/evpn-linux/src/diff.rs Outdated
Comment on lines +300 to +302
// Single-dst → FDB-NHG transition. Remove the old row,
// install the group-backed one.
updates.push(DataplaneOp::RemoveRemoteFdb { vni, mac });
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +540 to +542
// Gated to the first pass so we don't re-walk an
// already-empty map on every reconcile.
if !self.state.adoption_done {
Comment on lines +1307 to +1317
// Step 1: install any per-VTEP members not yet present.
for ip in members {
if state.groups.vtep_nh(ip).is_none() {
let id = state.nh_id_alloc.alloc_vtep_nh().map_err(|e| {
crate::error::DataplaneError::Other(format!(
"ADR-0059: vtep NH alloc failed: {e}"
))
})?;
dataplane.add_nexthop_member(id, *ip).await?;
state.groups.record_member_install(*ip, id);
state.adopted_unreferenced.remove(&id);
Comment thread crates/evpn-linux/src/diff.rs Outdated
Comment on lines +1195 to +1201

// GroupOwned and pre-existing struct imports kept used. The
// `GroupOwned` type isn't directly constructed in these tests
// but is referenced indirectly via `GroupOwnedMap`'s helpers.
fn _silence_unused_import_warning() -> Option<GroupOwned> {
None
}
Comment thread ROADMAP.md Outdated
- [x] **EVPN Phase 3 partial — Gate 8b enforcement intent foundation** (v0.17.0) — DF-election role state now feeds the Linux dataplane supervisor as a portable `(ESI, VNI)` BUM-enforcement table. The reconciler resolves bridge, VXLAN ifindex, and CE-facing port identity and reports the desired action (`allow` for DF, `suppress` for Non-DF) through `DataplaneReport.bum_enforcement`. Observable-only: no kernel filter mutation yet.
- [x] **EVPN Phase 3 partial — Gate 9 slice 6 symmetric Interface-less IRB end-to-end** (v0.18.0, ADR-0058) — `[[evpn_ip_vrfs]]` TOML schema with VRF / L3VXLAN device binding and operator-supplied Router MAC, pure-logic Type 5 origination + projection helpers (RFC 9136 §4.4.2 Interface-less IRB), `IpVrfStatus` readiness probe checking the seven ADR-0058 §3 predicates, Linux rtnetlink dumps for VRF / L3VXLAN inventory, and `Dataplane::probe_ip_vrfs` called every reconcile pass with readiness surfaced via `tracing::info!` / `warn!` + `DataplaneReport.ip_vrf_status` + `EvpnService.ListIpVrfs` / `GetIpVrf` gRPC + `rustbgpctl evpn vrfs`. Slice 6 PR A (#77) wires daemon-side origination (kernel-route dump + conservative classifier + L3 originator task + `originated_routes_count`); slice 6 PR B (#78) wires dataplane-side import (best-path subscription + projection + transactional `L3OwnedState` with value-aware drift detection + four-phase apply ordering + Router MAC conflict detection + foreign-state preservation + `installed_routes_count`). M39 manual containerlab smoke validates the bidirectional symmetric IRB datapath against FRR 10.3.1.
- [ ] **EVPN Phase 3 — Multi-homing enforcement (Gate 8b residual) + Gate 9 overlay-index IRB** — Gate 8b residuals: aliasing dataplane ECMP per [ADR-0059](docs/adr/0059-evpn-aliasing-fdb-nexthop-groups.md) (FDB nexthop groups via `NDA_NH_ID` + `NHA_FDB`; raw-netlink construction because `rtnetlink 0.21` has no nexthop API). Slice 1 (portable intent — `RemoteMacEntry::alias_group_key`) and slice 2 (`nexthop_raw` netlink primitive) have shipped on `main`; **slice 3 is in flight, split into 3a (state types + `NexthopOps` trait + `fdb_nhg` apply + CVE-2025-39851 guard, infrastructure only) and 3b (diff Pass 1b + reconcile actor coordinator + NHID startup adoption + Docker-runnable netns test, where operational behavior change lands)**. Slice 3.5 deferred: periodic `RTM_GETNEXTHOP` drift recovery + `RTNLGRP_NEXTHOP` (out of scope for the ADR). Slice 4 (M40 manual containerlab smoke against FRR) is the final piece. Then production-default multi-homing enforcement after the 24 h soak. Gate 9 follow-ups: overlay-index IRB (RFC 9135 — Gate 9 slice 6 ships the Interface-less variant only), auto-derived Route Targets (RFC 8365 §5.1.2.1), privileged-runner CI gate for the M39 smoke. Needed for active-active ToR deployments and broader tenant-routing topologies.
- [ ] **EVPN Phase 3 — Multi-homing enforcement (Gate 8b residual) + Gate 9 overlay-index IRB** — Gate 8b residuals: aliasing dataplane ECMP per [ADR-0059](docs/adr/0059-evpn-aliasing-fdb-nexthop-groups.md) (FDB nexthop groups via `NDA_NH_ID` + `NHA_FDB`; raw-netlink construction because `rtnetlink 0.21` has no nexthop API). Slice 1 (portable intent — `RemoteMacEntry::alias_group_key`), slice 2 (`nexthop_raw` netlink primitive), slice 3a (state types + apply primitive + CVE-2025-39851 guard, infrastructure-only), and slice 3b (diff Pass 1b + reconcile actor coordinator + `NexthopOps` impls + startup NHID adoption + Docker-runnable netns test — **operational behavior change**) have all shipped on `main`. Slice 4 (M40 manual containerlab smoke against FRR) is the remaining piece before production-default multi-homing enforcement after the 24 h soak. Slice 3.5 deferred follow-ups: periodic `RTM_GETNEXTHOP` drift recovery (research §8 confirms forced NHG deletion doesn't emit RTNLGRP_NEIGH events), `apply_aliasing_ecmp` operator toggle, IPv6 alias members. Gate 9 follow-ups: overlay-index IRB (RFC 9135 — Gate 9 slice 6 ships the Interface-less variant only), auto-derived Route Targets (RFC 8365 §5.1.2.1), privileged-runner CI gate for the M39 smoke. Needed for active-active ToR deployments and broader tenant-routing topologies.
Comment on lines +265 to +269
let nh_after = ip_nexthop_show();
assert!(
nh_after.trim().is_empty()
|| (!nh_after.contains("0x30000001") && !nh_after.contains("0x40000001")),
"nexthop table should be empty (of ours): {nh_after}",
Apply 7 review findings on the slice 3b coordinator PR:

1. Plan ordering: emit deletes before creates+updates so
   transitions (SingleDst↔FdbNhg, group-key drift) sequence
   remove → add at the kernel-op level, not just per-op.
2. SingleDst→FdbNhg `RemoveRemoteFdb` gated on kernel snapshot
   presence — keeps the op idempotent when the prior row never
   actually landed (failed install carried in `last_applied`).
3. Adoption cleanup gated on `failed.is_empty()` so a partial
   first-reconcile pass doesn't wipe stale-but-still-needed
   adopted NHIDs before the second pass gets a chance.
4. `apply_nhg_op` rolls back the allocator slot
   (`nh_id_alloc.release(id)`) on netlink failure so a transient
   kernel error doesn't leak bitmap IDs in either the install
   or update-members path.
5. Remove the unused `GroupOwned` import and the
   `_silence_unused_import_warning` helper from diff.rs.
6. ROADMAP slice 3b reads "in flight, pending merge" rather
   than claiming shipped on main.
7. Netns post-teardown emptiness asserted via
   `dump_owned_nexthops()` instead of hex-substring matching
   on `ip nexthop show` (which prints IDs in decimal — the
   prior assertion could pass even when entries were live).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +1349 to +1351
state
.groups
.record_group_member_change(*group_key, new_members);
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +1405 to +1411
// No record of the group — log + skip; the next
// InstallFdbNhg-driven pass will create it.
tracing::warn!(
?group_key,
"UpdateFdbNhgMembers: group not in owned-state map; skipping"
);
return Ok(());
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +417 to +442
if !self.state.adoption_done {
match self.dataplane.dump_owned_nexthops().await {
Ok(adopted) => {
for nh in adopted {
match self.state.nh_id_alloc.reserve(nh.id) {
Ok(()) => {
self.state.adopted_unreferenced.insert(nh.id, nh);
}
Err(e) => {
tracing::warn!(
?e,
id = nh.id,
"adoption: reserve failed; ignoring"
);
}
}
}
}
Err(e) => {
// Dump failed — log + continue. The actor will
// re-attempt adoption on the next pass since
// adoption_done stays false until apply_plan
// succeeds at least once.
tracing::warn!(error = %e, "adoption: dump_owned_nexthops failed");
}
}
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +912 to +923
// Drain to avoid holding the borrow across the await.
let stale: Vec<u32> = self.state.adopted_unreferenced.keys().copied().collect();
for id in stale {
// del_nexthop is idempotent on ENOENT (slice 2). Errors
// are logged at debug; the next reconcile won't try
// again because adoption_done is set after this loop.
if let Err(e) = self.dataplane.del_nexthop(id).await {
tracing::debug!(?e, id, "adoption cleanup: del_nexthop failed (best-effort)");
}
self.state.nh_id_alloc.release(id);
}
self.state.adopted_unreferenced.clear();
Four findings on the previous fixup pass, all in
`reconcile.rs`:

1. `InstallFdbNhg` drift path: when an existing group's
   member set drifts, `record_group_member_change` returns
   the removed IPs; previously that return was discarded,
   leaving per-VTEP nexthops orphaned in `vtep_nhs` and the
   kernel. Now mirrors the `UpdateFdbNhgMembers` path —
   `record_member_unref` → `del_nexthop` → allocator release
   for each removed IP whose last group-ref dropped.

2. `UpdateFdbNhgMembers` missing-group: previously logged
   + returned `Ok(())`, which cleared retry state on a no-op.
   `compute_diff` only emits this op when the group exists
   in `GroupOwnedMap`, so the branch indicates `owned` and
   `groups` have drifted out of sync. Now returns a transient
   `Other` error so the actor retries / surfaces it.

3. Adoption completion: track `dump_succeeded` from the
   `dump_owned_nexthops()` call and AND it into the
   `adoption_done = true` gate. Without this, a dump failure
   on the first pass would let `adoption_done` flip true
   while the allocator had reserved no kernel NHIDs,
   permanently stranding collision risk for future
   `InstallFdbNhg` ops.

4. `cleanup_unreferenced_adoptions`: now returns whether
   every staged delete succeeded; only release the
   allocator slot on a per-ID `Ok`. On `Err`, leave the
   slot reserved + the entry in `adopted_unreferenced` so
   the next reconcile pass retries — releasing on
   transient netlink failure would let a future
   `alloc_vtep_nh` / `alloc_nhg` hand out an ID still
   live in the kernel. `adoption_done` only flips true
   when cleanup returned `true`.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (2)

crates/evpn-linux/src/reconcile.rs:1472

  • Same issue as above: this GC loop ignores del_nexthop errors but still releases the allocator slot. Releasing on failure can lead to reusing an ID that is still present in the kernel. Prefer releasing only on successful delete (del already treats ENOENT as Ok), and keep failed deletions reserved + queued for retry.
            for ip in removed {
                if let Some(vtep_id) = state.groups.record_member_unref(ip, *group_key) {
                    let _ = dataplane.del_nexthop(vtep_id).await;
                    state.nh_id_alloc.release(vtep_id);
                }

crates/evpn-linux/src/reconcile.rs:1493

  • On group teardown, del_nexthop errors are ignored but the group/member IDs are released anyway. If a delete fails (e.g., transient netlink error), the allocator can hand out an ID that still exists in the kernel. Recommend releasing IDs only after del_nexthop succeeds, and retaining enough state to retry deletion later while keeping the IDs reserved.
                RefDelta::GroupShouldDelete { id, members } => {
                    let _ = dataplane.del_nexthop(id).await;
                    state.nh_id_alloc.release(id);
                    for ip in members {
                        if let Some(vtep_id) = state.groups.record_member_unref(ip, *group_key) {
                            let _ = dataplane.del_nexthop(vtep_id).await;
                            state.nh_id_alloc.release(vtep_id);
                        }

Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +1393 to +1394
let _ = dataplane.del_nexthop(vtep_id).await;
state.nh_id_alloc.release(vtep_id);
Comment on lines +841 to +848
let res = match op {
DataplaneOp::InstallFdbNhg { .. }
| DataplaneOp::UpdateFdbNhgMembers { .. }
| DataplaneOp::RemoveFdbNhg { .. } => {
apply_nhg_op(&mut self.dataplane, &mut self.state, op).await
}
_ => self.dataplane.apply(op).await,
};
Comment on lines +244 to +245
/// presence of `NHA_FDB`. ADR-0059 slice 3b uses this at
/// `LinuxDataplane::connect()` time so the allocator can
Three findings on the second fixup pass:

1. Steady-state GC paths (InstallFdbNhg drift, UpdateFdbNhgMembers,
   RemoveFdbNhg) released allocator slots unconditionally even when
   `del_nexthop` errored — same allocator-collision risk that the
   previous round's `cleanup_unreferenced_adoptions` fix addressed,
   but applied only to startup adoption. Extract the pattern into a
   shared `try_del_and_release_alloc` helper used by all three
   apply-time GC sites: release only when `del_nexthop` returned
   `Ok`, otherwise log + retain the slot to prevent a future
   `alloc_vtep_nh` / `alloc_nhg` from re-handing an in-kernel ID.

2. Retry/permanent-failure key for `UpdateFdbNhgMembers` was
   `(VNI, MacAddress::zero())` because the op has no per-MAC identity
   — every group update within the same VNI collided on a placeholder
   key. Adds a dedicated `nhg_retry: RetrySchedule<AliasGroupKey>`
   and `nhg_permanent_failures: BTreeMap<AliasGroupKey, DataplaneOp>`
   to `ActorState`, mirroring the BUM (ifindex-keyed) precedent. The
   apply loop, retry-deadline aggregation, permanent-suppression
   check, transient-retry record, and success path all dispatch
   `UpdateFdbNhgMembers` through the new key space. Extracts the
   permanent-suppression-check pattern into a generic
   `check_permanent_suppression` helper so the 3-way dispatch stays
   readable.

3. `NexthopSocket::dump_owned` doc said it ran "at
   `LinuxDataplane::connect()` time"; adoption was deferred to the
   reconcile actor's first pass per the slice 3b plan's ownership-
   boundary decision. Update the doc to match the call site.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comment on lines +680 to +693
fn map_nexthop_error(e: nexthop_raw::NexthopError) -> DataplaneError {
match e {
nexthop_raw::NexthopError::Io(io) => DataplaneError::Io(io),
nexthop_raw::NexthopError::Kernel(errno) => {
DataplaneError::InvalidArgument(format!("nexthop kernel errno {errno}"))
}
nexthop_raw::NexthopError::Validation(v) => {
DataplaneError::InvalidArgument(format!("nexthop validation: {v}"))
}
nexthop_raw::NexthopError::Ipv6Unsupported => {
DataplaneError::InvalidArgument("nexthop IPv6 unsupported (slice 2.5 follow-up)".into())
}
other => DataplaneError::Other(format!("nexthop socket: {other}")),
}
Comment thread crates/evpn-linux/src/diff.rs Outdated
Comment on lines +134 to +143
// IPv6 fallback — slice 2 `NexthopSocket` rejects v6
// gateways. Diff degrades to single-dst (primary VTEP
// only) and emits one warn per (VNI, MAC) so the
// operator sees the degradation.
tracing::warn!(
?vni,
mac = %mac,
"ADR-0059 IPv6 alias members not yet supported (slice 2.5 follow-up); \
falling back to single-dst FDB row at primary VTEP",
);
Comment on lines +105 to +133
/// Build a bridge + VXLAN port with `nolearning` in the current netns.
/// Returns the VNI for downstream test setup.
async fn build_l2_topology() -> EvpnInstanceId {
run("ip", &["addr", "add", "127.0.0.10/32", "dev", "lo"]);
run("ip", &["link", "set", "lo", "up"]);
run("ip", &["link", "add", "name", "br100", "type", "bridge"]);
run("ip", &["link", "set", "br100", "up"]);
run(
"ip",
&[
"link",
"add",
"name",
"vxlan100",
"type",
"vxlan",
"id",
"100",
"local",
"127.0.0.10",
"dstport",
"4789",
"nolearning",
],
);
run("ip", &["link", "set", "vxlan100", "master", "br100"]);
run("ip", &["link", "set", "vxlan100", "up"]);
EvpnInstanceId::new(100).expect("VNI 100")
}
Three findings on the third fixup pass:

1. `map_nexthop_error` collapsed every `NexthopError::Kernel(errno)`
   into `DataplaneError::InvalidArgument`, which the actor classifies
   as `Permanent`. That meant transient errnos (`EBUSY`, `ENOMEM`,
   `ENOBUFS`) got permanently suppressed, stranding allocator
   reservations and kernel state until restart. Splits the errno
   classification out into `map_nexthop_kernel_errno`, mirroring the
   dispatch in `linux::fdb::errno_to_dataplane_error`:
   `EPERM`/`EACCES` → `PermissionDenied`, `EOPNOTSUPP` →
   `KernelTooOld` (with the NDA_NH_ID kernel-version hint),
   `EINVAL` → `InvalidArgument`, everything else → `Other` (transient).

2. IPv6-alias fallback `tracing::warn!` fired in `compute_diff` on
   every reconcile pass — sustained log spam for stable
   configurations with v6 multi-homed entries. Moves the warn-trigger
   responsibility to the reconcile actor: `compute_diff` now returns
   the `(VNI, MAC)` set currently in fallback on `Plan`, and the
   actor warns only for `(VNI, MAC)` keys that newly entered the
   set since the previous pass. Keys that drop out are pruned so
   a future re-entry produces a fresh warn.

3. `build_l2_topology` in the netns test was `async fn` with no
   `.await` points — trips `clippy::unused_async` (not enabled by
   default but flagged on review). Drops `async` + adjusts the one
   call site.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread crates/evpn-linux/src/diff.rs Outdated
Comment on lines +398 to +404
// (FDB row deleted out-of-band), re-emit Install for
// this MAC. ADR §5 invariant 8: idempotent on
// EEXIST so this is safe even when the row is still
// there.
let kernel_has_row = snapshot
.find_fdb(vni, mac)
.is_some_and(|k| k.nh_id.is_some());
Comment on lines +1600 to +1622
/// Best-effort delete of a tagged kernel nexthop, releasing the
/// allocator slot only if the delete succeeded. On `Err` the slot
/// stays reserved so a future `alloc_vtep_nh` / `alloc_nhg` cannot
/// hand out an ID that is still live in the kernel — the failure
/// scenario flagged by ADR-0059 review. The leaked slot is bounded
/// (per-ID, per-pass) and reclaimable on a fresh daemon start via
/// the slice 3b adoption pass; periodic drift recovery (slice 3.5)
/// will eventually retry orphan deletions.
async fn try_del_and_release_alloc<D: crate::dataplane::NexthopOps>(
dataplane: &mut D,
alloc: &mut crate::nh_id_alloc::NhIdAllocator,
id: u32,
site: &'static str,
) {
match dataplane.del_nexthop(id).await {
Ok(()) => alloc.release(id),
Err(e) => tracing::warn!(
?e,
id,
site,
"FDB-NHG GC: del_nexthop failed; allocator slot retained to prevent kernel-ID collision"
),
}
Two findings on the fourth fixup pass:

1. FDB-NHG same-key drift check (`compute_diff` Pass 1b) only
   verified that the kernel snapshot row had *some* `nh_id`, not
   that it matched the locally-tracked group ID. A manual
   `bridge fdb replace` or partial-failure carryover could leave
   the row pointing at a different NHG and we'd silently forward
   traffic via the wrong group. Restructured the per-MAC check to
   require:
     - the snapshot row exists, AND
     - it has `nh_id == Some(local_id)` for the group we track
       under `linux_key`.
   Any other state (missing row, single-dst row with no NHID,
   different NHID, or local group missing) re-emits Install —
   safe via NLM_F_REPLACE. Folds the previous group-missing heal
   into this single check, which dedupes naturally across MACs
   via the apply path's `record_group_install`.

2. `try_del_and_release_alloc` previously logged + retained the
   allocator slot on `del_nexthop` failure but had no retry path.
   The surrounding op (`RemoveFdbNhg` last-ref teardown,
   `UpdateFdbNhgMembers` member GC, Install-drift heal) still
   returned `Ok(())` and the actor dropped the `GroupOwnedMap` /
   `OwnedSet` entries — so on `EBUSY` / `ENOMEM` the kernel
   orphan was permanently stranded with no signal for compute_diff
   to re-emit a Remove. Now queues failed IDs into
   `state.pending_deletes` (allocator slot still retained), and
   adds `drain_pending_deletes` that runs after `apply_plan` each
   pass — retries every queued ID, releases the slot + drops from
   the queue on success, leaves entries for next pass on persistent
   failure. Independent of the one-shot `cleanup_unreferenced_adoptions`
   (which walks startup-adopted IDs); steady-state GC failures land
   in the new queue.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Comment thread crates/evpn-linux/src/diff.rs Outdated
Comment on lines +325 to +331
// Fresh install for this MAC.
creates.push(DataplaneOp::InstallFdbNhg {
vni,
mac,
group_key: linux_key,
members: canonical,
});
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +457 to +484
let mut dump_succeeded = true;
if !self.state.adoption_done {
match self.dataplane.dump_owned_nexthops().await {
Ok(adopted) => {
for nh in adopted {
match self.state.nh_id_alloc.reserve(nh.id) {
Ok(()) => {
self.state.adopted_unreferenced.insert(nh.id, nh);
}
Err(e) => {
tracing::warn!(
?e,
id = nh.id,
"adoption: reserve failed; ignoring"
);
}
}
}
}
Err(e) => {
// Dump failed — log + leave adoption_done false so
// the next reconcile pass re-attempts the dump
// before any FDB-NHG ops are applied against an
// under-reserved allocator.
tracing::warn!(error = %e, "adoption: dump_owned_nexthops failed");
dump_succeeded = false;
}
}
Two findings on the fifth fixup pass:

1. `compute_diff` Pass 1b `None` (fresh-install) arm emitted
   `InstallFdbNhg` based solely on `last_applied` being empty,
   without consulting the kernel snapshot. Two ways this hurt:
     - foreign-entry preservation violation: an operator's static
       FDB row at the same `(vni, mac)` would be overwritten by
       NLM_F_REPLACE.
     - restart case: `last_applied` is empty post-restart but the
       kernel may still hold rustbgpd-owned rows from the prior
       instance — fine, but we'd also overwrite foreign ones.
   Mirrors `handle_existing_kernel_entry`'s gate: install only when
   the snapshot row is absent or has `extern_learn` set (our marker).
   Otherwise log + skip the install.

2. Startup-adoption dump failure was tracked via `dump_succeeded`
   but the reconcile pass kept going into `compute_diff` + `apply_plan`
   anyway, with an allocator that hadn't reserved pre-existing
   kernel NHIDs — collision risk that NLM_F_REPLACE turns into
   silent kernel-state overwrite. Moved the adoption dump to after
   `dump_snapshot` (so the report context is available) and changed
   the failure path to short-circuit: emit a no-ops report and
   return, deferring to the next reconcile pass. The
   `dump_succeeded` gate at the cleanup step is now unnecessary
   (reaching that point implies the dump succeeded) so it's
   removed in favor of a simpler `failed.is_empty()` check.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

crates/evpn-linux/src/reconcile.rs:1554

  • If add_nexthop_group(...).await? fails, any new per-VTEP members installed above remain present but (since record_group_member_change is not reached) are not attached to the group in GroupOwnedMap. A permanent failure would leave these members orphaned and allocator slots reserved. Consider best-effort rollback/enqueue of member IDs created in this Update op when the group REPLACE step fails.
            // Compute new member-id list and REPLACE the group.
            let member_ids: Vec<u32> = members
                .iter()
                .map(|ip| state.groups.vtep_nh(ip).expect("just installed").id)
                .collect();
            let Some(g_id) = state.groups.group(group_key).map(|g| g.id) else {
                // `compute_diff` Pass 1b only emits `UpdateFdbNhgMembers`
                // when the group exists in `GroupOwnedMap`, so hitting
                // this branch means `owned` and `groups` have drifted
                // out of sync — an internal state inconsistency.
                // Surface it as a transient error (classified `Other`
                // → retried) so it shows up in metrics rather than
                // silently clearing retry state.
                return Err(crate::error::DataplaneError::Other(format!(
                    "ADR-0059: UpdateFdbNhgMembers for {group_key:?} but group missing \
                     from GroupOwnedMap (owned/groups state drift)",
                )));
            };
            dataplane.add_nexthop_group(g_id, &member_ids).await?;
            let new_members: BTreeSet<_> = members.iter().copied().collect();

Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +1491 to +1497
let id = state.nh_id_alloc.alloc_nhg().map_err(|e| {
crate::error::DataplaneError::Other(format!("ADR-0059: nhg alloc failed: {e}"))
})?;
if let Err(e) = dataplane.add_nexthop_group(id, &member_ids).await {
state.nh_id_alloc.release(id);
return Err(e);
}
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +1505 to +1513
// Step 3: install the FDB row pointing at the group.
// Failure here leaves the group + members installed in
// `state.groups` — they're refcounted, so the next pass's
// Install/Update for any MAC sharing this `group_key`
// will reuse them. Per-VTEP members not yet ref'd by any
// MAC stay tracked in `groups.vtep_nhs`; the GC happens
// on the corresponding RemoveFdbNhg path's last-ref unref.
dataplane.install_fdb_nhg_row(*vni, *mac, group_id).await?;
state.groups.record_mac_ref(*group_key, *vni, *mac);
Comment on lines +1639 to +1651
) {
match dataplane.del_nexthop(id).await {
Ok(()) => state.nh_id_alloc.release(id),
Err(e) => {
tracing::warn!(
?e,
id,
site,
"FDB-NHG GC: del_nexthop failed; queued for retry, allocator slot retained"
);
state.pending_deletes.insert(id);
}
}
Three findings on the sixth fixup pass:

1. `apply_nhg_op InstallFdbNhg` Step 2 (group install) or Step 3
   (`install_fdb_nhg_row`) failure left any per-VTEP members
   newly-created in Step 1 (and the newly-created group on Step 3
   failure) installed in the kernel with zero MAC refs — the actor
   would record the op as failed, classify as `Permanent` on e.g.
   permission/CVE-guard errors, and never have a signal to re-emit
   a Remove. Adds `rollback_partial_install` helper + new
   `GroupOwnedMap::drop_unreferenced_group` method that walks the
   group + members in reverse-creation order, clears local tracking,
   and routes each delete through `try_del_and_release_alloc` (so
   transient rollback delete failures land in the retry queue).

2. Same shape on `UpdateFdbNhgMembers`: Step 1 created members, Step
   2 (group REPLACE) failure or the owned/groups-drift error path
   could leak the newly-installed per-VTEP members. Tracks new
   members + rolls them back on any later-step failure.

3. `try_del_and_release_alloc` enqueued every delete failure for
   retry, including permanent classes (`PermissionDenied`,
   `KernelTooOld`) that can't be cured by retrying. Now classifies
   via `DataplaneError::class()`: transient/conflict → queue +
   retry; permanent → warn once + retain slot (operator-intervention
   territory). Stops the runaway retry loop + repeated warns on
   permanent failure scenarios.
@lance0 lance0 requested a review from Copilot May 12, 2026 21:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Comment thread crates/evpn-linux/src/diff.rs Outdated
Comment on lines 50 to 53
@@ -47,6 +51,13 @@ pub struct Plan {
/// (sorted by `(VNI, MAC)`), then updates, then deletes. The
/// actor applies them serially with backoff per-op on failure.
pub ops: Vec<DataplaneOp>,
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
let status = build_instance_status(&intent.instances, &probes);
let bum_enforcement = build_bum_enforcement_status(
&intent.bum_enforcement,
&KernelSnapshot::new(),
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +1734 to +1736
&& let Some((_id_from_map, _members)) = state.groups.drop_unreferenced_group(&group_key)
{
try_del_and_release_alloc(dataplane, state, group_id, site).await;
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
) {
use crate::error::FailureClass;
match dataplane.del_nexthop(id).await {
Ok(()) => state.nh_id_alloc.release(id),
Four small findings on the seventh fixup pass:

1. `Plan::ops` docstring claimed creates → updates → deletes
   ordering, but the build order changed to deletes → creates →
   updates in an earlier round to keep transition pairs (Remove →
   Install) safe within a single plan. Update the doc to match.

2. Adoption-dump-failure short-circuit built BUM enforcement
   against `KernelSnapshot::new()` when the real `snapshot` was
   already available — would report BUM readiness/action as if no
   links existed. Switch to the real snapshot.

3. `rollback_partial_install` ignored the `(group_id, members)`
   returned by `drop_unreferenced_group` and instead used the
   caller-supplied `group_id`. Source these from the map (the ID
   we actually tracked there) with a debug_assert on equality, so
   any future state-drift bug surfaces at test time rather than
   silently deleting the wrong kernel object.

4. `try_del_and_release_alloc` released the allocator slot on `Ok`
   but didn't remove the ID from `state.pending_deletes`. A prior-
   pass-queued ID that got cleared by the steady-state path would
   stay queued and re-attempt every reconcile pass (kernel returns
   ENOENT idempotent-ACK, but the trace spam adds up). Now drops
   the entry from `pending_deletes` on success.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

crates/evpn-linux/src/in_memory.rs:351

  • Similarly, remove_fdb_nhg_row only updates fdb_nhg_rows; if you make install_fdb_nhg_row update state.kernel for snapshot consistency, this should also remove/clear the corresponding KernelSnapshot FDB entry so compute_diff converges correctly in in-memory actor tests.
    async fn remove_fdb_nhg_row(
        &mut self,
        vni: EvpnInstanceId,
        mac: MacAddress,
    ) -> Result<(), DataplaneError> {
        let mut state = self.state.lock().expect("poisoned");
        // Idempotent — slice 3b coordinator may issue this on
        // already-removed rows during stale cleanup.
        state.fdb_nhg_rows.remove(&(vni, mac));
        Ok(())

Comment thread crates/evpn-linux/src/reconcile.rs Outdated
.group(group_key)
.map(|g| (g.id, g.members.iter().copied().collect::<Vec<_>>()));
let group_id = if let Some((g_id, existing_members)) = existing {
if existing_members != *members {
Comment on lines +331 to +340
async fn install_fdb_nhg_row(
&mut self,
vni: EvpnInstanceId,
mac: MacAddress,
nh_id: u32,
) -> Result<(), DataplaneError> {
let mut state = self.state.lock().expect("poisoned");
state.fdb_nhg_rows.insert((vni, mac), nh_id);
Ok(())
}
Comment on lines +28 to +37
# Test selector. `bum_*` runs the Gate 8b harness; `fdb_nhg` runs
# the ADR-0059 slice 3b FDB nexthop group integration test.
TEST_BIN="netns_bum_filter"
case "${1:-all}" in
spike) FILTER="bum_filter_spike_validates_kernel_primitive" ;;
roundtrip) FILTER="linux_dataplane_set_bum_port_flags_round_trip" ;;
all|"") FILTER="" ;;
fdb_nhg) TEST_BIN="netns_fdb_nhg"; FILTER="" ;;
fdb_nhg_roundtrip) TEST_BIN="netns_fdb_nhg"; FILTER="round_trip_install_and_remove_fdb_nhg" ;;
fdb_nhg_cve) TEST_BIN="netns_fdb_nhg"; FILTER="cve_guard_blocks_install_when_learning_enabled" ;;
Three findings on the eighth fixup pass:

1. `InMemoryDataplane::install_fdb_nhg_row` recorded the row in
   `fdb_nhg_rows` but never mirrored it into `state.kernel`. The
   slice 3b drift check in `compute_diff` Pass 1b reads from the
   snapshot returned by `dump_snapshot()`, which only reflects
   `state.kernel`, so a unit test using the in-memory backend
   would never see the row and re-emit `InstallFdbNhg` forever.
   Insert a matching `KernelFdbEntry` (`nh_id: Some(_)`,
   `dst: None`, `extern_learn: true`) on install; remove it on
   `remove_fdb_nhg_row`.

2. `existing_members != *members` in the Install drift heal:
   readable as written (the `!=` operator's `PartialEq::ne` takes
   references, no move), but rewritten to
   `.as_slice() != .as_slice()` for clarity.

3. `run-netns-tests.sh` header described only the Gate 8b
   selectors; the script now also dispatches `fdb_nhg`,
   `fdb_nhg_roundtrip`, and `fdb_nhg_cve` for slice 3b. Update
   the usage block to document all selectors + bump the kernel
   requirement to >= 5.8 for the fdb_nhg bundle (NDA_NH_ID
   appeared then).
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment thread crates/evpn-linux/src/linux/mod.rs Outdated
Comment on lines +675 to +676
/// becomes `InvalidArgument` (the actor's classifier maps it to
/// `Permanent` so a programming bug doesn't backoff-retry forever);
Comment thread crates/evpn-linux/src/diff.rs Outdated
// [10.0.0.2]. Expect UpdateFdbNhgMembers emitted once.
let mut desired = RemoteMacTable::builder();
desired
.insert(vni(100), mac(1), entry_multi_homed("10.0.0.2", &[], 7))
Two findings on the ninth fixup pass:

1. `map_nexthop_error` doc said `Kernel(errno)` becomes
   `InvalidArgument`, but slice-3b round 4 routed it through
   `map_nexthop_kernel_errno` for per-errno classification
   (`PermissionDenied` / `KernelTooOld` / `InvalidArgument` /
   `Other`). Update the doc to match so the
   retry-vs-suppression behavior isn't misread.

2. `fdb_nhg_member_set_shrinks_to_one_stays_group_backed` test
   constructed a `RemoteMacEntry` with `alias_group_key = Some(_)`
   and empty `alias_vtep_ips`, which violates the projection
   invariant (`empty alias_vtep_ips ⇔ alias_group_key.is_none()`
   per `projection.rs:259`) — production wire intent can't
   reach an N=1 group-backed state because the projection
   collapses to single-dst. Test now exercises N=3 → N=2
   instead, which is the realistic "stays group-backed" path
   and still covers the UpdateFdbNhgMembers code path. Renamed
   to drop the misleading "to_one" suffix.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comment on lines +401 to +443
// Per-group member-set diff. Dedupe across MACs that
// share the group. The group-missing case (our local
// state doesn't track this `linux_key`) is intentionally
// *not* healed here — the per-MAC kernel-drift check
// below emits one Install per MAC, which the apply path
// dedupes at the group level via `record_group_install`.
if group_seen.insert(linux_key)
&& let Some(existing) = groups.group(&linux_key)
{
let existing_members: Vec<_> = existing.members.iter().copied().collect();
if existing_members != canonical {
updates.push(DataplaneOp::UpdateFdbNhgMembers {
group_key: linux_key,
members: canonical.clone(),
});
}
}

// Per-MAC kernel-snapshot drift check. Re-emit Install
// when:
// - kernel has no FDB row for `(vni, mac)`,
// - kernel row is single-dst (no `nh_id`),
// - kernel row points at an `nh_id` that doesn't
// match our locally-tracked group ID (manual
// `bridge fdb replace`, partial-failure carryover,
// or stale state — we'd silently forward via the
// wrong group otherwise),
// - or we don't track the group locally
// (startup-adoption mismatch).
// `InstallFdbNhg` is idempotent under `NLM_F_REPLACE` so
// re-emitting when the row is already correct is safe.
let kernel_row_correct = snapshot
.find_fdb(vni, mac)
.and_then(|k| k.nh_id)
.is_some_and(|kernel_id| tracked_group_id == Some(kernel_id));
if !kernel_row_correct {
creates.push(DataplaneOp::InstallFdbNhg {
vni,
mac,
group_key: linux_key,
members: canonical,
});
}
One finding on the tenth fixup pass:

  A single plan could emit both `UpdateFdbNhgMembers` (group
  member-set drift) and `InstallFdbNhg` (per-MAC kernel-row
  drift) for the same `group_key`. `apply_nhg_op InstallFdbNhg`
  already does a REPLACE in its group-already-exists branch, so
  the subsequent `UpdateFdbNhgMembers` is a redundant second
  REPLACE on the same NHID — wasted netlink, more complex
  failure sequencing.

Post-process the plan to drop `UpdateFdbNhgMembers` ops whose
`group_key` is also covered by an `InstallFdbNhg` in the same
plan.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Comment on lines +1559 to +1564
// Step 3: install the FDB row pointing at the group. On
// failure, roll back the newly-created group (if any) and
// members — they have no MAC ref yet, so they're true
// orphans without rollback.
if let Err(e) = dataplane.install_fdb_nhg_row(*vni, *mac, group_id).await {
rollback_partial_install(
Comment on lines +1409 to +1414
/// ADR-0059 slice 3b coordinator: apply an FDB-NHG op by calling
/// [`NexthopOps`] methods in ADR §5 invariant 1+2 order, updating
/// the allocator + [`GroupOwnedMap`] state.
///
/// Free function (not a method) so it can borrow `dataplane` and
/// `state` mutably from disjoint fields of the actor.
Two findings on the eleventh fixup pass:

1. `rollback_partial_install` could corrupt `GroupOwnedMap` on
   the existing-group drift-heal Step 3 failure path: Step 2's
   successful REPLACE had already attached the new members to
   the (still-installed) group via `record_group_member_change`,
   so the post-rollback `drop_vtep_nh` would remove members
   that the surviving group still references — leaving the
   group's `members` set pointing at gone kernel state.
   Rollback now skips members whose per-VTEP NH still has live
   group refs (`vtep_nh_is_orphan` returns false); the next
   reconcile heals the missing FDB row.

2. Shutdown `drain()` issued `RemoveRemoteFdb` for every owned
   `(VNI, MAC)`, regardless of `OwnedEntryKind`. For FdbNhg
   entries that's the wrong op shape — `Dataplane::apply`
   rejects with `InvalidArgument` and the kernel group/members
   stay orphaned across restarts. Drain now snapshots
   `(vni, mac, kind)` and dispatches: SingleDst →
   `RemoveRemoteFdb` (existing path), FdbNhg → `RemoveFdbNhg`
   routed through `apply_nhg_op` so the FDB row → group →
   members teardown sequence runs and refcount state stays
   consistent for sibling MACs that share the group.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +1487 to +1489
return Err(crate::error::DataplaneError::Other(format!(
"ADR-0059: vtep NH alloc failed: {e}"
)));
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +1659 to +1667
// Surface it as a transient error (classified `Other`
// → retried) so it shows up in metrics rather than
// silently clearing retry state.
rollback_partial_install(dataplane, state, &new_members, None, "update_no_group")
.await;
return Err(crate::error::DataplaneError::Other(format!(
"ADR-0059: UpdateFdbNhgMembers for {group_key:?} but group missing \
from GroupOwnedMap (owned/groups state drift)",
)));
Comment on lines +52 to +59
/// `NHA_GROUP_TYPE` — `u16` discriminator for group flavor
/// (MPATH vs resilient). Kernel may include this on dump even when
/// we don't set it on add; slice 3b's parser tolerates and ignores.
pub(crate) const NHA_GROUP_TYPE: u16 = 3;

/// `NHA_OP_FLAGS` — bitmask added in newer kernels; emitted on
/// dump regardless of whether userspace set it. Parser tolerates.
pub(crate) const NHA_OP_FLAGS: u16 = 15;
Comment thread crates/evpn-linux/src/reconcile.rs Outdated
Comment on lines +526 to +543
"adoption: dump_owned_nexthops failed; deferring this reconcile pass to avoid allocator collisions"
);
let status = build_instance_status(&intent.instances, &probes);
// `dump_snapshot` already succeeded — use the real
// snapshot so BUM enforcement reflects actual link
// state, not "no links exist".
let bum_enforcement =
build_bum_enforcement_status(&intent.bum_enforcement, &snapshot);
self.emit_report(
status,
vec![],
vec![],
bum_enforcement,
ip_vrf_status,
ip_vrf_routes,
)
.await;
return;
Four findings on the twelfth fixup pass:

1. Allocator failures (`alloc_vtep_nh` / `alloc_nhg` exhaustion or
   tag-bit collision) returned `DataplaneError::Other` (classified
   `Transient`), so the actor would infinite-retry an
   un-recoverable condition. Switch to `InvalidArgument` (Permanent)
   so the op-shape gets suppressed and the operator sees it surface.

2. `UpdateFdbNhgMembers` "group missing from GroupOwnedMap" branch
   represents an internal owned/groups state-drift invariant
   violation — also returned `Other` (transient) and would spin
   forever. Switch to `InvalidArgument` (Permanent) for the same
   reason; if state ever drifts, the suppression flags it as a
   bug rather than hiding it behind retry churn.

3. `constants_match_adr_0059_uapi_table` didn't pin the new
   `NHA_GROUP_TYPE = 3` and `NHA_OP_FLAGS = 15` constants added
   for the slice 3b dump-tolerance list. Drift here would
   silently break the dump-path filter — add the asserts.

4. `dump_owned_nexthops` failure short-circuited the entire
   reconcile pass regardless of error class. On
   `KernelTooOld` / `PermissionDenied` / `InvalidArgument` (no
   `NDA_NH_ID` / no `CAP_NET_ADMIN` / message-shape bug) that
   would permanently disable single-dst FDB / BUM / L3 too,
   not just FDB-NHG. Now classifies: Permanent → mark
   `adoption_done = true` + log once + keep going (FDB-NHG
   ops in this pass will Permanent-fail at apply time and
   get per-op-shape suppressed); Transient/Conflict → keep
   the existing short-circuit so a flaky dump doesn't
   under-reserve the allocator.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 14 out of 14 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

crates/evpn-linux/src/reconcile.rs:1066

  • cleanup_unreferenced_adoptions deletes stale nexthops in ascending ID order. Since member IDs (0x3000_....) sort before group IDs (0x4000_....), this will try to delete members while their (stale) group still exists, which the kernel typically rejects; that forces at least one extra reconcile pass and keeps adoption_done false even if the group delete succeeds later in the same loop. Consider deleting stale NHGs before members (using the stored KernelNexthopKind/tag bits to order), so cleanup can succeed in a single pass and avoid avoidable warnings/retries.
    async fn cleanup_unreferenced_adoptions(&mut self) -> bool {
        // Drain to avoid holding the borrow across the await.
        let stale: Vec<u32> = self.state.adopted_unreferenced.keys().copied().collect();
        let mut all_ok = true;
        for id in stale {
            // del_nexthop is idempotent on ENOENT (slice 2). Only
            // release + drop tracking on Ok — on Err, leave the slot
            // reserved + the entry in the map so the next reconcile
            // pass retries.
            match self.dataplane.del_nexthop(id).await {
                Ok(()) => {
                    self.state.nh_id_alloc.release(id);
                    self.state.adopted_unreferenced.remove(&id);
                }
                Err(e) => {
                    tracing::warn!(
                        ?e,
                        id,
                        "adoption cleanup: del_nexthop failed; leaving reserved for next pass"
                    );
                    all_ok = false;
                }
            }
        }
        all_ok

Comment on lines +1013 to +1026
async fn drain_pending_deletes(&mut self) {
let ids: Vec<u32> = self.state.pending_deletes.iter().copied().collect();
for id in ids {
match self.dataplane.del_nexthop(id).await {
Ok(()) => {
self.state.nh_id_alloc.release(id);
self.state.pending_deletes.remove(&id);
tracing::debug!(id, "pending_deletes: drained on retry");
}
Err(e) => {
tracing::trace!(?e, id, "pending_deletes: still failing");
}
}
}
lance0 added 2 commits May 12, 2026 20:59
Four findings from a deep review pass on top of the 13 Copilot
rounds:

1. LOAD-BEARING — Startup adoption cleanup could delete still-live
   adopted NHIDs after FDB-NHG installs were permanently suppressed.
   `apply_plan`'s suppression branch `continue`s out without joining
   `failed`, so once a permanent failure was recorded the next pass
   saw `failed.is_empty()` and ran cleanup against
   `adopted_unreferenced` — including IDs that a kernel FDB row was
   still pointing at. `cleanup_unreferenced_adoptions` now:
   - Blocks cleanup entirely if any FDB-NHG op is permanently
     suppressed (`nhg_permanent_failures` non-empty, or
     `permanent_failures` contains InstallFdbNhg / UpdateFdbNhgMembers
     / RemoveFdbNhg).
   - Threads `&KernelSnapshot` through; walks `snapshot.iter_fdb()`
     to build a retention set of every adopted NHID a kernel FDB
     row references, plus recursive members of retained groups.
   - Only deletes adopted IDs NOT in the retention set; retained
     IDs stay in `adopted_unreferenced` for the next pass to
     re-evaluate.

2. Teardown order — both `pending_deletes` and
   `cleanup_unreferenced_adoptions` iterate `BTreeSet<u32>` in
   ascending order; with the tag scheme (members at 0x3000_xxxx,
   groups at 0x4000_xxxx) that deletes members before their parent
   group, which the kernel can reject with `EINVAL` and which
   defeats ADR-0059 §5 invariant 2 (FDB row → group → members).
   Partition by `NhIdAllocator::is_nhg` and drain groups first.

3. Actor-level test coverage was missing for the slice 3b
   coordinator + adoption state machine. The diff-level tests
   exercise `compute_diff` Pass 1b in isolation, and the privileged
   netns test (`tests/netns_fdb_nhg.rs`) validates the raw netlink
   primitives — neither covers `apply_nhg_op` through
   `ReconcileActor` with `GroupOwnedMap` / `OwnedSet` / report. Add
   four new tests in `crates/evpn-linux/tests/reconcile_actor.rs`:
     - `fdb_nhg_multihomed_intent_installs_members_group_and_row`
     - `fdb_nhg_shared_group_refcount_teardown` (two MACs share
       group; withdraw-one keeps it, withdraw-last tears down)
     - `fdb_nhg_adoption_retains_live_kernel_fdb_refs` (preloaded
       NHG + member + kernel FDB row → cleanup retains)
     - `fdb_nhg_perm_suppressed_install_blocks_adoption_cleanup`
       (Permanent Install + later suppressed pass with empty `failed`
       → cleanup still blocked)
   Adds three `InMemoryHandle` helpers (`pre_load_nexthop_op`,
   `nexthop_ops`, `kernel_fdb_nh_id`) and threads a
   `take_universal_failure` hook into every `NexthopOps` method so
   the test can drive a Permanent FDB-NHG apply.

4. Nit — `encode_add_fdb_member` had two `match gateway { ... }`
   blocks (one for `family`, one for the wire octets). Fold into
   one match so the two derivations can't drift.
First post-mortem for the symmetric Interface-less IRB datapath.
Run completed cleanly at 24h00m elapsed against v0.18.0
(`5619ace` = v0.18.0 + harness PR #80) — peak PE1 RSS 14.34 MB,
steady-state slope 0.025 MB/h, 0 BGP flaps, 0 installed_routes
oscillations, 99.93% tenant/observed correlation. PASS, all gates
green. Verdict + sign-off mirror the Gate 8b post-mortem at
`docs/soak-gate8b-24h-bum-state.md`.

Five non-blocking follow-ups noted for the next re-soak:
- Watch for the single ~22h RSS step recurrence.
- Source `pe1_observed_routes` from a Prometheus counter, not
  the periodic `DataplaneReport` (eliminate the single
  observation-lag sampling race observed at row 1386).
- Fix the harness so `pe1.log` actually captures rustbgpd
  output (it was 0 bytes this run).
- Clarify `CHURN_INTERVAL_SEC` semantics (configured 30 s,
  actual cadence ~60 s/transition).
- Add an `originated_routes` column to the CSV now that v0.18.0
  ships the Prometheus series.

Raw artifacts pinned alongside per the
`docs/artifacts/soak/gate8b-20260510T152451Z/` convention.
@lance0 lance0 merged commit ac72619 into main May 13, 2026
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants