Skip to content

controller: hoist unknown BGP peer cleanup before main router bgp block#3627

Merged
ben-malbeclabs merged 2 commits intomainfrom
bc/controller-template-unknown-peer-cleanup
Apr 30, 2026
Merged

controller: hoist unknown BGP peer cleanup before main router bgp block#3627
ben-malbeclabs merged 2 commits intomainfrom
bc/controller-template-unknown-peer-cleanup

Conversation

@ben-malbeclabs
Copy link
Copy Markdown
Contributor

@ben-malbeclabs ben-malbeclabs commented Apr 30, 2026

Summary of Changes

  • Hoist UnknownBgpPeers no neighbor X emission into a per-peer block that re-enters router bgp 65342 before the main config, removing the inline cleanup loops from address-family ipv4, address-family vpn-ipv4, and the per-VRF block.
  • Each peer's removal now starts with a clean router bgp 65342 re-entry and explicitly emits no neighbor X at top-level and in each AF/VRF, so cleanup is comprehensive and isolated per peer.
  • Refresh fixtures to match the new emission order: unknown.peer.removal.tmpl, e2e.peer.removal.tmpl, e2e.last.user.tmpl.

Fundamental Bug Being Fixed

The previous template emitted no neighbor X per unknown peer inside address-family ipv4, address-family vpn-ipv4, and each vrf vrfN block. EOS silently exits AF/VRF context when no neighbor X is issued for a peer that doesn't exist there. With a list of unknown peers, this means:

  1. First peer in the list doesn't exist in the AF → EOS exits AF context.
  2. Second peer's no neighbor X runs at router bgp top-level instead of inside the AF.
  3. All subsequent peers continue at the wrong context.

So the cleanup was order-dependent and non-deterministic: depending on which peer happened to iterate first, you'd either get the intended per-AF/VRF cleanup for the whole list or a partial mix where only the first peer got AF treatment and the rest got incidental top-level deletion. The bug rarely produced observable damage because top-level deletion is also "remove this peer," and UnknownBgpPeers are by definition peers that should be fully removed — but the cleanup behavior was unreliable in principle.

RFC-18 made the bug visible by adding next-hop resolution ribs ... (an AF-only command) immediately after the UnknownBgpPeers loop in address-family vpn-ipv4. When the loop's first peer triggered context-exit, the new command landed at router bgp top-level and EOS rejected it as Invalid input. That was the surface symptom; the real defect was structural and pre-existing.

The fix:

  • One router bgp 65342 re-entry per unknown peer means context corruption inside one peer's cleanup is bounded to that single iteration.
  • Each peer's cleanup explicitly covers top-level + each AF + each VRF, so removal is comprehensive regardless of where the peer was actually configured.
  • Removes the three inline loops in the main BGP block, eliminating the source of context-bleed for any AF-scoped command emitted later in the template.

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 1 +15 / -9 +6
Fixtures 3 +55 / -15 +40

Single-file template change with three matching fixture refreshes.

Key files (click to expand)
  • controlplane/controller/internal/controller/templates/tunnel.tmpl — adds the per-peer cleanup block (top-level + per-AF + per-VRF no neighbor), removes the three inline UnknownBgpPeers ranges.
  • controlplane/controller/internal/controller/fixtures/e2e.last.user.tmpl — updated expected output for two unknown peers.
  • controlplane/controller/internal/controller/fixtures/e2e.peer.removal.tmpl — updated expected output for two unknown peers.
  • controlplane/controller/internal/controller/fixtures/unknown.peer.removal.tmpl — updated expected output for one unknown peer.

Testing Verification

  • Bug reproduced on physical EOS (chi-dn-dzd8, 7130LBR): no neighbor X for a non-existent peer inside address-family vpn-ipv4 followed by next-hop resolution ribs tunnel-rib system-tunnel-rib is rejected as Invalid input. Control test (same command without preceding no neighbor) is accepted normally — proving the rejection is caused by silent context-exit, not by the command itself.
  • Fix verified on the same hardware: cleanup block targeting three non-existent peer IPs (including 169.254.1.7, exact-match canary one digit off from real 169.254.1.6) produces zero diff. A sentinel command emitted after the cleanup correctly lands at router bgp 65342 top-level context. All real neighbors untouched. Session aborted; device state restored.

@ben-malbeclabs ben-malbeclabs marked this pull request as ready for review April 30, 2026 17:02
@ben-malbeclabs ben-malbeclabs requested a review from nikw9944 April 30, 2026 17:03
@ben-malbeclabs ben-malbeclabs self-assigned this Apr 30, 2026
@ben-malbeclabs ben-malbeclabs added the bug Something isn't working label Apr 30, 2026
EOS silently exits address-family context when `no neighbor X` is
issued for a peer that doesn't exist in that AF. Any AF-scoped command
emitted after the cleanup loop then runs at the wrong level — RFC-18's
`next-hop resolution ribs` line was the first to visibly hit this,
failing as "Invalid input" on physical EOS.

Move UnknownBgpPeers `no neighbor X` emission into a per-peer block
that re-enters `router bgp 65342` before the main config, and remove
the inline cleanup loops from address-family ipv4, address-family
vpn-ipv4, and the per-VRF block. Each peer's removal starts with a
clean `router bgp 65342` context, so any silent context-exit inside
the cleanup is bounded to that one iteration.
@ben-malbeclabs ben-malbeclabs force-pushed the bc/controller-template-unknown-peer-cleanup branch from 1a299fe to 115de72 Compare April 30, 2026 19:52
@ben-malbeclabs ben-malbeclabs merged commit 657fb50 into main Apr 30, 2026
47 of 49 checks passed
@ben-malbeclabs ben-malbeclabs deleted the bc/controller-template-unknown-peer-cleanup branch April 30, 2026 21:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants