Skip to content

telemetry/bgpstatus: collect BGP state from all tenant VRF namespaces#3597

Merged
juan-malbeclabs merged 4 commits intomainfrom
telemetry/bgpstatus-multi-vrf-namespace
Apr 28, 2026
Merged

telemetry/bgpstatus: collect BGP state from all tenant VRF namespaces#3597
juan-malbeclabs merged 4 commits intomainfrom
telemetry/bgpstatus-multi-vrf-namespace

Conversation

@juan-malbeclabs
Copy link
Copy Markdown
Contributor

Summary of Changes

  • Users whose tenant has VrfId != 1 have their GRE tunnel interface placed in ns-vrf<N> on the Arista device, but the BGP status submitter was only checking ns-vrf1. This caused a persistent "tunnel not found" debug log and left those users' onchain BGP status permanently stale.
  • Fix: on each tick, derive the full set of VRF namespaces from programData.Tenants via a new vrfNamespaces helper, then collect BGP socket stats and local interfaces from all of them before the per-user loop. Tunnel IPs are globally unique (onchain-allocated), so merging across namespaces is safe.
  • Introduced NamespaceCollector as an injectable function type in Config, replacing LocalNet. DefaultCollector wraps the real Linux calls; tests supply a mock without any Linux syscalls.

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 2 +82 / -21 +61
Scaffolding 1 +1 / -1 0
Tests 3 +508 / -11 +497

Mostly test additions; the fix itself is compact -- ~60 net lines across two core files.

Key files (click to expand)
  • controlplane/telemetry/internal/bgpstatus/submitter.go -- adds DefaultCollector (wraps Linux calls), rewrites tick() to loop over all VRF namespaces and merge results; aborts only if every namespace fails
  • controlplane/telemetry/internal/bgpstatus/bgpstatus.go -- adds NamespaceCollector func type, vrfNamespaces helper (derives namespace list from tenant VRF IDs), replaces LocalNet with Collector in Config
  • controlplane/telemetry/internal/bgpstatus/submitter_linux_test.go -- four new tick() behavioral tests: user in ns-vrf2 found and reported Up; partial namespace failure continues; all-namespace failure aborts
  • controlplane/telemetry/internal/bgpstatus/submitter_test.go -- migrates test helpers to NamespaceCollector, adds five vrfNamespaces unit tests (dedup, zero VrfId skip, multi-tenant)
  • e2e/user_bgp_status_test.go -- adds TestE2E_UserBGPStatus_NonDefaultTenant: creates a tenant (VrfId != 1), connects a client under it, and verifies Up -> Down BGP status transitions onchain
  • controlplane/telemetry/cmd/telemetry/main.go -- wires DefaultCollector(localNet) into bgpstatus.Config

Testing Verification

  • All existing bgpstatus unit tests pass unchanged
  • TestVrfNamespaces_* (5 cases): deduplication, zero VrfId skip, base namespace always included, additional VRF appended
  • TestTick_* (4 cases, Linux): user in ns-vrf2 is found and reported Up; one failing namespace is warned and skipped while others succeed; all namespaces failing aborts the tick with no submissions
  • TestE2E_UserBGPStatus_NonDefaultTenant: end-to-end validation that a user in a non-default tenant VRF reaches BGP status Up onchain when the session is established, and Down after the daemon is killed

Users whose tenant has VrfId != 1 have their GRE tunnel interface placed
in ns-vrf<N> on the Arista device. The BGP status submitter was only
checking ns-vrf1, causing a persistent "tunnel not found" debug log and
leaving those users' onchain BGP status stale.

Fix: derive the set of Linux VRF namespaces from programData.Tenants on
each tick (vrfNamespaces helper), then collect BGP socket stats and
local interfaces from all of them, merging before the per-user loop.
Tunnel IPs are globally unique (onchain-allocated) so merging across
namespaces is safe.

The NamespaceCollector function type replaces LocalNet in Config, making
the per-namespace collection fully injectable for testing. DefaultCollector
wraps the real Linux calls and is wired in main.go.

Unit tests cover vrfNamespaces (dedup, zero VrfId skip, multi-tenant) and
tick() behavior (user in ns-vrf2, partial namespace failure, all-fail abort).
Adds TestE2E_UserBGPStatus_NonDefaultTenant, which exercises the
multi-namespace collection path end-to-end: a tenant is created (VrfId != 1),
a client connects under that tenant, and the test verifies that the BGP
status submitter correctly reports Up (session established) and then Down
(doublezerod killed) onchain for the user whose tunnel lives in ns-vrf<N>.
@juan-malbeclabs juan-malbeclabs enabled auto-merge (squash) April 28, 2026 13:48
@juan-malbeclabs juan-malbeclabs merged commit d3dd140 into main Apr 28, 2026
48 of 52 checks passed
@juan-malbeclabs juan-malbeclabs deleted the telemetry/bgpstatus-multi-vrf-namespace branch April 28, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants