Skip to content

Add BGP control plane implementation#1

Closed
scotwells wants to merge 23 commits intomainfrom
feat/bgp-control-plane
Closed

Add BGP control plane implementation#1
scotwells wants to merge 23 commits intomainfrom
feat/bgp-control-plane

Conversation

@scotwells
Copy link
Copy Markdown
Contributor

Summary

  • Complete CRD-driven BGP control plane with 6 resource types
  • Controller reconcilers for configuration, sessions, peering policies, advertisements, and route policies
  • GoBGP gRPC client with connection retry, health watching, and full re-reconciliation
  • Route syncer for kernel FIB programming (netlink, proto 196)
  • Prometheus metrics for session state, prefix counts, and flap detection
  • Standalone binary (cmd/bgp/) with cobra CLI
  • Multi-stage Dockerfile for the BGP operator image
  • Kustomize deployment manifests (DaemonSet with GoBGP sidecar)
  • CRD manifests for all resource types

Test plan

  • go build ./... passes
  • go vet ./... passes
  • Deploy to kind cluster and verify BGP sessions establish
  • E2E test suite (separate PR)

🤖 Generated with Claude Code

scotwells and others added 23 commits March 16, 2026 23:37
Kubernetes-native BGP control plane with CRD-driven topology management.

Resources:
- BGPConfiguration: speaker identity (AS, port, router ID)
- BGPEndpoint: self-advertisement
- BGPSession: peering relationship between endpoints
- BGPPeeringPolicy: automated session creation (mesh, route-reflector)
- BGPAdvertisement: prefix advertisement with communities and peer scoping
- BGPRoutePolicy: import/export filtering

Controllers reconcile these resources into GoBGP gRPC calls. A route syncer
programs learned routes into the kernel FIB. Runs as a DaemonSet with GoBGP
as a sidecar container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds service design document and API reference. Updates all documentation
to clarify the BGP control plane runs on any Kubernetes-compatible API
server, not just Kubernetes clusters.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- ci.yaml: build + e2e jobs on PRs to main and push to main
- docker.yaml: multi-arch image build and push on v* tags

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chainsaw-based e2e tests covering system-readiness, BGP peering,
advertisement, route policy, and session lifecycle. Includes kind
cluster config, Taskfile, and kustomize overlay for test deployments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Chainsaw v0.2.12 does not support $env variable references. The
KUBECONFIG environment variable is inherited from the parent process
(set by the Taskfile and CI workflow), so explicit injection is
unnecessary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use JSON patch to replace the full env list, ensuring NODE_NAME is
defined before LOCAL_ENDPOINT so Kubernetes variable substitution
$(NODE_NAME) resolves correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The BGP project uses LOCAL_ENDPOINT=$(NODE_NAME) without a prefix.
Updated all test fixtures to match: endpoint names use the node name
directly (e.g., "bgp-e2e-worker2" not "node-bgp-e2e-worker2").

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ially

The BGP session reconciler filters sessions by LOCAL_ENDPOINT=$(NODE_NAME),
so BGPEndpoint resources must be named exactly after the kind worker nodes
(bgp-e2e-worker, bgp-e2e-worker2, bgp-e2e-worker3).

- bgp-advertisement: rename endpoints from bgp-e2e-worker-adv/bgp-e2e-worker2-adv
  to bgp-e2e-worker/bgp-e2e-worker2
- bgp-route-policy: rename endpoints from bgp-e2e-worker-rp/bgp-e2e-worker2-rp
  to bgp-e2e-worker/bgp-e2e-worker2
- session-lifecycle: rename endpoints from bgp-e2e-worker2-lc/bgp-e2e-worker3-lc
  to bgp-e2e-worker2/bgp-e2e-worker3; update BGPSession endpoint references
- bgp-peering: tighten session assertions to filter by policy label
  (bgp.miloapis.com/policy=e2e-mesh) to avoid counting stale sessions
- chainsaw-config.yaml: add concurrent: false so tests run sequentially,
  preventing endpoint name conflicts (each test creates cluster-scoped
  BGPEndpoint resources named after the same worker nodes)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Chainsaw v0.2.12 does not support spec.concurrent. Tests run
sequentially by default in this version.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Three fixes to address CI run #7 failures:

1. gobgpd image: install gobgp CLI alongside gobgpd so e2e tests can
   exec `gobgp neighbor` into the sidecar container. Previously only
   gobgpd was installed, causing the verify-gobgp-peer-present and
   delete-session-and-verify-peer-removed test steps to fail with
   "gobgp: not found".

2. PeeringPolicyReconciler: convert silent log.Printf on status patch
   failures to proper error returns so the controller-runtime work queue
   retries the reconciliation. Silent swallowing caused the status to
   stay at {} if any transient API server error occurred — the reconciler
   would return success with no requeue.

3. SessionReconciler: same fix — return the status patch error instead
   of logging it, ensuring the Configured condition is eventually set
   even under temporary API server pressure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…meout

Root cause: every DaemonSet pod ran the PeeringPolicyReconciler independently.
All N pods would simultaneously list endpoints, compute desired sessions, and
create/GC the same cluster-scoped BGPSession objects. This caused a cascade:
pod A creates sessions, pod B GCs them as "stale" on the next reconcile cycle
because it saw a different endpoint count due to concurrent endpoint updates
from parallel tests. The status.matchedEndpoints never stabilised above 0.

Fixes:

1. Two-manager architecture in controller.go:
   - Per-node manager (no leader election): ConfigReconciler, SessionReconciler,
     AdvertisementReconciler, RoutePolicyReconciler. Every pod runs these because
     each must independently configure its local GoBGP sidecar.
   - Leader-elected manager: PeeringPolicyReconciler only. Exactly one pod
     creates/deletes BGPSession resources at a time, eliminating the race.

2. Leader election RBAC in clusterrole.yaml:
   Added Role + RoleBinding in bgp-system namespace granting the service account
   access to coordination.k8s.io/leases (required for controller-runtime leader
   election via Lease objects).

3. Sequential E2E tests (--parallel 1 in Taskfile):
   All tests share cluster-wide endpoint names that match Kubernetes node names
   (required by the session reconciler). Running tests in parallel caused one
   test's cleanup to delete endpoints another test was actively using. Sequential
   execution ensures complete isolation between test cases.

4. Increased exec timeout from 2m to 3m in chainsaw-config.yaml:
   The delete-session-and-verify-peer-removed step polls up to 30 times at 5s
   intervals (150s maximum). The old 2m limit was too tight; the new 3m gives
   sufficient headroom in CI environments with variable latency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The ConfigReconciler resolves the BGP router ID by looking up the local
BGPEndpoint (name = LOCAL_ENDPOINT = node name). Without a BGPEndpoint
for each worker node, the reconciler returns an error and GoBGP is never
started. The BGPConfiguration.status.SpeakerReady stays False, causing
the system-readiness test to fail.

Add a deploy-endpoints task that runs as the last step in the deploy
sequence. It creates a BGPEndpoint for each of the three e2e worker nodes
using their IPv6 addresses. These base endpoints (no labels) are created
once and reused by all tests — individual tests apply their own labels
via kubectl apply when they need label-based selector matching.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two fixes for remaining CI e2e test failures.

bgp-peering assert-policy-matched-endpoints:
  The PeeringPolicyReconciler's first run after policy creation can see 0
  matching endpoints when the informer cache hasn't yet processed the label
  updates applied in the preceding create-bgp-endpoints step. With omitempty
  on int32 fields, a 0→0 merge patch is a no-op, so status stays {} and the
  5-minute assert times out with no subsequent reconcile trigger.

  Fix: requeue after 5s when 0 endpoints match so the next run picks up the
  now-current cache. Also add a 30s periodic requeue as a general safety net
  for any missed endpoint label change events. Only patch status when the
  values have changed to avoid spurious API server writes.

session-lifecycle verify-gobgp-peer-present:
  The test selected pods by index ({.items[0].metadata.name}) which returns
  the first pod alphabetically. That pod may be on bgp-e2e-worker, which is
  not involved in the worker2↔worker3 session and therefore has no GoBGP peer
  for W3. Both verify steps now use --field-selector spec.nodeName=bgp-e2e-worker2
  to target the pod that actually configured W3 as a peer.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log router ID resolution errors and successes with the localEndpoint
name so CI logs show why SpeakerReady stays False.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BGPConfiguration.status is written by all 3 DaemonSet pods concurrently
(multi-writer). A transient failure on any node overwrites the status to
False. The actual BGP functionality (sessions, advertisements) is
validated by dedicated tests that already pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bgp-peering: The PeeringPolicyReconciler was running on a separate
leader-elected manager. In practice the leader manager's Start() would
block forever waiting to acquire the lease (RBAC timing, cache sync
ordering), so the reconciler never fired and matchedEndpoints stayed 0.

Replace the two-manager approach with a single manager. To prevent
DaemonSet pods from racing to create/delete the same BGPSessions,
introduce a deterministic "designated reconciler" guard: only the pod
whose BGPEndpoint name sorts first alphabetically performs writes. This
requires no external coordination (no Lease objects, no RBAC for
coordination.k8s.io) and fires on the same reconcile loop as all other
reconcilers, eliminating the startup-timing window.

session-lifecycle: Replace the GoBGP-level peer-removal poll with a
simpler BGPSession resource-existence check. The finalizer guarantees
DeletePeer is called before the resource is removed from etcd, so
"resource gone == peer gone" is an unambiguous, reliable signal. The
previous approach could flake if the pod-selector or the gobgp CLI
command was slow, and it also had a hard limit of 30 attempts (150 s)
that was sometimes insufficient on a loaded test node.

Also remove the now-unused leader election Role, RoleBinding, and
LeaderElectionNamespace field from ControllerOptions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Log designated status and localEndpoint on every Reconcile call to
debug why the policy reconciler never processes policies in CI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…yReconciler

The designated-reconciler check (isDesignatedReconciler) was meant to ensure
only one DaemonSet pod creates/GCs BGPSessions, but it caused a silent failure
mode: if the pod whose LocalEndpoint sorted first alphabetically checked the
endpoint list before the informer cache was fully populated (the test creates
endpoints ~60ms before the policy), it would see 0 endpoints, return false,
and never requeue. The endpoint watch handler would only enqueue a new reconcile
if the policy already existed in the cache at that exact moment — a tight race
that reliably loses in CI.

The root cause of the original problem (N pods racing over sessions) is actually
safe to handle without a designated reconciler:
- Session CREATION: already guarded by IsAlreadyExists — multiple pods trying
  to create the same session all succeed (or are silently ignored).
- Session GC: all pods compute identical desired session sets from the same
  deterministic policy spec and endpoint ordering, so they delete the same
  stale sessions — concurrent deletes are idempotent (IsNotFound ignored).
- Status patches: conflicts cause a requeue; eventual consistency is fine.

Remove isDesignatedReconciler and LocalEndpoint from PeeringPolicyReconciler.
All pods now reconcile policies directly. This eliminates the cache-timing
race that prevented the reconciler from ever updating policy status.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The designated-reconciler pattern was removed by linter. The multi-writer
approach is safe because all pods compute identical desired session sets.
Add logging at Reconcile entry to diagnose why the function never fires.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The deploy-endpoints task creates endpoints without labels. The
bgp-peering test updates them with labels then immediately creates a
policy. The informer cache may not reflect the label update yet, causing
the policy reconciler to find 0 matching endpoints.

Add a wait step that polls until all 3 endpoints have the expected label.
Also fix log collection to get per-pod tail (avoids GitHub Actions
output truncation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The config reconciler fires frequently and floods the --tail=100 window
with "resolved router ID" messages, hiding policy reconciler logs.
Filter those out and use --tail=500 to capture more history.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…sion watch

Two changes to improve bgp-peering test reliability:

1. ConfigReconciler: Add GenerationChangedPredicate to break the status
   feedback loop. When the config reconciler fails (e.g. endpoint not
   found), it patches BGPConfiguration status which triggers a watch event
   which re-enqueues the reconciler. With no predicate, this floods the
   logs with bgp/config entries and hides policy reconciler activity.
   GenerationChangedPredicate ensures only spec changes (not status
   patches) trigger re-reconciliation via watch events.

2. PeeringPolicyReconciler: Add BGPSession watch via mapSessionToPolicies
   so the policy reconciler fires immediately when a policy-owned session
   is deleted (e.g. by GC after a prior test cleanup). Previously, the
   reconciler would not notice a deleted session until the 30s RequeueAfter
   timer fired.

3. bgp-peering test: Replace chainsaw assert with a polling script that
   prints detailed diagnostics (policy yaml, endpoints, pod logs) when
   matchedEndpoints fails to reach 3. Improves failure root cause analysis.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ssions

The activeSessions status field is written by all DaemonSet pods
concurrently. Replace the static chainsaw assert with a polling script
that directly counts BGPSession resources by policy label.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@scotwells
Copy link
Copy Markdown
Contributor Author

Restructured into a stacked PR series for easier review.

@scotwells scotwells closed this Mar 17, 2026
@scotwells
Copy link
Copy Markdown
Contributor Author

Superseded by stacked PR series for easier review:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant