BGP support for control plane APIs #1281

EmilienM · 2022-11-09T16:50:47Z

This is an enhancement that supports OSASINFRA-2999.

openshift-ci · 2022-11-09T16:51:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign danwinship for approval by writing /assign @danwinship in a comment. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

enhancements/network/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

patrickdillon · 2022-11-10T19:29:56Z

/cc

enhancements/network/on-prem-bgp-networking.md

JoelSpeed · 2022-11-14T12:23:44Z

enhancements/network/on-prem-bgp-networking.md

+
+#### Support Procedures
+
+It won't be possible to upgrade a cluster from Keepalived/HAproxy to BGP for now.


Why can't you change between the two?

I guess this is technically feasible, maybe I was being too strict in the initial writting and upgrades could be supported between a traditional to BGP networking.

For now, I changed the wording saying that this is not possible now but might be revisited in the future. I'm convinced that an upgrade would bring a certain level of complexity that is not required at the moment.

Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>

romfreiman · 2022-11-18T09:22:07Z

Is there an option that part of the cluster wont have BGP routing / VIP are not expected to run there - remote worker nodes for example? How will it be addressed?

romfreiman · 2022-11-18T09:26:11Z

@tsorya is working on https://issues.redhat.com/browse/OPNET-133 - lets see if there is some common infra / logic that can be reused.

EmilienM · 2022-11-18T14:01:37Z

Is there an option that part of the cluster wont have BGP routing / VIP are not expected to run there - remote worker nodes for example? How will it be addressed?

For Ingress, yes we'll have to use node selectors to choose where we want the Ingress router to run, and we'll need to address these 2 use cases:

Do not run Ingress on remote workers (like you mentioned), unless this is explicitly requested.
Do not run Ingress on the same worker nodes where MetalLB is already running and potentially peering with the Leaf router. @fedepaol about this one, can you share your thoughts?

I'll make sure this is mentioned in the enhancement.

fedepaol · 2022-11-18T14:09:49Z

Is there an option that part of the cluster wont have BGP routing / VIP are not expected to run there - remote worker nodes for example? How will it be addressed?

For Ingress, yes we'll have to use node selectors to choose where we want the Ingress router to run, and we'll need to address these 2 use cases:
* Do not run Ingress on remote workers (like you mentioned), unless this is explicitly requested.

* Do not run Ingress on the same worker nodes where MetalLB is already running and potentially peering with the Leaf router. @fedepaol about this one, can you share your thoughts?
I'll make sure this is mentioned in the enhancement.

I agree (and sorry, I haven't gone through the proposal yet). MetalLB will run it's own FRR instance, so we might hit issues / should be really careful in exposing different (and non default) ports for the two FRR instances. The MetalLB Operator comes with a node selector that will allow to pick which nodes to run metallb on, which may come handy. So the "don't expose ingress from where metallb is running seems a sufficient approximation for now.

One thing that I have in mind (but hasn't been discussed anywhere), is to have a way for MetalLB to be attached and configure a portion of an already running FRR instance instead of shipping it's own one.

EmilienM · 2022-11-18T15:09:27Z

[...]
One thing that I have in mind (but hasn't been discussed anywhere), is to have a way for MetalLB to be attached and configure a portion of an already running FRR instance instead of shipping it's own one.

That sounds interesting. Do you think that's something we could target for 4.13 (TP)? Would you mind sharing your design idea? Do you think this could be added later eventually?

I'm trying to move this enhancement forward, as we have customers waiting for this now.

fedepaol · 2022-11-18T15:12:36Z

[...]
One thing that I have in mind (but hasn't been discussed anywhere), is to have a way for MetalLB to be attached and configure a portion of an already running FRR instance instead of shipping it's own one.

That sounds interesting. Do you think that's something we could target for 4.13 (TP)? Would you mind sharing your design idea? Do you think this could be added later eventually?

Not really, it is a metallb evolution that I've been thinking lately and need to be shared and discussed both internally and externally. Not sure it is gonna even be accepted or if it works.

I'm trying to move this enhancement forward, as we have customers waiting for this now.

For the time being, I'd just say that we can't overlap with MetalLB because the other thing is still quite blurry.

romfreiman · 2022-11-18T18:04:29Z

Please talk to the assisted installer team to think what pre flight validations can be added in order to verify that the installation will be successful.
@oourfali @filanov

EmilienM · 2022-11-18T20:07:55Z

Please talk to the assisted installer team to think what pre flight validations can be added in order to verify that the installation will be successful. @oourfali @filanov

done via email today.

enhancements/network/on-prem-bgp-networking.md

pierreprinetti · 2022-11-21T11:35:13Z

enhancements/network/on-prem-bgp-networking.md

+	// The current default configuration uses VRRP.
+	//
+	// +optional
+	APILoadBalancer OpenStackAPILoadBalancer `json:"apiLoadBalancer"`


Last minute doubt: should this be called APIRouting instead?

ControlPlaneRouting maybe? But again we need to drop "API" here, since it's not only for the API.

enhancements/network/on-prem-bgp-networking.md

Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>

enhancements/network/on-prem-bgp-networking.md

mdbooth · 2022-11-21T16:02:48Z

enhancements/network/on-prem-bgp-networking.md

+different Failure Domains (stretched cluster) and where load balancing of API and Ingress operates at Layer 3.
+
+The minimal requirements includes:
+- Highly available and dynamically routable load-balancing API access for internal clients.


Are internal clients using this endpoint? They're most likely using kubernetes.default.svc.

I believe the requirement here is for external clients. For internal clients we might need CNI configuration.

I changed the wording to "external". What do you mean by CNI configuration? If you can explain a bit more. Thanks

kubernetes.default.svc will be using a ClusterIP provided by OVNKubernetes. If we wanted to, e.g. ensure pods talk to their local kube-apiserver by default to reduce east-west traffic I think we'd need to teach OVNKubernetes how to do that.

Firing from the hip here, though, so delighted to be educated!

enhancements/network/on-prem-bgp-networking.md

mdbooth · 2022-11-21T16:13:29Z

enhancements/network/on-prem-bgp-networking.md

+
+In conjunction to the work that has been (and is being) done on Failure Domains, we will propose 
+a new [API](https://github.com/openshift/api/pull/1321) to configure Load Balancers, so we will be able to:
+- Change the load-balancer type (default remains VRRP), to be able to enable BGP.


Classic bikeshed:

I know VRRP was what I originally called this, but I wonder if we should instead call it something more generic like L2. VRRP is an implementation detail which we don't expose to the user. For example, MetalLB and kube-vip don't use VRRP for their L2 load balancing, and we could currently switch our implementation to one of those without requiring the user to make any configuration changes.

BGP, on the other hand is directly exposed to the user: they have to provide their BGP configuration. We wouldn't call it FRR, for the same reason I'm questioning VRRP.

Good idea.
@pierreprinetti FYI, we might want to adjust the API PR.

enhancements/network/on-prem-bgp-networking.md

mdbooth · 2022-11-21T16:54:28Z

enhancements/network/on-prem-bgp-networking.md

+```bash
+while :; do
+    if check_health "api"; then
+      echo "Healthcheck for API OK"
+      for vip in $API_VIPS; do
+        create_vip $vip
+        create_vip_routing $vip
+      done
+    else
+      echo "Healthcheck for API not OK"
+      for vip in $API_VIPS; do
+        delete_vip $vip
+      done
+    fi
+
+    if check_health "ingress"; then
+      echo "Healthcheck for Ingress OK"
+      for vip in $INGRESS_VIPS; do
+        create_vip $vip
+        create_vip_routing $vip
+      done
+    else
+      echo "Healthcheck for Ingress not OK"
+      for vip in $INGRESS_VIPS; do
+        delete_vip $vip
+      done
+    fi
+    sleep 2
+done
+```


Not clear what create_vip and create_vip_routing mean here. Also not convinced that the inclusion of Ingress is correct.

From the PoC:

create_vip_routing() { # prepare the host to route the VIP traffic through the gateway # takes one argument: the VIP to route. local vip=$1 if ! ip rule show table 168 | grep -q $vip; then echo `date "+%FT%T"` "Create ip rule for $NETWORK_CIDR in table 168" ip rule add from $vip to $NETWORK_CIDR lookup 168 fi if ! ip route show table 168 | grep -q $GATEWAY; then echo `date "+%FT%T"` "Create default route for table 168 via $GATEWAY" ip route add default via $GATEWAY table 168 fi } create_vip() { # create the VIP on the node. # takes one argument: the VIP to create. local vip=$1 if ! ip address show dev lo | grep -q $vip; then echo `date "+%FT%T"` "Creating VIP $vip" ip address add $vip dev lo fi }

Also, why not adding Ingress? If that's because you suggest to use MetalLB for it, I already responded to that comment elsewhere.

mdbooth · 2022-11-21T17:02:36Z

enhancements/network/on-prem-bgp-networking.md

+
+##### Health check
+
+When checking for the health of a service (API and Ingress), we'll run a `curl` against the local service (API or Ingress) and check the response code.


Ideally users will want to ensure that control plane membership changes like this have a minimal effect on existing connections. Naively this could mean that if an API node goes out, rehashing by the upstream router could result in connection resets on almost all existing connections. Can something like consistent hashing be useful with only 3 nodes? Either way, this is likely an important documentation issue.

I'm not sure what to say about this in the RFE, if you can help me on this one. I understand what you're saying, but I guess we also have the same issue with Keepalived today, so trying to see what needs to be said in this RFE, that matters for BGP.

enhancements/network/on-prem-bgp-networking.md

Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>

enhancements/network/on-prem-bgp-networking.md

jcpowermac · 2022-11-28T20:59:25Z

cc: @rvanderp3 @bostrt

deads2k · 2022-11-28T21:15:33Z

enhancements/network/on-prem-bgp-networking.md

+### Non-Goals
+
+The proposed BGP-based load-balancing requires an existing BGP infrastructure is already in place when deploying. For this reason the default load-balancing solution will remain Keepalived/HAproxy.
+Also, while this enhancement will be implemented in a way that allows all on-premise platforms to use it, right now we haven't planned the work to implement and test it for platforms other than OpenStack.


this statement doesn't match the API.

ack. Thanks for your review on the API PR, I'll take a look asap.

Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>

cybertron

My first pass at this. I think the big thing is figuring out how bootstrap is going to work. Ideally we don't want traffic going to the cluster apiservers until bootstrapping is completed. I'm pretty sure it would sometimes wedge deployments if we let the keepalived vip move too early.

cybertron · 2022-12-01T22:36:16Z

enhancements/network/on-prem-bgp-networking.md

+The assets needed for these capabilities implementation are rendered by the [MCO](https://github.com/openshift/machine-config-operator) and
+the [baremetal-runtimecfg](https://github.com/openshift/baremetal-runtimecfg) is used significantly within the manifests and templates of
+the machine-config-operator project.
+There are also changes in the API and installer projects, which are further described in sections below.


Since this is a replacement for keepalived, I would assume we'll use the same VIP fields and only the backend implementation will change.

I'm also curious about the nodeip-configuration change. I don't see anything proposed and if I knew about this I've forgotten. :-)

cybertron · 2022-12-01T22:47:19Z

enhancements/network/on-prem-bgp-networking.md

+```
+
+- *controlPlaneLoadBalancer* will be a new parameter containing all BGP configurations for the control plane.
+  - *type* - VRRP or BGP (default FRR for backward compatibility.


I'm confused by this. Would we ever use VRRP here? Also, the default is FRR, which is not an option?

big typo on my side, sorry. No we wouldn't use it for VRRP I guess, and the default will never be BGP anyway.

cybertron · 2022-12-01T22:50:41Z

enhancements/network/on-prem-bgp-networking.md

+- *controlPlaneLoadBalancer* will be a new parameter containing all BGP configurations for the control plane.
+  - *type* - VRRP or BGP (default FRR for backward compatibility.
+  - *speakers* - provides a list of BGP speakers for each subnet that the control plane will be connected to.
+    - *subnetId* - platform specific parameter where each BGP speaker is attached to a specific subnet. Another alternative to make this platform agnostic is to use the subnet CIDR instead.


Since this is in a platform-specific config structure it's probably okay to use the ID, although using a CIDR would make it easier to document and implement on the installer side. I suppose some platform-specific logic would be needed in MCO to look up the ID though.

yes I'm working on a PoC for the CIDR. I'll update the enhancement next week.

enhancements/network/on-prem-bgp-networking.md

cybertron · 2022-12-01T23:16:14Z

enhancements/network/on-prem-bgp-networking.md

+
+This paragraph is WIP.
+
+how does the bootstrap node know its configuration?


I suppose there will need to be a bootstrap config section in install-config. This can be applied via the manifests folder in MCO like it is for keepalived. It can probably be simplified since it won't be long-lived and just needs to stand up a functional IP for the duration of bootstrap.

The question that remains for me is how do we signal when the VIP should move to the masters. Right now that is controlled by the bootstrap node, which stops its keepalived when bootstrapping is completed. I'm not super familiar with BGP, but if there's some way to give the bootstrap higher priority while its active that would replicate what we do with keepalived. You'd need to stop FRR after bootstrapping is complete, but that logic already exists for keepalived so it shouldn't be a problem.

I suppose there will need to be a bootstrap config section in install-config. (...)

I'm trying to avoid this and in my initial PoC I was able to do it. I'll try to make it so the bootstrap can find this info as soon as the local api is available to provide the Infra object. Again I might be wrong and this won't work but I'm currently prototyping it.

The question that remains for me is how do we signal when the VIP should move to the masters.
In my initial PoC it worked in the same way as Keepalived is being stopped as soon as the handleBootstrapStopKeepalived function: https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/monitor/dynkeepalived.go#L142-L184
When I tried, it worked fine but again I'm redoing the PoC with a new API and usinb subnetCIDR. I'll keep this comment open for now and give an update soon.

cybertron · 2022-12-01T23:18:25Z

enhancements/network/on-prem-bgp-networking.md

+
+A critical difference between this solution and keepalived (and external loadbalancers), is that
+keepalived's configuration is static on disk, whereas with BGP we update the configuration by
+rendering a new `frr.conf` using values ultimately sourced from the api server. This means that


Can we not do the same thing as with keepalived (which also renders its config based on values retrieved from the API, for the record)? The config file is stored on the host filesystem so that if the node goes down, when it comes back up the previous config is still there and keepalived is able to stand back up based on the previous state.

The config file is stored on the filesystem, but I'll double check.

enhancements/network/on-prem-bgp-networking.md

This is an enhancement that supports OSASINFRA-2999.

openshift-ci · 2022-12-02T19:55:21Z

@EmilienM: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>

openshift-bot · 2022-12-31T01:15:56Z

Inactive enhancement proposals go stale after 28d of inactivity.

See https://github.com/openshift/enhancements#life-cycle for details.

Mark the proposal as fresh by commenting /remove-lifecycle stale.
Stale proposals rot after an additional 7d of inactivity and eventually close.
Exclude this proposal from closing by commenting /lifecycle frozen.

If this proposal is safe to close now please do so with /close.

/lifecycle stale

pierreprinetti · 2022-12-31T10:33:09Z

/remove-lifecycle stale

EmilienM · 2023-01-18T18:32:55Z

The OpenStack team has decided that for now we'll work on External Load-Balancer support, which will help to meet some of our needs.
I'll close this enhancement for now but if we have a customer asking for it, we'll re-open it for sure. Please let me know any feedback.

/close

openshift-ci · 2023-01-18T18:33:06Z

@EmilienM: Closed this PR.

In response to this:

The OpenStack team has decided that for now we'll work on External Load-Balancer support, which will help to meet some of our needs.
I'll close this enhancement for now but if we have a customer asking for it, we'll re-open it for sure. Please let me know any feedback.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot requested review from danwinship and trozet November 9, 2022 16:51

EmilienM marked this pull request as draft November 9, 2022 16:55

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 9, 2022

EmilienM force-pushed the on-prem_bgp branch from 2d22307 to 452e210 Compare November 9, 2022 17:36

EmilienM marked this pull request as ready for review November 9, 2022 17:36

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 9, 2022

openshift-ci bot requested a review from abhat November 9, 2022 17:37

openshift-ci bot requested a review from patrickdillon November 10, 2022 19:29

cgwalters reviewed Nov 10, 2022

View reviewed changes

enhancements/network/on-prem-bgp-networking.md Outdated Show resolved Hide resolved

enhancements/network/on-prem-bgp-networking.md Outdated Show resolved Hide resolved

EmilienM force-pushed the on-prem_bgp branch from 452e210 to a2102a2 Compare November 10, 2022 20:37

JoelSpeed reviewed Nov 14, 2022

View reviewed changes

EmilienM force-pushed the on-prem_bgp branch 2 times, most recently from db54541 to 5d05729 Compare November 17, 2022 16:10

MaysaMacedo reviewed Nov 19, 2022

View reviewed changes

pierreprinetti reviewed Nov 21, 2022

View reviewed changes

EmilienM force-pushed the on-prem_bgp branch from 5d05729 to a913210 Compare November 21, 2022 15:20

danwinship reviewed Nov 21, 2022

View reviewed changes

mdbooth reviewed Nov 21, 2022

View reviewed changes

EmilienM force-pushed the on-prem_bgp branch from a913210 to 6cb226f Compare November 21, 2022 19:55

EmilienM force-pushed the on-prem_bgp branch from 6cb226f to 0a03b19 Compare November 23, 2022 14:32

mandre reviewed Nov 23, 2022

View reviewed changes

EmilienM force-pushed the on-prem_bgp branch from 0a03b19 to 7d4546e Compare November 28, 2022 20:52

deads2k reviewed Nov 28, 2022

View reviewed changes

EmilienM force-pushed the on-prem_bgp branch 2 times, most recently from 7ea294e to 29f3900 Compare November 30, 2022 19:38

cybertron reviewed Dec 1, 2022

View reviewed changes

BGP support for control plane APIs

b54fc17

This is an enhancement that supports OSASINFRA-2999.

EmilienM force-pushed the on-prem_bgp branch from 29f3900 to b54fc17 Compare December 2, 2022 19:51

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2022

openshift-ci bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2022

EmilienM closed this Jan 18, 2023

mandre mentioned this pull request Jan 24, 2023

[OCPBU-156] enhancement proposal for external loadbalancer #1322

Merged


		#### Support Procedures

		It won't be possible to upgrade a cluster from Keepalived/HAproxy to BGP for now.


		##### Health check

		When checking for the health of a service (API and Ingress), we'll run a `curl` against the local service (API or Ingress) and check the response code.


		This paragraph is WIP.

		how does the bootstrap node know its configuration?

BGP support for control plane APIs #1281

BGP support for control plane APIs #1281

Conversation

EmilienM commented Nov 9, 2022 • edited

openshift-ci bot commented Nov 9, 2022

patrickdillon commented Nov 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romfreiman commented Nov 18, 2022

romfreiman commented Nov 18, 2022

EmilienM commented Nov 18, 2022

fedepaol commented Nov 18, 2022

EmilienM commented Nov 18, 2022

fedepaol commented Nov 18, 2022

romfreiman commented Nov 18, 2022

EmilienM commented Nov 18, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcpowermac commented Nov 28, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cybertron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 2, 2022

openshift-bot commented Dec 31, 2022

pierreprinetti commented Dec 31, 2022

EmilienM commented Jan 18, 2023

openshift-ci bot commented Jan 18, 2023

EmilienM commented Nov 9, 2022 •

edited