New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BGP support for control plane APIs #1281
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cc |
452e210
to
a2102a2
Compare
|
||
#### Support Procedures | ||
|
||
It won't be possible to upgrade a cluster from Keepalived/HAproxy to BGP for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you change between the two?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this is technically feasible, maybe I was being too strict in the initial writting and upgrades could be supported between a traditional to BGP networking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I changed the wording saying that this is not possible now but might be revisited in the future. I'm convinced that an upgrade would bring a certain level of complexity that is not required at the moment.
db54541
to
5d05729
Compare
Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>
Is there an option that part of the cluster wont have BGP routing / VIP are not expected to run there - remote worker nodes for example? How will it be addressed? |
@tsorya is working on https://issues.redhat.com/browse/OPNET-133 - lets see if there is some common infra / logic that can be reused. |
For Ingress, yes we'll have to use node selectors to choose where we want the Ingress router to run, and we'll need to address these 2 use cases:
I'll make sure this is mentioned in the enhancement. |
I agree (and sorry, I haven't gone through the proposal yet). MetalLB will run it's own FRR instance, so we might hit issues / should be really careful in exposing different (and non default) ports for the two FRR instances. The MetalLB Operator comes with a node selector that will allow to pick which nodes to run metallb on, which may come handy. So the "don't expose ingress from where metallb is running seems a sufficient approximation for now. One thing that I have in mind (but hasn't been discussed anywhere), is to have a way for MetalLB to be attached and configure a portion of an already running FRR instance instead of shipping it's own one. |
That sounds interesting. Do you think that's something we could target for 4.13 (TP)? Would you mind sharing your design idea? Do you think this could be added later eventually? I'm trying to move this enhancement forward, as we have customers waiting for this now. |
Not really, it is a metallb evolution that I've been thinking lately and need to be shared and discussed both internally and externally. Not sure it is gonna even be accepted or if it works.
For the time being, I'd just say that we can't overlap with MetalLB because the other thing is still quite blurry. |
// The current default configuration uses VRRP. | ||
// | ||
// +optional | ||
APILoadBalancer OpenStackAPILoadBalancer `json:"apiLoadBalancer"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last minute doubt: should this be called APIRouting
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ControlPlaneRouting
maybe? But again we need to drop "API" here, since it's not only for the API.
5d05729
to
a913210
Compare
Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>
different Failure Domains (stretched cluster) and where load balancing of API and Ingress operates at Layer 3. | ||
|
||
The minimal requirements includes: | ||
- Highly available and dynamically routable load-balancing API access for internal clients. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are internal clients using this endpoint? They're most likely using kubernetes.default.svc
.
I believe the requirement here is for external clients. For internal clients we might need CNI configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the wording to "external". What do you mean by CNI configuration? If you can explain a bit more. Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes.default.svc
will be using a ClusterIP provided by OVNKubernetes. If we wanted to, e.g. ensure pods talk to their local kube-apiserver by default to reduce east-west traffic I think we'd need to teach OVNKubernetes how to do that.
Firing from the hip here, though, so delighted to be educated!
|
||
In conjunction to the work that has been (and is being) done on Failure Domains, we will propose | ||
a new [API](https://github.com/openshift/api/pull/1321) to configure Load Balancers, so we will be able to: | ||
- Change the load-balancer type (default remains VRRP), to be able to enable BGP. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Classic bikeshed:
I know VRRP
was what I originally called this, but I wonder if we should instead call it something more generic like L2
. VRRP is an implementation detail which we don't expose to the user. For example, MetalLB and kube-vip don't use VRRP for their L2 load balancing, and we could currently switch our implementation to one of those without requiring the user to make any configuration changes.
BGP
, on the other hand is directly exposed to the user: they have to provide their BGP configuration. We wouldn't call it FRR
, for the same reason I'm questioning VRRP
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea.
@pierreprinetti FYI, we might want to adjust the API PR.
```bash | ||
while :; do | ||
if check_health "api"; then | ||
echo "Healthcheck for API OK" | ||
for vip in $API_VIPS; do | ||
create_vip $vip | ||
create_vip_routing $vip | ||
done | ||
else | ||
echo "Healthcheck for API not OK" | ||
for vip in $API_VIPS; do | ||
delete_vip $vip | ||
done | ||
fi | ||
|
||
if check_health "ingress"; then | ||
echo "Healthcheck for Ingress OK" | ||
for vip in $INGRESS_VIPS; do | ||
create_vip $vip | ||
create_vip_routing $vip | ||
done | ||
else | ||
echo "Healthcheck for Ingress not OK" | ||
for vip in $INGRESS_VIPS; do | ||
delete_vip $vip | ||
done | ||
fi | ||
sleep 2 | ||
done | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear what create_vip
and create_vip_routing
mean here. Also not convinced that the inclusion of Ingress is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the PoC:
create_vip_routing()
{
# prepare the host to route the VIP traffic through the gateway
# takes one argument: the VIP to route.
local vip=$1
if ! ip rule show table 168 | grep -q $vip; then
echo `date "+%FT%T"` "Create ip rule for $NETWORK_CIDR in table 168"
ip rule add from $vip to $NETWORK_CIDR lookup 168
fi
if ! ip route show table 168 | grep -q $GATEWAY; then
echo `date "+%FT%T"` "Create default route for table 168 via $GATEWAY"
ip route add default via $GATEWAY table 168
fi
}
create_vip()
{
# create the VIP on the node.
# takes one argument: the VIP to create.
local vip=$1
if ! ip address show dev lo | grep -q $vip; then
echo `date "+%FT%T"` "Creating VIP $vip"
ip address add $vip dev lo
fi
}
Also, why not adding Ingress? If that's because you suggest to use MetalLB for it, I already responded to that comment elsewhere.
|
||
##### Health check | ||
|
||
When checking for the health of a service (API and Ingress), we'll run a `curl` against the local service (API or Ingress) and check the response code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally users will want to ensure that control plane membership changes like this have a minimal effect on existing connections. Naively this could mean that if an API node goes out, rehashing by the upstream router could result in connection resets on almost all existing connections. Can something like consistent hashing be useful with only 3 nodes? Either way, this is likely an important documentation issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what to say about this in the RFE, if you can help me on this one. I understand what you're saying, but I guess we also have the same issue with Keepalived today, so trying to see what needs to be said in this RFE, that matters for BGP.
a913210
to
6cb226f
Compare
Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>
6cb226f
to
0a03b19
Compare
0a03b19
to
7d4546e
Compare
cc: @rvanderp3 @bostrt |
### Non-Goals | ||
|
||
The proposed BGP-based load-balancing requires an existing BGP infrastructure is already in place when deploying. For this reason the default load-balancing solution will remain Keepalived/HAproxy. | ||
Also, while this enhancement will be implemented in a way that allows all on-premise platforms to use it, right now we haven't planned the work to implement and test it for platforms other than OpenStack. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this statement doesn't match the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack. Thanks for your review on the API PR, I'll take a look asap.
7ea294e
to
29f3900
Compare
Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My first pass at this. I think the big thing is figuring out how bootstrap is going to work. Ideally we don't want traffic going to the cluster apiservers until bootstrapping is completed. I'm pretty sure it would sometimes wedge deployments if we let the keepalived vip move too early.
The assets needed for these capabilities implementation are rendered by the [MCO](https://github.com/openshift/machine-config-operator) and | ||
the [baremetal-runtimecfg](https://github.com/openshift/baremetal-runtimecfg) is used significantly within the manifests and templates of | ||
the machine-config-operator project. | ||
There are also changes in the API and installer projects, which are further described in sections below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is a replacement for keepalived, I would assume we'll use the same VIP fields and only the backend implementation will change.
I'm also curious about the nodeip-configuration change. I don't see anything proposed and if I knew about this I've forgotten. :-)
``` | ||
|
||
- *controlPlaneLoadBalancer* will be a new parameter containing all BGP configurations for the control plane. | ||
- *type* - VRRP or BGP (default FRR for backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this. Would we ever use VRRP here? Also, the default is FRR, which is not an option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
big typo on my side, sorry. No we wouldn't use it for VRRP I guess, and the default will never be BGP anyway.
- *controlPlaneLoadBalancer* will be a new parameter containing all BGP configurations for the control plane. | ||
- *type* - VRRP or BGP (default FRR for backward compatibility. | ||
- *speakers* - provides a list of BGP speakers for each subnet that the control plane will be connected to. | ||
- *subnetId* - platform specific parameter where each BGP speaker is attached to a specific subnet. Another alternative to make this platform agnostic is to use the subnet CIDR instead. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is in a platform-specific config structure it's probably okay to use the ID, although using a CIDR would make it easier to document and implement on the installer side. I suppose some platform-specific logic would be needed in MCO to look up the ID though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I'm working on a PoC for the CIDR. I'll update the enhancement next week.
|
||
This paragraph is WIP. | ||
|
||
how does the bootstrap node know its configuration? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there will need to be a bootstrap config section in install-config. This can be applied via the manifests folder in MCO like it is for keepalived. It can probably be simplified since it won't be long-lived and just needs to stand up a functional IP for the duration of bootstrap.
The question that remains for me is how do we signal when the VIP should move to the masters. Right now that is controlled by the bootstrap node, which stops its keepalived when bootstrapping is completed. I'm not super familiar with BGP, but if there's some way to give the bootstrap higher priority while its active that would replicate what we do with keepalived. You'd need to stop FRR after bootstrapping is complete, but that logic already exists for keepalived so it shouldn't be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose there will need to be a bootstrap config section in install-config. (...)
I'm trying to avoid this and in my initial PoC I was able to do it. I'll try to make it so the bootstrap can find this info as soon as the local api is available to provide the Infra object. Again I might be wrong and this won't work but I'm currently prototyping it.
The question that remains for me is how do we signal when the VIP should move to the masters.
In my initial PoC it worked in the same way as Keepalived is being stopped as soon as thehandleBootstrapStopKeepalived
function: https://github.com/openshift/baremetal-runtimecfg/blob/master/pkg/monitor/dynkeepalived.go#L142-L184
When I tried, it worked fine but again I'm redoing the PoC with a new API and usinb subnetCIDR. I'll keep this comment open for now and give an update soon.
|
||
A critical difference between this solution and keepalived (and external loadbalancers), is that | ||
keepalived's configuration is static on disk, whereas with BGP we update the configuration by | ||
rendering a new `frr.conf` using values ultimately sourced from the api server. This means that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we not do the same thing as with keepalived (which also renders its config based on values retrieved from the API, for the record)? The config file is stored on the host filesystem so that if the node goes down, when it comes back up the previous config is still there and keepalived is able to stand back up based on the previous state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The config file is stored on the filesystem, but I'll double check.
This is an enhancement that supports OSASINFRA-2999.
@EmilienM: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>
Creates a new interface to be able to choose which load-balancer to deploy in the Control Plane. For on-prem, this will allow to not deploy Keepalived/HAproxy but use BGP (with FRR) in MCO. This API is required so: * we expose load-balancer options from the install-config * we add these options in the PlatformSpec, so it can be used in the cluster, and changed on day 2. enhancement: openshift/enhancements#1281 feature: OSASINFRA-2999 Co-authored-by: Matthew Booth <mbooth@redhat.com> Co-authored-by: Pierre Prinetti <pierreprinetti@redhat.com> Co-authored-by: Emilien Macchi <emilien@redhat.com>
Inactive enhancement proposals go stale after 28d of inactivity. See https://github.com/openshift/enhancements#life-cycle for details. Mark the proposal as fresh by commenting If this proposal is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
The OpenStack team has decided that for now we'll work on External Load-Balancer support, which will help to meet some of our needs. /close |
@EmilienM: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This is an enhancement that supports OSASINFRA-2999.