-
Notifications
You must be signed in to change notification settings - Fork 1k
Add BFD support #967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BFD support #967
Conversation
be53371 to
b69f399
Compare
74328ef to
7985cda
Compare
| echo-mode | ||
| {{end -}} | ||
| {{ if .EchoInterval -}} | ||
| echo-interval {{.EchoInterval}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the echo interval of the BFD in frr is configured this way.
In their docs it is splits to:
echo receive-interval <disabled|(10-60000)>
echo transmit-interval (10-60000)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part changed between 7.5 (which we are pinning now) and latest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I am missing something in the config there should be a matching peer section in the bfd config ? the above configure only the profile ? https://docs.frrouting.org/en/latest/bfd.html?highlight=bfd#configuration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can associate a neighbor to a bfd profile, which is what happens in https://github.com/metallb/metallb/pull/967/files#diff-1d9a33c302812585b1b2d1897f84126492129d1f9415786a73dcfa26603686c9R51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do u know how this will work with view bgp do u have one bfd process for every bgp instance or its shared and in that case u pick peers from both ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't neighbour I was asking about what I am asking about is the following
bfd
peer 192.168.0.1
label home-peer
no shutdown
!
!
bgp config will have what u showed
router bgp 65530
neighbor 192.168.0.1 remote-as 65531
neighbor 192.168.0.1 bfd profile blah
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are not configuring bfd peers explicitly because we are leveraging the bgp's bfd option described here https://docs.frrouting.org/en/latest/bfd.html#bgp-bfd-configuration
If you have a bfd profile defined, it's totally fine to add the option to the neighbour and that neighbor will be backed up by a bfd session. So the configuration is more compact without the need of adding the peer explicitly.
This is what we do here 694ed34#diff-1d9a33c302812585b1b2d1897f84126492129d1f9415786a73dcfa26603686c9R51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do u know how this will work with view bgp do u have one bfd process for every bgp instance or its shared and in that case u pick peers from both ?
I don't know, I guess we'll figure it out if / when we re-enable views.
2f43fd7 to
8777ad7
Compare
| type BFDProfile struct { | ||
| Name string | ||
| ReceiveInterval *uint32 | ||
| TransmitInterval *uint32 | ||
| DetectMultiplier *uint32 | ||
| EchoInterval *uint32 | ||
| EchoMode bool | ||
| PassiveMode bool | ||
| MinimumTTL *uint32 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this declared both here and in internal/config (I assume it's intentional because of configBFDProfileToFRR)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To enforce domain and module separation.
One is the data structure we read from the configuration, the other one is the one we pass to the template. They serve different purpouses and I did not want the config data structures to leak into the frr package.
speaker/bgp_controller.go
Outdated
| if c.bgpType != bgpFrr && len(profiles) > 0 { | ||
| return errors.New("bfd Profiles not supported in native mode") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this necessary? the native mode will return that error anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, it's a leftover of the previous check. Will remove it.
internal/config/config_test.go
Outdated
| `, | ||
| }, | ||
| { | ||
| desc: "Peer with duplicated BFD Profile", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be duplicate BFD Profile name?
| }, | ||
| { | ||
| desc: "BFD Profile with too high range receive interval", | ||
| raw: ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the same as the one above? + probably you should add a test for the other parameters that are validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, need to change. I was torn about testing every single parameter, thought it was significant enough (and less boring) to validate the range function against only one of them, but I can add more tests if you think it's necessary.
e2etest/bgp_test.go
Outdated
| var peers []peer | ||
| var peerAddrs []string | ||
| for _, c := range frrContainers { | ||
| for i, c := range frrContainers { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comment about the commit: Filling ... in the external containers but using them as configuration of speakers is misleading. - that's misleading because it wasn't used here correctly 😅
From the containerConfig's (which represents the external FRR container) perspective, configuration for the speaker side should be under NeighborConfig (and not RouterConfig). That's because RouterConfig is how it should configure itself and NeighborConfig is the way the speaker should present itself to the container.
I think we should preserve that and not declare any parameters somewhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we call the structure FRR container, a reader expects to find parameters affecting the container only.
We are currently leveraging some of those to set up the metallb config because they are symmetric (i.e. the password, the ASNs). Some other belong to metallb only and it does not make sense to have them here unless we think about a wider wrapper.
| container.NeighborConfig.BFDEnabled = true | ||
| }) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to call validateFRRPeeredWithNodes for BFD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I forgot about it
e2etest/bgp_test.go
Outdated
| }, 4*time.Minute, 1*time.Second).Should(BeNil()) | ||
|
|
||
| }, | ||
| table.FEntry("default", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FEntry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ops
c805231 to
66b63f6
Compare
9de0428 to
374083c
Compare
|
lgtm |
71233de to
ac68518
Compare
| sharpd=no | ||
| pbrd=no | ||
| bfdd=no | ||
| bfdd=yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to also enable the BFD daemon in dev-env/bgp/frr/daemons? Having it disabled means one would need to manually toggle this setting, restart the external FRR container and resume with the desired BFD configuration. This is only for testing deployments (dev-env) so impact is low to none.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't hurt I guess.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Here we enable an optional BFD profile that can be associated with a BGP session. Using BFD in conjunction with BGP allows faster failover detection. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
Filling routerid / holdtime in the external containers but using them as configuration of speakers is misleading. Here we decouple it. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
| EchoMode bool | ||
| PassiveMode bool | ||
| MinimumTTL *uint32 | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutdown option ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the question is, if you are asking why I did not add the shutdown option which says:
Enables or disables the peer. When the peer is disabled an ‘administrative down’ message is sent to the remote peer.
I think that goes against the metallb api. If a user puts a peer in the config, it's because he wants to enable it. If he wants to disable, he'll just remove the peer entry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is shuting down the bfd peer I am assume its used for diagnosticsnothing against metallb api
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understood the question then. Why would we need to add Shutdown here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its one of the BFD config option in the FRR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But does it matter if we don't use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functionality wise probably not but was just curious why u drop it to begin with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not add it because we don't need it, as there's no need to add stuff we won't be using.
| echo-mode | ||
| {{end -}} | ||
| {{ if .EchoInterval -}} | ||
| echo-interval {{.EchoInterval}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do u know how this will work with view bgp do u have one bfd process for every bgp instance or its shared and in that case u pick peers from both ?
| echo-mode | ||
| {{end -}} | ||
| {{ if .EchoInterval -}} | ||
| echo-interval {{.EchoInterval}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't neighbour I was asking about what I am asking about is the following
bfd
peer 192.168.0.1
label home-peer
no shutdown
!
!
bgp config will have what u showed
router bgp 65530
neighbor 192.168.0.1 remote-as 65531
neighbor 192.168.0.1 bfd profile blah
!
| res := &BFDProfile{} | ||
| res.Name = p.Name | ||
| var err error | ||
| res.DetectMultiplier, err = bfdIntFromConfig(p.DetectMultiplier, 2, 255) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to add 4th arg to bfdIntFromConfig to have the default settings if its not specified, pls refer to https://docs.frrouting.org/en/latest/bfd.html#clicmd-bfd to find the appropriate default value for each config options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't frr apply the default settings itself if not specified?
I fix defaults here, metallb will have to work with a fixed frr version which has 1:1 mapping to those parameters. If we keep it like this, we won't add the parameters that are not specified, making the config rendering more compatible.
| return nil, errors.Wrapf(err, "Failed to parse neighbours %s", res) | ||
| } | ||
| return peers, nil | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the CI as it stands it make sure BFD sessions are created , how about the failure detection part how does that been verified/tested, do we know how long it takes for BFD enabled session to detect a failure ? any plans to cover this in CI or at least manual testing for now and doc it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failover tests are in the backlog. We'd need to carefully think about them, because we are still subject to the time it takes ot execute the commands. Maybe setting a ludicrous hold time on the bgp side and to check that the failover happens in a lower time would work.
I'd keep this on the functional side though.
| return k8s.SyncStateErrorNoRetry | ||
| } | ||
|
|
||
| if c.bgpType == bgpNative { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bgpNative ?!? u meant bgpFrr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope. The check here is, if native check if something frr only is applied
e0a32e6 to
d062214
Compare
oribon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than these small comments, LGTM
e2etest/bgp_test.go
Outdated
| for _, item := range speakers.Items { | ||
| i := item | ||
| speakerPods = append(speakerPods, &i) | ||
| speakerPods = append(speakerPods, &item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
accidental change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep fixing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| } | ||
|
|
||
| for _, c := range frrContainers { | ||
| pairExternalFRRWithNodes(cs, c) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should have this in the loop above like you did here:
Line 452 in d062214
| pairExternalFRRWithNodes(cs, c, func(container *frrcontainer.FRR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes did not notice while rebasing
e2etest/bgp_test.go
Outdated
| configData := configFile{ | ||
| Pools: []addressPool{ | ||
| { | ||
| Name: "bgp-test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be better to rename this something like bfd-test (maybe with -bfd.Name suffix)?
Extend e2e tests to make sure bfd pairing works with the external containers. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
BFD is FRR only so we won't cover it. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
|
LGTM, the layer2 fail from the ci isn't related |
* Enable BGP + BFD support Here we enable an optional BFD profile that can be associated with a BGP session. Using BFD in conjunction with BGP allows faster failover detection. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com> * Decouple the configuration structure of external containers to speakers Filling routerid / holdtime in the external containers but using them as configuration of speakers is misleading. Here we decouple it. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com> * E2ETests: cover BFD Extend e2e tests to make sure bfd pairing works with the external containers. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com> * Skip BFD tests in CI in legacy lanes BFD is FRR only so we won't cover it. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
* Enable BGP + BFD support Here we enable an optional BFD profile that can be associated with a BGP session. Using BFD in conjunction with BGP allows faster failover detection. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com> * Decouple the configuration structure of external containers to speakers Filling routerid / holdtime in the external containers but using them as configuration of speakers is misleading. Here we decouple it. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com> * E2ETests: cover BFD Extend e2e tests to make sure bfd pairing works with the external containers. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com> * Skip BFD tests in CI in legacy lanes BFD is FRR only so we won't cover it. Signed-off-by: Federico Paolinelli <fpaoline@redhat.com>
This PR adds BFD support on top of BGP.
It will be possible to enable BFD for a given BGP peer, following the configuration described in the design proposal: https://github.com/metallb/metallb/blob/main/design/bgp-bfd.md