-
Notifications
You must be signed in to change notification settings - Fork 961
layer2: Make nodes re-join memberlist cluster #662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rata I've spent a couple of hours on this. I'm not done reviewing yet but am submitting what I have to speed things up. I'll continue the review ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the commit message:
This means the node is alive, but not taken into account as a possible speaker, making it pointless to run the MetalLB speaker.
This is worse than that, this is a split brain and you might have multiple nodes responding to ARP requests causing the tcp connection to switch nodes and get tcp reset
The other comments are not blockers
Yes, true. Didn't go into the details as I linked the issue. Tried to explain but the commit message was already big and explaining that in more detail was complicated while being clear, so I just explained that bit and let the link to the issue do the rest :) |
No need to go into detail, but I read your message as 1 node is unusable, when we should read the full service might be unusable, kind of day and night |
f232830
to
408cccd
Compare
@liuyuan10 thanks a lot for the review! :) |
25362bc
to
69e12ee
Compare
69e12ee
to
408cccd
Compare
74dcf2b
to
5c1e8eb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the effort @rata. The functionality looks good to me. I was able to verify correct behavior following a network partition.
I've left some comments, mainly around naming and comments.
I'm fine with having code review, but any plan to ship something at some point ?
|
Yes, of course :). I'd like to release in the next few weeks. I'll try to address @johananl review and merge when it is ready. |
As requested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As suggested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
2046420
to
639530c
Compare
As requested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As suggested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
639530c
to
1c5acfd
Compare
@johananl PTAL. I think I address your last remaining comments. FYI, I amended your github suggestions to the respective commits that introduced them. That is why you won't see a commit doing them, but they are addressed :) |
Signed-off-by: Etienne Champetier <echampetier@anevia.com> Amended-by: Johannes Liebermann <johannes@kinvolk.io> Amended-by: Rodrigo Campos <rodrigo@kinvolk.io>
Currently, nodes join the memberlist cluster only on startup and if they are removed (e.g. small network downtime and they recover) they never join again. This means the node is alive, but not taken into account as a possible speaker, making it pointless to run the MetalLB speaker. Furthermore, it can cause split brain and you might have multiple nodes responding to ARP requests causing the TCP connection to be reset often and have downtime in practice. This commit adds a go routine to do memberlist joins in a loop. This was missing in the commit that introduced memberlist (3c5fc63 "Use hashicorp/memberlist to speedup dead node detection") and fixes issue metallb#589. This routine tries to join nodes in two situations: (a) every 1 minute, (b) when the current k8s node object changes and the SetNode() callback is called. The case (b) is used when the node loses connectivity for long enough for Kubernetes to mark it in a Not Ready state. When the node is Ready again (e.g network is restored for the node), the callback is invoked and the node tries to join the cluster. The case (a), periodic checks, is needed for downtimes that are smaller than the time Kubernetes takes to mark a node as Not Ready (about 45 sec by default). For example, if a node has a small downtime (~15 seconds) memberlist will remove it from the cluster but as this is short enough for Kubernetes to not change the node state to Not Ready, the SetNode() callback won't be invoked (as the node didn't change). In these cases, the periodic join every 1 minute will make those node join the memberlist cluster again. The internal kubernetes watcher already has this logic to notify when the current node changes. In the Layer2 case, though, it was just a "return nil". Now, it queues the message to process a re-join. This routine tries to join only the nodes that are not currently part of the cluster, as it was advised by Etienne Champetier that the memberlist Join() operation is not a no-op for nodes that are already members and can take a considerable amount of time (specially when only a small group of nodes need to re-join the cluster). Furthermore, the IPs to join are stored in the SpeakerList struct to be able to recover nodes even if the API server is down. This means we don't need to query the API server and just use the IPs cached in the struct. This was also very good advise from Etienne. There is another go routine introduced here to update the IPs to join stored in the struct. This go routine updates them by quering the API server, every 5 minutes only to not overload the API server (something very easy to do by mistake). This is in a separate go routine because, as I mention in the code, the current version of client-go we are using doesn't support a context or a timeout so that query can block. To avoid this blocking other operations, we use a separate go routine. When we update to new client-go versions, this will of course not be needed and we can just run with a timeout if we want to. Fixes: metallb#589 Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
A node should be ignored if it is not present in the map or if it is present and not ready (i.e. present but with a false value). This issue never triggers in practice today, though, because the speakers map (created by UsableSpeakers() speakerlist method) always contains entries with true values. Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As suggested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
1c5acfd
to
07ba1df
Compare
That's fine, I never actually use the "commit" UI option. I use the suggestions feature just for the diff display. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Let's 🚢 it.
As requested by Johannes here: #662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here: #662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here: #662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> (cherry picked from commit 45f8058)
As requested by Johannes here: #662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> (cherry picked from commit 79db52b)
As suggested by Johannes here: #662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io> (cherry picked from commit 124f6be)
As requested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As suggested by Johannes here: metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here: metallb/metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here: metallb/metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As suggested by Johannes here: metallb/metallb#662 (comment) Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
This PR makes the nodes join the memberlist cluster when they left for some reason (network outage, etc.).
It is based on the ideas from the RFC PR #647 but storing the Speaker IPs to also be able to recover when the API server is down (before we relied on the API server to gives us the APIs, so if it was down we couldn't join members), based on @champtar suggestion on that PR.
It consists of two commits: the first one is cherry-picked from #595 and just moves the memberlist code from
speaker/main.go
to it's own package, so it is easier to reason and abstract. The second one just adds a go routine that will try to re-join the nodes whenever it is needed.The semantic changes in the second commit aim to be simple to follow and reason about, without any major rework. More reworks can be done in the future, in another minor release, taking into accounts the ideas from #595. This is aimed at v0.9.4, therefore the change tries to be as simple and correct as possible, avoiding major semantic changes.
I've tested this locally with kind (with
inv dev-env
) and blocking traffic with iptables. It worked well and as expected on all situations (API server down and recovers, one other node down and recovers, etc.).Let me know what you think.
TODO:
I'll c&p the commit message here from the rejoin commit, as it is the core of this PR. To properly understand what I refer to here, please look just at this commit and not the whole diff of the PR:
Currently, nodes join the memberlist cluster only on startup and if they are removed (e.g. small network downtime and they recover) they never join again. This means the node is alive, but not taken into account as a possible speaker, making it pointless to run the MetalLB speaker.
This commit adds a go routine to do memberlist joins in a loop. This was missing in the commit that introduced memberlist (3c5fc63 "Use hashicorp/memberlist to speedup dead node detection") and fixes issue #589.
This routine tries to join nodes in two situations: (a) every 1 minute, (b) when the current k8s node object changes and the SetNode() callback is called.
The case (b) is used when the node loses connectivity for long enough for Kubernetes to mark it in a Not Ready state. When the node is Ready again (e.g network is restored for the node), the callback is invoked and the node tries to join the cluster.
The case (a), periodic checks, is needed for downtimes that are smaller than the time Kubernetes takes to mark a node as Not Ready (about 45 sec by default). For example, if a node has a small downtime (~15 seconds) memberlist will remove it from the cluster but as this is short enough for Kubernetes to not change the node state to Not Ready, the SetNode() callback won't be invoked (as the node didn't change). In these cases, the periodic join every 1 minute will make those node join the
memberlist cluster again.
The internal kubernetes watcher already has this logic to notify when the current node changes. In the Layer2 case, though, it was just a "return nil". Now, it queues the message to process a re-join.
This routine tries to join only the nodes that are not currently part of the cluster, as it was advised by Etienne Champetier that the memberlist Join() operation is not a no-op for nodes that are already members and can take a considerable amount of time (specially when only a small group of nodes need to re-join the cluster).
Furthermore, the IPs to join are stored in the SpeakerList struct to be able to recover nodes even if the API server is down. This means we don't need to query the API server and just use the IPs cached in the struct. This was also very good advise from Etienne.
There is another go routine introduced here to update the IPs to join stored in the struct. This go routine updates them by quering the API server, every 5 minutes only to not overload the API server (something very easy to do by mistake). This is in a separate go routine because, as I mention in the code, the current version of client-go we are using doesn't support a context or a timeout so that query can block. To avoid this blocking other operations, we use a separate go routine. When we update to new client-go versions, this will of course not be needed and we can just run with a timeout if we want to.