Skip to content

layer2: Make nodes re-join memberlist cluster #662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Oct 14, 2020

Conversation

rata
Copy link
Contributor

@rata rata commented Jul 9, 2020

This PR makes the nodes join the memberlist cluster when they left for some reason (network outage, etc.).

It is based on the ideas from the RFC PR #647 but storing the Speaker IPs to also be able to recover when the API server is down (before we relied on the API server to gives us the APIs, so if it was down we couldn't join members), based on @champtar suggestion on that PR.

It consists of two commits: the first one is cherry-picked from #595 and just moves the memberlist code from speaker/main.go to it's own package, so it is easier to reason and abstract. The second one just adds a go routine that will try to re-join the nodes whenever it is needed.

The semantic changes in the second commit aim to be simple to follow and reason about, without any major rework. More reworks can be done in the future, in another minor release, taking into accounts the ideas from #595. This is aimed at v0.9.4, therefore the change tries to be as simple and correct as possible, avoiding major semantic changes.

I've tested this locally with kind (with inv dev-env) and blocking traffic with iptables. It worked well and as expected on all situations (API server down and recovers, one other node down and recovers, etc.).

Let me know what you think.

TODO:

  • Add tests. Memberlist is quite untested currently.

I'll c&p the commit message here from the rejoin commit, as it is the core of this PR. To properly understand what I refer to here, please look just at this commit and not the whole diff of the PR:

Currently, nodes join the memberlist cluster only on startup and if they are removed (e.g. small network downtime and they recover) they never join again. This means the node is alive, but not taken into account as a possible speaker, making it pointless to run the MetalLB speaker.

This commit adds a go routine to do memberlist joins in a loop. This was missing in the commit that introduced memberlist (3c5fc63 "Use hashicorp/memberlist to speedup dead node detection") and fixes issue #589.

This routine tries to join nodes in two situations: (a) every 1 minute, (b) when the current k8s node object changes and the SetNode() callback is called.

The case (b) is used when the node loses connectivity for long enough for Kubernetes to mark it in a Not Ready state. When the node is Ready again (e.g network is restored for the node), the callback is invoked and the node tries to join the cluster.

The case (a), periodic checks, is needed for downtimes that are smaller than the time Kubernetes takes to mark a node as Not Ready (about 45 sec by default). For example, if a node has a small downtime (~15 seconds) memberlist will remove it from the cluster but as this is short enough for Kubernetes to not change the node state to Not Ready, the SetNode() callback won't be invoked (as the node didn't change). In these cases, the periodic join every 1 minute will make those node join the
memberlist cluster again.

The internal kubernetes watcher already has this logic to notify when the current node changes. In the Layer2 case, though, it was just a "return nil". Now, it queues the message to process a re-join.

This routine tries to join only the nodes that are not currently part of the cluster, as it was advised by Etienne Champetier that the memberlist Join() operation is not a no-op for nodes that are already members and can take a considerable amount of time (specially when only a small group of nodes need to re-join the cluster).

Furthermore, the IPs to join are stored in the SpeakerList struct to be able to recover nodes even if the API server is down. This means we don't need to query the API server and just use the IPs cached in the struct. This was also very good advise from Etienne.

There is another go routine introduced here to update the IPs to join stored in the struct. This go routine updates them by quering the API server, every 5 minutes only to not overload the API server (something very easy to do by mistake). This is in a separate go routine because, as I mention in the code, the current version of client-go we are using doesn't support a context or a timeout so that query can block. To avoid this blocking other operations, we use a separate go routine. When we update to new client-go versions, this will of course not be needed and we can just run with a timeout if we want to.

Copy link
Member

@johananl johananl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rata I've spent a couple of hours on this. I'm not done reviewing yet but am submitting what I have to speed things up. I'll continue the review ASAP.

Copy link
Contributor

@champtar champtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the commit message:
This means the node is alive, but not taken into account as a possible speaker, making it pointless to run the MetalLB speaker.
This is worse than that, this is a split brain and you might have multiple nodes responding to ARP requests causing the tcp connection to switch nodes and get tcp reset

The other comments are not blockers

@champtar
Copy link
Contributor

champtar commented Jul 9, 2020

@johananl that is why I wanted to review each commit separately, I can't address any of your reviews on my commit
@rata I'll let you fix my commit

@rata
Copy link
Contributor Author

rata commented Jul 9, 2020

@champtar

This is worse than that, this is a split brain and you might have multiple nodes responding to ARP requests causing the tcp connection to switch nodes and get tcp reset

Yes, true. Didn't go into the details as I linked the issue. Tried to explain but the commit message was already big and explaining that in more detail was complicated while being clear, so I just explained that bit and let the link to the issue do the rest :)

@champtar
Copy link
Contributor

champtar commented Jul 9, 2020

@champtar

This is worse than that, this is a split brain and you might have multiple nodes responding to ARP requests causing the tcp connection to switch nodes and get tcp reset

Yes, true. Didn't go into the details as I linked the issue. Tried to explain but the commit message was already big and explaining that in more detail was complicated while being clear, so I just explained that bit and let the link to the issue do the rest :)

No need to go into detail, but I read your message as 1 node is unusable, when we should read the full service might be unusable, kind of day and night

@johananl johananl force-pushed the rata/memberlist-rejoin branch 3 times, most recently from f232830 to 408cccd Compare July 10, 2020 15:55
@rata
Copy link
Contributor Author

rata commented Jul 13, 2020

@liuyuan10 thanks a lot for the review! :)

@russellb russellb added this to the v0.9.4 milestone Aug 5, 2020
@rata rata force-pushed the rata/memberlist-rejoin branch 2 times, most recently from 25362bc to 69e12ee Compare August 13, 2020 15:00
@johananl johananl force-pushed the rata/memberlist-rejoin branch from 69e12ee to 408cccd Compare August 14, 2020 14:16
@rata rata force-pushed the rata/memberlist-rejoin branch 2 times, most recently from 74dcf2b to 5c1e8eb Compare August 14, 2020 16:36
@rata
Copy link
Contributor Author

rata commented Aug 14, 2020

@johananl @champtar or anyone else interested, PTAL. Please take into account the discussion that already took place, though :)

Just fixed most of the comments. There are a few open that need more discussion :)

@rata rata requested a review from johananl August 16, 2020 14:43
Copy link
Member

@johananl johananl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort @rata. The functionality looks good to me. I was able to verify correct behavior following a network partition.

I've left some comments, mainly around naming and comments.

@champtar
Copy link
Contributor

I'm fine with having code review, but any plan to ship something at some point ?

  • we have bug reports that might be fixed by this PR
  • a good part of this code is changed/going away in my other commits in SpeakerList improvements #595

@rata
Copy link
Contributor Author

rata commented Sep 11, 2020

I'm fine with having code review, but any plan to ship something at some point ?

Yes, of course :). I'd like to release in the next few weeks. I'll try to address @johananl review and merge when it is ready.

rata added a commit to kinvolk/metallb that referenced this pull request Oct 5, 2020
As requested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk/metallb that referenced this pull request Oct 5, 2020
As suggested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
@rata rata force-pushed the rata/memberlist-rejoin branch from 2046420 to 639530c Compare October 5, 2020 15:26
rata added a commit to kinvolk/metallb that referenced this pull request Oct 5, 2020
As requested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk/metallb that referenced this pull request Oct 5, 2020
As requested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk/metallb that referenced this pull request Oct 5, 2020
As suggested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
@rata rata force-pushed the rata/memberlist-rejoin branch from 639530c to 1c5acfd Compare October 5, 2020 15:35
@rata
Copy link
Contributor Author

rata commented Oct 5, 2020

@johananl PTAL. I think I address your last remaining comments.

FYI, I amended your github suggestions to the respective commits that introduced them. That is why you won't see a commit doing them, but they are addressed :)

champtar and others added 6 commits October 5, 2020 17:39
Signed-off-by: Etienne Champetier <echampetier@anevia.com>
Amended-by: Johannes Liebermann <johannes@kinvolk.io>
Amended-by: Rodrigo Campos <rodrigo@kinvolk.io>
Currently, nodes join the memberlist cluster only on startup and if they
are removed (e.g. small network downtime and they recover) they never
join again. This means the node is alive, but not taken into account as
a possible speaker, making it pointless to run the MetalLB speaker.
Furthermore, it can cause split brain and you might have multiple nodes
responding to ARP requests causing the TCP connection to be reset often
and have downtime in practice.

This commit adds a go routine to do memberlist joins in a loop. This was
missing in the commit that introduced memberlist
(3c5fc63 "Use hashicorp/memberlist to
speedup dead node detection") and fixes issue metallb#589.

This routine tries to join nodes in two situations: (a) every 1 minute,
(b) when the current k8s node object changes and the SetNode() callback
is called.

The case (b) is used when the node loses connectivity for long enough
for Kubernetes to mark it in a Not Ready state. When the node is Ready
again (e.g network is restored for the node), the callback is invoked
and the node tries to join the cluster.

The case (a), periodic checks, is needed for downtimes that are smaller
than the time Kubernetes takes to mark a node as Not Ready (about 45 sec
by default). For example, if a node has a small downtime (~15 seconds)
memberlist will remove it from the cluster but as this is short enough
for Kubernetes to not change the node state to Not Ready, the SetNode()
callback won't be invoked (as the node didn't change). In these cases,
the periodic join every 1 minute will make those node join the
memberlist cluster again.

The internal kubernetes watcher already has this logic to notify when
the current node changes. In the Layer2 case, though, it was just a
"return nil". Now, it queues the message to process a re-join.

This routine tries to join only the nodes that are not currently part of
the cluster, as it was advised by Etienne Champetier that the memberlist
Join() operation is not a no-op for nodes that are already members and
can take a considerable amount of time (specially when only a small
group of nodes need to re-join the cluster).

Furthermore, the IPs to join are stored in the SpeakerList struct to be
able to recover nodes even if the API server is down. This means we
don't need to query the API server and just use the IPs cached in the
struct. This was also very good advise from Etienne.

There is another go routine introduced here to update the IPs to join
stored in the struct. This go routine updates them by quering the API
server, every 5 minutes only to not overload the API server (something
very easy to do by mistake). This is in a separate go routine because,
as I mention in the code, the current version of client-go we are using
doesn't support a context or a timeout so that query can block. To avoid
this blocking other operations, we use a separate go routine. When we
update to new client-go versions, this will of course not be needed and
we can just run with a timeout if we want to.

Fixes: metallb#589
Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As requested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
A node should be ignored if it is not present in the map or if it is
present and not ready (i.e. present but with a false value).

This issue never triggers in practice today, though, because the
speakers map (created by UsableSpeakers() speakerlist method) always
contains entries with true values.

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
As suggested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
@rata rata force-pushed the rata/memberlist-rejoin branch from 1c5acfd to 07ba1df Compare October 5, 2020 15:39
@johananl
Copy link
Member

FYI, I amended your github suggestions to the respective commits that introduced them. That is why you won't see a commit doing them, but they are addressed :)

That's fine, I never actually use the "commit" UI option. I use the suggestions feature just for the diff display.

Copy link
Member

@johananl johananl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Let's 🚢 it.

@johananl johananl merged commit 124f6be into metallb:main Oct 14, 2020
johananl pushed a commit that referenced this pull request Oct 14, 2020
As requested by Johannes here:
	#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
johananl pushed a commit that referenced this pull request Oct 14, 2020
As requested by Johannes here:
	#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
@rata rata deleted the rata/memberlist-rejoin branch October 14, 2020 16:42
johananl pushed a commit that referenced this pull request Oct 16, 2020
As requested by Johannes here:
	#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
(cherry picked from commit 45f8058)
johananl pushed a commit that referenced this pull request Oct 16, 2020
As requested by Johannes here:
	#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
(cherry picked from commit 79db52b)
johananl pushed a commit that referenced this pull request Oct 16, 2020
As suggested by Johannes here:
	#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
(cherry picked from commit 124f6be)
rata added a commit to kinvolk/metallb that referenced this pull request Feb 17, 2022
As requested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk/metallb that referenced this pull request Feb 17, 2022
As requested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
rata added a commit to kinvolk/metallb that referenced this pull request Feb 17, 2022
As suggested by Johannes here:
	metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
novad03 pushed a commit to novad03/k8s-meta that referenced this pull request Nov 25, 2023
As requested by Johannes here:
	metallb/metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
novad03 pushed a commit to novad03/k8s-meta that referenced this pull request Nov 25, 2023
As requested by Johannes here:
	metallb/metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
novad03 pushed a commit to novad03/k8s-meta that referenced this pull request Nov 25, 2023
As suggested by Johannes here:
	metallb/metallb#662 (comment)

Signed-off-by: Rodrigo Campos <rodrigo@kinvolk.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants