-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid to take into account wrong versions of protocols in Vsn. #178
Avoid to take into account wrong versions of protocols in Vsn. #178
Conversation
On Consul, sometimes, nodes do send a pMin = pMan = 0 in Vsn This causes a corruption of the acceptable versions of protocol and thus requiring version = [0, 1]. After this corruption occurs, all new nodes cannot join anymore, it then force the restart of all Consul servers to resume normal operations. While not fixing the root cause, this patch discards alive nodes claiming version 0,0,0 and will avoid this breakage. See hashicorp/consul#3217
cc @Aestek |
LGTM. This provides good protection against bad messages as seen in hashicorp/consul#3217 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierresouchay This PR looks good to me. So far I just have a couple questions about why this helps. I have made a couple guesses but it would be great if you could describe how the problem occurs.
A few comments:
=> meaning that this code crashes when Vsn is empty |
Co-Authored-By: pierresouchay <pierresouchay@users.noreply.github.com>
@i0rek @mkeeler This bug has been reproduced quite a large time in our infrastructure because we have lots of agents being started/shutdown very often (due to poor lifecycle of somee of our machines). I have been able to reproduce it quite easily with very recent Consul version (here 3e299c0192b5277b938d99ff0fadfba45f2835a9 )
In a few seconds, on the server:
In the client's window:
By adding time.Sleep(1s), I just helped a bit, but this error happens eventually on a large enough cluster. |
… there is an Alive delegate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pierresouchay and @Aestek Thanks for diving in and figuring this one out as well as the clear documentation of the issue in #180.
It makes total sense what is happening now and why this PR would fix it.
LGTM! Thanks! |
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
Upgrade leads to protocol version (2) is incompatible: [1, 0] (#5313) This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
…ible: [1, 0] This is fixed in hashicorp/memberlist#178, bump memberlist to fix possible split brain in Consul.
full diff: hashicorp/memberlist@3d8438d...v0.1.4 - hashicorp/memberlist#158 Limit concurrent push/pull connections - hashicorp/memberlist#159 Prioritize alive message over other messages - hashicorp/memberlist#168 Add go.mod - hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters - hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop - hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn - hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
full diff: hashicorp/memberlist@3d8438d...v0.1.4 - hashicorp/memberlist#158 Limit concurrent push/pull connections - hashicorp/memberlist#159 Prioritize alive message over other messages - hashicorp/memberlist#168 Add go.mod - hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters - hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop - hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn - hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
full diff: hashicorp/memberlist@3d8438d...v0.1.4 - hashicorp/memberlist#158 Limit concurrent push/pull connections - hashicorp/memberlist#159 Prioritize alive message over other messages - hashicorp/memberlist#168 Add go.mod - hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters - hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop - hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn - hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
full diff: hashicorp/memberlist@3d8438d...v0.1.4 - hashicorp/memberlist#158 Limit concurrent push/pull connections - hashicorp/memberlist#159 Prioritize alive message over other messages - hashicorp/memberlist#168 Add go.mod - hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters - hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop - hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn - hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
On Consul, sometimes, nodes do send a pMin = pMan = 0 in Vsn
This causes a corruption of the acceptable versions of protocol
and thus requiring version = [0, 1].
After this corruption occurs, all new nodes cannot join anymore, it
then force the restart of all Consul servers to resume normal
operations.
While not fixing the root cause, this patch discards alive nodes
claiming version 0,0,0 and will avoid this breakage.
See hashicorp/consul#3217