Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid to take into account wrong versions of protocols in Vsn. #178

Merged

Conversation

pierresouchay
Copy link
Contributor

On Consul, sometimes, nodes do send a pMin = pMan = 0 in Vsn
This causes a corruption of the acceptable versions of protocol
and thus requiring version = [0, 1].

After this corruption occurs, all new nodes cannot join anymore, it
then force the restart of all Consul servers to resume normal
operations.

While not fixing the root cause, this patch discards alive nodes
claiming version 0,0,0 and will avoid this breakage.

See hashicorp/consul#3217

On Consul, sometimes, nodes do send a pMin = pMan = 0 in Vsn
This causes a corruption of the acceptable versions of protocol
and thus requiring version = [0, 1].

After this corruption occurs, all new nodes cannot join anymore, it
then force the restart of all Consul servers to resume normal
operations.

While not fixing the root cause, this patch discards alive nodes
claiming version 0,0,0 and will avoid this breakage.

See hashicorp/consul#3217
@pierresouchay
Copy link
Contributor Author

cc @Aestek

@ShimmerGlass
Copy link

LGTM. This provides good protection against bad messages as seen in hashicorp/consul#3217

state.go Outdated Show resolved Hide resolved
Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay This PR looks good to me. So far I just have a couple questions about why this helps. I have made a couple guesses but it would be great if you could describe how the problem occurs.

state.go Show resolved Hide resolved
state.go Outdated Show resolved Hide resolved
@pierresouchay
Copy link
Contributor Author

pierresouchay commented Jan 25, 2019

A few comments:
For some reason (legacy I suppose), it supports receiving empty Vsn (this is tested in a unit test), but there is no security before those lines:

	if m.config.Alive != nil {
		node := &Node{
			Name: a.Node,
			Addr: a.Addr,
			Port: a.Port,
			Meta: a.Meta,
			PMin: a.Vsn[0],
			PMax: a.Vsn[1],
			PCur: a.Vsn[2],
			DMin: a.Vsn[3],
			DMax: a.Vsn[4],
			DCur: a.Vsn[5],
		}
		if err := m.config.Alive.NotifyAlive(node); err != nil {
			m.logger.Printf("[WARN] memberlist: ignoring alive message for '%s': %s",
				a.Node, err)
			return
		}
	}

=> meaning that this code crashes when Vsn is empty
I can protect it, but I wanted the change to be minimal

hanshasselberg and others added 2 commits January 25, 2019 16:42
Co-Authored-By: pierresouchay <pierresouchay@users.noreply.github.com>
@pierresouchay
Copy link
Contributor Author

pierresouchay commented Jan 30, 2019

@i0rek @mkeeler This bug has been reproduced quite a large time in our infrastructure because we have lots of agents being started/shutdown very often (due to poor lifecycle of somee of our machines).

I have been able to reproduce it quite easily with very recent Consul version (here 3e299c0192b5277b938d99ff0fadfba45f2835a9 )

  1. Start a non-patched server :
consul agent -data-dir $PWD/server1 -node server1 -server -datacenter testDC -bootstrap-expect 1 -serf-lan-port 9500
  1. Apply a very small patch to file vendor/github.com/hashicorp/memberlist/memberlist.go, by adding time.Sleep(1 * time.Second) at line 202, just before line if err := m.setAlive(); err != nil {

  2. Run the following command:

while true; do consul agent -data-dir $PWD/client1 -node client1 -join 127.0.0.1:9500  -serf-lan-port 10300 -client 127.0.0.2 -datacenter=testDC -hcl leave_on_terminate=false & sleep 1.2 ; kill -9 %1; sleep .1; done

In a few seconds, on the server:

consul agent -data-dir $PWD/server1 -node server1 -server -datacenter testDC -bootstrap-expect 1 -serf-lan-port 9500
BootstrapExpect is set to 1; this is the same as Bootstrap mode.
bootstrap = true: do not enable unless necessary
==> Starting Consul agent...
==> Consul agent running!
           Version: '1.4.1-dev'
           Node ID: '649218c4-7d9e-af51-4b77-dcff7c565dfa'
         Node name: 'server1'
        Datacenter: 'testdc' (Segment: '<all>')
            Server: true (Bootstrap: true)
       Client Addr: [127.0.0.1] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 192.168.3.31 (LAN: 9500, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2019/01/30 01:13:29 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:649218c4-7d9e-af51-4b77-dcff7c565dfa Address:192.168.3.31:8300}]
    2019/01/30 01:13:29 [INFO] raft: Node at 192.168.3.31:8300 [Follower] entering Follower state (Leader: "")
    2019/01/30 01:13:30 [INFO] serf: EventMemberJoin: server1.testdc 192.168.3.31
    2019/01/30 01:13:30 [WARN] serf: Failed to re-join any previously known node
    2019/01/30 01:13:31 [INFO] serf: EventMemberJoin: server1 192.168.3.31
    2019/01/30 01:13:31 [WARN] serf: Failed to re-join any previously known node
    2019/01/30 01:13:31 [INFO] consul: Adding LAN server server1 (Addr: tcp/192.168.3.31:8300) (DC: testdc)
    2019/01/30 01:13:31 [INFO] consul: Handled member-join event for server "server1.testdc" in area "wan"
    2019/01/30 01:13:31 [INFO] agent: Started DNS server 127.0.0.1:8600 (tcp)
    2019/01/30 01:13:31 [INFO] agent: Started DNS server 127.0.0.1:8600 (udp)
    2019/01/30 01:13:31 [INFO] agent: Started HTTP server on 127.0.0.1:8500 (tcp)
    2019/01/30 01:13:31 [INFO] agent: started state syncer
    2019/01/30 01:13:34 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2019/01/30 01:13:34 [INFO] raft: Node at 192.168.3.31:8300 [Candidate] entering Candidate state in term 10
    2019/01/30 01:13:34 [INFO] raft: Election won. Tally: 1
    2019/01/30 01:13:34 [INFO] raft: Node at 192.168.3.31:8300 [Leader] entering Leader state
    2019/01/30 01:13:34 [INFO] consul: cluster leadership acquired
    2019/01/30 01:13:34 [INFO] consul: New leader elected: server1
    2019/01/30 01:13:34 [INFO] consul: member 'client1' reaped, deregistering
    2019/01/30 01:13:34 [INFO] agent: Synced node info
    2019/01/30 01:13:34 [INFO] serf: EventMemberJoin: client1 192.168.3.31
    2019/01/30 01:13:34 [INFO] consul: member 'client1' joined, marking health alive
    2019/01/30 01:13:35 [INFO] serf: EventMemberUpdate: client1
    2019/01/30 01:13:37 [WARN] memberlist: Was able to connect to client1 but other probes failed, network may be misconfigured
    2019/01/30 01:13:37 [INFO] serf: EventMemberUpdate: client1
    2019/01/30 01:13:38 [ERR] memberlist: Failed push/pull merge: Node 'client1' protocol version (2) is incompatible: [1, 0] from=192.168.3.31:58158
    2019/01/30 01:13:38 [ERR] memberlist: Failed push/pull merge: Node 'client1' protocol version (2) is incompatible: [1, 0] from=192.168.3.31:58159
    2019/01/30 01:13:39 [ERR] memberlist: Failed push/pull merge: Node 'client1' protocol version (2) is incompatible: [1, 0] from=192.168.3.31:58165
    2019/01/30 01:13:39 [ERR] memberlist: Failed push/pull merge: Node 'client1' protocol version (2) is incompatible: [1, 0] from=192.168.3.31:58166
    2019/01/30 01:13:42 [INFO] memberlist: Suspect client1 has failed, no acks received

In the client's window:

while true; do consul agent -data-dir $PWD/client1 -node client1 -join 192.168.3.31:9500  -serf-lan-port 10300 -client 127.0.0.2 -datacenter=testDC -hcl leave_on_terminate=false & sleep 1.2 ; kill -9 %1; sleep .1; done
[1] 23981
==> Starting Consul agent...
==> Joining cluster...
    Join completed. Synced with 1 initial agents
==> Consul agent running!
           Version: '1.4.1-dev'
           Node ID: '02552157-072f-bf43-63bf-dc429f40afea'
         Node name: 'client1'
        Datacenter: 'testdc' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [127.0.0.2] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 192.168.3.31 (LAN: 10300, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2019/01/30 01:13:34 [INFO] serf: EventMemberJoin: client1 192.168.3.31
    2019/01/30 01:13:34 [INFO] serf: Attempting re-join to previously known node: server1: 192.168.3.31:9500
    2019/01/30 01:13:34 [INFO] agent: Started DNS server 127.0.0.2:8600 (tcp)
    2019/01/30 01:13:34 [INFO] agent: Started DNS server 127.0.0.2:8600 (udp)
    2019/01/30 01:13:34 [INFO] agent: Started HTTP server on 127.0.0.2:8500 (tcp)
    2019/01/30 01:13:34 [INFO] agent: (LAN) joining: [192.168.3.31:9500]
    2019/01/30 01:13:34 [INFO] serf: EventMemberJoin: server1 192.168.3.31
    2019/01/30 01:13:34 [INFO] consul: adding server server1 (Addr: tcp/192.168.3.31:8300) (DC: testdc)
    2019/01/30 01:13:34 [INFO] serf: Re-joined to previously known node: server1: 192.168.3.31:9500
    2019/01/30 01:13:34 [INFO] agent: (LAN) joined: 1 Err: <nil>
    2019/01/30 01:13:34 [INFO] agent: started state syncer
    2019/01/30 01:13:34 [INFO] agent: Synced node info
[1]+  Killed: 9               consul agent -data-dir $PWD/client1 -node client1 -join 192.168.3.31:9500 -serf-lan-port 10300 -client 127.0.0.2 -datacenter=testDC -hcl leave_on_terminate=false
[1] 23986
==> Starting Consul agent...
==> Joining cluster...
    Join completed. Synced with 1 initial agents
==> Consul agent running!
           Version: '1.4.1-dev'
           Node ID: '02552157-072f-bf43-63bf-dc429f40afea'
         Node name: 'client1'
        Datacenter: 'testdc' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [127.0.0.2] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 192.168.3.31 (LAN: 10300, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2019/01/30 01:13:34 [WARN] memberlist: Refuting an alive message
    2019/01/30 01:13:34 [INFO] serf: EventMemberJoin: client1 192.168.3.31
    2019/01/30 01:13:34 [INFO] serf: EventMemberJoin: server1 192.168.3.31
    2019/01/30 01:13:35 [INFO] serf: EventMemberJoin: client1 192.168.3.31
    2019/01/30 01:13:35 [INFO] serf: Attempting re-join to previously known node: server1: 192.168.3.31:9500
    2019/01/30 01:13:35 [INFO] consul: adding server server1 (Addr: tcp/192.168.3.31:8300) (DC: testdc)
    2019/01/30 01:13:35 [INFO] consul: New leader elected: server1
    2019/01/30 01:13:35 [INFO] agent: Started DNS server 127.0.0.2:8600 (tcp)
    2019/01/30 01:13:35 [INFO] agent: Started DNS server 127.0.0.2:8600 (udp)
    2019/01/30 01:13:35 [INFO] agent: Started HTTP server on 127.0.0.2:8500 (tcp)
    2019/01/30 01:13:35 [INFO] agent: (LAN) joining: [192.168.3.31:9500]
    2019/01/30 01:13:35 [INFO] serf: Re-joined to previously known node: server1: 192.168.3.31:9500
    2019/01/30 01:13:35 [INFO] agent: (LAN) joined: 1 Err: <nil>
    2019/01/30 01:13:35 [INFO] agent: started state syncer
    2019/01/30 01:13:35 [INFO] agent: Synced node info
[1]+  Killed: 9               consul agent -data-dir $PWD/client1 -node client1 -join 192.168.3.31:9500 -serf-lan-port 10300 -client 127.0.0.2 -datacenter=testDC -hcl leave_on_terminate=false
[1] 23994
==> Starting Consul agent...
==> Joining cluster...
    Join completed. Synced with 1 initial agents
==> Consul agent running!
           Version: '1.4.1-dev'
           Node ID: '02552157-072f-bf43-63bf-dc429f40afea'
         Node name: 'client1'
        Datacenter: 'testdc' (Segment: '')
            Server: false (Bootstrap: false)
       Client Addr: [127.0.0.2] (HTTP: 8500, HTTPS: -1, gRPC: -1, DNS: 8600)
      Cluster Addr: 192.168.3.31 (LAN: 10300, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2019/01/30 01:13:36 [WARN] memberlist: Refuting an alive message
    2019/01/30 01:13:36 [INFO] serf: EventMemberJoin: client1 192.168.3.31
    2019/01/30 01:13:37 [INFO] serf: EventMemberJoin: client1 192.168.3.31
    2019/01/30 01:13:37 [INFO] serf: Attempting re-join to previously known node: server1: 192.168.3.31:9500
    2019/01/30 01:13:37 [INFO] agent: Started DNS server 127.0.0.2:8600 (udp)
    2019/01/30 01:13:37 [INFO] agent: Started DNS server 127.0.0.2:8600 (tcp)
    2019/01/30 01:13:37 [INFO] agent: Started HTTP server on 127.0.0.2:8500 (tcp)
    2019/01/30 01:13:37 [INFO] agent: (LAN) joining: [192.168.3.31:9500]
    2019/01/30 01:13:37 [INFO] serf: EventMemberJoin: server1 192.168.3.31
    2019/01/30 01:13:37 [INFO] consul: adding server server1 (Addr: tcp/192.168.3.31:8300) (DC: testdc)
    2019/01/30 01:13:37 [INFO] serf: Re-joined to previously known node: server1: 192.168.3.31:9500
    2019/01/30 01:13:37 [INFO] agent: (LAN) joined: 1 Err: <nil>
    2019/01/30 01:13:37 [INFO] agent: started state syncer
    2019/01/30 01:13:37 [INFO] agent: Synced node info
[1]+  Killed: 9               consul agent -data-dir $PWD/client1 -node client1 -join 192.168.3.31:9500 -serf-lan-port 10300 -client 127.0.0.2 -datacenter=testDC -hcl leave_on_terminate=false
[1] 24000
==> Starting Consul agent...
==> Joining cluster...
==> 1 error(s) occurred:

* Failed to join 192.168.3.31: Node 'server1' protocol version (2) is incompatible: [1, 0]

By adding time.Sleep(1s), I just helped a bit, but this error happens eventually on a large enough cluster.

Copy link
Member

@mkeeler mkeeler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pierresouchay and @Aestek Thanks for diving in and figuring this one out as well as the clear documentation of the issue in #180.

It makes total sense what is happening now and why this PR would fix it.

@banks
Copy link
Member

banks commented Feb 4, 2019

LGTM! Thanks!

@mkeeler mkeeler merged commit b38abf6 into hashicorp:master Feb 4, 2019
pierresouchay added a commit to pierresouchay/consul that referenced this pull request Feb 4, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
pierresouchay added a commit to criteo-forks/consul that referenced this pull request Feb 5, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
mkeeler pushed a commit to hashicorp/consul that referenced this pull request Feb 5, 2019
Upgrade leads to protocol version (2) is incompatible: [1, 0] (#5313)

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
ShimmerGlass pushed a commit to criteo-forks/consul that referenced this pull request Feb 8, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
ShimmerGlass pushed a commit to criteo-forks/consul that referenced this pull request Feb 8, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
LeoCavaille pushed a commit to DataDog/consul that referenced this pull request Mar 30, 2019
…ible: [1, 0]

This is fixed in hashicorp/memberlist#178, bump
memberlist to fix possible split brain in Consul.
thaJeztah added a commit to thaJeztah/libnetwork that referenced this pull request Aug 26, 2019
full diff: hashicorp/memberlist@3d8438d...v0.1.4

- hashicorp/memberlist#158 Limit concurrent push/pull connections
- hashicorp/memberlist#159 Prioritize alive message over other messages
- hashicorp/memberlist#168 Add go.mod
- hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters
- hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop
- hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn
- hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah added a commit to thaJeztah/libnetwork that referenced this pull request Aug 26, 2019
full diff: hashicorp/memberlist@3d8438d...v0.1.4

- hashicorp/memberlist#158 Limit concurrent push/pull connections
- hashicorp/memberlist#159 Prioritize alive message over other messages
- hashicorp/memberlist#168 Add go.mod
- hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters
- hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop
- hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn
- hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah added a commit to thaJeztah/libnetwork that referenced this pull request Feb 26, 2020
full diff: hashicorp/memberlist@3d8438d...v0.1.4

- hashicorp/memberlist#158 Limit concurrent push/pull connections
- hashicorp/memberlist#159 Prioritize alive message over other messages
- hashicorp/memberlist#168 Add go.mod
- hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters
- hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop
- hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn
- hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
thaJeztah added a commit to thaJeztah/libnetwork that referenced this pull request May 11, 2021
full diff: hashicorp/memberlist@3d8438d...v0.1.4

- hashicorp/memberlist#158 Limit concurrent push/pull connections
- hashicorp/memberlist#159 Prioritize alive message over other messages
- hashicorp/memberlist#168 Add go.mod
- hashicorp/memberlist#167 Various changes to improve the cpu impact of TransmitLimitedQueue in large clusters
- hashicorp/memberlist#169 added back-off to accept loop to avoid a tight loop
- hashicorp/memberlist#178 Avoid to take into account wrong versions of protocols in Vsn
- hashicorp/memberlist#189 Allow a dead node's name to be taken by a new node

Signed-off-by: Sebastiaan van Stijn <github@gone.nl>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants