Joining back 3rd manager sometimes causes reelection #1364

tonistiigi · 2016-08-11T01:51:34Z

In a three daemon cluster a leader is killed, 2 remaining nodes form a new leader. Now the old leader is started again, it shouldn't start a new election but sometimes it does(tested on a slow arm machine).

logs:

daemon1 d20073174 nodeid=2f7116c9f3135f00 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log
daemon2 d30297954 nodeid=feb61f32aa6d9a2 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log
daemon3 d35310795 nodeid=51dbe61bbca33471 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon3-d35310795-log

timeline:

LK4D4 · 2016-08-11T15:04:23Z

I wonder if we should tune raft options. Now reelection starts after 5 seconds of last node observation. A slow machine, CPU spikes, network problems and sequential nature of our raft package can lead to heartbeat miss.
ping @abronan

aaronlehmann · 2016-08-11T17:27:30Z

My suspicion is that since the arm machines use a network block device, raft's fsyncs introduce too much latency and lead to timeouts.

Could you please try running the tests on a ramdisk or local storage and see if it makes a difference?

tonistiigi · 2016-08-11T17:58:10Z

@aaronlehmann I ran it from tmpfs and it failed the test failed the same way. https://www.dropbox.com/s/yfcj8yul12f6bd4/leaderelection-tmpfs.tar.gz?dl=0 has all the logs and swarm state.

abronan · 2016-08-11T18:04:38Z

@tonistiigi Is it possible to try the test bumping the election-tick and heartbeat-tick to higher values? (like 2 or 3 times the default?)

The default values won't cut it for slow environments or ARM so we should tweak those values as @LK4D4 mentioned.

tonistiigi · 2016-08-11T19:55:26Z

@abronan Tested with ElectionTick=9 HeartbeatTick=3 but didn't seem to make a difference. Logs: https://www.dropbox.com/s/fisn7d6xpf3hvi4/leadereelction-modraft.tar.gz?dl=0

tonistiigi · 2016-08-11T19:57:06Z

diff --git a/integration-cli/check_test.go b/integration-cli/check_test.go
index 4dfc2f2..165b5b2 100644
--- a/integration-cli/check_test.go
+++ b/integration-cli/check_test.go
@@ -243,7 +243,10 @@ func (s *DockerSwarmSuite) AddDaemon(c *check.C, joinSwarm, manager bool) *Swarm
                                JoinToken:   token,
                        }), check.IsNil)
                } else {
-                       c.Assert(d.Init(swarm.InitRequest{}), check.IsNil)
+                       req := swarm.InitRequest{}
+                       req.Spec.Raft.ElectionTick = 9
+                       req.Spec.Raft.HeartbeatTick = 3
+                       c.Assert(d.Init(req), check.IsNil)
                }
        }

aaronlehmann · 2016-08-11T20:23:03Z

Not sure that patch will make a difference, as I believe these settings in the ClusterSpec are ignored (see #1326).

tonistiigi · 2016-08-11T21:16:03Z

@aaronlehmann @abronan I did 5 runs of this test with ElectionTick=9 HeartbeatTick=3 set in docker source(not through api) and all of them passed.

LK4D4 · 2016-08-11T21:17:01Z

@tonistiigi Cool

aaronlehmann · 2016-08-11T21:26:14Z

Let's try to understand why it sometimes takes a node more than 3 seconds to send a heartbeat. Is it blocked on I/O? Is it a networking issue?

I think this part of the log suggests something related to networking/GRPC:

time="2016-08-11T00:52:22.279387847Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:22.279672693Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:22.280490509Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:22.280744834Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:25.278396028Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:25.278711474Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:25.279343367Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:28.288733637Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:28.288980962Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:28.289528413Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:31.288264154Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:31.288503039Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:31.289682262Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:34.278710214Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:34.279260705Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:34.279432908Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:37.311223633Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:37.311468318Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:37.311629921Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]"

By 52:29, 2f7116c9f3135f00 has already started an election, and is receiving messages from 51dbe61bbca33471, so it surely must be reachable. We are supposed to reset the GRPC connection and replace it with a new one if the RPC times out, but perhaps there is a problem related to this.

aaronlehmann · 2016-08-11T21:33:45Z

Here's some additional logging to try with the original heartbeat/election tick values:

diff --git a/manager/state/raft/raft.go b/manager/state/raft/raft.go
index 0cfbdfa..37751f2 100644
--- a/manager/state/raft/raft.go
+++ b/manager/state/raft/raft.go
@@ -341,6 +341,8 @@ func (n *Node) Run(ctx context.Context) error {
            n.Tick()

        case rd := <-n.Ready():
+           n.Config.Logger.Debug("processing raft update")
+
            raftConfig := DefaultRaftConfig()
            n.memoryStore.View(func(readTx store.ReadTx) {
                clusters, err := store.FindClusters(readTx, store.ByName(store.DefaultClusterName))
@@ -432,6 +434,7 @@ func (n *Node) Run(ctx context.Context) error {

            // Advance the state machine
            n.Advance()
+           n.Config.Logger.Debug("finished processing raft update")

        case snapshotIndex := <-n.snapshotInProgress:
            if snapshotIndex > n.snapshotIndex {
@@ -751,6 +754,8 @@ func (n *Node) ProcessRaftMessage(ctx context.Context, msg *api.ProcessRaftMessa
        return nil, grpc.Errorf(codes.InvalidArgument, "no message provided")
    }

+   n.Config.Logger.Debugf("received message from %x", msg.Message.From)
+
    // Don't process the message if this comes from
    // a node in the remove set
    if n.cluster.IsIDRemoved(msg.Message.From) {
@@ -777,6 +782,8 @@ func (n *Node) ProcessRaftMessage(ctx context.Context, msg *api.ProcessRaftMessa
        return nil, err
    }

+   n.Config.Logger.Debugf("finished processing message from %x", msg.Message.From)
+
    return &api.ProcessRaftMessageResponse{}, nil
 }

@@ -1083,6 +1090,8 @@ func (n *Node) sendToMember(members map[uint64]*membership.Member, m raftpb.Mess

    _, err := conn.ProcessRaftMessage(ctx, &api.ProcessRaftMessageRequest{Message: &m})
    if err != nil {
+       n.Config.Logger.Debugf("failed to send message to %x: %v", m.To, err)
+
        if grpc.ErrorDesc(err) == ErrMemberRemoved.Error() {
            n.removeRaftFunc()
        }

I have access to an ARM machine, so I can try this myself. If I run TESTFLAGS='-check.f DockerSwarmSuite.*' make test-integration-cli, will it preserve the logs outside the container by default?

aaronlehmann · 2016-08-12T18:11:16Z

Should be fixed by #1367. Reopen if not.

tonistiigi · 2016-08-12T19:48:46Z

Still seeing this failing after #1367 , logs/state with extra debug: leaderelection4.tar.gz

abronan · 2016-08-12T20:42:17Z

Let's keep this one open until we add the necessary flags to tweak the tick values. I don't think #1367 alone fixes the issue. It was contributing to the issue but we still need a higher timeout for the election on ARM.

aaronlehmann · 2016-08-12T20:46:28Z

@abronan: This is not a problem with tick values. In the logs, daemon 2 tries to send RPCs to daemon 1 many times, and they all time out. daemon 1's RPC handler doesn't even get invoked. After 5 seconds of inactivity, daemon 1 starts an election.

It seems to be a GRPC-related problem, where reconnecting doesn't work properly. We're trying to debug further by enabling GRPC logging.

aaronlehmann · 2016-08-12T20:52:38Z

Relevant log messages from daemon 1:

time="2016-08-12T19:41:51.042965673Z" level=info msg="4956d4e4eaba468d became follower at term 2"
time="2016-08-12T19:41:51.043210238Z" level=info msg="newRaft 4956d4e4eaba468d [peers: [], term: 2, commit: 24, applied: 0, lastindex: 24, lastterm: 2]"
time="2016-08-12T19:41:51.050498147Z" level=debug msg="processing raft update"
time="2016-08-12T19:41:51.078656802Z" level=debug msg="finished processing raft update"
time="2016-08-12T19:41:56.020217400Z" level=info msg="4956d4e4eaba468d is starting a new election at term 2"

and from daemon 2:

time="2016-08-12T19:41:54.742029697Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 4 desc = context deadline exceeded"
time="2016-08-12T19:41:54.742356624Z" level=warning msg="ReplaceMemberConnection oldMemeber.Conn.Close() 5284645292421564045 5284645292421564045"
time="2016-08-12T19:41:54.742560388Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:41:54.742687871Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:41:57.744058481Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 4 desc = context deadline exceeded"
time="2016-08-12T19:41:57.744819257Z" level=warning msg="ReplaceMemberConnection oldMemeber.Conn.Close() 5284645292421564045 5284645292421564045"
time="2016-08-12T19:41:57.745221185Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:41:57.745360068Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:41:57.745684115Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:42:00.769452458Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 4 desc = context deadline exceeded"
time="2016-08-12T19:42:00.770170392Z" level=warning msg="ReplaceMemberConnection oldMemeber.Conn.Close() 5284645292421564045 5284645292421564045"
time="2016-08-12T19:42:00.770626282Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:42:00.783075936Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:42:00.995584435Z" level=debug msg="received message from 4956d4e4eaba468d"
time="2016-08-12T19:42:00.995745678Z" level=debug msg="finished processing message from 4956d4e4eaba468d"
time="2016-08-12T19:42:00.996044844Z" level=info msg="2c1a070ca2c6eb3f [term: 3] received a MsgVote message with higher term from 4956d4e4eaba468d [term: 4]"
time="2016-08-12T19:42:00.996234528Z" level=info msg="2c1a070ca2c6eb3f became follower at term 4"

aaronlehmann · 2016-08-12T23:31:53Z

@abronan: This is not a problem with tick values.

I was wrong. I wrote a test program that makes a TLS connection to itself using the certs and keys from the integration test run. The TLS handshake takes 1.9 seconds. Since our send timeout is 2 seconds, this clearly doesn't work well.

Adjusting the send timeout and tick counts is one option. Another would be to reduce the ECC key size, or make it configurable.

Where do we want to go from here? Taking 2 seconds to open a connection is a bit absurd IMHO. I'm reluctant to tweak the defaults with that use case in mind, since it means it will take normal setups longer to recover after the leader goes down. But I'm not sure there's a better way if we want to support these tests on ARM (do we?).

abronan · 2016-08-12T23:43:00Z

@aaronlehmann

I'm reluctant to tweak the defaults with that use case in mind, since it means it will take normal setups longer to recover after the leader goes down

Losing the Leader should be an exceptional event. Bumping the default Election tick to 10s (like etcd) makes the default sane for slow setups that would wrongfully trigger elections in chain because of degraded network performance. It is not only about ARM. IMHO it is better to take longer to recover than wrongfully trigger elections because of a low timeout/tick (which in the long run results in more unavailability).

aaronlehmann · 2016-08-13T00:02:38Z

Yeah, that sounds reasonable.

aaronlehmann · 2016-08-13T00:23:55Z

RSA keys do the handshake way faster on ARM, but the key generation takes forever.

Reducing the size of the root key from 384 bits to 256 bits takes the handshake time from 1.9 seconds to about 280ms. Should we consider that?

aaronlehmann · 2016-08-13T00:24:02Z

cc @diogomonica

aaronlehmann · 2016-08-13T01:15:27Z

Some data points from my laptop: the handshake takes 100 ms with the 384 bit root key, and 3 ms with a 256 bit root key.

I think unless we really need/want the security level of 384 bits, we should reduce the size of the root key (or use an RSA root key?). If a network partition is resolved and hundreds or thousands of workers connect back to a manager, the TLS handshakes will become very expensive and possibly lead to timeouts.

This would also have the nice side effect of fixing the issues on ARM (though I'm open to allowing longer ElectionTicks and send timeouts as well).

aaronlehmann · 2016-08-13T01:21:29Z

Another data point from the laptop: the handshake takes 2 ms with a RSA-2048 key as the root key. But we probably don't want to switch to RSA even though we generate root keys infrequently, because older versions don't support it, and it would really slow down the tests.

aaronlehmann · 2016-08-16T02:48:01Z

This is hopefully fixed after #1376.

There is still the issue of making ElectionTick adjustable or improving the default.

aaronlehmann · 2016-08-17T17:47:03Z

If we can confirm that #1375 fixes the ARM CI problems, I suggest closing this and creating a separate ticket about changing the default ElectionTicks.

LK4D4 · 2016-08-17T17:58:28Z

@aaronlehmann I think at least one case of passing test cases known now :)
ping @tonistiigi

tonistiigi · 2016-08-17T21:47:15Z

@LK4D4 @aaronlehmann Indeed, this happened yesterday https://jenkins.dockerproject.org/job/Docker-PRs-arm/298/console

aaronlehmann · 2016-08-17T21:48:52Z

When I saw the jenkins link I thought it was something bad...

aaronlehmann · 2016-10-12T17:26:48Z

ARM CI problems have been fixed for awhile. I'm going to close this and open a separate issue about changing the default ElectionTicks.

dperny added kind/bug area/raft labels Aug 11, 2016

aaronlehmann closed this as completed Aug 12, 2016

tonistiigi reopened this Aug 12, 2016

This was referenced Aug 15, 2016

Added leader election test moby/moby#24563

Merged

Change root CA key size to 256 bits #1375

Closed

aaronlehmann closed this as completed Oct 12, 2016

aaronlehmann mentioned this issue Oct 12, 2016

Consider increasing ElectionTicks value #1622

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Joining back 3rd manager sometimes causes reelection #1364

Joining back 3rd manager sometimes causes reelection #1364

tonistiigi commented Aug 11, 2016

LK4D4 commented Aug 11, 2016

aaronlehmann commented Aug 11, 2016

tonistiigi commented Aug 11, 2016

abronan commented Aug 11, 2016

tonistiigi commented Aug 11, 2016

tonistiigi commented Aug 11, 2016

aaronlehmann commented Aug 11, 2016

tonistiigi commented Aug 11, 2016

LK4D4 commented Aug 11, 2016

aaronlehmann commented Aug 11, 2016

aaronlehmann commented Aug 11, 2016

aaronlehmann commented Aug 12, 2016

tonistiigi commented Aug 12, 2016

abronan commented Aug 12, 2016

aaronlehmann commented Aug 12, 2016

aaronlehmann commented Aug 12, 2016

aaronlehmann commented Aug 12, 2016

abronan commented Aug 12, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 16, 2016

aaronlehmann commented Aug 17, 2016

LK4D4 commented Aug 17, 2016

tonistiigi commented Aug 17, 2016

aaronlehmann commented Aug 17, 2016

aaronlehmann commented Oct 12, 2016

Joining back 3rd manager sometimes causes reelection #1364

Joining back 3rd manager sometimes causes reelection #1364

Comments

tonistiigi commented Aug 11, 2016

logs:

timeline:

LK4D4 commented Aug 11, 2016

aaronlehmann commented Aug 11, 2016

tonistiigi commented Aug 11, 2016

abronan commented Aug 11, 2016

tonistiigi commented Aug 11, 2016

tonistiigi commented Aug 11, 2016

aaronlehmann commented Aug 11, 2016

tonistiigi commented Aug 11, 2016

LK4D4 commented Aug 11, 2016

aaronlehmann commented Aug 11, 2016

aaronlehmann commented Aug 11, 2016

aaronlehmann commented Aug 12, 2016

tonistiigi commented Aug 12, 2016

abronan commented Aug 12, 2016

aaronlehmann commented Aug 12, 2016

aaronlehmann commented Aug 12, 2016

aaronlehmann commented Aug 12, 2016

abronan commented Aug 12, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 13, 2016

aaronlehmann commented Aug 16, 2016

aaronlehmann commented Aug 17, 2016

LK4D4 commented Aug 17, 2016

tonistiigi commented Aug 17, 2016

aaronlehmann commented Aug 17, 2016

aaronlehmann commented Oct 12, 2016