Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Joining back 3rd manager sometimes causes reelection #1364

Closed
tonistiigi opened this issue Aug 11, 2016 · 29 comments
Closed

Joining back 3rd manager sometimes causes reelection #1364

tonistiigi opened this issue Aug 11, 2016 · 29 comments

Comments

@tonistiigi
Copy link
Member

In a three daemon cluster a leader is killed, 2 remaining nodes form a new leader. Now the old leader is started again, it shouldn't start a new election but sometimes it does(tested on a slow arm machine).

logs:

daemon1 d20073174 nodeid=2f7116c9f3135f00 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log
daemon2 d30297954 nodeid=feb61f32aa6d9a2 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log
daemon3 d35310795 nodeid=51dbe61bbca33471 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon3-d35310795-log

timeline:

2016-08-11T00:52:14.175173269Z daemon1(leader) is shut down https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log-L214
2016-08-11T00:52:17.276697359Z daemon2 starts new election https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log-L154
2016-08-11T00:52:17.290743959Z daemon2 becomes new leader https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log-L167
2016-08-11T00:52:21.453692753Z daemon1 is restarted https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log-L221
2016-08-11T00:52:29.206870451Z deamon1 kicks off new election(why?) https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log-L370
from the logs it appears that daemon3 rejected to vote for daemon1 and no vote from daemon2
2016-08-11T00:52:31.288503039Z in daemon2 still for 10 sec all sends to daemon1 fail https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log-L199
2016-08-11T00:52:33.226706096Z in daemon3 logs daemon2 has lost its leader status(nothing in daemon2 logs) https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon3-d35310795-log-L176
2016-08-11T00:52:37.603734049Z daemon3 set itself as candidate and becomes leader(voted by daemon1 but not daemon2) https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon3-d35310795-log-L193
2016-08-11T00:52:37.680063567Z strange "failed to remove node" in daemon2 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log-L216

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 11, 2016

I wonder if we should tune raft options. Now reelection starts after 5 seconds of last node observation. A slow machine, CPU spikes, network problems and sequential nature of our raft package can lead to heartbeat miss.
ping @abronan

@aaronlehmann
Copy link
Collaborator

My suspicion is that since the arm machines use a network block device, raft's fsyncs introduce too much latency and lead to timeouts.

Could you please try running the tests on a ramdisk or local storage and see if it makes a difference?

@tonistiigi
Copy link
Member Author

@aaronlehmann I ran it from tmpfs and it failed the test failed the same way. https://www.dropbox.com/s/yfcj8yul12f6bd4/leaderelection-tmpfs.tar.gz?dl=0 has all the logs and swarm state.

@abronan
Copy link
Contributor

abronan commented Aug 11, 2016

@tonistiigi Is it possible to try the test bumping the election-tick and heartbeat-tick to higher values? (like 2 or 3 times the default?)

The default values won't cut it for slow environments or ARM so we should tweak those values as @LK4D4 mentioned.

@tonistiigi
Copy link
Member Author

@abronan Tested with ElectionTick=9 HeartbeatTick=3 but didn't seem to make a difference. Logs: https://www.dropbox.com/s/fisn7d6xpf3hvi4/leadereelction-modraft.tar.gz?dl=0

@tonistiigi
Copy link
Member Author

diff --git a/integration-cli/check_test.go b/integration-cli/check_test.go
index 4dfc2f2..165b5b2 100644
--- a/integration-cli/check_test.go
+++ b/integration-cli/check_test.go
@@ -243,7 +243,10 @@ func (s *DockerSwarmSuite) AddDaemon(c *check.C, joinSwarm, manager bool) *Swarm
                                JoinToken:   token,
                        }), check.IsNil)
                } else {
-                       c.Assert(d.Init(swarm.InitRequest{}), check.IsNil)
+                       req := swarm.InitRequest{}
+                       req.Spec.Raft.ElectionTick = 9
+                       req.Spec.Raft.HeartbeatTick = 3
+                       c.Assert(d.Init(req), check.IsNil)
                }
        }

@aaronlehmann
Copy link
Collaborator

Not sure that patch will make a difference, as I believe these settings in the ClusterSpec are ignored (see #1326).

@tonistiigi
Copy link
Member Author

@aaronlehmann @abronan I did 5 runs of this test with ElectionTick=9 HeartbeatTick=3 set in docker source(not through api) and all of them passed.

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 11, 2016

@tonistiigi Cool

@aaronlehmann
Copy link
Collaborator

Let's try to understand why it sometimes takes a node more than 3 seconds to send a heartbeat. Is it blocked on I/O? Is it a networking issue?

I think this part of the log suggests something related to networking/GRPC:

time="2016-08-11T00:52:22.279387847Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:22.279672693Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:22.280490509Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:22.280744834Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:25.278396028Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:25.278711474Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:25.279343367Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:28.288733637Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:28.288980962Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:28.289528413Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:31.288264154Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:31.288503039Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:31.289682262Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:34.278710214Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:34.279260705Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:34.279432908Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:37.311223633Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:37.311468318Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 
time="2016-08-11T00:52:37.311629921Z" level=debug msg="feb61f32aa6d9a2 failed to send message to 2f7116c9f3135f00 because it is unreachable [next = 25, match = 0, state = ProgressStateProbe, waiting = false, pendingSnapshot = 0]" 

By 52:29, 2f7116c9f3135f00 has already started an election, and is receiving messages from 51dbe61bbca33471, so it surely must be reachable. We are supposed to reset the GRPC connection and replace it with a new one if the RPC times out, but perhaps there is a problem related to this.

@aaronlehmann
Copy link
Collaborator

Here's some additional logging to try with the original heartbeat/election tick values:

diff --git a/manager/state/raft/raft.go b/manager/state/raft/raft.go
index 0cfbdfa..37751f2 100644
--- a/manager/state/raft/raft.go
+++ b/manager/state/raft/raft.go
@@ -341,6 +341,8 @@ func (n *Node) Run(ctx context.Context) error {
            n.Tick()

        case rd := <-n.Ready():
+           n.Config.Logger.Debug("processing raft update")
+
            raftConfig := DefaultRaftConfig()
            n.memoryStore.View(func(readTx store.ReadTx) {
                clusters, err := store.FindClusters(readTx, store.ByName(store.DefaultClusterName))
@@ -432,6 +434,7 @@ func (n *Node) Run(ctx context.Context) error {

            // Advance the state machine
            n.Advance()
+           n.Config.Logger.Debug("finished processing raft update")

        case snapshotIndex := <-n.snapshotInProgress:
            if snapshotIndex > n.snapshotIndex {
@@ -751,6 +754,8 @@ func (n *Node) ProcessRaftMessage(ctx context.Context, msg *api.ProcessRaftMessa
        return nil, grpc.Errorf(codes.InvalidArgument, "no message provided")
    }

+   n.Config.Logger.Debugf("received message from %x", msg.Message.From)
+
    // Don't process the message if this comes from
    // a node in the remove set
    if n.cluster.IsIDRemoved(msg.Message.From) {
@@ -777,6 +782,8 @@ func (n *Node) ProcessRaftMessage(ctx context.Context, msg *api.ProcessRaftMessa
        return nil, err
    }

+   n.Config.Logger.Debugf("finished processing message from %x", msg.Message.From)
+
    return &api.ProcessRaftMessageResponse{}, nil
 }

@@ -1083,6 +1090,8 @@ func (n *Node) sendToMember(members map[uint64]*membership.Member, m raftpb.Mess

    _, err := conn.ProcessRaftMessage(ctx, &api.ProcessRaftMessageRequest{Message: &m})
    if err != nil {
+       n.Config.Logger.Debugf("failed to send message to %x: %v", m.To, err)
+
        if grpc.ErrorDesc(err) == ErrMemberRemoved.Error() {
            n.removeRaftFunc()
        }

I have access to an ARM machine, so I can try this myself. If I run TESTFLAGS='-check.f DockerSwarmSuite.*' make test-integration-cli, will it preserve the logs outside the container by default?

@aaronlehmann
Copy link
Collaborator

Should be fixed by #1367. Reopen if not.

@tonistiigi
Copy link
Member Author

Still seeing this failing after #1367 , logs/state with extra debug: leaderelection4.tar.gz

@tonistiigi tonistiigi reopened this Aug 12, 2016
@abronan
Copy link
Contributor

abronan commented Aug 12, 2016

Let's keep this one open until we add the necessary flags to tweak the tick values. I don't think #1367 alone fixes the issue. It was contributing to the issue but we still need a higher timeout for the election on ARM.

@aaronlehmann
Copy link
Collaborator

@abronan: This is not a problem with tick values. In the logs, daemon 2 tries to send RPCs to daemon 1 many times, and they all time out. daemon 1's RPC handler doesn't even get invoked. After 5 seconds of inactivity, daemon 1 starts an election.

It seems to be a GRPC-related problem, where reconnecting doesn't work properly. We're trying to debug further by enabling GRPC logging.

@aaronlehmann
Copy link
Collaborator

Relevant log messages from daemon 1:

time="2016-08-12T19:41:51.042965673Z" level=info msg="4956d4e4eaba468d became follower at term 2"
time="2016-08-12T19:41:51.043210238Z" level=info msg="newRaft 4956d4e4eaba468d [peers: [], term: 2, commit: 24, applied: 0, lastindex: 24, lastterm: 2]"
time="2016-08-12T19:41:51.050498147Z" level=debug msg="processing raft update"
time="2016-08-12T19:41:51.078656802Z" level=debug msg="finished processing raft update"
time="2016-08-12T19:41:56.020217400Z" level=info msg="4956d4e4eaba468d is starting a new election at term 2"

and from daemon 2:

time="2016-08-12T19:41:54.742029697Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 4 desc = context deadline exceeded"
time="2016-08-12T19:41:54.742356624Z" level=warning msg="ReplaceMemberConnection oldMemeber.Conn.Close() 5284645292421564045 5284645292421564045"
time="2016-08-12T19:41:54.742560388Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:41:54.742687871Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:41:57.744058481Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 4 desc = context deadline exceeded"
time="2016-08-12T19:41:57.744819257Z" level=warning msg="ReplaceMemberConnection oldMemeber.Conn.Close() 5284645292421564045 5284645292421564045"
time="2016-08-12T19:41:57.745221185Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:41:57.745360068Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:41:57.745684115Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:42:00.769452458Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 4 desc = context deadline exceeded"
time="2016-08-12T19:42:00.770170392Z" level=warning msg="ReplaceMemberConnection oldMemeber.Conn.Close() 5284645292421564045 5284645292421564045"
time="2016-08-12T19:42:00.770626282Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:42:00.783075936Z" level=debug msg="failed to send message to 4956d4e4eaba468d: rpc error: code = 2 desc = grpc: the client connection is closing"
time="2016-08-12T19:42:00.995584435Z" level=debug msg="received message from 4956d4e4eaba468d"
time="2016-08-12T19:42:00.995745678Z" level=debug msg="finished processing message from 4956d4e4eaba468d"
time="2016-08-12T19:42:00.996044844Z" level=info msg="2c1a070ca2c6eb3f [term: 3] received a MsgVote message with higher term from 4956d4e4eaba468d [term: 4]"
time="2016-08-12T19:42:00.996234528Z" level=info msg="2c1a070ca2c6eb3f became follower at term 4"

@aaronlehmann
Copy link
Collaborator

@abronan: This is not a problem with tick values.

I was wrong. I wrote a test program that makes a TLS connection to itself using the certs and keys from the integration test run. The TLS handshake takes 1.9 seconds. Since our send timeout is 2 seconds, this clearly doesn't work well.

Adjusting the send timeout and tick counts is one option. Another would be to reduce the ECC key size, or make it configurable.

Where do we want to go from here? Taking 2 seconds to open a connection is a bit absurd IMHO. I'm reluctant to tweak the defaults with that use case in mind, since it means it will take normal setups longer to recover after the leader goes down. But I'm not sure there's a better way if we want to support these tests on ARM (do we?).

@abronan
Copy link
Contributor

abronan commented Aug 12, 2016

@aaronlehmann

I'm reluctant to tweak the defaults with that use case in mind, since it means it will take normal setups longer to recover after the leader goes down

Losing the Leader should be an exceptional event. Bumping the default Election tick to 10s (like etcd) makes the default sane for slow setups that would wrongfully trigger elections in chain because of degraded network performance. It is not only about ARM. IMHO it is better to take longer to recover than wrongfully trigger elections because of a low timeout/tick (which in the long run results in more unavailability).

@aaronlehmann
Copy link
Collaborator

Yeah, that sounds reasonable.

@aaronlehmann
Copy link
Collaborator

RSA keys do the handshake way faster on ARM, but the key generation takes forever.

Reducing the size of the root key from 384 bits to 256 bits takes the handshake time from 1.9 seconds to about 280ms. Should we consider that?

@aaronlehmann
Copy link
Collaborator

cc @diogomonica

@aaronlehmann
Copy link
Collaborator

Some data points from my laptop: the handshake takes 100 ms with the 384 bit root key, and 3 ms with a 256 bit root key.

I think unless we really need/want the security level of 384 bits, we should reduce the size of the root key (or use an RSA root key?). If a network partition is resolved and hundreds or thousands of workers connect back to a manager, the TLS handshakes will become very expensive and possibly lead to timeouts.

This would also have the nice side effect of fixing the issues on ARM (though I'm open to allowing longer ElectionTicks and send timeouts as well).

@aaronlehmann
Copy link
Collaborator

Another data point from the laptop: the handshake takes 2 ms with a RSA-2048 key as the root key. But we probably don't want to switch to RSA even though we generate root keys infrequently, because older versions don't support it, and it would really slow down the tests.

@aaronlehmann
Copy link
Collaborator

This is hopefully fixed after #1376.

There is still the issue of making ElectionTick adjustable or improving the default.

@aaronlehmann
Copy link
Collaborator

If we can confirm that #1375 fixes the ARM CI problems, I suggest closing this and creating a separate ticket about changing the default ElectionTicks.

@LK4D4
Copy link
Contributor

LK4D4 commented Aug 17, 2016

@aaronlehmann I think at least one case of passing test cases known now :)
ping @tonistiigi

@tonistiigi
Copy link
Member Author

@aaronlehmann
Copy link
Collaborator

When I saw the jenkins link I thought it was something bad...

@aaronlehmann
Copy link
Collaborator

ARM CI problems have been fixed for awhile. I'm going to close this and open a separate issue about changing the default ElectionTicks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants