-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Joining back 3rd manager sometimes causes reelection #1364
Comments
I wonder if we should tune raft options. Now reelection starts after 5 seconds of last node observation. A slow machine, CPU spikes, network problems and sequential nature of our raft package can lead to heartbeat miss. |
My suspicion is that since the arm machines use a network block device, raft's fsyncs introduce too much latency and lead to timeouts. Could you please try running the tests on a ramdisk or local storage and see if it makes a difference? |
@aaronlehmann I ran it from tmpfs and it failed the test failed the same way. https://www.dropbox.com/s/yfcj8yul12f6bd4/leaderelection-tmpfs.tar.gz?dl=0 has all the logs and swarm state. |
@tonistiigi Is it possible to try the test bumping the The default values won't cut it for slow environments or ARM so we should tweak those values as @LK4D4 mentioned. |
@abronan Tested with |
|
Not sure that patch will make a difference, as I believe these settings in the |
@aaronlehmann @abronan I did 5 runs of this test with |
@tonistiigi Cool |
Let's try to understand why it sometimes takes a node more than 3 seconds to send a heartbeat. Is it blocked on I/O? Is it a networking issue? I think this part of the log suggests something related to networking/GRPC:
By 52:29, |
Here's some additional logging to try with the original heartbeat/election tick values:
I have access to an ARM machine, so I can try this myself. If I run |
Should be fixed by #1367. Reopen if not. |
Still seeing this failing after #1367 , logs/state with extra debug: leaderelection4.tar.gz |
Let's keep this one open until we add the necessary flags to tweak the tick values. I don't think #1367 alone fixes the issue. It was contributing to the issue but we still need a higher timeout for the election on ARM. |
@abronan: This is not a problem with tick values. In the logs, daemon 2 tries to send RPCs to daemon 1 many times, and they all time out. daemon 1's RPC handler doesn't even get invoked. After 5 seconds of inactivity, daemon 1 starts an election. It seems to be a GRPC-related problem, where reconnecting doesn't work properly. We're trying to debug further by enabling GRPC logging. |
Relevant log messages from daemon 1:
and from daemon 2:
|
I was wrong. I wrote a test program that makes a TLS connection to itself using the certs and keys from the integration test run. The TLS handshake takes 1.9 seconds. Since our send timeout is 2 seconds, this clearly doesn't work well. Adjusting the send timeout and tick counts is one option. Another would be to reduce the ECC key size, or make it configurable. Where do we want to go from here? Taking 2 seconds to open a connection is a bit absurd IMHO. I'm reluctant to tweak the defaults with that use case in mind, since it means it will take normal setups longer to recover after the leader goes down. But I'm not sure there's a better way if we want to support these tests on ARM (do we?). |
Losing the Leader should be an exceptional event. Bumping the default Election tick to |
Yeah, that sounds reasonable. |
RSA keys do the handshake way faster on ARM, but the key generation takes forever. Reducing the size of the root key from 384 bits to 256 bits takes the handshake time from 1.9 seconds to about 280ms. Should we consider that? |
cc @diogomonica |
Some data points from my laptop: the handshake takes 100 ms with the 384 bit root key, and 3 ms with a 256 bit root key. I think unless we really need/want the security level of 384 bits, we should reduce the size of the root key (or use an RSA root key?). If a network partition is resolved and hundreds or thousands of workers connect back to a manager, the TLS handshakes will become very expensive and possibly lead to timeouts. This would also have the nice side effect of fixing the issues on ARM (though I'm open to allowing longer |
Another data point from the laptop: the handshake takes 2 ms with a RSA-2048 key as the root key. But we probably don't want to switch to RSA even though we generate root keys infrequently, because older versions don't support it, and it would really slow down the tests. |
This is hopefully fixed after #1376. There is still the issue of making |
If we can confirm that #1375 fixes the ARM CI problems, I suggest closing this and creating a separate ticket about changing the default |
@aaronlehmann I think at least one case of passing test cases known now :) |
@LK4D4 @aaronlehmann Indeed, this happened yesterday https://jenkins.dockerproject.org/job/Docker-PRs-arm/298/console |
When I saw the jenkins link I thought it was something bad... |
ARM CI problems have been fixed for awhile. I'm going to close this and open a separate issue about changing the default |
In a three daemon cluster a leader is killed, 2 remaining nodes form a new leader. Now the old leader is started again, it shouldn't start a new election but sometimes it does(tested on a slow arm machine).
logs:
daemon1 d20073174 nodeid=2f7116c9f3135f00 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log
daemon2 d30297954 nodeid=feb61f32aa6d9a2 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log
daemon3 d35310795 nodeid=51dbe61bbca33471 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon3-d35310795-log
timeline:
2016-08-11T00:52:14.175173269Z daemon1(leader) is shut down https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log-L214
2016-08-11T00:52:17.276697359Z daemon2 starts new election https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log-L154
2016-08-11T00:52:17.290743959Z daemon2 becomes new leader https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log-L167
2016-08-11T00:52:21.453692753Z daemon1 is restarted https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log-L221
2016-08-11T00:52:29.206870451Z deamon1 kicks off new election(why?) https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon1-d20073174-log-L370
from the logs it appears that daemon3 rejected to vote for daemon1 and no vote from daemon2
2016-08-11T00:52:31.288503039Z in daemon2 still for 10 sec all sends to daemon1 fail https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log-L199
2016-08-11T00:52:33.226706096Z in daemon3 logs daemon2 has lost its leader status(nothing in daemon2 logs) https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon3-d35310795-log-L176
2016-08-11T00:52:37.603734049Z daemon3 set itself as candidate and becomes leader(voted by daemon1 but not daemon2) https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon3-d35310795-log-L193
2016-08-11T00:52:37.680063567Z strange "failed to remove node" in daemon2 https://gist.github.com/tonistiigi/985a6e0b90b5a94b6bb0639c880636cb#file-daemon2-d30297954-log-L216
The text was updated successfully, but these errors were encountered: