-
Notifications
You must be signed in to change notification settings - Fork 618
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase gRPC request timeout to 20 seconds for sending snapshots #2391
Conversation
i'm pretty sure that's going to mean it takes 45 seconds to recognize that a raft member is down. |
I think @dperny's concern is worth investigation. |
@dperny let me test this to make sure, but your concern is legitimate. This is only the worst case though right? Does changing this timeout mean we need to adjust other Raft settings? @davidwilliamson in your tests, is it possible to check for @dperny's concern above? |
3dbe6ea
to
86fccb3
Compare
Codecov Report
@@ Coverage Diff @@
## master #2391 +/- ##
==========================================
+ Coverage 60.5% 60.68% +0.18%
==========================================
Files 128 128
Lines 26319 26332 +13
==========================================
+ Hits 15923 15980 +57
+ Misses 9006 8958 -48
- Partials 1390 1394 +4 |
86fccb3
to
1815b5f
Compare
@dperny why did you claim that detecting down Raft members will take longer to detect when down? Is the "heartbeat" done through a gRPC call that times out at 2 seconds? Anyway, based on a discussion with @anshulpundir, I've updated this PR to only increase the time limit for when we send a snapshot. This is a little hacky perhaps, but should avoid other issues. I would still suggest that @davidwilliamson test this change out before we consider merging it. I will create a build shortly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke with @nishanttotla offline about making this a more targeted to only affect snap send messages.
Oops, didn't see @nishanttotla previous message, where he already mentioned this. Please ignore. |
PTAL, this is less of a hack, but might be over-engineered. WDYT? @nishanttotla @anshulpundir |
It would be good to see that this is tunable. Do we have a PR for the real solution? |
@dperny looks alright to me, I think that is the right way to engineer it. The only qualm I have with doing this is that if we introduce this What we also need to figure out is whether the core change (in peer.go) is the right one to begin with. |
Had the same concern as @dperny - I see it got addressed, looks good to me. @stevvooe Agree with the tunable part generally speaking, however I'd hope we'd get rid of this workaround very soon and therefore the tunable option would become useless. Can we do a round of manual testing to validate this? Specifically, raft down nodes with and without the patch as well as sending a large snapshot. Not sure what to think about the current value of 45 seconds. Given the max snapshot size is 128MB, you'd need a transfer speed of at least 23 mbit/s to make it happen within the timeout limit, which I think is a reasonable expectation? There are a few edge cases but I don't think we can really address those. I believe the raft state machine counts any message for heart beat purposes. If the peer goes down just as we're sending a snapshot, it would take 43 additional seconds to detect it's down. But that's really a edge case and we can't really work around that, can we? |
@aluzzardi re: "Given the max snapshot size is 128MB, you'd need a transfer speed of at least 23 mbit/s to make it happen within the timeout limit" What is the expected behavior if there are, say, five snapshots (current + four historical), i.e., Is the 45 second timeout (and thus the bandwidth calculation) per snapshot, or for the entire history? |
@davidwilliamson |
Only the latest snapshot is sent @davidwilliamson |
+1 on making this tunable, since the previous value of 2s for the timeout also seems arbitrary. |
manager/state/raft/transport/peer.go
Outdated
// adjust timeout to be higher for when a snapshot is being sent. This | ||
// is to accommodate for the fact that snapshots can be large. | ||
if m.Type == raftpb.MsgSnap { | ||
timeout = 45 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define a constant for the default ?
1815b5f
to
c0586aa
Compare
Added a Also reduced the value of the large timeout to 20 seconds. Assuming a 100Mbps connection, about 240MB can be sent in a period of 20 seconds. This is well over our allowed limit. |
SendTimeout time.Duration | ||
TLSCredentials credentials.TransportCredentials | ||
KeyRotator EncryptionKeyRotator | ||
SendTimeout time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to be able to modify this without a code change, for testing etc. Is that currently possible ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would have to be wired through the CLI and require more changes. I don't think that should be part of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't recommending that it be a part of this PR. But, I was curious if there is an easy way to add customizable options to swarmkit. Is the answer no ? Also, is CLI the only way to do it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's certainly possible (and not hard) to make this configurable at the SwarmKit level, but assuming that it'll mostly be used through the Docker API means it's not very valuable unless it's done end-to-end. This is just my opinion.
Raft: mr, | ||
HeartbeatInterval: 3 * time.Second, | ||
SendTimeout: 2 * time.Second, | ||
SendTimeoutSnapshot: 20 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd still suggest making this (and others too) a constant and using it here and in raft.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constants won't make this clearer, just harder to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, its better to have it defined in one place than inline in many places. Agreed that that other place is test code, but still.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do that in a follow-up if we think it's required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a simple change, what do you think ? Also, I don't think there is a rush to get this in since 17.03 RC2 isn't waiting for this.
manager/state/raft/raft.go
Outdated
SendTimeout time.Duration | ||
// SendTimeoutSnapshot is the timeout on the sending snapshots to other raft | ||
// nodes. Leave this as 0 to get the default value. | ||
SendTimeoutSnapshot time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be SendSnapshotTimeout
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me after the outstanding comments are addressed.
The same problem might exist for AppendEntries
. Snapshots are most likely to take a long time to send, but sending a lot of entries outside a snapshot could take a long time too.
I think it is safe to increase the timeout for the general case. It's true that it might take a leader longer to notice that a follower node is down, but this case isn't particularly important. It's more important when the leader goes down and some other node needs to start a leader election, but this is handled by the Raft state machine and I don't think the timeout has any influence on this. Of course, it would be a good idea to test any changes here carefully.
c0586aa
to
df42e10
Compare
@aaronlehmann for I understand that increasing the overall timeout may work out, but given that we're able to be specific here, I think we should only increase timeout where necessary. |
Even if all the entries are sent together, does increasing the grpc message size have any affect on this ? @nishanttotla @aaronlehmann Basically I'm curious why you feel that the timeout for that also needs to be increased. |
The MaxSizePerMsg, which limits the number of entries in an append message, is set to 64K. Depending on the size of each entry, a larger grpc message size can lead to larger append messages. Also, depending on the size of each entry, MaxSizePerMsg should probably be lowered. |
LGTM |
I think it would be fine to split these across multiple gRPC requests if we detect that a lot of data would be sent. It would also be okay to increase the timeout. Whatever's easiest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can increase the timeout for specifically for append entries message, if thats possible. Otherwise, reducing MaxSizePerMsg seems simpler.
df42e10
to
56f1d85
Compare
Based on a chat with @anshulpundir, we have decided to pursue the case of |
Signed-off-by: Nishant Totla <nishanttotla@gmail.com>
56f1d85
to
e3e2821
Compare
After more discussion, @anshulpundir and I have decided to just increase the timeout for |
@thaJeztah @andrewhsu we must cherry pick this PR along with the gRPC limit increase. This must go into 17.09 as well. |
Since #2375 was merged, we've been getting
context deadline exceeded
while sending large raft snapshots. After investigation from @anshulpundir and I, it seems that the gRPC context timeout was too short at 2 seconds to be sufficient for sending large snapshots. This PR increases the default value to 45 seconds. Testing from @davidwilliamson confirms that this fixes the issue.cc @anshulpundir @aluzzardi @stevvooe @aaronlehmann @marcusmartins @mghazizadeh
Signed-off-by: Nishant Totla nishanttotla@gmail.com