New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
agent, manager: decrease backoff for session and raftproxy #1259
Conversation
Current coverage is 54.84% (diff: 100%)@@ master #1259 diff @@
==========================================
Files 78 78
Lines 12428 12422 -6
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 6808 6813 +5
+ Misses 4678 4669 -9
+ Partials 942 940 -2
|
I don't know if I agree with making the max backoff only 1 second. If the agent tries several times to connect to managers and doesn't succeed, it shouldn't continue trying every second forever. Ideally we would back off slowly, and after the first minute or so we would slow down. |
@aaronlehmann makes sense, will try to tune it. |
Especially if the node was @aaronlehmann @stevvooe: This is related to the old exp backoff conversations we had. If the node was rm'ed, you probably don't want the lower bound of the retry to be 0. |
@@ -13,7 +13,7 @@ import ( | |||
|
|||
const ( | |||
initialSessionFailureBackoff = time.Second | |||
maxSessionFailureBackoff = 8 * time.Second | |||
maxSessionFailureBackoff = time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this also sets the session failure backoff, in addition to GRPC.
LGTM One second was the original value here. With the way grpc implements the backoff algorithm, a 1 second bound, with jitter, should perform better. Even with 1000 nodes, this is only 1000 connections/second, spread with randomness and jitter, spread across several managers. Note that there is a better algorithm for this use case that doesn't have a lower bound on the random delay for each retry, such that it will select something from 0 to max. I'll keep trying to GRPC to open up so we can implement this. |
LGTM with all the above caveats. |
@aluzzardi @aaronlehmann I think I can tune agent side, like having 30 seconds backoff after some number of tries. WDYT? |
@LK4D4 This backoff is set for GRPC, so we don't have any control over the algorithm expect for maximum backoff time. We can adjust the session retry backoff, but I am not sure if that is the cause of this issue. |
@aaronlehmann I've changed an agent session backoff code to safer version. |
2639965
to
04dbd61
Compare
For raftproxy in manager just cap grpc backoff to 1 second. For agent session change initial backoff value to 100 millisecond. So it will retry much faster at first. Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>
ping @aaronlehmann @stevvooe |
LGTM |
1 similar comment
LGTM |
We need to recover from "failed node" case as fast as possible.
ping @aaronlehmann @aluzzardi @dongluochen @stevvooe