Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

agent, manager: decrease backoff for session and raftproxy #1259

Merged
merged 1 commit into from Aug 1, 2016

Conversation

LK4D4
Copy link
Contributor

@LK4D4 LK4D4 commented Jul 28, 2016

We need to recover from "failed node" case as fast as possible.

ping @aaronlehmann @aluzzardi @dongluochen @stevvooe

@codecov-io
Copy link

codecov-io commented Jul 28, 2016

Current coverage is 54.84% (diff: 100%)

Merging #1259 into master will increase coverage by 0.06%

@@             master      #1259   diff @@
==========================================
  Files            78         78          
  Lines         12428      12422     -6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6808       6813     +5   
+ Misses         4678       4669     -9   
+ Partials        942        940     -2   

Sunburst

Powered by Codecov. Last update 058bc22...87fa6f4

@aaronlehmann
Copy link
Collaborator

I don't know if I agree with making the max backoff only 1 second. If the agent tries several times to connect to managers and doesn't succeed, it shouldn't continue trying every second forever. Ideally we would back off slowly, and after the first minute or so we would slow down.

@LK4D4
Copy link
Contributor Author

LK4D4 commented Jul 28, 2016

@aaronlehmann makes sense, will try to tune it.

@aluzzardi
Copy link
Member

Especially if the node was node rm -f'ed /cc @diogomonica - in those cases, you want to retry very fast then gradually slow dow to avoid hitting the dispatcher too hard.

@aaronlehmann @stevvooe: This is related to the old exp backoff conversations we had. If the node was rm'ed, you probably don't want the lower bound of the retry to be 0.

@@ -13,7 +13,7 @@ import (

const (
initialSessionFailureBackoff = time.Second
maxSessionFailureBackoff = 8 * time.Second
maxSessionFailureBackoff = time.Second
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this also sets the session failure backoff, in addition to GRPC.

@stevvooe
Copy link
Contributor

LGTM

One second was the original value here. With the way grpc implements the backoff algorithm, a 1 second bound, with jitter, should perform better. Even with 1000 nodes, this is only 1000 connections/second, spread with randomness and jitter, spread across several managers.

Note that there is a better algorithm for this use case that doesn't have a lower bound on the random delay for each retry, such that it will select something from 0 to max. I'll keep trying to GRPC to open up so we can implement this.

@diogomonica
Copy link
Contributor

LGTM with all the above caveats.

@LK4D4 LK4D4 added this to the 1.12.1 milestone Jul 28, 2016
@LK4D4
Copy link
Contributor Author

LK4D4 commented Jul 28, 2016

@aluzzardi @aaronlehmann I think I can tune agent side, like having 30 seconds backoff after some number of tries. WDYT?

@stevvooe
Copy link
Contributor

@LK4D4 This backoff is set for GRPC, so we don't have any control over the algorithm expect for maximum backoff time. We can adjust the session retry backoff, but I am not sure if that is the cause of this issue.

@LK4D4
Copy link
Contributor Author

LK4D4 commented Jul 29, 2016

@aaronlehmann I've changed an agent session backoff code to safer version.
Thanks for review!

@LK4D4 LK4D4 force-pushed the decrease_backoffs branch 2 times, most recently from 2639965 to 04dbd61 Compare July 29, 2016 20:45
For raftproxy in manager just cap grpc backoff to 1 second.
For agent session change initial backoff value to 100 millisecond. So it
will retry much faster at first.

Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>
@LK4D4
Copy link
Contributor Author

LK4D4 commented Aug 1, 2016

ping @aaronlehmann @stevvooe

@aaronlehmann
Copy link
Collaborator

LGTM

1 similar comment
@stevvooe
Copy link
Contributor

stevvooe commented Aug 1, 2016

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants