agent, manager: decrease backoff for session and raftproxy #1259

LK4D4 · 2016-07-28T14:57:42Z

We need to recover from "failed node" case as fast as possible.

ping @aaronlehmann @aluzzardi @dongluochen @stevvooe

codecov-io · 2016-07-28T15:05:55Z

Current coverage is 54.84% (diff: 100%)

Merging #1259 into master will increase coverage by 0.06%

@@             master      #1259   diff @@
==========================================
  Files            78         78          
  Lines         12428      12422     -6   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6808       6813     +5   
+ Misses         4678       4669     -9   
+ Partials        942        940     -2

Powered by Codecov. Last update 058bc22...87fa6f4

aaronlehmann · 2016-07-28T17:25:32Z

I don't know if I agree with making the max backoff only 1 second. If the agent tries several times to connect to managers and doesn't succeed, it shouldn't continue trying every second forever. Ideally we would back off slowly, and after the first minute or so we would slow down.

LK4D4 · 2016-07-28T17:34:07Z

@aaronlehmann makes sense, will try to tune it.

aluzzardi · 2016-07-28T17:57:05Z

Especially if the node was node rm -f'ed /cc @diogomonica - in those cases, you want to retry very fast then gradually slow dow to avoid hitting the dispatcher too hard.

@aaronlehmann @stevvooe: This is related to the old exp backoff conversations we had. If the node was rm'ed, you probably don't want the lower bound of the retry to be 0.

stevvooe · 2016-07-28T19:05:41Z

agent/agent.go

@@ -13,7 +13,7 @@ import (

 const (
 	initialSessionFailureBackoff = time.Second
-	maxSessionFailureBackoff     = 8 * time.Second
+	maxSessionFailureBackoff     = time.Second


Note that this also sets the session failure backoff, in addition to GRPC.

stevvooe · 2016-07-28T19:18:15Z

LGTM

One second was the original value here. With the way grpc implements the backoff algorithm, a 1 second bound, with jitter, should perform better. Even with 1000 nodes, this is only 1000 connections/second, spread with randomness and jitter, spread across several managers.

Note that there is a better algorithm for this use case that doesn't have a lower bound on the random delay for each retry, such that it will select something from 0 to max. I'll keep trying to GRPC to open up so we can implement this.

diogomonica · 2016-07-28T20:51:08Z

LGTM with all the above caveats.

LK4D4 · 2016-07-28T22:53:16Z

@aluzzardi @aaronlehmann I think I can tune agent side, like having 30 seconds backoff after some number of tries. WDYT?

stevvooe · 2016-07-28T22:56:57Z

@LK4D4 This backoff is set for GRPC, so we don't have any control over the algorithm expect for maximum backoff time. We can adjust the session retry backoff, but I am not sure if that is the cause of this issue.

LK4D4 · 2016-07-29T20:27:33Z

@aaronlehmann I've changed an agent session backoff code to safer version.
Thanks for review!

For raftproxy in manager just cap grpc backoff to 1 second. For agent session change initial backoff value to 100 millisecond. So it will retry much faster at first. Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>

LK4D4 · 2016-08-01T16:36:44Z

ping @aaronlehmann @stevvooe

aaronlehmann · 2016-08-01T16:51:34Z

LGTM

stevvooe · 2016-08-01T19:36:45Z

LGTM

GordonTheTurtle added the status/0-triage label Jul 28, 2016

stevvooe reviewed Jul 28, 2016
View reviewed changes

LK4D4 added this to the 1.12.1 milestone Jul 28, 2016

LK4D4 mentioned this pull request Jul 29, 2016

[1.12] Killing leader makes all containers end up on a single node moby/moby#25017

Closed

LK4D4 force-pushed the decrease_backoffs branch from 130b1c3 to 6984c94 Compare July 29, 2016 20:26

LK4D4 force-pushed the decrease_backoffs branch 2 times, most recently from 2639965 to 04dbd61 Compare July 29, 2016 20:45

agent, manager: reduce backoffs influence

87fa6f4

For raftproxy in manager just cap grpc backoff to 1 second. For agent session change initial backoff value to 100 millisecond. So it will retry much faster at first. Signed-off-by: Alexander Morozov <lk4d4math@gmail.com>

LK4D4 force-pushed the decrease_backoffs branch from 04dbd61 to 87fa6f4 Compare July 29, 2016 20:46

LK4D4 merged commit fe55e5d into moby:master Aug 1, 2016

LK4D4 deleted the decrease_backoffs branch August 1, 2016 21:09

aaronlehmann added process/cherry-pick process/cherry-picked and removed status/0-triage process/cherry-pick labels Aug 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent, manager: decrease backoff for session and raftproxy #1259

agent, manager: decrease backoff for session and raftproxy #1259

LK4D4 commented Jul 28, 2016

codecov-io commented Jul 28, 2016 •

edited

aaronlehmann commented Jul 28, 2016

LK4D4 commented Jul 28, 2016

aluzzardi commented Jul 28, 2016

stevvooe Jul 28, 2016

stevvooe commented Jul 28, 2016

diogomonica commented Jul 28, 2016

LK4D4 commented Jul 28, 2016

stevvooe commented Jul 28, 2016

LK4D4 commented Jul 29, 2016

LK4D4 commented Aug 1, 2016

aaronlehmann commented Aug 1, 2016

stevvooe commented Aug 1, 2016

agent, manager: decrease backoff for session and raftproxy #1259

agent, manager: decrease backoff for session and raftproxy #1259

Conversation

LK4D4 commented Jul 28, 2016

codecov-io commented Jul 28, 2016 • edited

Current coverage is 54.84% (diff: 100%)

aaronlehmann commented Jul 28, 2016

LK4D4 commented Jul 28, 2016

aluzzardi commented Jul 28, 2016

stevvooe Jul 28, 2016

Choose a reason for hiding this comment

stevvooe commented Jul 28, 2016

diogomonica commented Jul 28, 2016

LK4D4 commented Jul 28, 2016

stevvooe commented Jul 28, 2016

LK4D4 commented Jul 29, 2016

LK4D4 commented Aug 1, 2016

aaronlehmann commented Aug 1, 2016

stevvooe commented Aug 1, 2016

codecov-io commented Jul 28, 2016 •

edited