picker: increase range of peer observation #1267

stevvooe · 2016-07-28T22:28:21Z

To allow the weighting algorithms to work properly, the values of
observation weights need to be selected to be greater than 1. Without
this, the observations, positive or negative, will always converge to 1.
We also relax the smoothing factor to allow the observations to react
faster to changes.

Arguably, we could remove the weight parameter from Observe and then
make the manager state truly discrete but that is a much larger change.

Signed-off-by: Stephen J Day stephen.day@docker.com

cc @LK4D4 @aaronlehmann

codecov-io · 2016-07-28T22:40:09Z

Current coverage is 54.91% (diff: 16.66%)

Merging #1267 into master will increase coverage by 0.19%

@@             master      #1267   diff @@
==========================================
  Files            78         78          
  Lines         12418      12418          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6795       6819    +24   
+ Misses         4677       4660    -17   
+ Partials        946        939     -7

Powered by Codecov. Last update 069614f...bbc4243

aaronlehmann · 2016-07-28T22:42:46Z

picker/picker.go

@@ -299,15 +303,15 @@ func (p *Picker) WaitForStateChange(ctx context.Context, sourceState grpc.Connec
 	// TODO(stevvooe): This is questionable, but we'll see how it works.


Maybe remove the TODO now that we are taking a closer look at this.

This is still questionable... and we're probably getting rid of this after the move to grpc.LoadBalancer.

LK4D4 · 2016-07-28T22:51:10Z

LGTM

LK4D4 · 2016-07-28T23:02:34Z

@aaronlehmann true, we should modify test which I wrote to use such ideas, like Observe each twice, then downweight one and see how often it's there.

aaronlehmann · 2016-07-28T23:06:35Z

I'm wondering if using math.Ceil really makes sense.

For example, when the weight has converged to 10, and we observe with -10, we should end up at 0. But if the math was not exact due to FP rounding, and we ended up with a value like 0.00000001, this would turn into a weight of 1. It wouldn't be a big deal if the domain was large, but since we've limited ourselves to 21 discrete steps, a difference of 1 is quite significant.

I think using floating point weights for internal state would prevent roundoffs from becoming amplified like this.

aaronlehmann · 2016-07-28T23:09:01Z

picker/picker_test.go

+		if peer == peers[0] {
+			// we have an *extremely* low probability of selecting this node
+			// (like 0.5%) once. We still allow the delta to keep from being
+			// flaky.


Thanks for adding the protection against flakiness. It really sucks when a probablistic test fails 1 in 1000 times because someone felt that was rare enough not to be a problem.

This has a probability of 0.5%^20 of failing.

LK4D4 · 2016-07-28T23:12:13Z

@aaronlehmann @stevvooe I'm not sure that I understand algorithm, but test

func TestRemotesDownweight(t *testing.T) {
    peers := []api.Peer{{Addr: "one"}, {Addr: "two"}, {Addr: "three"}}
    index := make(map[api.Peer]struct{}, len(peers))
    remotes := NewRemotes()

    for _, peer := range peers {
        index[peer] = struct{}{}
    }

    for _, p := range peers {
        remotes.Observe(p, 10)
        remotes.Observe(p, 10)
    }

    remotes.Observe(peers[0], -10)

    samples := 100000
    choosen := 0

    for i := 0; i < samples; i++ {
        p, err := remotes.Select()
        if err != nil {
            t.Fatalf("error selecting remote: %v", err)
        }
        if p == peers[0] {
            choosen++
        }
    }
    ratio := float32(choosen) / float32(samples)
    t.Logf("ratio: %f", ratio)
    if ratio > 0.001 {
        t.Fatalf("downweighted peer is choosen too often, ratio: %f", ratio)
    }
}

passes, downweighted peer choosen only in 0.001% cases.

aaronlehmann · 2016-07-28T23:13:02Z

picker/picker_test.go

+	}
+
+	// one bad observation should mark the node as bad
+	remotes.Observe(peers[0], -DefaultObservationWeight)


Should we observe a few times with a positive weight before this downweighting to make sure that we're still unlikely to select this node even if its weight started as > 0 (as it will in practice)?

You can either have balanced downweighting have failures immediately reduce selection probability or have less negative than positive to require several observations before it is fully downweighted.

In this case, initial condition is DefaultObservationWeight and we downweight with -DefaultObservationWeight, expecting it to cross zero.

I missed the initial condition. Where is it set up?

The initial loop converges it to DefaultObservationWeight.

stevvooe · 2016-07-29T00:13:07Z

The tests have been adjusted to reduce failure probability. Another test has been changed to deal with bias reduction, discovered after running 10s of thousands of times.

aaronlehmann · 2016-07-29T00:17:47Z

picker/picker_test.go

+	remotes := NewRemotes(peers...)
+	seen := map[api.Peer]int{}
+	selections := 1000
+	tolerance := 0.20 // allow 10% delta to reduce test failure probability


comment says 10%

To allow the weighting algorithms to work properly, the values of observation weights need to be selected to be greater than 1. Without this, the observations, positive or negative, will always converge to 1. We also relax the smoothing factor to allow the observations to react faster to changes. Arguably, we could remove the weight parameter from `Observe` and then make the manager state truly discrete but that is a much larger change. Signed-off-by: Stephen J Day <stephen.day@docker.com>

aaronlehmann · 2016-07-29T00:45:11Z

LGTM

stevvooe added this to the 1.12.1 milestone Jul 28, 2016

GordonTheTurtle added the status/0-triage label Jul 28, 2016

stevvooe force-pushed the increase-observation-weight branch from d7399c2 to 48b52dc Compare July 28, 2016 22:32

aaronlehmann reviewed Jul 28, 2016
View reviewed changes

stevvooe force-pushed the increase-observation-weight branch 2 times, most recently from ee5fbe1 to 74fdc94 Compare July 29, 2016 00:04

aaronlehmann reviewed Jul 29, 2016
View reviewed changes

stevvooe force-pushed the increase-observation-weight branch from 74fdc94 to bfd10a1 Compare July 29, 2016 00:18

stevvooe force-pushed the increase-observation-weight branch from bfd10a1 to bbc4243 Compare July 29, 2016 00:34

aaronlehmann merged commit f271cc7 into moby:master Jul 29, 2016

stevvooe deleted the increase-observation-weight branch July 29, 2016 00:49

aaronlehmann added process/cherry-picked and removed status/0-triage labels Jul 29, 2016

LK4D4 mentioned this pull request Jul 29, 2016

[1.12] Killing leader makes all containers end up on a single node moby/moby#25017

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

picker: increase range of peer observation #1267

picker: increase range of peer observation #1267

stevvooe commented Jul 28, 2016

codecov-io commented Jul 28, 2016 •

edited

aaronlehmann Jul 28, 2016

stevvooe Jul 28, 2016

LK4D4 commented Jul 28, 2016

LK4D4 commented Jul 28, 2016

aaronlehmann commented Jul 28, 2016

aaronlehmann Jul 28, 2016

stevvooe Jul 28, 2016

LK4D4 commented Jul 28, 2016

aaronlehmann Jul 28, 2016

stevvooe Jul 28, 2016

aaronlehmann Jul 28, 2016

stevvooe Jul 29, 2016

stevvooe commented Jul 29, 2016

aaronlehmann Jul 29, 2016

aaronlehmann commented Jul 29, 2016

		@@ -299,15 +303,15 @@ func (p *Picker) WaitForStateChange(ctx context.Context, sourceState grpc.Connec
		// TODO(stevvooe): This is questionable, but we'll see how it works.

picker: increase range of peer observation #1267

picker: increase range of peer observation #1267

Conversation

stevvooe commented Jul 28, 2016

codecov-io commented Jul 28, 2016 • edited

Current coverage is 54.91% (diff: 16.66%)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LK4D4 commented Jul 28, 2016

LK4D4 commented Jul 28, 2016

aaronlehmann commented Jul 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

LK4D4 commented Jul 28, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stevvooe commented Jul 29, 2016

Choose a reason for hiding this comment

aaronlehmann commented Jul 29, 2016

codecov-io commented Jul 28, 2016 •

edited