Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

picker: increase range of peer observation #1267

Merged
merged 1 commit into from Jul 29, 2016

Conversation

stevvooe
Copy link
Contributor

To allow the weighting algorithms to work properly, the values of
observation weights need to be selected to be greater than 1. Without
this, the observations, positive or negative, will always converge to 1.
We also relax the smoothing factor to allow the observations to react
faster to changes.

Arguably, we could remove the weight parameter from Observe and then
make the manager state truly discrete but that is a much larger change.

Signed-off-by: Stephen J Day stephen.day@docker.com

cc @LK4D4 @aaronlehmann

@stevvooe stevvooe added this to the 1.12.1 milestone Jul 28, 2016
@stevvooe stevvooe force-pushed the increase-observation-weight branch from d7399c2 to 48b52dc Compare July 28, 2016 22:32
@codecov-io
Copy link

codecov-io commented Jul 28, 2016

Current coverage is 54.91% (diff: 16.66%)

Merging #1267 into master will increase coverage by 0.19%

@@             master      #1267   diff @@
==========================================
  Files            78         78          
  Lines         12418      12418          
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           6795       6819    +24   
+ Misses         4677       4660    -17   
+ Partials        946        939     -7   

Sunburst

Powered by Codecov. Last update 069614f...bbc4243

@@ -299,15 +303,15 @@ func (p *Picker) WaitForStateChange(ctx context.Context, sourceState grpc.Connec
// TODO(stevvooe): This is questionable, but we'll see how it works.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the TODO now that we are taking a closer look at this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still questionable... and we're probably getting rid of this after the move to grpc.LoadBalancer.

@LK4D4
Copy link
Contributor

LK4D4 commented Jul 28, 2016

LGTM

@LK4D4
Copy link
Contributor

LK4D4 commented Jul 28, 2016

@aaronlehmann true, we should modify test which I wrote to use such ideas, like Observe each twice, then downweight one and see how often it's there.

@aaronlehmann
Copy link
Collaborator

I'm wondering if using math.Ceil really makes sense.

For example, when the weight has converged to 10, and we observe with -10, we should end up at 0. But if the math was not exact due to FP rounding, and we ended up with a value like 0.00000001, this would turn into a weight of 1. It wouldn't be a big deal if the domain was large, but since we've limited ourselves to 21 discrete steps, a difference of 1 is quite significant.

I think using floating point weights for internal state would prevent roundoffs from becoming amplified like this.

if peer == peers[0] {
// we have an *extremely* low probability of selecting this node
// (like 0.5%) once. We still allow the delta to keep from being
// flaky.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the protection against flakiness. It really sucks when a probablistic test fails 1 in 1000 times because someone felt that was rare enough not to be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a probability of 0.5%^20 of failing.

@LK4D4
Copy link
Contributor

LK4D4 commented Jul 28, 2016

@aaronlehmann @stevvooe I'm not sure that I understand algorithm, but test

func TestRemotesDownweight(t *testing.T) {
    peers := []api.Peer{{Addr: "one"}, {Addr: "two"}, {Addr: "three"}}
    index := make(map[api.Peer]struct{}, len(peers))
    remotes := NewRemotes()

    for _, peer := range peers {
        index[peer] = struct{}{}
    }

    for _, p := range peers {
        remotes.Observe(p, 10)
        remotes.Observe(p, 10)
    }

    remotes.Observe(peers[0], -10)

    samples := 100000
    choosen := 0

    for i := 0; i < samples; i++ {
        p, err := remotes.Select()
        if err != nil {
            t.Fatalf("error selecting remote: %v", err)
        }
        if p == peers[0] {
            choosen++
        }
    }
    ratio := float32(choosen) / float32(samples)
    t.Logf("ratio: %f", ratio)
    if ratio > 0.001 {
        t.Fatalf("downweighted peer is choosen too often, ratio: %f", ratio)
    }
}

passes, downweighted peer choosen only in 0.001% cases.

}

// one bad observation should mark the node as bad
remotes.Observe(peers[0], -DefaultObservationWeight)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we observe a few times with a positive weight before this downweighting to make sure that we're still unlikely to select this node even if its weight started as > 0 (as it will in practice)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can either have balanced downweighting have failures immediately reduce selection probability or have less negative than positive to require several observations before it is fully downweighted.

In this case, initial condition is DefaultObservationWeight and we downweight with -DefaultObservationWeight, expecting it to cross zero.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed the initial condition. Where is it set up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial loop converges it to DefaultObservationWeight.

@stevvooe stevvooe force-pushed the increase-observation-weight branch 2 times, most recently from ee5fbe1 to 74fdc94 Compare July 29, 2016 00:04
@stevvooe
Copy link
Contributor Author

The tests have been adjusted to reduce failure probability. Another test has been changed to deal with bias reduction, discovered after running 10s of thousands of times.

remotes := NewRemotes(peers...)
seen := map[api.Peer]int{}
selections := 1000
tolerance := 0.20 // allow 10% delta to reduce test failure probability
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment says 10%

@stevvooe stevvooe force-pushed the increase-observation-weight branch from 74fdc94 to bfd10a1 Compare July 29, 2016 00:18
To allow the weighting algorithms to work properly, the values of
observation weights need to be selected to be greater than 1. Without
this, the observations, positive or negative, will always converge to 1.
We also relax the smoothing factor to allow the observations to react
faster to changes.

Arguably, we could remove the weight parameter from `Observe` and then
make the manager state truly discrete but that is a much larger change.

Signed-off-by: Stephen J Day <stephen.day@docker.com>
@stevvooe stevvooe force-pushed the increase-observation-weight branch from bfd10a1 to bbc4243 Compare July 29, 2016 00:34
@aaronlehmann
Copy link
Collaborator

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants