Skip to content
This repository has been archived by the owner on Apr 19, 2024. It is now read-only.

Make ConsistantHash have a better distribution #56

Merged
merged 7 commits into from
May 14, 2020

Conversation

bohde
Copy link
Contributor

@bohde bohde commented May 9, 2020

In a 3 peer cluster, I observed up to 60% of requests being routed to a single peer. I traced this back to the ConsistantHash algorithm only choosing one point in the ring per peer, which can skew the distribution. To alleviate this, I adjusted the algorithm in a few different ways:

  1. Generate 512 points in the ring per peer

This reduces the probability that a given peer will take more requests than others.

  1. Change the default hashing algorithm from crc32 to 64 bit fnv1 from https://godoc.org/github.com/segmentio/fasthash.

This has a better distribution than crc32 at lower number of points per peer.

  1. Rework internals to have zero allocations and no map lookups in ConsistantHash.Get

I've included tests and benchmarks for ConsistantHash for all hash functions in fasthash, including tests for how a set of requests are distributed amongst peers.

This is backwards incompatible in a few minor ways:

  1. The signature of HashFunc now returns a uint64.

  2. Requests will most likely route to a new peer, meaning peers with this code can't work with peers with the previous algorithm, since they won't agree on the canonical peer for a request.

Below are a results from the new distrubution tests and benchmarks. I've also deployed a variant of this to a test environment, and can confirm that requests are more evenly distributed.

Distributions

=== RUN   TestConsistantHash/distribution/fasthash/fnv1a
    TestConsistantHash/distribution/fasthash/fnv1a: hash_test.go:77: host: c.svc.local, percent: 0.357200
    TestConsistantHash/distribution/fasthash/fnv1a: hash_test.go:77: host: a.svc.local, percent: 0.336400
    TestConsistantHash/distribution/fasthash/fnv1a: hash_test.go:77: host: b.svc.local, percent: 0.306400
=== RUN   TestConsistantHash/distribution/fasthash/fnv1
    TestConsistantHash/distribution/fasthash/fnv1: hash_test.go:77: host: a.svc.local, percent: 0.349700
    TestConsistantHash/distribution/fasthash/fnv1: hash_test.go:77: host: b.svc.local, percent: 0.340100
    TestConsistantHash/distribution/fasthash/fnv1: hash_test.go:77: host: c.svc.local, percent: 0.310200
=== RUN   TestConsistantHash/distribution/fasthash/jody
    TestConsistantHash/distribution/fasthash/jody: hash_test.go:77: host: b.svc.local, percent: 0.276900
    TestConsistantHash/distribution/fasthash/jody: hash_test.go:77: host: c.svc.local, percent: 0.523900
    TestConsistantHash/distribution/fasthash/jody: hash_test.go:77: host: a.svc.local, percent: 0.199200

Benchmarks

BenchmarkConsistantHash/fasthash/fnv1a
BenchmarkConsistantHash/fasthash/fnv1a-8        17533118                70.5 ns/op             0 B/op          0 allocs/op
BenchmarkConsistantHash/fasthash/fnv1
BenchmarkConsistantHash/fasthash/fnv1-8         18706365                65.4 ns/op             0 B/op          0 allocs/op
BenchmarkConsistantHash/fasthash/jody
BenchmarkConsistantHash/fasthash/jody-8         20959976                54.8 ns/op             0 B/op          0 allocs/op

1. Generate 512 points in the ring per peer to reduce the probability
that a given peer will take more requests than others.

2. Change the default hashing algorithm from crc32 to fnv1 from github.com/segmentio/fasthash

3. Adjust internal implementation to have zero allocation and avoid map lookups.

4. Add tests and benchmarks, including distribution tests for hash
functions.

```
BenchmarkConsistantHash/fasthash/fnv1
BenchmarkConsistantHash/fasthash/fnv1-8         18706365                65.4 ns/op             0 B/op          0 allocs/op
```

5. Adjust functional tests to accomodate new distribution
@bohde bohde requested a review from thrawn01 as a code owner May 9, 2020 18:30
@mailgun-ci
Copy link

Can one of the admins verify this patch?

This prevents random test failures.
@thrawn01
Copy link
Contributor

This is great!

Thank you for spending the time to test and change this algorithm! I would love to get this merged before PR #55.

As you say this will change distribution of keys on systems in production and that might not be desirable for a non major release (No ready to bump to v1.0 yet 😄). So my thought is that we keep the current ConsistentHash as the default but create an optional ConsistentFastHash with your code which can be assigned to gubernator.Config.PeerPicker during initialization of gubernator. Once we are ready for a new major release of gubernator we can make ConsistentFastHash the default PeerPicker and deprecate ConsistentHash implementation.

Should also create FastHashFunc and DefaultFastHash so we don't change the current public interface.

Thoughts?

@bohde
Copy link
Contributor Author

bohde commented May 13, 2020

That sounds good. I think that provides room for testing this on more data sets to verify distributions are fairly equal before establishing a default in a major release.

I've made a variant on the proposed change and added it to the pull request. I named it ReplicatedConsistantHash, since replicating each host throughout the ring multiple times is the key here, instead of the speed improvements. I've backported the zero allocation setup, and written benchmarks for the original, alongside the 32 bit variants in fasthash and they are as follows:

BenchmarkConsistantHash/crc32
BenchmarkConsistantHash/crc32-8         19633891                62.7 ns/op             0 B/op          0 allocs/op
BenchmarkConsistantHash/fasthash/fnv1a
BenchmarkConsistantHash/fasthash/fnv1a-8                32143843                37.4 ns/op             0 B/op          0 allocs/op
BenchmarkConsistantHash/fasthash/fnv1
BenchmarkConsistantHash/fasthash/fnv1-8                 33347883                36.2 ns/op             0 B/op          0 allocs/op
BenchmarkReplicatedConsistantHash
BenchmarkReplicatedConsistantHash/fasthash/fnv1a
BenchmarkReplicatedConsistantHash/fasthash/fnv1a-8              18421224                65.9 ns/op             0 B/op          0 allocs/op
BenchmarkReplicatedConsistantHash/fasthash/fnv1
BenchmarkReplicatedConsistantHash/fasthash/fnv1-8               19322701                64.2 ns/op             0 B/op          0 allocs/op

I also added tests for the original ConsistantHash (including a regression test to verify that the zero allocation improvement didn't change the distribution). For the original algorithm, I'm seeing the following distribution, which is similar to the real world data I was seeing:

    TestConsistantHash/distribution/crc32: hash_test.go:98: host: a.svc.local, percent: 0.160900
    TestConsistantHash/distribution/crc32: hash_test.go:98: host: b.svc.local, percent: 0.603900
    TestConsistantHash/distribution/crc32: hash_test.go:98: host: c.svc.local, percent: 0.235200

I've added a config option via GUBER_HASH_REPLICAS which will configure the new ReplicatedConsistantHash. I've moved a branch with this change in it configured with 512 replicas to production, and have noticed the distribution between nodes is 31%-35%. I'd like this to be closer to 33%, and will continue to experiment with settings.

@thrawn01
Copy link
Contributor

thrawn01 commented May 13, 2020

Thanks for making that change! I created a PR against your branch to allow users to pick the picker and algorithm. See bohde#1

If that looks okay to you, merge it into your branch and we can get this PR merged into master.

@bohde
Copy link
Contributor Author

bohde commented May 14, 2020

Merged!

@thrawn01 thrawn01 merged commit cd5588d into mailgun:master May 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants