partitionBy bySeriesWithTags (aka "shard by tag") #1282

Dieterbe · 2019-04-12T17:19:55Z

fix #1123
This PR is a reworked version of #1146
It intends to be a more well-founded approach to implementing bySeriesWithTags.
Making changes to our sharding scheme is not a trivial operation (lots of overhead to deploy), so I want to make sure we get it right.
This PR tells a story: If you look at the commits one by one, you'll see I've added a bunch of different possible implementations for the actual sharding, and a testing framework to validate the different implementations.
Once we agree that the tests are conclusive and we pick a winner, we can simply remove all other implementations.

Criteria (somewhat in order of importance):

distribution performance (metrics must be as evenly distributed across partitions as possible)
maturity of library
computational performance

In particular, I want to answer questions such as:

does jump hashing bring value? (it has a reputation for resulting in very even distributions). Is the Sarama hasher good enough? (note: it uses 32bit FNV-1a, whereas JumpPartitionerFnv uses 64bit FNV-1a)
as jump needs a uint64 input, which pre-hashing step do we use to feed input into it?
how correct is @dgryski in the conversation below (he is a hashing algorithm wizard). there's a plethora of hashing functions that all seem like viable candidates (siphash, metro, xxhash, fnv, ...)

Dieter Plaetinck10:42 PM
hash master @dgryski, is there a best practice for converting arbitrary length []byte slices to uint64's for feeding into go-jump? the byte slices are string id's (metric names)
Damian Gryski10:43 PM
@dieter_be any fast hash function; go-metro or what-have-you
Dieter Plaetinck10:46 PM
@dgryski so not some kind of hand written loop that goes over all byte values and adds them together or something? also, which library would you recommend for a production app?
Damian Gryski10:47 PM
@dieter_be pick a fast one: https://github.com/dgryski/go-metro

As far as analysis goes, i tested with fakemetrics and 4 datasets that I pulled from some of our HM instances - with permission.
Of course I can't share the contents or name names here, though ops is our internal monitoring instance.

-rw-r--r--  1 dieter dieter 1.5G Apr 12 15:49 fng.txt
-rw-r--r--  1 dieter dieter 860M Nov 30 23:47 id.txt
-rw-r--r--  1 dieter dieter 113M Nov 30 22:14 ops.txt
-rw-r--r--  1 dieter dieter 774K Nov 30 23:52 rtm.txt

bench: https://gist.github.com/Dieterbe/e5bd5b4a19c5d20fb95324eddc228a19
distribution analysis: https://gist.github.com/Dieterbe/0d9da8f0e394877a30c3a8fe554d4663

Now, there's obviously lots of output, some more relevant than others (e.g. no one should run large amounts of metrics on a small number of partitions). I have mentioned some tips in cluster/partitioner/partitioner_test.go
So I have just grepped for the dataset-partitioncount combinations that seem most relevant (r means grep)

~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/32.*fng' results.txt

              sarama/32          fng   3024891 -> 0.002749
          jump-mauro/32          fng   3024891 -> 0.003652
            jump-fnv/32          fng   3024891 -> 0.003544
          jump-metro/32          fng   3024891 -> 0.003134
            jump-sip/32          fng   3024891 -> 0.003200
         jump-xxhash/32          fng   3024891 -> 0.003050
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/128.*fng' results.txt

              sarama/128         fng   3024891 -> 0.005799
          jump-mauro/128         fng   3024891 -> 0.006008
            jump-fnv/128         fng   3024891 -> 0.007099
          jump-metro/128         fng   3024891 -> 0.006588
            jump-sip/128         fng   3024891 -> 0.006446
         jump-xxhash/128         fng   3024891 -> 0.006058
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/32.*ops' results.txt

              sarama/32          ops   1009213 -> 0.007097
          jump-mauro/32          ops   1009213 -> 0.006237
            jump-fnv/32          ops   1009213 -> 0.006406
          jump-metro/32          ops   1009213 -> 0.006473
            jump-sip/32          ops   1009213 -> 0.004689
         jump-xxhash/32          ops   1009213 -> 0.006385
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/128.*ops' results.txt

              sarama/128         ops   1009213 -> 0.012501
          jump-mauro/128         ops   1009213 -> 0.012525
            jump-fnv/128         ops   1009213 -> 0.012115
          jump-metro/128         ops   1009213 -> 0.011460
            jump-sip/128         ops   1009213 -> 0.012769
         jump-xxhash/128         ops   1009213 -> 0.013628
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/32.*id' results.txt

              sarama/32           id    687540 -> 0.007669
          jump-mauro/32           id    687540 -> 0.006550
            jump-fnv/32           id    687540 -> 0.007153
          jump-metro/32           id    687540 -> 0.008065
            jump-sip/32           id    687540 -> 0.006398
         jump-xxhash/32           id    687540 -> 0.005489
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/128.*id' results.txt

              sarama/128          id    687540 -> 0.013978
          jump-mauro/128          id    687540 -> 0.014216
            jump-fnv/128          id    687540 -> 0.012401
          jump-metro/128          id    687540 -> 0.012911
            jump-sip/128          id    687540 -> 0.013363
         jump-xxhash/128          id    687540 -> 0.014056
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/8.*rtm' results.txt

              sarama/8           rtm      3424 -> 0.030103
          jump-mauro/8           rtm      3424 -> 0.054707
            jump-fnv/8           rtm      3424 -> 0.024084
          jump-metro/8           rtm      3424 -> 0.051627
            jump-sip/8           rtm      3424 -> 0.032121
         jump-xxhash/8           rtm      3424 -> 0.031434

original scaling is 25-50k up to 250k-500k per instance (per 8 shards)
though in prod we see 1-2M per 8 shards
so test targets should be, for
1 shard: 62.5k-250k
32 shards: 2-8M
128 shards: about 8-32M total

Dieterbe · 2019-04-12T17:26:17Z

vendor/github.com/raintank/schema/metric.go

+		nameWithTagsBuffer.WriteString(t)
+	}
+
+	return nameWithTagsBuffer.Bytes()


@robert-milan do you think we can make this better?

We could use a pool, and also call Grow on the buffer before writing to it, but we would just be guessing at the size. This could decrease allocations.

I haven't followed the entire code path, but it looks like we are always passing in nil for the b []byte, so that doesn't help us at all. I think a pool makes the most sense.

right, we could use a pool but that's a concern for the caller.
I should have been clearer, i specifically wonder:

if you think it's silly to use a human friendly ascii separator like ; and not 0x00 or something . i don't see any downside to using something friendly. Though i will say since it's only a single char we should probably use WriteRune instead of WriteString

is there anything there to take into account wrt the upcoming interning. To be precise, i don't want to be in a situation where the output is different once we support interning, or where keeping the output the same makes the implementation suboptimal

@robert-milan ^

I might be missing something, as I have not reviewed the PR in great detail, but as I understand it this is only used for partitioning. My PR deals strictly with interning in the index. The only time I touch any MetricData is when I convert it into a MetricDefinition, so I don't think the two affect each other.

I don't think it matters. Also, I would use WriteByte instead of WriteRune. Are we worried about needing an escape sequence (like when processing stream data) that we have to scan for or something? That would require a different answer.

Since this only appears to operate on MetricData I don't think it will be an issue.

Dieterbe · 2019-04-12T17:38:33Z

want to do a bit more analysis before drawing conclusions.

dgryski · 2019-04-12T17:40:21Z

An even distribution is important, as is the "peak-to-mean" ratio. That is, how wide your distribution is. What is the maximum number of elements that are mapped to a single shard vs. the mean. A small standard deviation will help with capacity planning.

As for choosing a hash function, metro and cespare's xxhash should both be sufficiently fast and give an equivalent distribution. Siphash will also give a good distribution but will be slower (assuming you're using siphash2-4). You'll get a speedup with siphash1-3 which should give the same distribution but will probably end up being slower than both metrohash and xxhash.

Edit: my siphash1-3 implementation: https://github.com/dgryski/go-sip13

Dieterbe · 2019-04-12T19:07:10Z

I have added a "% diff between min and max".
The new results are in. I also removed the test for 1k metrics because that was basically useless.
interestingly, sarama's partitioner definitely has a worst case as can be seen here:

~/g/s/g/g/m/c/partitioner ❯❯❯ grep '32.*fake-true.*1000000\b'  results.txt
              sarama/32    fake-true   1000000 -> cov=1.109, diff=+Inf%
          jump-mauro/32    fake-true   1000000 -> cov=0.005, diff=2.21%
            jump-fnv/32    fake-true   1000000 -> cov=0.006, diff=2.25%
          jump-metro/32    fake-true   1000000 -> cov=0.005, diff=2.17%
            jump-sip/32    fake-true   1000000 -> cov=0.005, diff=2.40%
         jump-xxhash/32    fake-true   1000000 -> cov=0.004, diff=1.55%
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '32.*fake-true.*10000000\b'  results.txt
              sarama/32    fake-true  10000000 -> cov=1.107, diff=+Inf%
          jump-mauro/32    fake-true  10000000 -> cov=0.001, diff=0.55%
            jump-fnv/32    fake-true  10000000 -> cov=0.002, diff=0.59%
          jump-metro/32    fake-true  10000000 -> cov=0.002, diff=0.66%
            jump-sip/32    fake-true  10000000 -> cov=0.002, diff=0.90%
         jump-xxhash/32    fake-true  10000000 -> cov=0.002, diff=0.75%
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '128.*fake-true.*10000000\b'  results.txt
              sarama/128   fake-true  10000000 -> cov=1.107, diff=+Inf%
          jump-mauro/128   fake-true  10000000 -> cov=0.003, diff=1.46%
            jump-fnv/128   fake-true  10000000 -> cov=0.003, diff=1.70%
          jump-metro/128   fake-true  10000000 -> cov=0.004, diff=1.99%
            jump-sip/128   fake-true  10000000 -> cov=0.004, diff=1.93%
         jump-xxhash/128   fake-true  10000000 -> cov=0.003, diff=1.57%

all cases with an Inf percentage diff is the sarama fake-true case, in fact.

doing the previous tests again:

~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/32.*fng' results.txt
              sarama/32          fng   3024891 -> cov=0.003, diff=1.27%
          jump-mauro/32          fng   3024891 -> cov=0.004, diff=1.45%
            jump-fnv/32          fng   3024891 -> cov=0.004, diff=2.04%
          jump-metro/32          fng   3024891 -> cov=0.003, diff=1.16%
            jump-sip/32          fng   3024891 -> cov=0.003, diff=1.45%
         jump-xxhash/32          fng   3024891 -> cov=0.003, diff=1.55%
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/128.*fng' results.txt
              sarama/128         fng   3024891 -> cov=0.006, diff=2.97%
          jump-mauro/128         fng   3024891 -> cov=0.006, diff=3.83%
            jump-fnv/128         fng   3024891 -> cov=0.007, diff=3.96%
          jump-metro/128         fng   3024891 -> cov=0.007, diff=3.66%
            jump-sip/128         fng   3024891 -> cov=0.006, diff=3.08%
         jump-xxhash/128         fng   3024891 -> cov=0.006, diff=3.12%
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/32.*ops' results.txt
              sarama/32          ops   1009213 -> cov=0.007, diff=3.03%
          jump-mauro/32          ops   1009213 -> cov=0.006, diff=2.71%
            jump-fnv/32          ops   1009213 -> cov=0.006, diff=2.77%
          jump-metro/32          ops   1009213 -> cov=0.006, diff=3.18%
            jump-sip/32          ops   1009213 -> cov=0.005, diff=1.83%
         jump-xxhash/32          ops   1009213 -> cov=0.006, diff=2.63%
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/128.*ops' results.txt
              sarama/128         ops   1009213 -> cov=0.013, diff=6.39%
          jump-mauro/128         ops   1009213 -> cov=0.013, diff=6.31%
            jump-fnv/128         ops   1009213 -> cov=0.012, diff=7.46%
          jump-metro/128         ops   1009213 -> cov=0.011, diff=5.93%
            jump-sip/128         ops   1009213 -> cov=0.013, diff=6.47%
         jump-xxhash/128         ops   1009213 -> cov=0.014, diff=7.96%
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/32.*id' results.txt
              sarama/32           id    687540 -> cov=0.008, diff=2.68%
          jump-mauro/32           id    687540 -> cov=0.007, diff=3.18%
            jump-fnv/32           id    687540 -> cov=0.007, diff=3.11%
          jump-metro/32           id    687540 -> cov=0.008, diff=2.88%
            jump-sip/32           id    687540 -> cov=0.006, diff=2.68%
         jump-xxhash/32           id    687540 -> cov=0.005, diff=2.23%
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/128.*id' results.txt
              sarama/128          id    687540 -> cov=0.014, diff=8.13%
          jump-mauro/128          id    687540 -> cov=0.014, diff=6.73%
            jump-fnv/128          id    687540 -> cov=0.012, diff=6.63%
          jump-metro/128          id    687540 -> cov=0.013, diff=8.26%
            jump-sip/128          id    687540 -> cov=0.013, diff=7.45%
         jump-xxhash/128          id    687540 -> cov=0.014, diff=7.52%
~/g/s/g/g/m/c/partitioner ❯❯❯ grep '/8.*rtm' results.txt
              sarama/8           rtm      3424 -> cov=0.030, diff=9.73%
          jump-mauro/8           rtm      3424 -> cov=0.055, diff=17.14%
            jump-fnv/8           rtm      3424 -> cov=0.024, diff=8.43%
          jump-metro/8           rtm      3424 -> cov=0.052, diff=17.41%
            jump-sip/8           rtm      3424 -> cov=0.032, diff=9.51%
         jump-xxhash/8           rtm      3424 -> cov=0.031, diff=9.69%

it's hard to find meaningfull differences between the implementations. especially between metro, sip and xxhash (as predicted)
I'm inclined to go with jump-xxhash because it's fast and seems like a popular library, also used by other projects such as influxdb

woodsaj · 2019-04-12T19:21:58Z

cluster/partitioner/partitioner.go

-		// partition by series: metrics are distrubted across all metrictank instances
-		// to allow horizontal scalability
-		return m.KeyBySeries(b), nil
+func (p jumpPartitionerMauro) RequiresConsistency() bool {


what is RequiresConsistency() for? is this meant to be part of the "Partitioner" interface?

it is part of the sarama.Partitioner interface.
https://github.com/Shopify/sarama/blob/a74fb772b5cfe121956313c4f6596e372e83cd27/partitioner.go#L17-L22

cluster/partitioner/partitioner.go

this way, runtime Partition() calls don't need to check the type every time and we can remove the error checking.

and not having to construct a whole sarama message

Dieterbe · 2019-04-18T18:18:03Z

went with xxhash+jump. removed all others.

callers like tsdb-gw no longer have to set the key property, and can instead use ManualPartitioner and just obtain the partition themselves (actually they could already do this before by calling Kafka.Partition() but it used to instantiate a temporary sarama message for some reason.. but not anymore)

how does it look now?

replay · 2019-04-22T14:37:48Z

cluster/partitioner/partitioner.go

 	"github.com/raintank/schema"
 )

 type Partitioner interface {
-	Partition(schema.PartitionedMetric, int32) (int32, error)
+	Partition(schema.PartitionedMetric, int32) int32


Is this interface still used anywhere? same question in the tsdb code btw. can't see where it's used.

i was wondering the same thing and looked around a bit, and also didn't see it being used anywhere. i'll just remove it.

cluster/partitioner/partitioner.go

replay · 2019-04-22T14:44:28Z

cluster/partitioner/partitioner.go

+
+func (k *Kafka) Partition(m schema.PartitionedMetric, numPartitions int32) int32 {
+	key := k.GetPartitionKey(m, nil)
+	return (jumpPartitioner{}).PartitionKey(key, numPartitions)


it seems like this is the only place where the jumpPartitioner is ever used. So what's still the benefit of making it satisfy the sarama.Partitioner interface? Due to the fact that it currently implements that interface we need to first instantiate it and then call a method on that instance, that seems unnecessary if it's never used as a sarama partitioner. so we might as well just have a standalone function like partitionKeyWithJumpHash([]byte, int32) int32 instead of instantiating this struct and then call the method on it.

hmm yes. this and some other things here are rather confusing.
i'll see if i can refactor it.

Dieterbe · 2019-04-23T11:07:57Z

@replay @woodsaj how does it look now?
i kept the interface, but now it is actually used.

woodsaj

This seems like a huge mess. You cant add a new implementation of an interface "Partitioner" and then add comments to instruct users not to use certain methods because they will get unexpected and incorrect results. I dont even understand how you could think this is a good idea.

I am all for moving to xxhash and jump, but to do that, we need to completely refactor how we do partitioning.

I suggest we just get rid of metrictank/cluster/partitioner package and move everything into raintank/schema.

The "schema.PartitionedMetric" interface should just be updated to something like

type PartitionByMethod uint8
const (
	PartitionByOrg PartitionByMethod = iota
	PartitionBySeries
	PartitionBySeriesWithTags
)
type PartitionedMetric interface {
	Validate() error
	SetId()		SetId()
	GetPartitionID(method PartitionByMethod, partitions int32) int32
}

Dieterbe · 2019-07-10T10:21:45Z

the awoods-style partitioning is now merged in schema.
raintank/schema#26
I will create a new PR to replace this one.

replay · 2019-07-10T14:18:03Z

This one looks good to me. But if you want to replace it that's fine too

robert-milan · 2019-08-13T12:27:56Z

Closed in favor of #1427

Dieterbe · 2019-12-20T10:19:38Z

Note:
in hindsight, turns out there was a bug in BenchmarkJumpPartitionerFnv.
It was allocating on each iteration due to the constructor.
With this fix fix.txt
the benchmark becomes

taskset --cpu-list 1,2 go test . -run='^$' -test.benchmem -bench .
goos: linux
goarch: amd64
pkg: github.com/grafana/metrictank/cluster/partitioner
BenchmarkPartitionerSarama-2       	 5727127	       206 ns/op	       0 B/op	       0 allocs/op
BenchmarkJumpPartitionerMauro-2    	 4032470	       284 ns/op	     210 B/op	       1 allocs/op
BenchmarkJumpPartitionerFnv-2      	 4189651	       287 ns/op	       0 B/op	       0 allocs/op
BenchmarkJumpPartitionerMetro-2    	11723461	       103 ns/op	       0 B/op	       0 allocs/op
BenchmarkJumpPartitionerSip-2      	 8235633	       146 ns/op	       0 B/op	       0 allocs/op
BenchmarkJumpPartitionerXxhash-2   	12077864	        99.7 ns/op	       0 B/op	       0 allocs/op
PASS
ok  	github.com/grafana/metrictank/cluster/partitioner	8.389s

BenchmarkJumpPartitionerFnv is now faster and no longer allocates, but the conclusion would still be the same (xxhash still leads)

Dieterbe force-pushed the shard-by-tag branch from 2fa73a5 to 7135f70 Compare April 12, 2019 17:21

Dieterbe mentioned this pull request Apr 12, 2019

[WIP] add partitioner keyBySeriesWithTags #1146

Closed

Dieterbe commented Apr 12, 2019

View reviewed changes

woodsaj reviewed Apr 12, 2019

View reviewed changes

cluster/partitioner/partitioner.go Outdated Show resolved Hide resolved

replay and others added 16 commits April 18, 2019 20:10

add partitioner bySeriesWithTags

74d0e06

resolve GetPartitionKey at construction time, rather than runtime.

055d089

this way, runtime Partition() calls don't need to check the type every time and we can remove the error checking.

vendor go-jump

97690e4

add jumpPartitionerMauro for "bySeriesWithTags"

325119f

add jumpPartitionerFnv for "bySeriesWithTags"

10ce9cd

vendor go-metro

6a617e1

add jumpPartitionerMetro for "bySeriesWithTags"

1427867

vendor siphash

2424e33

add jumpPartitionerSip for "bySeriesWithTags"

0cb7cfe

update xxhash

29728db

add jumpPartitionerXxhash for "bySeriesWithTags"

d80df47

add partitioner benchmark, and scoring tests

956ac07

also print % diff between min and max used partition

6c0b5e6

settle with xxhash, remove others and comparison framework

7d883af

set partition directly, don't publish key into kafka

eea55b1

expose method so users can partition directly based on key

aa351b6

and not having to construct a whole sarama message

Dieterbe force-pushed the shard-by-tag branch from aded6a6 to 473d5fb Compare April 18, 2019 18:10

replay reviewed Apr 22, 2019

View reviewed changes

cluster/partitioner/partitioner.go Outdated Show resolved Hide resolved

replay reviewed Apr 22, 2019

View reviewed changes

refactor

6e76a26

Dieterbe force-pushed the shard-by-tag branch from 473d5fb to d883f6c Compare April 23, 2019 10:55

simplify: Partition() can never return a non-nil error

bdc9d9d

Dieterbe force-pushed the shard-by-tag branch from d883f6c to bdc9d9d Compare April 23, 2019 11:04

woodsaj suggested changes Apr 23, 2019

View reviewed changes

robert-milan mentioned this pull request Jun 12, 2019

add partitioning of MetricData/MetricDefinitions raintank/schema#26

Merged

robert-milan mentioned this pull request Aug 13, 2019

Use new partition methods #1427

Merged

robert-milan closed this Aug 13, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

partitionBy bySeriesWithTags (aka "shard by tag") #1282

partitionBy bySeriesWithTags (aka "shard by tag") #1282

Dieterbe commented Apr 12, 2019 •

edited

Loading

Dieterbe Apr 12, 2019

robert-milan Apr 16, 2019

Dieterbe Apr 18, 2019 •

edited

Loading

Dieterbe Apr 18, 2019

robert-milan Apr 19, 2019

Dieterbe commented Apr 12, 2019

dgryski commented Apr 12, 2019 •

edited

Loading

Dieterbe commented Apr 12, 2019

woodsaj Apr 12, 2019

Dieterbe Apr 15, 2019

Dieterbe commented Apr 18, 2019

replay Apr 22, 2019

Dieterbe Apr 22, 2019

replay Apr 22, 2019 •

edited

Loading

Dieterbe Apr 22, 2019

Dieterbe commented Apr 23, 2019

woodsaj left a comment

Dieterbe commented Jul 10, 2019

replay commented Jul 10, 2019

robert-milan commented Aug 13, 2019

Dieterbe commented Dec 20, 2019

partitionBy bySeriesWithTags (aka "shard by tag") #1282

partitionBy bySeriesWithTags (aka "shard by tag") #1282

Conversation

Dieterbe commented Apr 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe Apr 18, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe commented Apr 12, 2019

dgryski commented Apr 12, 2019 • edited Loading

Dieterbe commented Apr 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe commented Apr 18, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

replay Apr 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Dieterbe commented Apr 23, 2019

woodsaj left a comment

Choose a reason for hiding this comment

Dieterbe commented Jul 10, 2019

replay commented Jul 10, 2019

robert-milan commented Aug 13, 2019

Dieterbe commented Dec 20, 2019

Dieterbe commented Apr 12, 2019 •

edited

Loading

Dieterbe Apr 18, 2019 •

edited

Loading

dgryski commented Apr 12, 2019 •

edited

Loading

replay Apr 22, 2019 •

edited

Loading