-
Notifications
You must be signed in to change notification settings - Fork 105
[WIP] add partitioner keyBySeriesWithTags #1146
Conversation
I've done testing like this:
After discussion with Dieter I'm now going to add jump consistent hashing to the new partitioner, because that way we should hopefully be able to reach a more equal distribution among the partitions. Going to test that too once it's done |
@Dieterbe I've been reading more through how the partitioning code works: Currently our "Partitioner" in MT actually doesn't do the partitioning, it only generates a key which is then used as input for Sarama's default partitioner ( The Sarama default partitioner is using |
We could use our partitioner everywhere by simply changing the the sarama config to use the ManualPartitioner and set the partition ID in the kafka message. this is what notifierKafka does as it already knows the partition ID to use. We actually want to do this anyway as it means we then dont need to include the key field in the kafka messages, which will save a lot of bytes. |
Before switching to jumphash let's see if we observe uneven distributions. Is easy to check both in docker compose and prod instances or our ops instance via mt-index-cat. |
If we change the partitioner to |
I just found something interesting when I wanted to test the distribution: Ran fakemetrics, with tags, generating 10k series, feeding into a tsdb with
Note that the 4 counts which are assigned to odd partition IDs add up to exactly 10k, so I'm assuming those are the ones generated by fakemetrics. So I was confused about why all the partitions assigned to the fakemetrics metrics are odd numbers and checked how the distribution looks in our ops cluster. There the distribution looks better:
I still can't explain why all the fakemetrics-generated metrics end up with odd partition IDs, but I can reproduce it. This is doing the same like what sarama's default partitioner does and it results in the same numbers ( When playing with the metric names in that play.golang link it's obvious that patterns emerge, f.e. if the tags get removed and only this metric format is used
I don't know why this is happening, but it looks like under certain circumstances the distribution of the default partitioner is not great |
That is exactly what https://github.com/grafana/metrictank/blob/master/cluster/partitioner/partitioner.go#L19 gives you. github.com/grafana/metrictank/cluster/partitioner.Kafka actually uses the sarama code directly. |
I'm doing another round of tests. There are mainly two things that I need to check for:
Regarding point
Regarding point
|
If this PR gets accepted I'll create a corresponding PR to github.com/raintank/schema before merging, so we can keep everything in sync. |
@@ -81,6 +85,23 @@ func (m *MetricData) KeyBySeries(b []byte) []byte { | |||
return b | |||
} | |||
|
|||
func (m *MetricData) KeyBySeriesWithTags(b []byte) []byte { | |||
nameWithTagsBuffer := bytes.NewBuffer(b[:0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shouldn't be up to this function to modify pre-existing data in the slice.
if you want to discard previous data (we generally do), it's up to the caller to reset len back to 0.
this function should just append data behind pre-existing data, like the other functions (above) do.
for some examples of this pratice in the standard library, see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just checked all the call sites, they already honor this pattern (by just passing in nil slices)
"testing" | ||
) | ||
|
||
func TestNameWithTagFromMetricData(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we are probably doing a bit too much (in general, not just here) testing of particular implementations, rather than behavior.
that said i don't have a concrete suggestion, so ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually my reason to add tests like this one is just to verify that my implementation actually works. It wouldn't necessarily need to be committed, I could also just not commit those, but then somebody who wants to modify the implementation would have to test it again independently.
func (m *MetricDefinition) KeyBySeriesWithTags(b []byte) []byte { | ||
return []byte(m.NameWithTags()) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to follow the same style as KeyBySeries and use the given input slice.
note that b = append(b, []byte(foo)...)
has recently been optimized. It doesn't allocate anymore:
https://go-review.googlesource.com/c/go/+/109517
here's the difference in action:
~/g/s/g/D/s/append ❯❯❯ cat append_test.go
package main
import "testing"
var a []byte
var nameWithTags = "hello"
func NameWithTags() string {
return nameWithTags
}
func Benchmark1(b *testing.B) {
for i := 0; i < b.N; i++ {
a = []byte(NameWithTags())
}
b.Log(len(a))
}
func Benchmark2(b *testing.B) {
for i := 0; i < b.N; i++ {
a = a[:0]
a = append(a, []byte(NameWithTags())...)
}
b.Log(len(a))
}
~/g/s/g/D/s/append ❯❯❯ go test -bench . -benchmem -v .
goos: linux
goarch: amd64
pkg: github.com/Dieterbe/sandbox/append
Benchmark1-8 50000000 29.1 ns/op 8 B/op 1 allocs/op
--- BENCH: Benchmark1-8
append_test.go:16: 5
append_test.go:16: 5
append_test.go:16: 5
append_test.go:16: 5
append_test.go:16: 5
Benchmark2-8 100000000 13.1 ns/op 0 B/op 0 allocs/op
--- BENCH: Benchmark2-8
append_test.go:24: 5
append_test.go:24: 5
append_test.go:24: 5
append_test.go:24: 5
append_test.go:24: 5
PASS
ok github.com/Dieterbe/sandbox/append 2.811s
of course, since currently our callers don't actually recycle slices, we won't see a real benefit, but i rather apply the best practice here now so we can reap the benefit when we do start recycling.
it's worth pondering whether it makes sense to just use the Id field as the key. so the question boils down to how useful is it to keep series hashed to the same partition when only their interval / mtype / unit changes? I think it may be useful in the future, if we ever get to the point of introducing optimizations (e.g. merging series together when the unit changes) which would benefit from the data locality. and also we need them together for pruning to work properly. |
cluster/partitioner/partitioner.go
Outdated
case "bySeriesWithTags": | ||
return &Kafka{ | ||
PartitionBy: partitionBy, | ||
Partitioner: &jumpPartitioner{}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please stop using pointers when there's no need for them.
If the id field is used, changing the interval of a series would no longer work properly as the metricDefs for the single series name could be on different mt instances. I think queries might still work, but the loading/pruning would break as we need to know all of the metricDefs for a series name to know if a single metricDef is too old or not. |
0a83689
to
f8f5623
Compare
// so we keep adding slices of 8 bytes to the jump key as uint64 values | ||
// if the result wraps around the uin64 boundary it doesn't matter because | ||
// this only needs to be fast and consistent | ||
key = append(key, make([]byte, 8-len(key)%8)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably do this logic without heaving to allocate here.
I think i saw something similar in the bstream code. i'll give it a shot
eee44ce
to
60d4c03
Compare
This reverts commit 2c7af35. This was too hastily merged. See grafana/metrictank#1146 for continuation
closing in favor of #1282 |
This is for: #1123
Still doing some testing