-
Notifications
You must be signed in to change notification settings - Fork 105
partitionBy bySeriesWithTags (aka "shard by tag") #1282
Conversation
nameWithTagsBuffer.WriteString(t) | ||
} | ||
|
||
return nameWithTagsBuffer.Bytes() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robert-milan do you think we can make this better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use a pool, and also call Grow on the buffer before writing to it, but we would just be guessing at the size. This could decrease allocations.
I haven't followed the entire code path, but it looks like we are always passing in nil for the b []byte
, so that doesn't help us at all. I think a pool makes the most sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, we could use a pool but that's a concern for the caller.
I should have been clearer, i specifically wonder:
- if you think it's silly to use a human friendly ascii separator like
;
and not0x00
or something . i don't see any downside to using something friendly. Though i will say since it's only a single char we should probably use WriteRune instead of WriteString - is there anything there to take into account wrt the upcoming interning. To be precise, i don't want to be in a situation where the output is different once we support interning, or where keeping the output the same makes the implementation suboptimal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, as I have not reviewed the PR in great detail, but as I understand it this is only used for partitioning. My PR deals strictly with interning in the index. The only time I touch any MetricData
is when I convert it into a MetricDefinition
, so I don't think the two affect each other.
-
I don't think it matters. Also, I would use
WriteByte
instead ofWriteRune
. Are we worried about needing an escape sequence (like when processing stream data) that we have to scan for or something? That would require a different answer. -
Since this only appears to operate on
MetricData
I don't think it will be an issue.
want to do a bit more analysis before drawing conclusions. |
An even distribution is important, as is the "peak-to-mean" ratio. That is, how wide your distribution is. What is the maximum number of elements that are mapped to a single shard vs. the mean. A small standard deviation will help with capacity planning. As for choosing a hash function, metro and cespare's xxhash should both be sufficiently fast and give an equivalent distribution. Siphash will also give a good distribution but will be slower (assuming you're using siphash2-4). You'll get a speedup with siphash1-3 which should give the same distribution but will probably end up being slower than both metrohash and xxhash. Edit: my siphash1-3 implementation: https://github.com/dgryski/go-sip13 |
I have added a "% diff between min and max".
all cases with an Inf percentage diff is the sarama fake-true case, in fact. doing the previous tests again:
it's hard to find meaningfull differences between the implementations. especially between metro, sip and xxhash (as predicted) |
cluster/partitioner/partitioner.go
Outdated
// partition by series: metrics are distrubted across all metrictank instances | ||
// to allow horizontal scalability | ||
return m.KeyBySeries(b), nil | ||
func (p jumpPartitionerMauro) RequiresConsistency() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is RequiresConsistency()
for? is this meant to be part of the "Partitioner" interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is part of the sarama.Partitioner interface.
https://github.com/Shopify/sarama/blob/a74fb772b5cfe121956313c4f6596e372e83cd27/partitioner.go#L17-L22
this way, runtime Partition() calls don't need to check the type every time and we can remove the error checking.
and not having to construct a whole sarama message
went with xxhash+jump. removed all others.
how does it look now? |
"github.com/raintank/schema" | ||
) | ||
|
||
type Partitioner interface { | ||
Partition(schema.PartitionedMetric, int32) (int32, error) | ||
Partition(schema.PartitionedMetric, int32) int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this interface still used anywhere? same question in the tsdb code btw. can't see where it's used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i was wondering the same thing and looked around a bit, and also didn't see it being used anywhere. i'll just remove it.
cluster/partitioner/partitioner.go
Outdated
|
||
func (k *Kafka) Partition(m schema.PartitionedMetric, numPartitions int32) int32 { | ||
key := k.GetPartitionKey(m, nil) | ||
return (jumpPartitioner{}).PartitionKey(key, numPartitions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems like this is the only place where the jumpPartitioner
is ever used. So what's still the benefit of making it satisfy the sarama.Partitioner
interface? Due to the fact that it currently implements that interface we need to first instantiate it and then call a method on that instance, that seems unnecessary if it's never used as a sarama partitioner. so we might as well just have a standalone function like partitionKeyWithJumpHash([]byte, int32) int32
instead of instantiating this struct and then call the method on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm yes. this and some other things here are rather confusing.
i'll see if i can refactor it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a huge mess. You cant add a new implementation of an interface "Partitioner" and then add comments to instruct users not to use certain methods because they will get unexpected and incorrect results. I dont even understand how you could think this is a good idea.
I am all for moving to xxhash and jump, but to do that, we need to completely refactor how we do partitioning.
I suggest we just get rid of metrictank/cluster/partitioner package and move everything into raintank/schema.
The "schema.PartitionedMetric" interface should just be updated to something like
type PartitionByMethod uint8
const (
PartitionByOrg PartitionByMethod = iota
PartitionBySeries
PartitionBySeriesWithTags
)
type PartitionedMetric interface {
Validate() error
SetId() SetId()
GetPartitionID(method PartitionByMethod, partitions int32) int32
}
the awoods-style partitioning is now merged in schema. |
This one looks good to me. But if you want to replace it that's fine too |
Closed in favor of #1427 |
Note:
BenchmarkJumpPartitionerFnv is now faster and no longer allocates, but the conclusion would still be the same (xxhash still leads) |
fix #1123
This PR is a reworked version of #1146
It intends to be a more well-founded approach to implementing
bySeriesWithTags
.Making changes to our sharding scheme is not a trivial operation (lots of overhead to deploy), so I want to make sure we get it right.
This PR tells a story: If you look at the commits one by one, you'll see I've added a bunch of different possible implementations for the actual sharding, and a testing framework to validate the different implementations.
Once we agree that the tests are conclusive and we pick a winner, we can simply remove all other implementations.
Criteria (somewhat in order of importance):
In particular, I want to answer questions such as:
As far as analysis goes, i tested with fakemetrics and 4 datasets that I pulled from some of our HM instances - with permission.
Of course I can't share the contents or name names here, though ops is our internal monitoring instance.
Now, there's obviously lots of output, some more relevant than others (e.g. no one should run large amounts of metrics on a small number of partitions). I have mentioned some tips in cluster/partitioner/partitioner_test.go
So I have just grepped for the dataset-partitioncount combinations that seem most relevant (
r
means grep)original scaling is 25-50k up to 250k-500k per instance (per 8 shards)
though in prod we see 1-2M per 8 shards
so test targets should be, for
1 shard: 62.5k-250k
32 shards: 2-8M
128 shards: about 8-32M total