Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upConsistent hashing in addition to hashmod #4309
Comments
This comment has been minimized.
This comment has been minimized.
|
Could you elaborate on the benefit that consistent hashing would bring in the case of a mod-sharded Prometheus? Normally consistent hashing for a distributed k/v store solves the problem of not having to move around so many keys' values to other nodes upon resharding, or for caches (where values are not actively moved around) it helps to keep more cache entries valid. But we neither move around data (series) between mod-sharded Prometheus servers, nor do we query them in a cache-like fashion (where it's ok if data is missing), so we always have to query all servers of a mod-sharded group anyway. So the only problem with the current hashing I readily see is the series churn strain on each Prometheus's TSDB. That is, upon resharding with current |
This comment has been minimized.
This comment has been minimized.
|
We already have a unique identification of targets, this would be the key, and we want to figure out whether the key belongs to the “current” prometheus server. It’s about targets that move upon changing the number of shards, therefore their respective time-series which among other things affects alerting. I’m not intending to move any data around. And as you mentioned already new time-series has a big cost in tsdb, so we want to keep this event to a minimum when we can. And as I mentioned with bounded loads we may be able to build something like bin packing, without an external system performing any scheduling (but also as I mentioned my knowledge in this area currently consists of reading a couple of papers and blog posts). |
This comment has been minimized.
This comment has been minimized.
Yep, I see the TSDB churn concern, but I'm curious whether there are any other valid benefits of adding consistent hashing. I'm not sure about the alerting example, although that's an interesting point. While non-consistent hashing will lose continuity (and accumulation of |
This comment has been minimized.
This comment has been minimized.
|
I agree with Julius. When you changing the hashing, basically everything will break. Using consistent hashing doesn't change that |
This comment has been minimized.
This comment has been minimized.
|
IIUC when deploying more than one Prometheus instance to make sure that the same targets will be scrapped by the same Prometheus instance so you can scale horizontally without worrying about missing or duplicating scrapped targets. the part that explains hashmod |
This comment has been minimized.
This comment has been minimized.
domgreen
commented
Jun 25, 2018
|
The use case that I have is that I am collecting metrics in a multi-tenant environment where I would like to ensure that when scaling horizontally I keep as much of the tenants on the same scrape instance and therefore on the read path trying to optimise to query mimimal Prometheus instances. 1/n targets moving is much better in this scenario that all n targets moving to a new scrape instance. |
This comment has been minimized.
This comment has been minimized.
|
@domgreen so you are recording all sharding changes over the retention period and then computing the same hash in a multitenant gateway as Prometheus does, so you can figure out which instances to hit for what data (different ones depending on what sharding events happened over the query time range)? |
This comment has been minimized.
This comment has been minimized.
domgreen
commented
Jun 25, 2018
•
|
I'm not currently recording changes to sharding, currently sharding in a sidecar and using file_sd_config to keep tenants together. But looking to extend this in a more generic way either via Thanos or Prometheus. We have many tenants that can change their targets so for aggregation and alerting want to keep all the data for them on minimal scrapers. I was hoping to build into |
This comment has been minimized.
This comment has been minimized.
|
@domgreen But the only way you can get all the data is to never reshard at all, always ask all shards, or explicitly track all reshardings. I'm not sure it's in Thanos's roadmap to track reshardings, and without that you are always stuck with either incomplete data (although it'll be less incomplete with consistent hashing) or querying everyone. |
This comment has been minimized.
This comment has been minimized.
domgreen
commented
Jun 25, 2018
•
|
@juliusv I totally agree with you, i was hoping that we could minimise the number of extra nodes we had to query to get the complete picture for a tenant. |
This comment has been minimized.
This comment has been minimized.
|
So to summarize, IMO there are two possible legitimate benefits of consistent hashing so far:
So far I don't see the second point happening in practice yet, but I'm happy to be proven wrong. That leaves the local TSDB churn case. FWIW I don't have a strong opinion on whether that warrants adding a new hashing option, but just want people to be clear on what the actual benefits are. |
This comment has been minimized.
This comment has been minimized.
In this expected use case for this I'd expect users to throw away all the data in the scraping tsdbs when they reshard, as it's fairly complicated to get this right and any data you care about would already be aggregated up on the master. I don't get the impression that this is a very strong argument, particularly as we're talking about adding yet another relabelling action which is very similar to an existing one to satisfy an very niche use case where there's already quite a bit of custom code involved. If you want that level of control over how targets are allocated in a horizontal sharding setup, you could always use file_sd. |
This comment has been minimized.
This comment has been minimized.
|
I think this discussion diverged really from the main argument why Context: improbable-eng/thanos#387 IMO keeping tenant stuff together is not the main issue but bit connected here - the main issue is how to dynamically change the number of shards up and down without too much of time series churn and no disruption. We just think that keeping tenant stuff together might be useful for some use cases when we know that a huge number of targets (and series for these) go up and down together for each tenant. (For us, single tenant spins up huge workloads that consist of several isolated VMs and services that needs to be monitored.) So IMO I think we need to clarify that we need to be able to:
We would love your feedback on how to meet these use cases Thanks for the feedback so far! |
This comment has been minimized.
This comment has been minimized.
|
I think what's being discussed in improbable-eng/thanos#387 is pretty far outside the scope of Prometheus, as it's full-on building a clustered monitoring system. I think any features we add should make sense in terms of a standalone Prometheus setup, both for sanity and so we don't end up sleepwalking into either being a clustered system or having all the maintenance pains of being one.
Consistent hashing doesn't help you a lot there, this is quite a bit more involved than that as e.g. you still have hot shards and need to worry about resharding for queries. How to actually build a transparently sharded distributed storage system like this is I believe more @tomwilkie's area. |
This comment has been minimized.
This comment has been minimized.
|
Thanks for your input Brian - totally agree with you. @tomwilkie, if you have a moment to look on our proposed way of solving our goal, your feedback would be awesome! I am pretty sure we are not the only one having such use case. |
This comment has been minimized.
This comment has been minimized.
|
In Cortex we use consistent hashing to solve this problem; but we have also decoupled scraping & storage, so scaling scraping with hashmod is more than good enough. And we we do change sharding, we move data around**. ** this isn't strictly "implement" right now: when we scale only 1/n of the data will need moving, but all queries have to hit all nodes, not moving the data isn't strictly necessary - it will get marked as stale and flushed in due course. |
This comment has been minimized.
This comment has been minimized.
|
I think concretely for the Thanos ingestion scaling problem this means, that |
This comment has been minimized.
This comment has been minimized.
|
@brancz Do you mean to totally wipe out (or rather "flush out" as discussed here: prometheus/tsdb#346) everything on every scale out or in to reshard and start new set of Prometheus instances? |
This comment has been minimized.
This comment has been minimized.
|
Yes, I'm thinking of flushing everything in the WAL to a TSDB block and upload that. I haven't thought it through a lot but my first idea was that a new set of Prometheuses should be possible to be started immediately as Thanos would deduplicate at query time and alerts are deduplicated at the Alertmanager, so I don't think there even needs to be a lot of coordination (other than local to the Prometheus server that needs to "flush"). |
This comment has been minimized.
This comment has been minimized.
|
What about something like a DIBS system? In a k8s env this would look like this: so when another server sees these annotation this would mean - |
This comment has been minimized.
This comment has been minimized.
domgreen
commented
Jul 2, 2018
|
@krasi-georgiev although the main benefit of the DIBs system is that it would mean no central location for managing targets / assignment, I feel that having this central orchestration for assignment / sharding gives us more flexibility and reliability in the long run as well as being a single place to debug any issues. With the DIBS approach you would only know how "hot" your own shard is so could choose not to take on any more targets however you would not be able to assign data to the least loaded scraper. Nor would it allow us to do any bin packing of jobs or specific labels. Within k8s the approach of using labels to define who is scraping and their heartbeat but for our use case we will be scraping targets outside of the k8s cluster so labels alone would not solve it for us. Here we would probably end up having to store the data in etcd or something similar. Would service discovery in the DIBS approach have all targets written to all nodes then you can iterate over each target to identify if they had already been claimed? |
This comment has been minimized.
This comment has been minimized.
Can you give an example as I didn't understand why this wouldn't work in your use case.
I believe you can still use relabelling for this. |
This comment has been minimized.
This comment has been minimized.
|
Yeah I wrote a PoC where a prometheus would drop all targets that were not “scheduled onto it” (meaning any Kubernetes pod that had a specific annotation) and then had an additional scheduler, das would add the annotation to a pod. While this works and scheduling can be done quite dynamically and without any additional data store (in the Kubernetes case) it's very difficult to determine whether this system is actually working. This is as Brian mentioned a full on distributed monitoring system. With consistent heading with bounded loads I was hoping to take away the distributed system part but still kind of get the bin packing load distribution. |
This comment has been minimized.
This comment has been minimized.
|
Yeah, independently of whether you think such a system (DIBS) is a good idea (I happen to think centralized control is a better idea much of the time), if you want to build something as advanced as that, you can still use |
This comment has been minimized.
This comment has been minimized.
|
I’m ok with that and think file sd is a reasonable interface (if high performance cases should occur we could think about extending this, but I haven’t seen a case where this was necessary). I should have mentioned that for the above reasons I decided against such a distributed monitoring system. From my side we can close this, but don’t want to shutdown the discussions for others just yet so I’ll leave this open for now. |
This comment has been minimized.
This comment has been minimized.
|
Yea, feel free to close this issue, we can continue the discussion if the dynamic target assignment from a centralized place makes sense or not here: improbable-eng/thanos#387 Thanks all for your input! |
brancz
closed this
Jul 2, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
brancz commentedJun 25, 2018
•
edited
The
hashmodrelabeling action is already super useful in order to shard Prometheus targets onto multiple Prometheus instances, without having to treat Prometheus as a distributed system. However,hashmoduses a md5 hash which is evenly distributed, but in contrast to consistent hashing methods on the event of changing the number of shards, the distribution of targets to nodes is very disruptive.With this issue I propose to add an additional relabeling action, that implements a consistent hashing method that can be used in the same way as
hashmod, but less disruptive in terms of re-sharding of targets when changing the number of shards.I would also like to use this issue to start the discussion of which consistent hashing algorithm we should choose to accomplish this. I have not performed any experiments on this yet, so any experience and knowledge would be highly appreciated.
In terms of characteristics/trade-offs of the system we don't care much about lookup speed, as relabeling results are cached for target-config combinations, so we can focus on getting the best distribution and least disruptive re-sharding. I have no experience with this, but I feel we could even do some consistent hashing with bounded loads using the scrape limits as the "bounded load" (only I'm not sure how to describe maximum load other than having it be a new configuration field, but I guess that's what we have the discussion comments here for🙂 ).
@brian-brazil @tomwilkie I remember we talked about this once or twice at KubeCon Copenhagen, let me know if you have any other insights.
cc @mxinden @Bplotka @domgreen