Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistent hashing in addition to hashmod #4309

Closed
brancz opened this Issue Jun 25, 2018 · 27 comments

Comments

Projects
None yet
7 participants
@brancz
Copy link
Member

brancz commented Jun 25, 2018

The hashmod relabeling action is already super useful in order to shard Prometheus targets onto multiple Prometheus instances, without having to treat Prometheus as a distributed system. However, hashmod uses a md5 hash which is evenly distributed, but in contrast to consistent hashing methods on the event of changing the number of shards, the distribution of targets to nodes is very disruptive.

With this issue I propose to add an additional relabeling action, that implements a consistent hashing method that can be used in the same way as hashmod, but less disruptive in terms of re-sharding of targets when changing the number of shards.

I would also like to use this issue to start the discussion of which consistent hashing algorithm we should choose to accomplish this. I have not performed any experiments on this yet, so any experience and knowledge would be highly appreciated.

In terms of characteristics/trade-offs of the system we don't care much about lookup speed, as relabeling results are cached for target-config combinations, so we can focus on getting the best distribution and least disruptive re-sharding. I have no experience with this, but I feel we could even do some consistent hashing with bounded loads using the scrape limits as the "bounded load" (only I'm not sure how to describe maximum load other than having it be a new configuration field, but I guess that's what we have the discussion comments here for 🙂).

@brian-brazil @tomwilkie I remember we talked about this once or twice at KubeCon Copenhagen, let me know if you have any other insights.

cc @mxinden @Bplotka @domgreen

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 25, 2018

Could you elaborate on the benefit that consistent hashing would bring in the case of a mod-sharded Prometheus? Normally consistent hashing for a distributed k/v store solves the problem of not having to move around so many keys' values to other nodes upon resharding, or for caches (where values are not actively moved around) it helps to keep more cache entries valid. But we neither move around data (series) between mod-sharded Prometheus servers, nor do we query them in a cache-like fashion (where it's ok if data is missing), so we always have to query all servers of a mod-sharded group anyway. So the only problem with the current hashing I readily see is the series churn strain on each Prometheus's TSDB. That is, upon resharding with current hashmod, most series will change (and need to be indexed), while consistent hashing would only create K/n new series on each server.

@brancz

This comment has been minimized.

Copy link
Member Author

brancz commented Jun 25, 2018

We already have a unique identification of targets, this would be the key, and we want to figure out whether the key belongs to the “current” prometheus server. It’s about targets that move upon changing the number of shards, therefore their respective time-series which among other things affects alerting. I’m not intending to move any data around. And as you mentioned already new time-series has a big cost in tsdb, so we want to keep this event to a minimum when we can.

And as I mentioned with bounded loads we may be able to build something like bin packing, without an external system performing any scheduling (but also as I mentioned my knowledge in this area currently consists of reading a couple of papers and blog posts).

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 25, 2018

We already have a unique identification of targets, this would be the key, and we want to figure out whether the key belongs to the “current” prometheus server. It’s about targets that move upon changing the number of shards, therefore their respective time-series which among other things affects alerting. I’m not intending to move any data around. And as you mentioned already new time-series has a big cost in tsdb, so we want to keep this event to a minimum when we can.

Yep, I see the TSDB churn concern, but I'm curious whether there are any other valid benefits of adding consistent hashing. I'm not sure about the alerting example, although that's an interesting point. While non-consistent hashing will lose continuity (and accumulation of for state) for most/all alert elements, consistent hashing will still change K/n of the alert elements, so it still fundamentally means you can't reshard and expect alerting to work properly across reshards. Just half-broken is still broken?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 25, 2018

I agree with Julius. When you changing the hashing, basically everything will break. Using consistent hashing doesn't change that

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 25, 2018

IIUC when deploying more than one Prometheus instance to make sure that the same targets will be scrapped by the same Prometheus instance so you can scale horizontally without worrying about missing or duplicating scrapped targets.

the part that explains hashmod
https://www.robustperception.io/scaling-and-federating-prometheus/

@domgreen

This comment has been minimized.

Copy link

domgreen commented Jun 25, 2018

The use case that I have is that I am collecting metrics in a multi-tenant environment where I would like to ensure that when scaling horizontally I keep as much of the tenants on the same scrape instance and therefore on the read path trying to optimise to query mimimal Prometheus instances. 1/n targets moving is much better in this scenario that all n targets moving to a new scrape instance.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 25, 2018

@domgreen so you are recording all sharding changes over the retention period and then computing the same hash in a multitenant gateway as Prometheus does, so you can figure out which instances to hit for what data (different ones depending on what sharding events happened over the query time range)?

@domgreen

This comment has been minimized.

Copy link

domgreen commented Jun 25, 2018

I'm not currently recording changes to sharding, currently sharding in a sidecar and using file_sd_config to keep tenants together. But looking to extend this in a more generic way either via Thanos or Prometheus.

We have many tenants that can change their targets so for aggregation and alerting want to keep all the data for them on minimal scrapers.

I was hoping to build into thanos query the notion that you can hit minimal scrape instances when running a query for a specific tenants data. Therefore for both query, aggregation and alerting it would be beneficial to keep all tsdb data to a minimum number of shards and not re allocate. So when scaling scrapers I do not want to lose aggregation from a node if I can help it. Maybe its something we build into Thanos but an ideal solution would be to get a good solution for all use cases.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 25, 2018

@domgreen But the only way you can get all the data is to never reshard at all, always ask all shards, or explicitly track all reshardings. I'm not sure it's in Thanos's roadmap to track reshardings, and without that you are always stuck with either incomplete data (although it'll be less incomplete with consistent hashing) or querying everyone.

@domgreen

This comment has been minimized.

Copy link

domgreen commented Jun 25, 2018

@juliusv I totally agree with you, i was hoping that we could minimise the number of extra nodes we had to query to get the complete picture for a tenant.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jun 25, 2018

So to summarize, IMO there are two possible legitimate benefits of consistent hashing so far:

  • Reduce series indexing churn on each shard's local TSDB.
  • Reduce the number of shards you need to query in the case that someone really builds a proxy that tracks reshardings and what data lived where at what time.

So far I don't see the second point happening in practice yet, but I'm happy to be proven wrong.

That leaves the local TSDB churn case. FWIW I don't have a strong opinion on whether that warrants adding a new hashing option, but just want people to be clear on what the actual benefits are.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 25, 2018

Reduce series indexing churn on each shard's local TSDB.

In this expected use case for this I'd expect users to throw away all the data in the scraping tsdbs when they reshard, as it's fairly complicated to get this right and any data you care about would already be aggregated up on the master.

I don't get the impression that this is a very strong argument, particularly as we're talking about adding yet another relabelling action which is very similar to an existing one to satisfy an very niche use case where there's already quite a bit of custom code involved. If you want that level of control over how targets are allocated in a horizontal sharding setup, you could always use file_sd.

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Jun 26, 2018

I think this discussion diverged really from the main argument why hasmod is not enough.

Context: improbable-eng/thanos#387

IMO keeping tenant stuff together is not the main issue but bit connected here - the main issue is how to dynamically change the number of shards up and down without too much of time series churn and no disruption. We just think that keeping tenant stuff together might be useful for some use cases when we know that a huge number of targets (and series for these) go up and down together for each tenant. (For us, single tenant spins up huge workloads that consist of several isolated VMs and services that needs to be monitored.)

So IMO I think we need to clarify that we need to be able to:

  • enable auto-scaling of Prometheus instances to meet dynamic targets demands (including scaling down if Prometheus instances are under-utilized)
  • keep tenant's targets together if possible (for easier reshard/autoscale and less Prometheus to query - but with Thanos there are less data to query [fresh data only])

We would love your feedback on how to meet these use cases ⬆️

Thanks for the feedback so far!

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 26, 2018

I think what's being discussed in improbable-eng/thanos#387 is pretty far outside the scope of Prometheus, as it's full-on building a clustered monitoring system. I think any features we add should make sense in terms of a standalone Prometheus setup, both for sanity and so we don't end up sleepwalking into either being a clustered system or having all the maintenance pains of being one.

the main issue is how to dynamically change the number of shards up and down without too much of time series churn and no disruption.

Consistent hashing doesn't help you a lot there, this is quite a bit more involved than that as e.g. you still have hot shards and need to worry about resharding for queries. How to actually build a transparently sharded distributed storage system like this is I believe more @tomwilkie's area.

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Jun 26, 2018

Thanks for your input Brian - totally agree with you.

@tomwilkie, if you have a moment to look on our proposed way of solving our goal, your feedback would be awesome! I am pretty sure we are not the only one having such use case.

@tomwilkie

This comment has been minimized.

Copy link
Member

tomwilkie commented Jun 27, 2018

In Cortex we use consistent hashing to solve this problem; but we have also decoupled scraping & storage, so scaling scraping with hashmod is more than good enough. And we we do change sharding, we move data around**.

** this isn't strictly "implement" right now: when we scale only 1/n of the data will need moving, but all queries have to hit all nodes, not moving the data isn't strictly necessary - it will get marked as stale and flushed in due course.

@brancz

This comment has been minimized.

Copy link
Member Author

brancz commented Jun 27, 2018

I think concretely for the Thanos ingestion scaling problem this means, that hashmod is good enough to distribute targets on ingesting Prometheus servers, and on the event of increasing/decreasing the number of shards a tsdb chunk is built from the WAL and uploaded to object storage, and the new sharding can be rolled out simultaneously.

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Jun 27, 2018

@brancz Do you mean to totally wipe out (or rather "flush out" as discussed here: prometheus/tsdb#346) everything on every scale out or in to reshard and start new set of Prometheus instances?

@brancz

This comment has been minimized.

Copy link
Member Author

brancz commented Jun 27, 2018

Yes, I'm thinking of flushing everything in the WAL to a TSDB block and upload that. I haven't thought it through a lot but my first idea was that a new set of Prometheuses should be possible to be started immediately as Thanos would deduplicate at query time and alerts are deduplicated at the Alertmanager, so I don't think there even needs to be a lot of coordination (other than local to the Prometheus server that needs to "flush").

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 28, 2018

What about something like a DIBS system?
This can be a prometheus sidecar.
Side car settings:
max_targets - how many targets to scrape
heartbeat_timestamp - how often to refresh the DIBS timestamp.

In a k8s env this would look like this:
the first Prometheus that sees a new target also adds pod annotations
scraper=prom1
scraper_heartbeat=10.00.10-28.06.2018
prom1 uses relabaling to scrape only pods with annotation scraper=prom1

so when another server sees these annotation this would mean - prom1 is currently scraping this target and the last heartbeat stamp was at 10.00.10-28.06.2018.
the side car would also be responsible for refreshing the timestamp to confirm that prom1 is still scraping this target.
If the timestamp doesn't update for 10min another sidecar can DIBS this target.
Side cars can only DIB up the max_targets settings.

@domgreen

This comment has been minimized.

Copy link

domgreen commented Jul 2, 2018

@krasi-georgiev although the main benefit of the DIBs system is that it would mean no central location for managing targets / assignment, I feel that having this central orchestration for assignment / sharding gives us more flexibility and reliability in the long run as well as being a single place to debug any issues.

With the DIBS approach you would only know how "hot" your own shard is so could choose not to take on any more targets however you would not be able to assign data to the least loaded scraper. Nor would it allow us to do any bin packing of jobs or specific labels.

Within k8s the approach of using labels to define who is scraping and their heartbeat but for our use case we will be scraping targets outside of the k8s cluster so labels alone would not solve it for us. Here we would probably end up having to store the data in etcd or something similar.

Would service discovery in the DIBS approach have all targets written to all nodes then you can iterate over each target to identify if they had already been claimed?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 2, 2018

With the DIBS approach you would only know how "hot" your own shard is so could choose not to take on any more targets however you would not be able to assign data to the least loaded scraper.

Can you give an example as I didn't understand why this wouldn't work in your use case.

Nor would it allow us to do any bin packing of jobs or specific labels.

I believe you can still use relabelling for this.

@brancz

This comment has been minimized.

Copy link
Member Author

brancz commented Jul 2, 2018

Yeah I wrote a PoC where a prometheus would drop all targets that were not “scheduled onto it” (meaning any Kubernetes pod that had a specific annotation) and then had an additional scheduler, das would add the annotation to a pod. While this works and scheduling can be done quite dynamically and without any additional data store (in the Kubernetes case) it's very difficult to determine whether this system is actually working. This is as Brian mentioned a full on distributed monitoring system.

With consistent heading with bounded loads I was hoping to take away the distributed system part but still kind of get the bin packing load distribution.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 2, 2018

Yeah, independently of whether you think such a system (DIBS) is a good idea (I happen to think centralized control is a better idea much of the time), if you want to build something as advanced as that, you can still use file_sd to do custom sharding, and it's nothing that core Prometheus has to offer.

@brancz

This comment has been minimized.

Copy link
Member Author

brancz commented Jul 2, 2018

I’m ok with that and think file sd is a reasonable interface (if high performance cases should occur we could think about extending this, but I haven’t seen a case where this was necessary). I should have mentioned that for the above reasons I decided against such a distributed monitoring system. From my side we can close this, but don’t want to shutdown the discussions for others just yet so I’ll leave this open for now.

@bwplotka

This comment has been minimized.

Copy link
Contributor

bwplotka commented Jul 2, 2018

Yea, feel free to close this issue, we can continue the discussion if the dynamic target assignment from a centralized place makes sense or not here: improbable-eng/thanos#387

Thanks all for your input!

@brancz brancz closed this Jul 2, 2018

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.