-
Notifications
You must be signed in to change notification settings - Fork 8.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TSDB] Prometheus instances dumping blocks simultaneously #8278
Comments
Off the top of my head, I think as long as we still produce 2h blocks the time we actually run compaction could be offset (hard to tell for me right now how complex this would be). I think this is a great one to be discussed in the storage working group indeed. |
There's many places we already take advantage or plan to take advantage of the fact that blocks always have the same phase, changing that would break lots of things (e.g. Thanos, Cortex, backfilling). May I ask why it matters that different machines have increased CPU usage at the same times? You need to provision a core for background tsdb processing one way or the other, when that core is actually used doesn't change that. |
This is a sane request, I would love to see it solved. We know that there is some kind of de facto stop the world in Prometheus at that time and it would be great that it does not happen at the same time in ha pairs. |
There is no stop the world in compaction - only a few places where mutexes are very briefly taken to switch out state. Compaction is an asynchronous process during which ingestion and querying continues as normal. |
This would be a nice to have for my team as well. CPU spiking isn't my concern; rather, our aggregate egress traffic for our prometheus nodes increases by 3 orders of magnitude every 2 hours as thanos sidecar uploads the blocks to object storage. |
@brian-brazil In an ideal world each Prometheus server runs on a different hardware machine but that is not the case in the real world. In my case the Prometheus servers are currently running on separate VMs causing a high load on the host for a short time when they dump the blocks. It is also uncommon to reserve a CPU just for one process to work smoothly. Normally the Host system are overcommitted to reduce the waste of resources. |
That's very common, as that's the essence of provisioning and capacity planning - providing whatever resources each process needs, keeping in mind that average and peak are not the same thing. Otherwise you're underprovisioned and at risk of an outage. In the case of Prometheus to run smoothly you need to provision a core for background tasks - even if when head compaction happens was to change. (Also you need 33% more CPU on top of how much you think you need in total due to how Go GC works.). One core is not a lot to ask, and Prometheus purposefully does not use as many cores as it can during compaction so that it is possible to reasonably provision its CPU usage. In practice given the numbers you're provided, this means you need to set aside at least 2 cores for each of your Prometheus servers on average. Plus whatever you need to handle query spikes. |
|
That's not what I proposed in my first comment. I wrote that the flushing itself could happen at different times, but still produce blocks that are time-aligned in the same way as today. |
That'd avoid that class of issues, however it'd require more RAM and could still cause issues for backfilling as the head block would be longer that we'd otherwise expect. |
Let's say we had a configurable duration over which servers randomly distribute their starting time for compaction, could we not use that duration to account for that? It's a trade-off for RAM for sure, but I don't see why this shouldn't be a choice. |
Yes, I think that keeping blocks aligned as they are now is important, and let's have the ability to add jitter-delay to the compaction, like we do now for scrape. |
It's not just a tradeoff for RAM, but also for disk due to the WAL and CPU due to having more queries hit data in head. For a Thanos setup it'd be more ram/network/cpu also. In addition it invalidates present assumptions in backfilling around where the head is, and other systems may also make similar assumptions. That all needs to be considered. |
Most of this is out of my scope of knowledge. Only thing I can really offer on the last point is that a common technique used in electrical demand curtailment is to use unique/evenly distributed offsets from a common start time. Sometimes it’s hard set and other times it’s calculated from network addressing to keep it unique. |
Was there any progress on this recently? |
Was there any progress on this recently? |
NOTE: We discussed this topic today on Prometheus DevSummit. We agreed to implement this. Help wanted. |
Hello, I'd try to work on this, I'll put together a PoC and see how we could implement this. |
@machine424 |
I started looking at this since the feature was approved at the Dev Summit. |
@machine424 Please carry on then. Thanks. |
@machine424 Hello, are you still working on this issue? We are interested in this feature and are currently considering implementing it. Thank you |
Hello, |
Please tell me what you think about #12532 |
Hello @codesome @beorn7, since you've contributed the most to head.go, what do you think about this approach #12532 (comment)? |
@codesome has very limited availability right now. And I am not actually the expert here. @jesusvazquez is probably the best person to review (but on vacation this week ;). Sorry for letting this sit for so long. We have a serious review backlog. |
I'll try to have a look at this next week. |
Sorry for the silly question, but do I understand correctly that the issue assumes all the instances have been (re)started at the very same time? I've never experienced this issue because I usually restart Prometheus instances in a staggered fashion, so their 2-hour cycles are out of sync by a couple minutes and that's usually enough. |
Nope, prometheus dump its block at minute zero, every two hours whatever their restart date. So they are always all in-sync. |
Hello from the bug scrub. @machine424 is working in this. #12532 |
What did you do?
All our prometheus instances are dumping their blocks every 2hours (fixed for the thanos setup).
The problem I see is every prometheus instances are dumping these blocks in sync:
all instances: 09:00, 11:00, 13:00 ...
These makes hard time for our capacity planning as we have huge CPU peaks:
What did you expect to see?
I would like to have randomness so that the multiple prometheus can dump at random times, e.g.
Environment
kubernetes
prometheus 2.23.0
thanos v0.17.2
Already discussed this with @brancz on thanos slack. A discussion about this is scheduled on the Prometheus Storage
Working Group.
The text was updated successfully, but these errors were encountered: