Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TSDB] Prometheus instances dumping blocks simultaneously #8278

Open
ahurtaud opened this issue Dec 11, 2020 · 30 comments
Open

[TSDB] Prometheus instances dumping blocks simultaneously #8278

ahurtaud opened this issue Dec 11, 2020 · 30 comments

Comments

@ahurtaud
Copy link
Contributor

What did you do?
All our prometheus instances are dumping their blocks every 2hours (fixed for the thanos setup).
The problem I see is every prometheus instances are dumping these blocks in sync:
all instances: 09:00, 11:00, 13:00 ...

These makes hard time for our capacity planning as we have huge CPU peaks:
Screen Shot 2020-12-11 at 10 09 37

What did you expect to see?
I would like to have randomness so that the multiple prometheus can dump at random times, e.g.

  • instance1: 09:28, 11:28, 13:28 ...
  • instance2: 09:44, 11:44, 13:44 ...

Environment
kubernetes
prometheus 2.23.0
thanos v0.17.2

Already discussed this with @brancz on thanos slack. A discussion about this is scheduled on the Prometheus Storage
Working Group.

@brancz
Copy link
Member

brancz commented Dec 11, 2020

Off the top of my head, I think as long as we still produce 2h blocks the time we actually run compaction could be offset (hard to tell for me right now how complex this would be). I think this is a great one to be discussed in the storage working group indeed.

@brian-brazil
Copy link
Contributor

There's many places we already take advantage or plan to take advantage of the fact that blocks always have the same phase, changing that would break lots of things (e.g. Thanos, Cortex, backfilling).

May I ask why it matters that different machines have increased CPU usage at the same times? You need to provision a core for background tsdb processing one way or the other, when that core is actually used doesn't change that.

@roidelapluie
Copy link
Member

This is a sane request, I would love to see it solved. We know that there is some kind of de facto stop the world in Prometheus at that time and it would be great that it does not happen at the same time in ha pairs.

@brian-brazil
Copy link
Contributor

There is no stop the world in compaction - only a few places where mutexes are very briefly taken to switch out state. Compaction is an asynchronous process during which ingestion and querying continues as normal.

@XNorith
Copy link

XNorith commented Dec 11, 2020

This would be a nice to have for my team as well. CPU spiking isn't my concern; rather, our aggregate egress traffic for our prometheus nodes increases by 3 orders of magnitude every 2 hours as thanos sidecar uploads the blocks to object storage.
We also see aggregate disk IO spikes every two hours that limit our options for using shared storage.

@Jakob3xD
Copy link

@brian-brazil In an ideal world each Prometheus server runs on a different hardware machine but that is not the case in the real world. In my case the Prometheus servers are currently running on separate VMs causing a high load on the host for a short time when they dump the blocks. It is also uncommon to reserve a CPU just for one process to work smoothly. Normally the Host system are overcommitted to reduce the waste of resources.
I currently run 78 Prometheus Servers spread over several hosts due to the current behavior.
Query: sum(irate(process_cpu_seconds_total{job="prometheus"}[1m]))
grafik

@brian-brazil
Copy link
Contributor

It is also uncommon to reserve a CPU just for one process to work smoothly.

That's very common, as that's the essence of provisioning and capacity planning - providing whatever resources each process needs, keeping in mind that average and peak are not the same thing. Otherwise you're underprovisioned and at risk of an outage. In the case of Prometheus to run smoothly you need to provision a core for background tasks - even if when head compaction happens was to change. (Also you need 33% more CPU on top of how much you think you need in total due to how Go GC works.). One core is not a lot to ask, and Prometheus purposefully does not use as many cores as it can during compaction so that it is possible to reasonably provision its CPU usage.

In practice given the numbers you're provided, this means you need to set aside at least 2 cores for each of your Prometheus servers on average. Plus whatever you need to handle query spikes.

@roidelapluie
Copy link
Member

There is no stop the world in compaction - only a few places where mutexes are very briefly taken to switch out state. Compaction is an asynchronous process during which ingestion and querying continues as normal.

#7669

@brancz
Copy link
Member

brancz commented Dec 11, 2020

There's many places we already take advantage or plan to take advantage of the fact that blocks always have the same phase, changing that would break lots of things (e.g. Thanos, Cortex, backfilling).

That's not what I proposed in my first comment. I wrote that the flushing itself could happen at different times, but still produce blocks that are time-aligned in the same way as today.

@brian-brazil
Copy link
Contributor

That'd avoid that class of issues, however it'd require more RAM and could still cause issues for backfilling as the head block would be longer that we'd otherwise expect.

@brancz
Copy link
Member

brancz commented Dec 18, 2020

Let's say we had a configurable duration over which servers randomly distribute their starting time for compaction, could we not use that duration to account for that? It's a trade-off for RAM for sure, but I don't see why this shouldn't be a choice.

@roidelapluie
Copy link
Member

Yes, I think that keeping blocks aligned as they are now is important, and let's have the ability to add jitter-delay to the compaction, like we do now for scrape.

@brian-brazil
Copy link
Contributor

It's not just a tradeoff for RAM, but also for disk due to the WAL and CPU due to having more queries hit data in head. For a Thanos setup it'd be more ram/network/cpu also. In addition it invalidates present assumptions in backfilling around where the head is, and other systems may also make similar assumptions. That all needs to be considered.

@mtfoley
Copy link
Contributor

mtfoley commented Nov 7, 2021

Most of this is out of my scope of knowledge. Only thing I can really offer on the last point is that a common technique used in electrical demand curtailment is to use unique/evenly distributed offsets from a common start time. Sometimes it’s hard set and other times it’s calculated from network addressing to keep it unique.

@azhiltsov
Copy link

Was there any progress on this recently?
jitter-delay to the compaction could spread the load and that is what the topic starter(and me) is mostly interested in.

@xie233
Copy link

xie233 commented Jan 18, 2023

Was there any progress on this recently?
We have the same problem now, thanos sidecar upload block at different time may solve this problem, but we would like to solve it by not changing thanos sidecar.

@bwplotka
Copy link
Member

NOTE: We discussed this topic today on Prometheus DevSummit. We agreed to implement this.

Help wanted.

@machine424
Copy link
Collaborator

Hello,

I'd try to work on this, I'll put together a PoC and see how we could implement this.

@codebasky
Copy link
Contributor

@machine424
I was looking into it and discussed about it prometheus-dev channel but forgot to tag it here. If you are not spend long time on it let me know I will proceed further. Thanks.

@machine424
Copy link
Collaborator

I started looking at this since the feature was approved at the Dev Summit.
Could you share a link to the discussion? If it's in a more advanced stage than what I have, I'll step out. Otherwise I'd like to continue working on this (It'd be my first Prometheus contribution)

@codebasky
Copy link
Contributor

@machine424 Please carry on then. Thanks.

@ahurtaud
Copy link
Contributor Author

@machine424 Hello, are you still working on this issue? We are interested in this feature and are currently considering implementing it. Thank you

@machine424
Copy link
Collaborator

Hello,
Yes, I’ll be pushing soon.

@machine424
Copy link
Collaborator

Please tell me what you think about #12532

@machine424
Copy link
Collaborator

Hello @codesome @beorn7, since you've contributed the most to head.go, what do you think about this approach #12532 (comment)?

@beorn7
Copy link
Member

beorn7 commented Aug 15, 2023

@codesome has very limited availability right now. And I am not actually the expert here. @jesusvazquez is probably the best person to review (but on vacation this week ;). Sorry for letting this sit for so long. We have a serious review backlog.

@jesusvazquez
Copy link
Member

I'll try to have a look at this next week.

@prometheus prometheus deleted a comment from prombot Aug 28, 2023
@nemobis
Copy link
Contributor

nemobis commented Dec 22, 2023

Sorry for the silly question, but do I understand correctly that the issue assumes all the instances have been (re)started at the very same time? I've never experienced this issue because I usually restart Prometheus instances in a staggered fashion, so their 2-hour cycles are out of sync by a couple minutes and that's usually enough.

@ahurtaud
Copy link
Contributor Author

ahurtaud commented Jan 8, 2024

Sorry for the silly question, but do I understand correctly that the issue assumes all the instances have been (re)started at the very same time? I've never experienced this issue because I usually restart Prometheus instances in a staggered fashion, so their 2-hour cycles are out of sync by a couple minutes and that's usually enough.

Nope, prometheus dump its block at minute zero, every two hours whatever their restart date. So they are always all in-sync.

@beorn7
Copy link
Member

beorn7 commented May 21, 2024

Hello from the bug scrub.

@machine424 is working in this. #12532

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests