[TSDB] Prometheus instances dumping blocks simultaneously #8278

ahurtaud · 2020-12-11T09:29:56Z

What did you do?
All our prometheus instances are dumping their blocks every 2hours (fixed for the thanos setup).
The problem I see is every prometheus instances are dumping these blocks in sync:
all instances: 09:00, 11:00, 13:00 ...

These makes hard time for our capacity planning as we have huge CPU peaks:

What did you expect to see?
I would like to have randomness so that the multiple prometheus can dump at random times, e.g.

instance1: 09:28, 11:28, 13:28 ...
instance2: 09:44, 11:44, 13:44 ...

Environment
kubernetes
prometheus 2.23.0
thanos v0.17.2

Already discussed this with @brancz on thanos slack. A discussion about this is scheduled on the Prometheus Storage
Working Group.

brancz · 2020-12-11T09:38:31Z

Off the top of my head, I think as long as we still produce 2h blocks the time we actually run compaction could be offset (hard to tell for me right now how complex this would be). I think this is a great one to be discussed in the storage working group indeed.

brian-brazil · 2020-12-11T10:59:42Z

There's many places we already take advantage or plan to take advantage of the fact that blocks always have the same phase, changing that would break lots of things (e.g. Thanos, Cortex, backfilling).

May I ask why it matters that different machines have increased CPU usage at the same times? You need to provision a core for background tsdb processing one way or the other, when that core is actually used doesn't change that.

roidelapluie · 2020-12-11T11:50:42Z

This is a sane request, I would love to see it solved. We know that there is some kind of de facto stop the world in Prometheus at that time and it would be great that it does not happen at the same time in ha pairs.

brian-brazil · 2020-12-11T12:11:27Z

There is no stop the world in compaction - only a few places where mutexes are very briefly taken to switch out state. Compaction is an asynchronous process during which ingestion and querying continues as normal.

XNorith · 2020-12-11T12:26:55Z

This would be a nice to have for my team as well. CPU spiking isn't my concern; rather, our aggregate egress traffic for our prometheus nodes increases by 3 orders of magnitude every 2 hours as thanos sidecar uploads the blocks to object storage.
We also see aggregate disk IO spikes every two hours that limit our options for using shared storage.

Jakob3xD · 2020-12-11T12:28:46Z

@brian-brazil In an ideal world each Prometheus server runs on a different hardware machine but that is not the case in the real world. In my case the Prometheus servers are currently running on separate VMs causing a high load on the host for a short time when they dump the blocks. It is also uncommon to reserve a CPU just for one process to work smoothly. Normally the Host system are overcommitted to reduce the waste of resources.
I currently run 78 Prometheus Servers spread over several hosts due to the current behavior.
Query: sum(irate(process_cpu_seconds_total{job="prometheus"}[1m]))

brian-brazil · 2020-12-11T12:56:45Z

It is also uncommon to reserve a CPU just for one process to work smoothly.

That's very common, as that's the essence of provisioning and capacity planning - providing whatever resources each process needs, keeping in mind that average and peak are not the same thing. Otherwise you're underprovisioned and at risk of an outage. In the case of Prometheus to run smoothly you need to provision a core for background tasks - even if when head compaction happens was to change. (Also you need 33% more CPU on top of how much you think you need in total due to how Go GC works.). One core is not a lot to ask, and Prometheus purposefully does not use as many cores as it can during compaction so that it is possible to reasonably provision its CPU usage.

In practice given the numbers you're provided, this means you need to set aside at least 2 cores for each of your Prometheus servers on average. Plus whatever you need to handle query spikes.

roidelapluie · 2020-12-11T13:06:01Z

There is no stop the world in compaction - only a few places where mutexes are very briefly taken to switch out state. Compaction is an asynchronous process during which ingestion and querying continues as normal.

#7669

brancz · 2020-12-11T19:42:18Z

There's many places we already take advantage or plan to take advantage of the fact that blocks always have the same phase, changing that would break lots of things (e.g. Thanos, Cortex, backfilling).

That's not what I proposed in my first comment. I wrote that the flushing itself could happen at different times, but still produce blocks that are time-aligned in the same way as today.

brian-brazil · 2020-12-11T19:48:30Z

That'd avoid that class of issues, however it'd require more RAM and could still cause issues for backfilling as the head block would be longer that we'd otherwise expect.

brancz · 2020-12-18T11:03:50Z

Let's say we had a configurable duration over which servers randomly distribute their starting time for compaction, could we not use that duration to account for that? It's a trade-off for RAM for sure, but I don't see why this shouldn't be a choice.

roidelapluie · 2020-12-18T11:09:26Z

Yes, I think that keeping blocks aligned as they are now is important, and let's have the ability to add jitter-delay to the compaction, like we do now for scrape.

brian-brazil · 2020-12-18T11:59:00Z

It's not just a tradeoff for RAM, but also for disk due to the WAL and CPU due to having more queries hit data in head. For a Thanos setup it'd be more ram/network/cpu also. In addition it invalidates present assumptions in backfilling around where the head is, and other systems may also make similar assumptions. That all needs to be considered.

mtfoley · 2021-11-07T18:56:44Z

Most of this is out of my scope of knowledge. Only thing I can really offer on the last point is that a common technique used in electrical demand curtailment is to use unique/evenly distributed offsets from a common start time. Sometimes it’s hard set and other times it’s calculated from network addressing to keep it unique.

azhiltsov · 2023-01-09T16:50:54Z

Was there any progress on this recently?
jitter-delay to the compaction could spread the load and that is what the topic starter(and me) is mostly interested in.

xie233 · 2023-01-18T03:31:33Z

Was there any progress on this recently?
We have the same problem now, thanos sidecar upload block at different time may solve this problem, but we would like to solve it by not changing thanos sidecar.

bwplotka · 2023-04-17T14:43:42Z

NOTE: We discussed this topic today on Prometheus DevSummit. We agreed to implement this.

Help wanted.

machine424 · 2023-05-31T07:04:25Z

Hello,

I'd try to work on this, I'll put together a PoC and see how we could implement this.

codebasky · 2023-05-31T08:31:24Z

@machine424
I was looking into it and discussed about it prometheus-dev channel but forgot to tag it here. If you are not spend long time on it let me know I will proceed further. Thanks.

machine424 · 2023-05-31T09:16:12Z

I started looking at this since the feature was approved at the Dev Summit.
Could you share a link to the discussion? If it's in a more advanced stage than what I have, I'll step out. Otherwise I'd like to continue working on this (It'd be my first Prometheus contribution)

codebasky · 2023-05-31T09:18:33Z

@machine424 Please carry on then. Thanks.

ahurtaud · 2023-06-28T07:47:35Z

@machine424 Hello, are you still working on this issue? We are interested in this feature and are currently considering implementing it. Thank you

machine424 · 2023-06-28T12:20:28Z

Hello,
Yes, I’ll be pushing soon.

machine424 · 2023-07-12T08:40:29Z

Please tell me what you think about #12532

machine424 · 2023-08-14T06:48:19Z

Hello @codesome @beorn7, since you've contributed the most to head.go, what do you think about this approach #12532 (comment)?

beorn7 · 2023-08-15T14:33:57Z

@codesome has very limited availability right now. And I am not actually the expert here. @jesusvazquez is probably the best person to review (but on vacation this week ;). Sorry for letting this sit for so long. We have a serious review backlog.

jesusvazquez · 2023-08-25T15:57:21Z

I'll try to have a look at this next week.

nemobis · 2023-12-22T12:43:44Z

Sorry for the silly question, but do I understand correctly that the issue assumes all the instances have been (re)started at the very same time? I've never experienced this issue because I usually restart Prometheus instances in a staggered fashion, so their 2-hour cycles are out of sync by a couple minutes and that's usually enough.

ahurtaud · 2024-01-08T09:15:32Z

Sorry for the silly question, but do I understand correctly that the issue assumes all the instances have been (re)started at the very same time? I've never experienced this issue because I usually restart Prometheus instances in a staggered fashion, so their 2-hour cycles are out of sync by a couple minutes and that's usually enough.

Nope, prometheus dump its block at minute zero, every two hours whatever their restart date. So they are always all in-sync.

beorn7 · 2024-05-21T11:43:58Z

Hello from the bug scrub.

@machine424 is working in this. #12532

roidelapluie added component/tsdb not-as-easy-as-it-looks priority/Pmaybe kind/enhancement labels Dec 15, 2020

bboreham added the help wanted label May 11, 2023

machine424 mentioned this issue Jul 6, 2023

tsdb: allow to delay head compaction start time #12532

Open

4 tasks

prometheus deleted a comment from prombot Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TSDB] Prometheus instances dumping blocks simultaneously #8278

[TSDB] Prometheus instances dumping blocks simultaneously #8278

ahurtaud commented Dec 11, 2020

brancz commented Dec 11, 2020 •

edited

brian-brazil commented Dec 11, 2020

roidelapluie commented Dec 11, 2020

brian-brazil commented Dec 11, 2020

XNorith commented Dec 11, 2020 •

edited

Jakob3xD commented Dec 11, 2020

brian-brazil commented Dec 11, 2020

roidelapluie commented Dec 11, 2020

brancz commented Dec 11, 2020

brian-brazil commented Dec 11, 2020

brancz commented Dec 18, 2020

roidelapluie commented Dec 18, 2020

brian-brazil commented Dec 18, 2020

mtfoley commented Nov 7, 2021

azhiltsov commented Jan 9, 2023

xie233 commented Jan 18, 2023

bwplotka commented Apr 17, 2023

machine424 commented May 31, 2023

codebasky commented May 31, 2023

machine424 commented May 31, 2023

codebasky commented May 31, 2023

ahurtaud commented Jun 28, 2023

machine424 commented Jun 28, 2023

machine424 commented Jul 12, 2023

machine424 commented Aug 14, 2023

beorn7 commented Aug 15, 2023

jesusvazquez commented Aug 25, 2023

nemobis commented Dec 22, 2023

ahurtaud commented Jan 8, 2024

beorn7 commented May 21, 2024

[TSDB] Prometheus instances dumping blocks simultaneously #8278

[TSDB] Prometheus instances dumping blocks simultaneously #8278

Comments

ahurtaud commented Dec 11, 2020

brancz commented Dec 11, 2020 • edited

brian-brazil commented Dec 11, 2020

roidelapluie commented Dec 11, 2020

brian-brazil commented Dec 11, 2020

XNorith commented Dec 11, 2020 • edited

Jakob3xD commented Dec 11, 2020

brian-brazil commented Dec 11, 2020

roidelapluie commented Dec 11, 2020

brancz commented Dec 11, 2020

brian-brazil commented Dec 11, 2020

brancz commented Dec 18, 2020

roidelapluie commented Dec 18, 2020

brian-brazil commented Dec 18, 2020

mtfoley commented Nov 7, 2021

azhiltsov commented Jan 9, 2023

xie233 commented Jan 18, 2023

bwplotka commented Apr 17, 2023

machine424 commented May 31, 2023

codebasky commented May 31, 2023

machine424 commented May 31, 2023

codebasky commented May 31, 2023

ahurtaud commented Jun 28, 2023

machine424 commented Jun 28, 2023

machine424 commented Jul 12, 2023

machine424 commented Aug 14, 2023

beorn7 commented Aug 15, 2023

jesusvazquez commented Aug 25, 2023

nemobis commented Dec 22, 2023

ahurtaud commented Jan 8, 2024

beorn7 commented May 21, 2024

brancz commented Dec 11, 2020 •

edited

XNorith commented Dec 11, 2020 •

edited