Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAvoid HA pair hitting targets at the same time #1546
Comments
brian-brazil
added
the
enhancement
label
Apr 11, 2016
This comment has been minimized.
This comment has been minimized.
|
AFAICT the phasing is not based on absolute time anyways, but relative to the latest scrape (https://github.com/prometheus/prometheus/blob/master/retrieval/target.go#L127). Given that, won't things distribute enough anyways? |
This comment has been minimized.
This comment has been minimized.
|
I can't confirm this, our scrape times are independent from each other. I believe you'll run into this issue more often if you reload/restart both Prometheis at the exact same time though. OT: What external labels do you usually set? Our pairs are all 100% identical actually, including external_labels. |
This comment has been minimized.
This comment has been minimized.
|
It's relative to absolute time, so that if a target disappears and reappears (e.g. race in SD) it'll keep its offset. |
This comment has been minimized.
This comment has been minimized.
That'll cause you some fun with remote storage, and possibly also federation depending on exactly what you're doing. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil How so? Remote storage does not really exist so far so it's hard to plan for that, and federation will add an instance label which will distinguish the results of the two nodes. What labels do you suggest, something like |
This comment has been minimized.
This comment has been minimized.
There's some aspects that can be planned, from experience you'll want both in storage and distinguished.
You should be setting |
This comment has been minimized.
This comment has been minimized.
Andor
commented
Apr 22, 2016
|
And how we can call query like that? |
This comment has been minimized.
This comment has been minimized.
|
@Andor There's no query-time federation at this point. You'd have to federate the necessary aggregated data into your higher-level Prometheus first and then query the data there. |
This comment has been minimized.
This comment has been minimized.
|
@Andor That's unrelated to this issue, and HA. This isn't possible currently. |
fabxc
added
kind/enhancement
and removed
enhancement
labels
Apr 28, 2016
This comment has been minimized.
This comment has been minimized.
|
A simpler way of doing this might be adding the hostname to the hash, as it's unlikely it'll be the same in a HA pair (at least if they're in the same datacenter). |
brian-brazil
referenced this issue
Apr 25, 2017
Closed
[Discussion] Caching values for a short time #153
This comment has been minimized.
This comment has been minimized.
|
Adding both |
brian-brazil
added
priority/P3
not-as-easy-as-it-looks
labels
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
|
Doesn't the scrape interval relative to the last scrape time add some jitter, for making sure multiple concurrent request are less likely to synchronize? See https://cloudplatform.googleblog.com/2016/11/how-to-avoid-a-self-inflicted-DDoS-Attack-CRE-life-lessons.html for reasoning as to why you would want to add jitter / randomness to your intervals. |
This comment has been minimized.
This comment has been minimized.
No, the scrape duration doesn't impact when the next scrape happens. Randomness is not desirable here, as it'd make the data less clean. |
This comment has been minimized.
This comment has been minimized.
|
As a counterpoint, we are currently relying on this behaviour; if this changed, we would request an option to maintain sync between HA pairs. |
This comment has been minimized.
This comment has been minimized.
|
This is an implementation detail, which you should not depend on. From a philosophical standpoint, if you care about timestamps at this level then your monitoring is likely very fragile and you should look at making it more resilient. |
This comment has been minimized.
This comment has been minimized.
|
It's not about timestamps, it's about load reduction for SNMP targets as per prometheus/snmp_exporter#153 Yes, it sucks, but we still have a few legacy, or not-so-legacy, devices which are being icky. |
This comment has been minimized.
This comment has been minimized.
|
That's an independent problem, where your devices lack capacity to deal with scrapes. This is about trying to reduce load spikes. |
This comment has been minimized.
This comment has been minimized.
|
We're 17 comments into discussing the benefits of different behaviours. You can't just dismiss "the current behaviour works better than the proposed one for specific cases" by pointing to "implementation details" – if it's been behaving consistently enough that someone actually relies on it behaving this way, then this is a specific behaviour that this issue proposes to change. It doesn't matter whether that has been documented, especially since in other discussions you have dismissed requests to document with "look at the code" @brian-brazil. @RichiH did look at the code. |
This comment has been minimized.
This comment has been minimized.
|
I think both behaviours are valid: spread out scrape times, or consistently scrape. It should be easy enough to gate the additional salting on a configuration or flag? |
This comment has been minimized.
This comment has been minimized.
|
Making an optional flag would allow to randomize if the flag is empty, and allow to sync if it's set to a specific value. |
This comment has been minimized.
This comment has been minimized.
|
I don't think this should be configurable, nor that it should be anything other than a black box to users. Depending on Prometheus servers scraping at the same time goes against our general resilience philosophy, we explicitly do not try to keep things in sync between a HA pair and should not add features that encourage that. We can't create a flag every time one user has a non-recommended setup that breaks due to implementation changes, https://xkcd.com/1172/ |
This comment has been minimized.
This comment has been minimized.
|
So we have valid arguments for leaving the behavior as it is. I still don't see an argument for changing anything. Theoretical arguments are avoiding deadlocks and reducing load spikes. Have we had any real-world observation where either of those is happening that give us reason to change our current behavior? Especially the deadlock case just means the exporter has a totally solvable bug. Plus, us spreading our scrapes is merely a best-effort solution anyway. I'd reject that with your own argument @brian-brazil that incapability of a certain exporter should not dictate Prometheus' behavior. TL;DR as long as there's no real-world usage report giving us reason to make a change, I suggest to keep behavior as it is. |
This comment has been minimized.
This comment has been minimized.
I'd disagree there.
I've gotten one. It'd be a fairly subtle thing though that most users wouldn't realise, it'd do things like slightly increase 99th percentile latency.
I never mentioned deadlocks. What I meant by locking was that scraping at the same instant increases lock contention.
There's spreading our scrapes to balance load within Prometheus which works fine in practice (unless you're pulling from a few large targets - such as whole Prometheus servers via federation). And then there's the impact of the scrapes on the targets themselves, and once again for a single Prometheus scraping a service this works okay in practice. What this is about is a micro-optimisation to reduce the impact of multiple Prometheus servers scraping the same target. It'd only ever be best effort, but it'd be better than what we have now. It's a general principle of distributed system design to not have requests happening at exactly the same time, as that tends to cause both fan-in and fan-out problems. |
This comment has been minimized.
This comment has been minimized.
|
It's also a general principle to not fix it if it ain't broken. Relying on scrapes not happening at the same time is at least as non-recommended as relying on them being consistent. |
This comment has been minimized.
This comment has been minimized.
I never said a user should rely on them (and by chance many scrapes will still overlap. e.g. for a 10s interval that took 1s, we'd expect ~10% to still overlap between the HA pair). This is merely a behind the scenes change that reduces the amount of overlap, and thus helps reduce tail latency. |
This comment has been minimized.
This comment has been minimized.
|
If the scrapes are not in sync, falling back on the second Prometheus instance in a HA pair will subtly change the underlying data on every switchover. |
This comment has been minimized.
This comment has been minimized.
|
That is expected, in general we expect the HA pair to be out of sync. They're there for alert redundancy, not to have identical data. https://www.robustperception.io/monitoring-without-consensus/ Many users report this already today. |
This comment has been minimized.
This comment has been minimized.
I dismiss it in this specific case. @RichiH asked about this behaviour and was told that it not specified, was going to change, and should not be depended on. And he chose to depend on it anyway. A single user building something which goes against our best practices, is dependant on implementation details of a particular version of Prometheus, and which they were explicitly told is an implementation detail subject to change is not exactly a strong argument for keeping a behaviour around. There is a recommended solution to the problem Richie is having, and it's the standard approach to not being able to scrape the metrics within the scrape interval - either increase the scrape interval or decrease the volume of metrics you're pulling. Neither adding caches nor depending on the exact time Prometheus scrapes at (to try and workaround the reason it is recommend not to use caches) are recommended. We should not be doing anything to either encourage or enable such approaches.
That's not what happened here, but even if it there were such a behaviour we'd have to look at why the users find that behaviour useful on its own merits and see what the proper solution (if any) for their problem is in the broader context. Depending on that we may decide for example to specify it, keep it unspecified but try not to break it, keep it unspecified, or even go so far as to go out of our way to break that use case if it was particularly problematic. We have things in all these categories today. For me this particular one is pretty clearly in the unspecified category, users explicitly should not care about how this works. As long as we obey the scrape_interval we're good.
If someone reads the source code and depends on some implementation detail they find then it's on them if that implementation detail changes. Also, in the cases you're referring to that part of the reason for me saying to look at the code is that the behaviour in question was non-trivial implementation details where change is likely. This is one of those things you have to be careful with for the sake of being able to evolve the codebase, we can't let ourselves be ossified by users depending on things that we've told them not to depend on. Unspecified behaviour is a thing, and it gives us flexibility that ultimately benefits our users. For example I think we've on about the 4th set of logic for the code we're talking about in this issue, and aside from the 30% CPU reduction from the 2nd set of logic, noone ever noticed the changes. |
This comment has been minimized.
This comment has been minimized.
|
I totally agree with @brian-brazil here that exact scrape phases are an implementation detail and shouldn't be relied upon by the user. Optimizing this under the hood as suggested sounds good to me. Although it is impossible for us to quantify the size of the effect on average in the world, the increased lock contention and potentially higher tail latency in targets makes sense to me too. HA pairs already have slightly out-of-phase scrapes, and I always tell dashboard users to only hit one member of the HA pair from Grafana (either directly or through e.g. HAProxy with main and backup backends). The standard HA strategy in Prometheus is optimized for reliable alerting, not for always- So generally I'm for this change. |
This comment has been minimized.
This comment has been minimized.
|
Result from the dev summit:
|
This comment has been minimized.
This comment has been minimized.
|
In addition, we'll document that this behaviour is unspecified. |
This comment has been minimized.
This comment has been minimized.
|
Closed by #5181 |
brian-brazil commentedApr 11, 2016
Currently we determine the phase of a target's scrape by it's labels and params. This means that a HA pair will both try to scrape targets at exactly the same time, causing a bigger load spike and potential locking issues.
In order to avoid this we need to offset them somehow. I propose also adding in
external_labelsto the hash, as that should be different between the pair.