Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid HA pair hitting targets at the same time #1546

Closed
brian-brazil opened this Issue Apr 11, 2016 · 32 comments

Comments

Projects
None yet
8 participants
@brian-brazil
Copy link
Member

brian-brazil commented Apr 11, 2016

Currently we determine the phase of a target's scrape by it's labels and params. This means that a HA pair will both try to scrape targets at exactly the same time, causing a bigger load spike and potential locking issues.

In order to avoid this we need to offset them somehow. I propose also adding in external_labels to the hash, as that should be different between the pair.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 11, 2016

AFAICT the phasing is not based on absolute time anyways, but relative to the latest scrape (https://github.com/prometheus/prometheus/blob/master/retrieval/target.go#L127). Given that, won't things distribute enough anyways?

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Apr 11, 2016

I can't confirm this, our scrape times are independent from each other. I believe you'll run into this issue more often if you reload/restart both Prometheis at the exact same time though.

OT: What external labels do you usually set? Our pairs are all 100% identical actually, including external_labels.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Apr 11, 2016

It's relative to absolute time, so that if a target disappears and reappears (e.g. race in SD) it'll keep its offset.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Apr 11, 2016

OT: What external labels do you usually set? Our pairs are all 100% identical actually, including external_labels.

That'll cause you some fun with remote storage, and possibly also federation depending on exactly what you're doing.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Apr 12, 2016

@brian-brazil How so? Remote storage does not really exist so far so it's hard to plan for that, and federation will add an instance label which will distinguish the results of the two nodes. What labels do you suggest, something like role=ha1/role=ha2?

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Apr 12, 2016

Remote storage does not really exist so far so it's hard to plan for that

There's some aspects that can be planned, from experience you'll want both in storage and distinguished.

and federation will add an instance label which will distinguish the results of the two nodes.

You should be setting honor_labels for federation, otherwise you'll mess up your labels.

@Andor

This comment has been minimized.

Copy link

Andor commented Apr 22, 2016

And how we can call query like that?
http://i.imgur.com/JENmJpE.png

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Apr 22, 2016

@Andor There's no query-time federation at this point. You'd have to federate the necessary aggregated data into your higher-level Prometheus first and then query the data there.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Apr 22, 2016

@Andor That's unrelated to this issue, and HA. This isn't possible currently.

@fabxc fabxc added kind/enhancement and removed enhancement labels Apr 28, 2016

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Mar 7, 2017

A simpler way of doing this might be adding the hostname to the hash, as it's unlikely it'll be the same in a HA pair (at least if they're in the same datacenter).

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Apr 25, 2017

Adding both external_labels and the local hostname seem prudent. An additional, and common in the networking world, source of deterministic entropy could be the (primary, non-local) IP addresses of the system.

@JorritSalverda

This comment has been minimized.

Copy link
Contributor

JorritSalverda commented Aug 1, 2017

Doesn't the scrape interval relative to the last scrape time add some jitter, for making sure multiple concurrent request are less likely to synchronize?

See https://cloudplatform.googleblog.com/2016/11/how-to-avoid-a-self-inflicted-DDoS-Attack-CRE-life-lessons.html for reasoning as to why you would want to add jitter / randomness to your intervals.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 1, 2017

Doesn't the scrape interval relative to the last scrape time add some jitter, for making sure multiple concurrent request are less likely to synchronize?

No, the scrape duration doesn't impact when the next scrape happens. Randomness is not desirable here, as it'd make the data less clean.

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Aug 1, 2017

As a counterpoint, we are currently relying on this behaviour; if this changed, we would request an option to maintain sync between HA pairs.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 1, 2017

This is an implementation detail, which you should not depend on. From a philosophical standpoint, if you care about timestamps at this level then your monitoring is likely very fragile and you should look at making it more resilient.

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Aug 1, 2017

It's not about timestamps, it's about load reduction for SNMP targets as per prometheus/snmp_exporter#153

Yes, it sucks, but we still have a few legacy, or not-so-legacy, devices which are being icky.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 1, 2017

That's an independent problem, where your devices lack capacity to deal with scrapes.

This is about trying to reduce load spikes.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Aug 1, 2017

We're 17 comments into discussing the benefits of different behaviours. You can't just dismiss "the current behaviour works better than the proposed one for specific cases" by pointing to "implementation details" – if it's been behaving consistently enough that someone actually relies on it behaving this way, then this is a specific behaviour that this issue proposes to change. It doesn't matter whether that has been documented, especially since in other discussions you have dismissed requests to document with "look at the code" @brian-brazil. @RichiH did look at the code.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Aug 1, 2017

I think both behaviours are valid: spread out scrape times, or consistently scrape. It should be easy enough to gate the additional salting on a configuration or flag?

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Aug 1, 2017

Making an optional flag would allow to randomize if the flag is empty, and allow to sync if it's set to a specific value.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 1, 2017

I don't think this should be configurable, nor that it should be anything other than a black box to users. Depending on Prometheus servers scraping at the same time goes against our general resilience philosophy, we explicitly do not try to keep things in sync between a HA pair and should not add features that encourage that.

We can't create a flag every time one user has a non-recommended setup that breaks due to implementation changes, https://xkcd.com/1172/

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 1, 2017

So we have valid arguments for leaving the behavior as it is. I still don't see an argument for changing anything.

Theoretical arguments are avoiding deadlocks and reducing load spikes. Have we had any real-world observation where either of those is happening that give us reason to change our current behavior? Especially the deadlock case just means the exporter has a totally solvable bug. Plus, us spreading our scrapes is merely a best-effort solution anyway. I'd reject that with your own argument @brian-brazil that incapability of a certain exporter should not dictate Prometheus' behavior.

TL;DR as long as there's no real-world usage report giving us reason to make a change, I suggest to keep behavior as it is.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 1, 2017

So we have valid arguments for leaving the behavior as it is.

I'd disagree there.

Have we had any real-world observation where either of those is happening that give us reason to change our current behavior?

I've gotten one. It'd be a fairly subtle thing though that most users wouldn't realise, it'd do things like slightly increase 99th percentile latency.

Especially the deadlock case just means the exporter has a totally solvable bug.

I never mentioned deadlocks. What I meant by locking was that scraping at the same instant increases lock contention.

Plus, us spreading our scrapes is merely a best-effort solution anyway.

There's spreading our scrapes to balance load within Prometheus which works fine in practice (unless you're pulling from a few large targets - such as whole Prometheus servers via federation). And then there's the impact of the scrapes on the targets themselves, and once again for a single Prometheus scraping a service this works okay in practice.

What this is about is a micro-optimisation to reduce the impact of multiple Prometheus servers scraping the same target. It'd only ever be best effort, but it'd be better than what we have now.

It's a general principle of distributed system design to not have requests happening at exactly the same time, as that tends to cause both fan-in and fan-out problems.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Aug 1, 2017

It's also a general principle to not fix it if it ain't broken. Relying on scrapes not happening at the same time is at least as non-recommended as relying on them being consistent.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 1, 2017

Relying on scrapes not happening at the same time is at least as non-recommended as relying on them being consistent.

I never said a user should rely on them (and by chance many scrapes will still overlap. e.g. for a 10s interval that took 1s, we'd expect ~10% to still overlap between the HA pair). This is merely a behind the scenes change that reduces the amount of overlap, and thus helps reduce tail latency.

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Aug 1, 2017

If the scrapes are not in sync, falling back on the second Prometheus instance in a HA pair will subtly change the underlying data on every switchover.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 1, 2017

That is expected, in general we expect the HA pair to be out of sync. They're there for alert redundancy, not to have identical data. https://www.robustperception.io/monitoring-without-consensus/

Many users report this already today.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 2, 2017

You can't just dismiss "the current behaviour works better than the proposed one for specific cases" by pointing to "implementation details" –

I dismiss it in this specific case.

@RichiH asked about this behaviour and was told that it not specified, was going to change, and should not be depended on. And he chose to depend on it anyway. A single user building something which goes against our best practices, is dependant on implementation details of a particular version of Prometheus, and which they were explicitly told is an implementation detail subject to change is not exactly a strong argument for keeping a behaviour around.

There is a recommended solution to the problem Richie is having, and it's the standard approach to not being able to scrape the metrics within the scrape interval - either increase the scrape interval or decrease the volume of metrics you're pulling. Neither adding caches nor depending on the exact time Prometheus scrapes at (to try and workaround the reason it is recommend not to use caches) are recommended. We should not be doing anything to either encourage or enable such approaches.

if it's been behaving consistently enough that someone actually relies on it behaving this way, then this is a specific behaviour that this issue proposes to change.

That's not what happened here, but even if it there were such a behaviour we'd have to look at why the users find that behaviour useful on its own merits and see what the proper solution (if any) for their problem is in the broader context. Depending on that we may decide for example to specify it, keep it unspecified but try not to break it, keep it unspecified, or even go so far as to go out of our way to break that use case if it was particularly problematic.

We have things in all these categories today. For me this particular one is pretty clearly in the unspecified category, users explicitly should not care about how this works. As long as we obey the scrape_interval we're good.

It doesn't matter whether that has been documented, especially since in other discussions you have dismissed requests to document with "look at the code"

If someone reads the source code and depends on some implementation detail they find then it's on them if that implementation detail changes. Also, in the cases you're referring to that part of the reason for me saying to look at the code is that the behaviour in question was non-trivial implementation details where change is likely.

This is one of those things you have to be careful with for the sake of being able to evolve the codebase, we can't let ourselves be ossified by users depending on things that we've told them not to depend on.

Unspecified behaviour is a thing, and it gives us flexibility that ultimately benefits our users. For example I think we've on about the 4th set of logic for the code we're talking about in this issue, and aside from the 30% CPU reduction from the 2nd set of logic, noone ever noticed the changes.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Aug 2, 2017

I totally agree with @brian-brazil here that exact scrape phases are an implementation detail and shouldn't be relied upon by the user. Optimizing this under the hood as suggested sounds good to me. Although it is impossible for us to quantify the size of the effect on average in the world, the increased lock contention and potentially higher tail latency in targets makes sense to me too.

HA pairs already have slightly out-of-phase scrapes, and I always tell dashboard users to only hit one member of the HA pair from Grafana (either directly or through e.g. HAProxy with main and backup backends). The standard HA strategy in Prometheus is optimized for reliable alerting, not for always-
up dashboards.

So generally I'm for this change.

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Aug 23, 2017

Result from the dev summit:

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Aug 23, 2017

In addition, we'll document that this behaviour is unspecified.

@brian-brazil

This comment has been minimized.

Copy link
Member Author

brian-brazil commented Mar 12, 2019

Closed by #5181

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.