Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduping Identical targets #884

Closed
nikgrok opened this Issue Jul 8, 2015 · 19 comments

Comments

Projects
None yet
6 participants
@nikgrok
Copy link

nikgrok commented Jul 8, 2015

Hey guys,

Today if you are using service discovery (consul), querying 2 services in one job will return you duplicate scrapes. If possible, can you make scrapes with the same address in the same job unique?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jul 8, 2015

My stance on this has always been that duplicate target configuration has to be resolved by the user.

If targets are duplicate with different static labels, Prometheus cannot decide for you which one is "the right one".
If you configure your targets or SD within one job, they have distinct information about how they were retrieved. This information is used to preserve target states across reloads. It would be possible to check whether no relabeling rules modify based on the SD information and then dedupe targets based on that. This involves a lot of implementation complication.

Could you elaborate why the very same target shows up in two distinct services for you? This doesn't seem like a desirable configuration to me.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 8, 2015

If at the end of all the SD and processing, there are completely identical duplicate targets among all jobs it seems reasonable to me to scrape only one of them. By identical I mean all labels, scheme, basic auth, interval, metrics path and anything else. Metric relabelling would be the one wrinkle to this.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jul 8, 2015

"seems reasonable" is not good enough given how invasive this change would be. Only with very good reasons where this problem occurs with sane SD setups.

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Jul 8, 2015

Our use case is as follows. We'd like to monitor Hadoop machines. We have a
service for the Task Tracker, and the Data node, as well as Namenode. For
me to set up a node exporter instance to scrape all machines I have no way
of not having repeats (since some machines can have a task tracker and a
datanode, while others are just task tracker). Today, I have no way of
uniquely monitoring all Hadoop servers because of this. The only other
option is to put both services and almost doubling my number of scrapes.

A simple Merge would fix it completely

On Wed, Jul 8, 2015 at 1:28 PM, Fabian Reinartz notifications@github.com
wrote:

"seems reasonable" is not good enough given how invasive this change would
be. Only with very good reasons where this problem occurs with sane SD
setups.


Reply to this email directly or view it on GitHub
#884 (comment)
.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jul 8, 2015

So is this about running the node exporter? If so, exporting node metrics has no direct relation to Hadoop running on them. Node exporters should probably have their own service. I assume you want to know that these nodes belong to your Hadoop cluster, which Consul tags and relabeling can solve for you.

In my opinion it doesn't boil down to a "simple merge", unfortunately. But PRs are always welcome – a place to start would probably be to detach the scraping from the targets.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jul 9, 2015

I think this is a valid request. At SC we don't encounter this because our cookbook forces us to work around it anyway, by creating a single service discovery entry for every possible group we want to change.

But I do see the need to "get the node-level statistics from all nodes of groups A and B" where A and B may or may not overlap. It is currently impossible to do this without either duplicating the intersection, or creating another entry for the overlap, which is nothing other than de-duplicating outside of Prometheus.

To be clear, I think scraping the same instance from several jobs should not be merged, but if I specify

dns_sd_configs:
  - some_target.some_domain
  - other_target.other_domain

and one endpoint is listed in both SRV records, it should only be scraped and recorded once. The same goes for other SD methods.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jul 9, 2015

(actually, why isn't this deduplicating at the storage layer?)

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 9, 2015

(actually, why isn't this deduplicating at the storage layer?)

Getting two scrapes of the same target at different time offsets is going to be confusing, and can't be de-duplicated at the storage layer due to the different timestamps.

@matthiasr

This comment has been minimized.

Copy link
Contributor

matthiasr commented Jul 9, 2015

But it should end up as one time series with double the sampling rate?

@nikgrok could you describe how the duplication manifests for you?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 9, 2015

But it should end up as one time series with double the sampling rate?

Yes. Probably interesting interactions with #394 too.

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Aug 6, 2015

Hey guys, Trying to restart this discussion. @matthiasr, Basically today if we use the consul config 2 services with a machine in both will be scraped 2 times

Example:
Hadoop Datanode
srv1 srv2 srv3

Hadoop Tasktrakcer
srv2 srv3.

Total scraped machines
srv1:9100
srv2:9100
srv3:9100
srv2:9100
srv3:9100

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Aug 6, 2015

I guess the discussion ended at "but why is that so bad?". @fabxc can tell us how invasive the de-duping would be.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 6, 2015

Thanks to an in-person discussion with @matthiasr I better understand why the scenario in general is reasonable.

The question indeed remains "but why is that so bad?". There is an increased sampling rate but no race or anything like that. Is the impact on the storage relevant?
It's certainly a brain-itch that should be fixed eventually, but none an end user would usually see.

Scraping would essentially have to be detached from targets. This would solve the double scraping but the target would still be discovered multiple times. If the latter is the actual thing that's annoying people, there are easier ways to fix that at the UI level.

@nikgrok It would be good to know which part actually causes frustration for you.

@nikgrok

This comment has been minimized.

Copy link
Author

nikgrok commented Aug 6, 2015

I imagine multiple endpoints for storage will be created (in the case of 2 services with the 90% similarity), you are increasing the space required by 90%. Am I wrong about that?

Also, it means that the collectors will be scraping 90% more and we already plan on really hammering prometheus. (Our hadoop infrastructure is 3.5k machines already)

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Aug 6, 2015

That sounds like a very legitimate use case. I'll try to get to this rather
sooner than later.

On Thu, Aug 6, 2015, 7:12 PM nikgrok notifications@github.com wrote:

I imagine multiple endpoints for storage will be created (in the case of 2
services with the 90% similarity), you are increasing the space required by
90%. Am I wrong about that?

Also, it means that the collectors will be scraping 90% more and we
already plan on really hammering prometheus. (Our hadoop infrastructure is
3.5k machines already)


Reply to this email directly or view it on GitHub
#884 (comment)
.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Aug 6, 2015

Why do you use the consul results for the tasktracker/datanode to discover node exporters? If you register a separate service for the node exporter, it would only be scraped once. Using your example, shouldn't it be:

Hadoop Datanode
srv1:PORTDATA srv2:PORTDATA srv3:PORTDATA

Hadoop Tasktrakcer
srv2:PORTTASK srv3:PORTTASK

Node exporters
srv1:9100 srv2:9100 srv3:9100

Total scraped machines
srv1:PORTDATA
srv1:9100
srv2:PORTDATA
srv2:PORTTASK
srv2:9100
srv3:PORTDATA
srv3:PORTTASK
srv3:9100

EDIT: I understood @matthiasr's comment now. If it's about having a common label everywhere, an additional label could be applied to all three (datanode, tasktracker, node exporter) jobs?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Sep 12, 2015

@nikgrok #1064 implements this. You might want to give it a try.

@fabxc fabxc added this to the v0.17.0 milestone Sep 21, 2015

@fabxc fabxc self-assigned this Sep 21, 2015

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Apr 7, 2016

Fixed in 0.18.0rc1

@fabxc fabxc closed this Apr 7, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.