Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consul SD: Include health check information in meta data #770

Open
beorn7 opened this Issue Jun 3, 2015 · 11 comments

Comments

Projects
None yet
6 participants
@beorn7
Copy link
Member

beorn7 commented Jun 3, 2015

So that users who need that information can relabel it onto their metrics.

__meta_consul_health or something...

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 15, 2015

There can be multiple health checks per node. The check IDs do not match our labelname constraints.
Health checks can have four states: unknown, passing, warning, or critical.

What would make sense is __meta_consul_health_<state> with a list of the check IDs concatenated by a configurable checkSeparator like __meta_consul_tags.

@sdurrheimer

This comment has been minimized.

Copy link
Member

sdurrheimer commented Jun 24, 2015

Just in case, there can be multiple health checks per node AND per service.

@beorn7 beorn7 referenced this issue Aug 5, 2015

Merged

consul port label #948

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Aug 5, 2015

See #948 (comment) for an example for output with multiple health checks.

@nfirvine

This comment has been minimized.

Copy link

nfirvine commented Apr 11, 2018

Any movement on this? I see that #2085 was closed due to it causing too many RPCs.

For me the impact is with prometheus finding its alertmanagers using consul SD. Even though the node/service is unhealthy, prometheus is still attempting to send alerts to it.

#2085 (comment) indicates you should keep trying to scrape services, even if they're unhealthy, which I'm not quite sure I understand, but maybe. On the other hand, sending alerts to unhealthy alertmanagers seems wrong in general, and at least deserves the ability to opt out, I would think.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 12, 2018

Even though the node/service is unhealthy, prometheus is still attempting to send alerts to it.

There's no problem with doing that.

@nfirvine

This comment has been minimized.

Copy link

nfirvine commented Apr 13, 2018

No, it's not really a problem in terms of Prometheus doing its job, but the logs are filled with spurious errors. Also, I was hoping to use the series prometheus_notifications_alertmanagers_discovered but it seems it includes all these unhealthy ones.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 13, 2018

the logs are filled with spurious errors.

That sounds like something we can improve, can you file an issue?

I was hoping to use the series prometheus_notifications_alertmanagers_discovered but it seems it includes all these unhealthy ones.

If your alertmanagers are unhealthy that often, you have bigger problems.

@nfirvine

This comment has been minimized.

Copy link

nfirvine commented Apr 17, 2018

the logs are filled with spurious errors.
That sounds like something we can improve, can you file an issue?

I can, but from my point of view avoiding enumerating those alertmanagers in the first place seems like the right fix. If we rip out those log messages, there'd really be no way to tell if an alertmanager that's reporting healthy in consul is actually failing to accept alerts.

Honestly, I'm struggling to understand the resistance to figuring out a way to exclude unhealthy consul services. Seems like it'd be a pretty desirable feature, and the design of scraping unhealthy targets is unintuitive. Kubernetes SD for endpoints has the ready meta label, and I don't know how this is that different. Granted, I haven't worked with the Consul API extensively, so very likely I'm underestimating how difficult/expensive it is. FWIW, I'm not just idly complaining: I want to know if somebody else tried and found it too expensive, because I'm considering writing it myself.

If your alertmanagers are unhealthy that often, you have bigger problems.

Yes, a bigger problem that I would love to be able to alert on using my meta-monitoring. ;)

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 17, 2018

Yes, a bigger problem that I would love to be able to alert on using my meta-monitoring. ;)

Which is why it's important to scrape non-healthy instances :)

Something is wrong with your setup, this is a very weak use case.

@nfirvine

This comment has been minimized.

Copy link

nfirvine commented Apr 17, 2018

I am scraping them. But I need to know how many Prometheus considers in terms of being a valid place to send alerts, not just whether it's a valid target for scraping. That information doesn't seem to exist. Anyway, I seem to not be able to convey this, so let's drop it.

Can we talk about the other aspect though? I think there's valid reasons for something like __meta_consul_service_healthy, just like __meta_kubernetes_pod_ready. Are there technical or design reasons why this shouldn't happen?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 18, 2018

That information doesn't seem to exist.

There's no such distinction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.