Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Show big and fat warning on web status page upon certain conditions. #1481

Open
beorn7 opened this Issue Mar 9, 2016 · 12 comments

Comments

Projects
None yet
8 participants
@beorn7
Copy link
Member

beorn7 commented Mar 9, 2016

Like dirty storage, quarantining of series, too high persist pressure, too many memory chunks...

Ideally with a little explanation what's happening and what can be done.

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented Mar 9, 2016

@fabxc fallout from our discussion

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Mar 9, 2016

Sounds good.

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented May 23, 2016

Arguably, certain machine-level metrics like available disk space, system load, heavy swap/full memory, high IO wait, and possibly others fall into the same category of "the user should really handle this themselves, but this is really bad and we need to yell about it".

@beorn7

This comment has been minimized.

Copy link
Member Author

beorn7 commented May 23, 2016

@RichiH Those are machine level metrics you want to monitor anyway, no matter if Prometheus runs on your machine or not. Also, those metrics are not even accessible for the Prometheus server binary.

The conditions this issue refers to are Prometheus specific. You should still monitor it in you meta-monitoring, too. But putting on the status page if something is fundamentally wrong within Prometheus sounds like a good idea.

There is no intent, though, to turn the status page into a system-health monitor.

@dominikschulz

This comment has been minimized.

Copy link
Contributor

dominikschulz commented Oct 16, 2016

Maybe a little off-topic, but it fits this discussion: Is that status page suitable as a health check URL (e.g. for K8s Liveness Probes)?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Oct 16, 2016

@dominikschulz This would just be part of the normal web UI. It wouldn't return a non-200 if it visually reports some errors there. So not really.

Generally, you can use the availability of the web UI in a limited way as a startup health probe, as the web UI is started as the last thing on Prometheus startup. However, it won't tell you much about whether Prometheus is generally healthy, only that it has finished starting up.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 14, 2017

See also #1468

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Jan 18, 2018

Is this still relevant given 2.0?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 18, 2018

The storage less so, but there's still things like rule groups taking longer than their interval.

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Jan 18, 2018

It probably changes a bit. For example isolation might want to let the users know if there's a really old write hanging somewhere or, as Brian beat me to writing, recording rules taking longer than they should.

@RichiH

This comment has been minimized.

Copy link
Member

RichiH commented Jan 18, 2018

Though @gouthamve 's implicit question is valid; there should be a new list of things to handle.

@roidelapluie

This comment has been minimized.

Copy link
Contributor

roidelapluie commented Feb 14, 2018

Easy ones:

  • Prometheus failed to reload its configuration
  • WAL corruptions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.