Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upShow big and fat warning on web status page upon certain conditions. #1481
Comments
beorn7
added
the
feature-request
label
Mar 9, 2016
This comment has been minimized.
This comment has been minimized.
|
@fabxc fallout from our discussion |
This comment has been minimized.
This comment has been minimized.
|
Sounds good. |
fabxc
added
kind/enhancement
and removed
feature request
labels
Apr 28, 2016
This comment has been minimized.
This comment has been minimized.
|
Arguably, certain machine-level metrics like available disk space, system load, heavy swap/full memory, high IO wait, and possibly others fall into the same category of "the user should really handle this themselves, but this is really bad and we need to yell about it". |
This comment has been minimized.
This comment has been minimized.
|
@RichiH Those are machine level metrics you want to monitor anyway, no matter if Prometheus runs on your machine or not. Also, those metrics are not even accessible for the Prometheus server binary. The conditions this issue refers to are Prometheus specific. You should still monitor it in you meta-monitoring, too. But putting on the status page if something is fundamentally wrong within Prometheus sounds like a good idea. There is no intent, though, to turn the status page into a system-health monitor. |
This comment has been minimized.
This comment has been minimized.
|
Maybe a little off-topic, but it fits this discussion: Is that status page suitable as a health check URL (e.g. for K8s Liveness Probes)? |
This comment has been minimized.
This comment has been minimized.
|
@dominikschulz This would just be part of the normal web UI. It wouldn't return a non-200 if it visually reports some errors there. So not really. Generally, you can use the availability of the web UI in a limited way as a startup health probe, as the web UI is started as the last thing on Prometheus startup. However, it won't tell you much about whether Prometheus is generally healthy, only that it has finished starting up. |
beorn7
added
the
component/ui
label
Nov 2, 2016
brian-brazil
added
the
priority/P3
label
Jul 14, 2017
This comment has been minimized.
This comment has been minimized.
|
See also #1468 |
brian-brazil
referenced this issue
Aug 1, 2017
Closed
Throttled scraping with small heap size doesn't get logged. #3011
This comment has been minimized.
This comment has been minimized.
|
Is this still relevant given 2.0? |
This comment has been minimized.
This comment has been minimized.
|
The storage less so, but there's still things like rule groups taking longer than their interval. |
This comment has been minimized.
This comment has been minimized.
|
It probably changes a bit. For example isolation might want to let the users know if there's a really old write hanging somewhere or, as Brian beat me to writing, recording rules taking longer than they should. |
This comment has been minimized.
This comment has been minimized.
|
Though @gouthamve 's implicit question is valid; there should be a new list of things to handle. |
This comment has been minimized.
This comment has been minimized.
|
Easy ones:
|
beorn7 commentedMar 9, 2016
Like dirty storage, quarantining of series, too high persist pressure, too many memory chunks...
Ideally with a little explanation what's happening and what can be done.