Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upImprove `/-/healthy` endpoint to check more than the webHandler #3650
Comments
This comment has been minimized.
This comment has been minimized.
|
We have both /-/healthy which does what you describe and /-/ready which is doing what I think you want, right? See this for context: #2997 |
This comment has been minimized.
This comment has been minimized.
|
|
This comment has been minimized.
This comment has been minimized.
|
If I read this correctly, it should always check if it's ready before returning 200 on the ready endpoint. The handler is wrapped by https://github.com/prometheus/prometheus/pull/2997/files#diff-73cbd2009b0e4f156e8ca0f47e95b016R273 |
This comment has been minimized.
This comment has been minimized.
|
True, but as I read it What doesn't look normal to me, is that |
This comment has been minimized.
This comment has been minimized.
|
/-/ready is suppose to respond with 200 no matter what. It can be used as liveness probe in kubernetes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/ Now regarding /-/healthy, it looks like you're right. Nothing sets that to non-ready. So I suppose it's a reasonable request to have this return non-2xx if something breaks after startup. @beorn7 / @grobie Thoughts? |
This comment has been minimized.
This comment has been minimized.
|
Sounds good. What and how do you want to check? |
This comment has been minimized.
This comment has been minimized.
|
I would check that reads and writes to the local storage are working. So I'm looking for new ideas. Do you have any suggestions ? |
This comment has been minimized.
This comment has been minimized.
|
Maybe you could:
This will just detect that nothing is obviously wrong (huge deadlock), but that should be good enough. We should be careful though because /-/health will be called once per second or more, so nothing too expensive should be done. |
This comment has been minimized.
This comment has been minimized.
|
Those could get expensive on large setups. |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil any idea of calls that would be good enough but still cheap ? I think we should use the approach used in the push_gateway: https://github.com/prometheus/pushgateway/blob/c62e6bb458ff2192bc5df82a1a7f9d8ac826fac7/handler/misc.go#L23 and add
The most basic implementation for these would just lock/unlock the mutex if there is one and check that there isn't obvious errors. What do you think ? |
simonpasquier
added
the
kind/enhancement
label
Aug 7, 2018
This comment has been minimized.
This comment has been minimized.
|
Related #3807 |
Thib17 commentedJan 4, 2018
Prometheus already provides an
/-/healthyendpoint (https://github.com/prometheus/prometheus/blob/master/web/web.go#L262). But this endpoint keep responding 200 for as long as the webHandler is up. In some situation the webHandler is up, but the Prometheus doesn't work as expected (e.g. it can't write on disk, some important mutex are deadlocked, ...).Therefore the
/-/healthyshould be improved to better check Prometheus health in order to detect when it needs to get killed. In my opinion testing that we can at least read samples that are fresh enough could be a good start.I would be happy to help on that, but I have no idea how should I start.