Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve `/-/healthy` endpoint to check more than the webHandler #3650

Open
Thib17 opened this Issue Jan 4, 2018 · 11 comments

Comments

Projects
None yet
6 participants
@Thib17
Copy link
Contributor

Thib17 commented Jan 4, 2018

Prometheus already provides an /-/healthy endpoint (https://github.com/prometheus/prometheus/blob/master/web/web.go#L262). But this endpoint keep responding 200 for as long as the webHandler is up. In some situation the webHandler is up, but the Prometheus doesn't work as expected (e.g. it can't write on disk, some important mutex are deadlocked, ...).

Therefore the /-/healthy should be improved to better check Prometheus health in order to detect when it needs to get killed. In my opinion testing that we can at least read samples that are fresh enough could be a good start.

I would be happy to help on that, but I have no idea how should I start.

@discordianfish

This comment has been minimized.

Copy link
Member

discordianfish commented Jan 4, 2018

We have both /-/healthy which does what you describe and /-/ready which is doing what I think you want, right? See this for context: #2997

@Thib17

This comment has been minimized.

Copy link
Contributor Author

Thib17 commented Jan 4, 2018

/-/ready seems to be only used at start-up. AFAIK, once the webHandler is set ready, it never change.
So if the Prometheus properly starts and then stop working as expected after a while, both current /-/healthy and /-/ready will keep respond 200. In such situation I believe that /-/ready is right to respond 200, but /-/healthy should respond 5xx.

@discordianfish

This comment has been minimized.

Copy link
Member

discordianfish commented Jan 4, 2018

If I read this correctly, it should always check if it's ready before returning 200 on the ready endpoint. The handler is wrapped by https://github.com/prometheus/prometheus/pull/2997/files#diff-73cbd2009b0e4f156e8ca0f47e95b016R273

@Thib17

This comment has been minimized.

Copy link
Contributor Author

Thib17 commented Jan 4, 2018

True, but as I read it h.ready is never set to 0. So once it's set to 1, isReady() always return true, and /-/ready always responds 200. And this looks normal to me, as it checks that start-up is done.

What doesn't look normal to me, is that /-/healthy always return 200, no matter if the Prometheus works as expected or not.

@discordianfish

This comment has been minimized.

Copy link
Member

discordianfish commented Jan 4, 2018

/-/ready is suppose to respond with 200 no matter what. It can be used as liveness probe in kubernetes: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

Now regarding /-/healthy, it looks like you're right. Nothing sets that to non-ready. So I suppose it's a reasonable request to have this return non-2xx if something breaks after startup. @beorn7 / @grobie Thoughts?

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Jan 4, 2018

Sounds good. What and how do you want to check?

@Thib17

This comment has been minimized.

Copy link
Contributor Author

Thib17 commented Jan 5, 2018

I would check that reads and writes to the local storage are working.
To do that I first thought to query up metric and check that timestamps are fresh enough. But it doesn't work because a Prometheus without any target (without up metric) would never be healthy.

So I'm looking for new ideas. Do you have any suggestions ?

@iksaif

This comment has been minimized.

Copy link
Contributor

iksaif commented Jan 9, 2018

Maybe you could:

  • List of series names (equivalent of /api/v1/label/__name__/values): this validates that the query engine works
  • List all targets (equivalent of /api/v1/targets): this validates that sd works
  • List alertmanager (equivalent of /api/v1/alertmanagers): this validate that alertmanager sd works

This will just detect that nothing is obviously wrong (huge deadlock), but that should be good enough. We should be careful though because /-/health will be called once per second or more, so nothing too expensive should be done.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jan 9, 2018

Those could get expensive on large setups.

@iksaif

This comment has been minimized.

Copy link
Contributor

iksaif commented Jan 9, 2018

@brian-brazil any idea of calls that would be good enough but still cheap ?

I think we should use the approach used in the push_gateway: https://github.com/prometheus/pushgateway/blob/c62e6bb458ff2192bc5df82a1a7f9d8ac826fac7/handler/misc.go#L23 and add Healthy() error to important subsystems

  • ScrapeManager
  • TargetManager
  • RuleManager
  • QueryEngine
  • Notifier

The most basic implementation for these would just lock/unlock the mutex if there is one and check that there isn't obvious errors.

What do you think ?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 7, 2018

Related #3807

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.