Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: add healthcheck #61

Open
alexellis opened this issue Jul 6, 2019 · 5 comments
Open

Feature: add healthcheck #61

alexellis opened this issue Jul 6, 2019 · 5 comments

Comments

@alexellis
Copy link
Member

Expected Behaviour

Healthcheck over HTTP or an exec probe which can be used by Kubernetes to check readiness and health

Current Behaviour

N/a

Possible Solution

Please suggest one of the options above, or see how other projects are doing this and report back.

Context

A health-check can help with robustness.

@alexellis
Copy link
Member Author

@matthiashanel what are your thoughts on this?

@matthiashanel
Copy link
Contributor

@alexellis, I can see how adding a HTTP endpoint makes sense if the service itself serves HTTP. There you'd get some feedback e.g. your service slows down and so would the health check endpoint. In the queue worker this would be largely unrelated, so I don't quite see the benefit justifying the added complexity.
As for readiness, if connect fails the program will exit, causing a restart. When this happens messages will continue to be stored streaming.

Did you run into a concrete problem where this could help?

@alexellis
Copy link
Member Author

Most Kubernetes services should have a way to express health and readiness via an exec, TCP, or HTTP probe. This can be used for a number of things including decisions about scaling or recovery.

If we're fairly sure that this is not required when interacting with NATS then I'll close it out.

I wonder if there is any value in exploring metrics instrumentation of the queue-worker itself, or if the metrics in the gateway and NATS itself are enough to get a good picture of things?

@matthiashanel
Copy link
Contributor

matthiashanel commented May 13, 2020

health probe:
The best value I can imagine the queue worker to produce is how many messages it currently processes. A value of 5 says little about wether scaling is needed or not. Scaling is needed if there are too many messages the service has not seem.

Readiness probe:
The queue worker does not open a port or serve HTTP, which makes a readiness probe a tough nut to crack. Ready for the queue worker essentially means the nats connection got established. If that does not work the queue worker exits. I can imagine conditions where the streaming client does not return from connect. Starting a webserver to protect against this by indicating readiness seems even more complex. Do I make sense here?

We will get a lot more mileage by using the metrics nats has.
In the nats-streaming-server what would have to happen is opening the monitoring port -m <port>

This example shows how to discover channels and inspect them via curl

nats-streaming-server -m 8080
curl http://127.0.0.1:8080/streaming/channelsz
{
  "cluster_id": "test-cluster",
  "server_id": "ZAs0tFNCNAd5CZuEm0I0xA",
  "now": "2020-05-13T14:49:04.556034-04:00",
  "offset": 0,
  "limit": 1024,
  "count": 2,
  "total": 2,
  "names": [
    "queue",
    "foo"
  ]
}
curl http://127.0.0.1:8080/streaming/channelsz\?channel\=queue
{
  "name": "queue",
  "msgs": 1,
  "bytes": 22,
  "first_seq": 1,
  "last_seq": 1
}%
# this one also returns information about subscriber
curl http://127.0.0.1:8080/streaming/channelsz\?channel\=foo&subs=1

https://docs.nats.io/nats-streaming-concepts/monitoring#monitoring-a-nats-streaming-channel-with-grafana-and-prometheus

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants