Feature: add healthcheck #61

alexellis · 2019-07-06T10:44:01Z

Expected Behaviour

Healthcheck over HTTP or an exec probe which can be used by Kubernetes to check readiness and health

Current Behaviour

N/a

Possible Solution

Please suggest one of the options above, or see how other projects are doing this and report back.

Context

A health-check can help with robustness.

alexellis · 2019-07-06T10:44:54Z

See also: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-probes/

https://github.com/openfaas/faas-netes/blob/master/chart/openfaas/templates/gateway-dep.yaml#L45

alexellis · 2020-05-12T12:30:37Z

@matthiashanel what are your thoughts on this?

matthiashanel · 2020-05-13T06:21:58Z

@alexellis, I can see how adding a HTTP endpoint makes sense if the service itself serves HTTP. There you'd get some feedback e.g. your service slows down and so would the health check endpoint. In the queue worker this would be largely unrelated, so I don't quite see the benefit justifying the added complexity.
As for readiness, if connect fails the program will exit, causing a restart. When this happens messages will continue to be stored streaming.

Did you run into a concrete problem where this could help?

alexellis · 2020-05-13T13:43:24Z

Most Kubernetes services should have a way to express health and readiness via an exec, TCP, or HTTP probe. This can be used for a number of things including decisions about scaling or recovery.

If we're fairly sure that this is not required when interacting with NATS then I'll close it out.

I wonder if there is any value in exploring metrics instrumentation of the queue-worker itself, or if the metrics in the gateway and NATS itself are enough to get a good picture of things?

matthiashanel · 2020-05-13T18:54:00Z

health probe:
The best value I can imagine the queue worker to produce is how many messages it currently processes. A value of 5 says little about wether scaling is needed or not. Scaling is needed if there are too many messages the service has not seem.

Readiness probe:
The queue worker does not open a port or serve HTTP, which makes a readiness probe a tough nut to crack. Ready for the queue worker essentially means the nats connection got established. If that does not work the queue worker exits. I can imagine conditions where the streaming client does not return from connect. Starting a webserver to protect against this by indicating readiness seems even more complex. Do I make sense here?

We will get a lot more mileage by using the metrics nats has.
In the nats-streaming-server what would have to happen is opening the monitoring port -m <port>

This example shows how to discover channels and inspect them via curl

nats-streaming-server -m 8080
curl http://127.0.0.1:8080/streaming/channelsz
{
  "cluster_id": "test-cluster",
  "server_id": "ZAs0tFNCNAd5CZuEm0I0xA",
  "now": "2020-05-13T14:49:04.556034-04:00",
  "offset": 0,
  "limit": 1024,
  "count": 2,
  "total": 2,
  "names": [
    "queue",
    "foo"
  ]
}
curl http://127.0.0.1:8080/streaming/channelsz\?channel\=queue
{
  "name": "queue",
  "msgs": 1,
  "bytes": 22,
  "first_seq": 1,
  "last_seq": 1
}%
# this one also returns information about subscriber
curl http://127.0.0.1:8080/streaming/channelsz\?channel\=foo&subs=1

https://docs.nats.io/nats-streaming-concepts/monitoring#monitoring-a-nats-streaming-channel-with-grafana-and-prometheus

alexellis added blocked good first issue help wanted labels Jul 6, 2019

matthiashanel mentioned this issue May 14, 2020

Expose nats monitoring endpoints to prometheus #99

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: add healthcheck #61

Feature: add healthcheck #61

alexellis commented Jul 6, 2019

alexellis commented Jul 6, 2019

alexellis commented May 12, 2020

matthiashanel commented May 13, 2020

alexellis commented May 13, 2020

matthiashanel commented May 13, 2020 •

edited

Loading

Feature: add healthcheck #61

Feature: add healthcheck #61

Comments

alexellis commented Jul 6, 2019

Expected Behaviour

Current Behaviour

Possible Solution

Context

alexellis commented Jul 6, 2019

alexellis commented May 12, 2020

matthiashanel commented May 13, 2020

alexellis commented May 13, 2020

matthiashanel commented May 13, 2020 • edited Loading

matthiashanel commented May 13, 2020 •

edited

Loading