healthcheck not failing even when service registration is not successful #8783

wkloucek · 2024-04-05T05:43:13Z

Describe the bug

For some reason, all services can't do their service registration:

...
appregistry-5c7cd79696-vvzzm appregistry {"level":"error","service":"app-registry","error":"Failed to store data in bucket 'ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOMFYGSLTBOBYC24TFM5UXG5DSPE======': nats: connection closed","time":"2024-04-05T05:25:12Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/registry/register.go:33","message":"registration error for external service com.owncloud.api.app-registry"}
authmachine-6b75db9859-rjnkk authmachine {"level":"error","service":"auth-machine","error":"Failed to store data in bucket 'ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOMFYGSLTBOV2GQLLNMFRWQ2LOMU======': nats: connection closed","time":"2024-04-05T05:25:13Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/registry/register.go:33","message":"registration error for external service com.owncloud.api.auth-machine"}
gateway-7f75b8d7d-smmkt gateway {"level":"error","service":"gateway","error":"Failed to store data in bucket 'ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOMFYGSLTHMF2GK53BPE======': nats: connection closed","time":"2024-04-05T05:25:13Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/registry/register.go:33","message":"registration error for external service com.owncloud.api.gateway"}
frontend-599d689b84-d6bj7 frontend {"level":"error","service":"frontend","error":"Failed to store data in bucket 'ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOO5SWELTGOJXW45DFNZSA====': nats: connection closed","time":"2024-04-05T05:25:13Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/registry/register.go:33","message":"registration error for external service com.owncloud.web.frontend"}
authservice-59696b97d8-wgbq7 authservice {"level":"error","service":"auth-service","error":"Failed to store data in bucket 'ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOMFYGSLTBOV2GQLLTMVZHM2LDMU======': nats: connection closed","time":"2024-04-05T05:25:14Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/registry/register.go:33","message":"registration error for external service com.owncloud.api.auth-service"}
gateway-7f75b8d7d-z85gw gateway {"level":"error","service":"gateway","error":"Failed to store data in bucket 'ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOMFYGSLTHMF2GK53BPE======': nats: connection closed","time":"2024-04-05T05:25:14Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/registry/register.go:33","message":"registration error for external service com.owncloud.api.gateway"}
groups-664b5c5459-jpsqq groups {"level":"error","service":"groups","error":"Failed to store data in bucket 'ONSXE5TJMNSS24TFM5UXG5DSPFPWG33NFZXXO3TDNRXXKZBOMFYGSLTHOJXXK4DT': nats: connection closed","time":"2024-04-05T05:25:15Z","line":"github.com/owncloud/ocis/v2/ocis-pkg/registry/register.go:33","message":"registration error for external service com.owncloud.api.groups"}
...

The only service that actually fails and restarts is the appprovider. All other services pretend to be healthy and stay running.

$kubectl get pods -n ocis
NAME                                       READY   STATUS      RESTARTS          AGE
antivirus-7c77fb964b-g8r4l                 1/1     Running     0                 6d16h
antivirus-7c77fb964b-wqdh8                 1/1     Running     0                 6d16h
appprovider-office-6c6846cf9f-c4wcs        1/1     Running     980 (5m41s ago)   6d16h
appregistry-5c7cd79696-vvzzm               1/1     Running     0                 6d16h
audit-57484b77c5-grq5f                     1/1     Running     0                 6d16h
audit-57484b77c5-s48lv                     1/1     Running     0                 6d16h
authmachine-6b75db9859-jvdz4               1/1     Running     0                 6d16h
authmachine-6b75db9859-rjnkk               1/1     Running     0                 6d16h
authservice-59696b97d8-h9p9c               1/1     Running     0                 6d16h
authservice-59696b97d8-wgbq7               1/1     Running     0                 6d16h
clientlog-fdfb746f8-bj7l9                  1/1     Running     7 (4d7h ago)      4d7h
clientlog-fdfb746f8-pc2pn                  1/1     Running     0                 6d16h
eventhistory-7bdddbcc76-mrp59              1/1     Running     0                 6d16h
eventhistory-7bdddbcc76-q2r4b              1/1     Running     0                 6d16h
frontend-599d689b84-d6bj7                  1/1     Running     0                 6d16h
frontend-599d689b84-p872k                  1/1     Running     8 (4d7h ago)      4d7h
gateway-7f75b8d7d-smmkt                    1/1     Running     0                 6d16h
gateway-7f75b8d7d-z85gw                    1/1     Running     0                 6d16h
graph-bb5764595-nf9xn                      1/1     Running     0                 6d16h
graph-bb5764595-rqv9p                      1/1     Running     0                 6d16h
groups-664b5c5459-jpsqq                    1/1     Running     0                 6d16h
groups-664b5c5459-mfck5                    1/1     Running     8 (4d7h ago)      4d7h
ldap-main-0                                1/1     Running     0                 6d16h
ldap-secondary-0                           1/1     Running     0                 6d16h
nats-0                                     2/2     Running     0                 21h
nats-1                                     2/2     Running     0                 21h
nats-2                                     2/2     Running     0                 21h
nats-box-5d5787998c-zqdlx                  1/1     Running     0                 6d17h
notifications-5bc9cb698c-6mcdf             1/1     Running     0                 6d16h
ocdav-66fd6cdd68-9pcsm                     1/1     Running     0                 6d16h
ocdav-66fd6cdd68-vbzf9                     1/1     Running     0                 6d16h
ocs-56cc5f7777-4djwq                       1/1     Running     8 (4d7h ago)      4d7h
ocs-56cc5f7777-vbr6s                       1/1     Running     0                 6d16h
policies-648db4488b-mq585                  1/1     Running     0                 6d16h
policies-648db4488b-vfztq                  1/1     Running     0                 6d16h
postprocessing-6bfb9c75ff-dfhmb            1/1     Running     0                 6d16h
postprocessing-6bfb9c75ff-hmdhx            1/1     Running     0                 6d16h
proxy-574c44c499-xhbdm                     1/1     Running     0                 6d16h
proxy-574c44c499-zvb48                     1/1     Running     0                 6d16h
search-86969f87cc-dnmjv                    1/1     Running     0                 6d16h
settings-7b79f98c45-pdfqh                  1/1     Running     0                 6d16h
settings-7b79f98c45-qvbnq                  1/1     Running     0                 6d16h
sharing-fcd5df4b9-tcn2c                    1/1     Running     0                 6d16h
sharing-fcd5df4b9-x5f56                    1/1     Running     0                 6d16h
sse-567f9f4658-mfl5x                       1/1     Running     0                 6d16h
sse-567f9f4658-zm8nr                       1/1     Running     0                 6d16h
storagepubliclink-69dbfb86df-2c9ns         1/1     Running     0                 6d16h
storagepubliclink-69dbfb86df-6wbpr         1/1     Running     0                 6d16h
storageshares-765cb77cf4-26bpc             1/1     Running     0                 6d16h
storageshares-765cb77cf4-cghkg             1/1     Running     0                 6d16h
storagesystem-557cc5f9bb-bt8ph             1/1     Running     8 (4d7h ago)      4d7h
storagesystem-557cc5f9bb-hx748             1/1     Running     0                 6d16h
storageusers-85697664c7-kzjds              1/1     Running     0                 6d16h
storageusers-85697664c7-wbqk5              1/1     Running     0                 6d16h
thumbnails-8d4d7bd56-lt5zx                 1/1     Running     0                 6d16h
thumbnails-cleanup-28538116-nfv4r          0/1     Completed   0                 130m
thumbnails-cleanup-28538176-26h7g          0/1     Completed   0                 70m
thumbnails-cleanup-28538236-8fzvg          0/1     Completed   0                 10m
userlog-74fd6d7b99-7jp6r                   1/1     Running     0                 6d16h
userlog-74fd6d7b99-wlfkx                   1/1     Running     0                 6d16h
users-b897bfdf-6wnvm                       1/1     Running     0                 6d16h
users-b897bfdf-nlrkx                       1/1     Running     0                 6d16h
web-6945f454d4-5f2qb                       1/1     Running     8 (4d7h ago)      4d7h
web-6945f454d4-tmvkx                       1/1     Running     0                 6d16h
webdav-58f7f68c5b-m6629                    1/1     Running     0                 6d16h
webdav-58f7f68c5b-wm8rc                    1/1     Running     0                 6d16h
webfinger-7465cfdcd9-2drtz                 1/1     Running     0                 6d16h
webfinger-7465cfdcd9-tvf5f                 1/1     Running     0                 6d16h
wopi-wopiserver-699df58bc7-7tqkk           1/1     Running     10 (4d7h ago)     6d16h
wopi-wopiserver-699df58bc7-qbtbz           1/1     Running     10 (4d7h ago)     6d16h

Steps to reproduce

I have no reproducer to get into that state. (actually recreating all oCIS pods got me out of this state)

Expected behavior

No service returns a positive health status when the service registration doesn't work.

Actual behavior

All pods are marked as healthy.

Setup

I have oCIS 5.0.0 installed via the oCIS Helm Chart (owncloud/ocis-charts@fe5697d) with a external NATS cluster as service registry, store and cache.

Additional context

The oCIS chart uses the /healthz endpoint to determine the status of a oCIS pod: https://github.com/owncloud/ocis-charts/blob/fe5697d0a8a0431d7b0b39e4928beb66f2276baf/charts/ocis/templates/_common/_tplvalues.tpl#L207-L220

for some context:

Scaling oCIS in kubernetes causes requests to fail #8589 (see "3. service readyness and health endpoints do not reflect the actual state of the service

The text was updated successfully, but these errors were encountered:

wkloucek · 2024-04-08T11:17:09Z

@micbar I would like to escalate this, for more details see https://github.com/owncloud/enterprise/issues/6547

The NATS client seems to be stuck in a disconnected state .

wkloucek · 2024-04-08T11:20:26Z

Reminds me of #7056

butonic · 2024-04-08T15:00:12Z

I am wondering what our nats clients, using the default options, do after 60 reconnect attempts habe been reached:

// Default Constants
const (
	Version                   = "1.33.1"
	DefaultURL                = "nats://127.0.0.1:4222"
	DefaultPort               = 4222
	DefaultMaxReconnect       = 60
	DefaultReconnectWait      = 2 * time.Second
	DefaultReconnectJitter    = 100 * time.Millisecond
	DefaultReconnectJitterTLS = time.Second
	DefaultTimeout            = 2 * time.Second
	DefaultPingInterval       = 2 * time.Minute
	DefaultMaxPingOut         = 2
	DefaultMaxChanLen         = 64 * 1024       // 64k
	DefaultReconnectBufSize   = 8 * 1024 * 1024 // 8MB
	RequestChanLen            = 8
	DefaultDrainTimeout       = 30 * time.Second
	DefaultFlusherTimeout     = time.Minute
	LangString                = "go"
)

hm ... that would explain why the nats js registry logs the "registration error for external service ..." with the "nats: connection closed" 🤔

butonic · 2024-04-08T15:03:40Z

The registry should kill the service if the underlying store cannot recover the underlying connection.

butonic · 2024-04-08T15:08:21Z

at least the service should not be healthy if the nats client connection was closed.

wkloucek · 2024-04-08T15:14:24Z

The reproducer is:

take all nats pods down
wait for a longer time until all service registration logs log "connection closed"
you can get nats back up

If you're looking at the outgoing connections in the oCIS pods, you don't see any attempt / active connection to nats, see #8783 (comment)

kobergj · 2024-04-09T09:12:08Z

Decided to fix this like the following:

use the integrated nats option ClosedCB
wrap this option in the nats-js-kv store (upstream, go-micro)
set the `conn = nil when the connection is closed

This should cause ocis to automatically reconnect on the next operation when the connection is closed.

butonic · 2024-04-09T10:39:43Z

For reference: https://docs.nats.io/using-nats/developer/connecting/reconnect

kobergj · 2024-04-09T10:42:32Z

Here is the micro PR: micro/plugins#139

Just a draft for now as I need to test it properly

kobergj · 2024-04-09T10:44:14Z

@butonic reading the docu you posted: Should we rather use DisconnectedErrCB? Or react to both?

wkloucek · 2024-04-18T13:24:12Z

@kobergj does #8880 also make the healthcheck fail when service registration is not possible? I think we really must reflect a non working service in the health status, even if we now try to reconnect to the service registry forever.

kobergj · 2024-04-18T13:29:03Z

Yes. It will correctly reconnect until MaxRetries is reached. If it cannot reconnect, it will os.Exit killing the pod.

We still need to add the same mechanic for the events handlers, but service registration is supposed to work properly now.

micbar · 2024-04-18T15:53:06Z

@kobergj sorry to ask again, but does the health endpoint switch to not healthy during that reconnect attempts?

kobergj · 2024-04-19T06:50:18Z

No - we have no control over that. Reconnects are done by the nats pkg. We would need to implement some magic (call nats, use the disconnect handler, ...) to achieve something like that.

But I would not recommend that. If we implement something like that and nats goes down, it would take all services with it. I would say it is better if just nats is dying and other services reconnect when it is back.

wkloucek · 2024-04-19T07:15:44Z

I'd really like to have a way to see the health of a oCIS deployment WITHOUT looking at the logs. This is not possible with the current approach.
Even when looking at the logs, you might not see the state because those reconnects might happen without any logs being issued.

Currently a oCIS installation on Kubernetes is a blackbox to me and I only know if it's working when I logged in, uploaded and downloaded a file, created shares, ....

micbar · 2024-04-19T07:27:57Z

@wkloucek Let me try to formulate it what we decided together with you some time ago.

Service Health

We define a healthy service when the service itself and its core functions are available.

Dependencies

A lot of services have dependencies, e.g NATS or the reva gateway or the S3 connection. We do not indicate an unhealthy service when one of its dependencies is not working. Our understanding was, if we do that, the whole service mesh could show as not healthy but only one dependency needs to be fixed / restarted.

This case with the NATS connection

In this case, the ocis service is healty and can work as soon as the NATS service becomes healthy again.

kobergj · 2024-04-19T07:35:25Z

I'd really like to have a way to see the health of a oCIS deployment WITHOUT looking at the logs. This is not possible with the current approach.

@wkloucek
I don't understand. Isn't this story the first step on this path? Or would you like ocis dying instead reconnecting as @micbar suggested?

wkloucek · 2024-04-19T08:46:48Z

Our understanding was, if we do that, the whole service mesh could show as not healthy but only one dependency needs to be fixed / restarted.

But even if failing healthchecks are considered bad in this case, we still should reflect unmet dependencies in the readiness status.

Or would you like ocis dying instead reconnecting

This is not contradictory for me. oCIS services can do reconnects and at the same time report itself as unhealthy or not ready. A supervisor like Kubernetes would let this service run for some time and only restart it after x failing healthchecks. (Basically this is what oCIS does right now. Reconnect x times and then exit with status code 1).

kobergj · 2024-04-19T09:11:23Z

Ok I see. So what would be the ideal behaviour?

While reconnecting be healthy but not ready? Or be unhealthy directly?
When reconnect fails be unhealthy but alive? Or just die directly?

For the single binary it is convenient that ocis exits when reconnect fails.

micbar · 2024-04-19T10:28:28Z

For the single binary it is convenient that ocis exits when reconnect fails.

We can make this work for the single process only.

wkloucek · 2024-04-23T07:16:52Z

For Kubernetes I think it would be desirable to:

have the /healthz reporting unhealthy when it's "our problem". A connection that is in "connection closed" state all the time would fit here. Another example could be "permission denied" because of credentials that have changed and a restart of the pod would fix it since the secret object was updated in the meantime. Keep in mind we supply most configuration via environment variables and don't reload config files. (https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/ suggests Avoid checking dependencies in liveness probes. Liveness probes should be inexpensive and have response times with minimal variance. but I understand it like "don't do in time requests to dependencies to compute the healthiness". Instead checking the state of eg. a connection pool (connected vs. not connected) would be a very inexpensive method to check the dependencies.)
have the /readyz report not ready when it's "someone else's problem" and of course "we're not ready (yet) to serve traffic". I'd actually count a reconnect result "failed to dial" / "connection timed out" here.
having the /healthz reporting unhealthy and leaving the pod restart to Kubernetes would be preferable - this provides extended configurability (timeout, retries), exponential backoff delay and visibility (Kubernetes will report a failing health check instead of exit code 1). But I have the feeling this is a lower prio since it also needs the /healthz implementation first.

If you do the "our problem" vs. "someone else's problem" you also avoid this:

We do not indicate an unhealthy service when one of its dependencies is not working. Our understanding was, if we do that, the whole service mesh could show as not healthy but only one dependency needs to be fixed / restarted.

As a drastic example:
We forgot to deploy NATS and are configure it to be used as store, cache and service registry.
All oCIS services would be healthy but not ready.

wkloucek added the Type:Bug label Apr 5, 2024

micbar added the Priority:p3-medium Normal priority label Apr 5, 2024

wkloucek added the Interaction:Blue-Ticket label Apr 8, 2024

micbar added Priority:p2-high Escalation, on top of current planning, release blocker and removed Priority:p3-medium Normal priority labels Apr 8, 2024

kobergj mentioned this issue Apr 17, 2024

Fix nats reconnects #8880

Merged

kobergj closed this as completed in #8880 Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

healthcheck not failing even when service registration is not successful #8783

healthcheck not failing even when service registration is not successful #8783

wkloucek commented Apr 5, 2024

wkloucek commented Apr 8, 2024 •

edited

Loading

wkloucek commented Apr 8, 2024

butonic commented Apr 8, 2024 •

edited

Loading

butonic commented Apr 8, 2024

butonic commented Apr 8, 2024

wkloucek commented Apr 8, 2024

kobergj commented Apr 9, 2024

butonic commented Apr 9, 2024

kobergj commented Apr 9, 2024

kobergj commented Apr 9, 2024

wkloucek commented Apr 18, 2024

kobergj commented Apr 18, 2024 •

edited

Loading

micbar commented Apr 18, 2024

kobergj commented Apr 19, 2024

wkloucek commented Apr 19, 2024

micbar commented Apr 19, 2024

kobergj commented Apr 19, 2024

wkloucek commented Apr 19, 2024

kobergj commented Apr 19, 2024

micbar commented Apr 19, 2024

wkloucek commented Apr 23, 2024

healthcheck not failing even when service registration is not successful #8783

healthcheck not failing even when service registration is not successful #8783

Comments

wkloucek commented Apr 5, 2024

Describe the bug

Steps to reproduce

Expected behavior

Actual behavior

Setup

Additional context

wkloucek commented Apr 8, 2024 • edited Loading

wkloucek commented Apr 8, 2024

butonic commented Apr 8, 2024 • edited Loading

butonic commented Apr 8, 2024

butonic commented Apr 8, 2024

wkloucek commented Apr 8, 2024

kobergj commented Apr 9, 2024

butonic commented Apr 9, 2024

kobergj commented Apr 9, 2024

kobergj commented Apr 9, 2024

wkloucek commented Apr 18, 2024

kobergj commented Apr 18, 2024 • edited Loading

micbar commented Apr 18, 2024

kobergj commented Apr 19, 2024

wkloucek commented Apr 19, 2024

micbar commented Apr 19, 2024

Service Health

Dependencies

This case with the NATS connection

kobergj commented Apr 19, 2024

wkloucek commented Apr 19, 2024

kobergj commented Apr 19, 2024

micbar commented Apr 19, 2024

wkloucek commented Apr 23, 2024

wkloucek commented Apr 8, 2024 •

edited

Loading

butonic commented Apr 8, 2024 •

edited

Loading

kobergj commented Apr 18, 2024 •

edited

Loading