Informers do not surface API server request failures to callers #155

seh · 2017-03-21T15:38:41Z

When I call cache.NewInformer to create an Informer and then call cache.Controller.Run on the returned value, the controller periodically contacts the API server to list the objects of the given type. At some point the API server rejects these requests, causing the controller to log messages like the following:

Failed to list *v1beta1.Ingress: the server has asked for the client to provide credentials (get ingresses.extensions)\n","stream":"stderr","time":"2017-03-21T14:14:37.397681364Z"}

Failed to list *v1.Service: the server has asked for the client to provide credentials (get services)\n","stream":"stderr","time":"2017-03-21T14:14:37.398311081Z"}

Failed to list *v1.Endpoints: the server has asked for the client to provide credentials (get endpoints)\n","stream":"stderr","time":"2017-03-21T14:14:37.399217817Z"}

These failures repeat periodically, swelling the log file, and it may take many days before we notice that our controller is ostensibly still running, but only inertly; it can't get its job done without talking to the API server. Sometimes the server changes its mind and starts fulfilling the requests again, but these failure periods can persist for days.

A caller of cache.Controller.Run should have some way of detecting that these failures are occurring in order to declare the process unhealthy. Retrying automatically to smooth over intermittent network trouble is a nice feature, but with neither the ability to control it nor detect its ongoing failure makes it dangerous.

I would be happy with either of the following two improvements:

Accept a callback (perhaps via a new sibling method for Controller.Run) that tells a caller when these request failures arrive.
It could also accept a caller-provided channel, and push errors into the channel when they arise, dropping errors that can't be delivered synchronously.
Provide a way to integrate a controller into a "healthz" handler.
That leaves the health criteria opaque to callers—and probably begs for some way to configure the thresholds—but still allows a calling process to indicate that it's in dire shape.

We discussed this gap in the "kubernetes-dev" channel in the "Kubernetes" Slack team.

The text was updated successfully, but these errors were encountered:

caesarxuchao · 2017-03-23T23:18:31Z

cc @lavalamp he's the original author

seh · 2017-03-24T00:08:19Z

It occurred to me after writing the second suggestion about the channel that if the caller provides it, the caller can choose a buffer for the channel, but the caller can also close the channel prematurely, causing Controller.Run* to panic later. Alternately, if Controller.Run* returned the channel, then the caller can't control the buffer, but at least the caller can't close the channel either, assuming the returned channel only allows receive operations.

With a callback, we then have to worry about the caller panicking, or taking so long time to react that we can't attend to the next scheduled retrying of the failed operation. The main goal of hearing about these errors is counting them—best done over a rolling window. It's not so important which errors are arising; what matters is that lack of events firing on the ResourceEventHandler is not just for lack of observable activity in the cluster, but rather because we're deaf to any such activity.

fejta-bot · 2017-12-22T15:42:41Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

seh · 2017-12-22T16:35:29Z

/remove-lifecycle stale

fejta-bot · 2018-03-22T17:23:55Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

seh · 2018-03-22T18:20:32Z

/remove-lifecycle stale

lavalamp · 2018-03-23T18:04:13Z

I agree with: `Provide a way to integrate a controller into a "healthz" handler` cc @cheftako

…

On Thu, Mar 22, 2018 at 11:20 AM Steven E. Harris ***@***.***> wrote: /remove-lifecycle stale — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#155 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAnglkECit83qyuQIndPXbniOy4-5ROzks5tg-twgaJpZM4Mj_gW> .

fejta-bot · 2018-06-21T18:37:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

seh · 2018-06-21T18:38:26Z

/remove-lifecycle stale

fejta-bot · 2018-09-19T18:54:47Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2018-10-19T19:18:08Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

seh · 2018-10-19T19:43:47Z

/remove-lifecycle rotten

fejta-bot · 2019-01-17T20:33:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

george-angel · 2019-01-18T08:55:22Z

/remove-lifecycle stale

discordianfish · 2019-03-08T15:45:40Z

This issues makes it impossible to handle errors like this: kubernetes/ingress-nginx#2837

discordianfish · 2019-03-08T16:50:59Z

Unless I'm mistaken that means that every client using the informer has no way to handle connection issues with the apiserver and cluster operators have no way to monitor this either.

discordianfish · 2019-04-16T09:02:13Z

It should be possible to fix this now with kubernetes/kubernetes#73937 merged.

discordianfish · 2019-10-16T09:29:49Z

I think it's best to integrate these with healthz. I am not a huge fan of the callback idea as there's not much useful to be done if the informer stops working.

Not exactly sure what 'integrate with healthz' means, but a callback/channel would be what I (naively?) would want, to simply fail/exit with error or reconnect the informer. An error channel seems the idiomatic way to me.

I also think that we need to surface a metric that counts requests made by a client-- if a controller stops making more requests for long enough, it needs attention. We should also, in sibling metrics, track observed latency and observed errors.

While such metric would be useful either way, I think it's more robust to handle failing connections more explicitly.

fejta-bot · 2020-01-14T09:31:06Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

george-angel · 2020-01-14T09:33:12Z

/remove-lifecycle stale

nicks · 2020-01-17T18:00:38Z

This problem has been causing a lot of problems for us. Informers get into an access denied error loop and spew out an overwhelming amount of error logs. For details, see tilt-dev/tilt#2702

I proposed a PR upstream that will address our problems, but I don't know if it addresses some of the other ideas on this thread around healthz: kubernetes/kubernetes#87329

When creating an informer, this adds a way to add custom error handling, so that Kubernetes tooling can properly surface the errors to the end user. Fixes kubernetes/client-go#155

seh · 2020-04-04T15:41:06Z

What a day! Thank you, @nicks, and all the careful reviewers.

When creating an informer, this adds a way to add custom error handling, so that Kubernetes tooling can properly surface the errors to the end user. Fixes #155 Kubernetes-commit: 435b40aa1e5c0ae44e0aeb9aa6dbde79838b3390

The WaitForCacheSync waits forever and never returns in the case a persisten error occurs. On the other hand, it looks like there is no way in the current version of surfacing informer problems to the caller, as stated in this issue: kubernetes/client-go#155. Error handling for informers has been added recently in kubernetes/kubernetes#87329 which is only available in master for the moment.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017

briansmith mentioned this issue Mar 5, 2018

Retry k8s watch endpoints on error linkerd/linkerd2#510

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2018

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2018

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 19, 2018

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 19, 2018

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2019

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2019

k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 14, 2019

nicks mentioned this issue Dec 20, 2019

exec auth provider expiration causes problems tilt-dev/tilt#2702

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2020

nicks mentioned this issue Jan 17, 2020

cache: add error handling to informers kubernetes/kubernetes#87329

Merged

mikedanese mentioned this issue Mar 13, 2020

Verify behavior of exec auth plugin when credentials expire kubernetes/kubernetes#89114

Closed

k8s-ci-robot closed this as completed in kubernetes/kubernetes#87329 Apr 4, 2020

amshuman-kr mentioned this issue Nov 6, 2020

[BUG] Automatic recovery from informer errors gardener/etcd-druid#113

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Informers do not surface API server request failures to callers #155

Informers do not surface API server request failures to callers #155

seh commented Mar 21, 2017

caesarxuchao commented Mar 23, 2017

seh commented Mar 24, 2017

fejta-bot commented Dec 22, 2017

seh commented Dec 22, 2017

fejta-bot commented Mar 22, 2018

seh commented Mar 22, 2018

lavalamp commented Mar 23, 2018 via email

fejta-bot commented Jun 21, 2018

seh commented Jun 21, 2018

fejta-bot commented Sep 19, 2018

fejta-bot commented Oct 19, 2018

seh commented Oct 19, 2018

fejta-bot commented Jan 17, 2019

george-angel commented Jan 18, 2019

discordianfish commented Mar 8, 2019

discordianfish commented Mar 8, 2019

discordianfish commented Apr 16, 2019

discordianfish commented Oct 16, 2019

fejta-bot commented Jan 14, 2020

george-angel commented Jan 14, 2020

nicks commented Jan 17, 2020

seh commented Apr 4, 2020

Informers do not surface API server request failures to callers #155

Informers do not surface API server request failures to callers #155

Comments

seh commented Mar 21, 2017

caesarxuchao commented Mar 23, 2017

seh commented Mar 24, 2017

fejta-bot commented Dec 22, 2017

seh commented Dec 22, 2017

fejta-bot commented Mar 22, 2018

seh commented Mar 22, 2018

lavalamp commented Mar 23, 2018 via email

fejta-bot commented Jun 21, 2018

seh commented Jun 21, 2018

fejta-bot commented Sep 19, 2018

fejta-bot commented Oct 19, 2018

seh commented Oct 19, 2018

fejta-bot commented Jan 17, 2019

george-angel commented Jan 18, 2019

discordianfish commented Mar 8, 2019

discordianfish commented Mar 8, 2019

discordianfish commented Apr 16, 2019

discordianfish commented Oct 16, 2019

fejta-bot commented Jan 14, 2020

george-angel commented Jan 14, 2020

nicks commented Jan 17, 2020

seh commented Apr 4, 2020