Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Informers do not surface API server request failures to callers #155

Closed
seh opened this issue Mar 21, 2017 · 27 comments · Fixed by kubernetes/kubernetes#87329
Closed

Informers do not surface API server request failures to callers #155

seh opened this issue Mar 21, 2017 · 27 comments · Fixed by kubernetes/kubernetes#87329
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@seh
Copy link

seh commented Mar 21, 2017

When I call cache.NewInformer to create an Informer and then call cache.Controller.Run on the returned value, the controller periodically contacts the API server to list the objects of the given type. At some point the API server rejects these requests, causing the controller to log messages like the following:

Failed to list *v1beta1.Ingress: the server has asked for the client to provide credentials (get ingresses.extensions)\n","stream":"stderr","time":"2017-03-21T14:14:37.397681364Z"}

Failed to list *v1.Service: the server has asked for the client to provide credentials (get services)\n","stream":"stderr","time":"2017-03-21T14:14:37.398311081Z"}

Failed to list *v1.Endpoints: the server has asked for the client to provide credentials (get endpoints)\n","stream":"stderr","time":"2017-03-21T14:14:37.399217817Z"}

These failures repeat periodically, swelling the log file, and it may take many days before we notice that our controller is ostensibly still running, but only inertly; it can't get its job done without talking to the API server. Sometimes the server changes its mind and starts fulfilling the requests again, but these failure periods can persist for days.

A caller of cache.Controller.Run should have some way of detecting that these failures are occurring in order to declare the process unhealthy. Retrying automatically to smooth over intermittent network trouble is a nice feature, but with neither the ability to control it nor detect its ongoing failure makes it dangerous.

I would be happy with either of the following two improvements:

  • Accept a callback (perhaps via a new sibling method for Controller.Run) that tells a caller when these request failures arrive.
    It could also accept a caller-provided channel, and push errors into the channel when they arise, dropping errors that can't be delivered synchronously.
  • Provide a way to integrate a controller into a "healthz" handler.
    That leaves the health criteria opaque to callers—and probably begs for some way to configure the thresholds—but still allows a calling process to indicate that it's in dire shape.

We discussed this gap in the "kubernetes-dev" channel in the "Kubernetes" Slack team.

@caesarxuchao
Copy link
Member

cc @lavalamp he's the original author

@seh
Copy link
Author

seh commented Mar 24, 2017

It occurred to me after writing the second suggestion about the channel that if the caller provides it, the caller can choose a buffer for the channel, but the caller can also close the channel prematurely, causing Controller.Run* to panic later. Alternately, if Controller.Run* returned the channel, then the caller can't control the buffer, but at least the caller can't close the channel either, assuming the returned channel only allows receive operations.

With a callback, we then have to worry about the caller panicking, or taking so long time to react that we can't attend to the next scheduled retrying of the failed operation. The main goal of hearing about these errors is counting them—best done over a rolling window. It's not so important which errors are arising; what matters is that lack of events firing on the ResourceEventHandler is not just for lack of observable activity in the cluster, but rather because we're deaf to any such activity.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017
@seh
Copy link
Author

seh commented Dec 22, 2017

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 22, 2017
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2018
@seh
Copy link
Author

seh commented Mar 22, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2018
@lavalamp
Copy link
Member

lavalamp commented Mar 23, 2018 via email

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2018
@seh
Copy link
Author

seh commented Jun 21, 2018

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 21, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2018
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 19, 2018
@seh
Copy link
Author

seh commented Oct 19, 2018

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Oct 19, 2018
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 17, 2019
@george-angel
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2019
@discordianfish
Copy link

This issues makes it impossible to handle errors like this: kubernetes/ingress-nginx#2837

@discordianfish
Copy link

Unless I'm mistaken that means that every client using the informer has no way to handle connection issues with the apiserver and cluster operators have no way to monitor this either.

@discordianfish
Copy link

It should be possible to fix this now with kubernetes/kubernetes#73937 merged.

@k8s-ci-robot k8s-ci-robot added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Oct 14, 2019
@discordianfish
Copy link

I think it's best to integrate these with healthz. I am not a huge fan of the callback idea as there's not much useful to be done if the informer stops working.

Not exactly sure what 'integrate with healthz' means, but a callback/channel would be what I (naively?) would want, to simply fail/exit with error or reconnect the informer. An error channel seems the idiomatic way to me.

I also think that we need to surface a metric that counts requests made by a client-- if a controller stops making more requests for long enough, it needs attention. We should also, in sibling metrics, track observed latency and observed errors.

While such metric would be useful either way, I think it's more robust to handle failing connections more explicitly.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2020
@george-angel
Copy link

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 14, 2020
@nicks
Copy link
Contributor

nicks commented Jan 17, 2020

This problem has been causing a lot of problems for us. Informers get into an access denied error loop and spew out an overwhelming amount of error logs. For details, see tilt-dev/tilt#2702

I proposed a PR upstream that will address our problems, but I don't know if it addresses some of the other ideas on this thread around healthz: kubernetes/kubernetes#87329

nicks added a commit to tilt-dev/kubernetes that referenced this issue Mar 14, 2020
When creating an informer, this adds a way to add custom error handling, so that
Kubernetes tooling can properly surface the errors to the end user.

Fixes kubernetes/client-go#155
@seh
Copy link
Author

seh commented Apr 4, 2020

What a day! Thank you, @nicks, and all the careful reviewers.

k8s-publishing-bot pushed a commit that referenced this issue Apr 4, 2020
When creating an informer, this adds a way to add custom error handling, so that
Kubernetes tooling can properly surface the errors to the end user.

Fixes #155

Kubernetes-commit: 435b40aa1e5c0ae44e0aeb9aa6dbde79838b3390
roivaz added a commit to 3scale-ops/marin3r that referenced this issue Apr 15, 2020
The WaitForCacheSync waits forever and never returns in the case a
persisten error occurs. On the other hand, it looks like there is no
way in the current version of surfacing informer problems to the caller,
as stated in this issue: kubernetes/client-go#155.
Error handling for informers has been added recently in
kubernetes/kubernetes#87329 which is only available
in master for the moment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants