-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Informers do not surface API server request failures to callers #155
Comments
cc @lavalamp he's the original author |
It occurred to me after writing the second suggestion about the channel that if the caller provides it, the caller can choose a buffer for the channel, but the caller can also close the channel prematurely, causing With a callback, we then have to worry about the caller panicking, or taking so long time to react that we can't attend to the next scheduled retrying of the failed operation. The main goal of hearing about these errors is counting them—best done over a rolling window. It's not so important which errors are arising; what matters is that lack of events firing on the |
Issues go stale after 90d of inactivity. Prevent issues from auto-closing with an If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
I agree with: `Provide a way to integrate a controller into a "healthz"
handler`
cc @cheftako
…On Thu, Mar 22, 2018 at 11:20 AM Steven E. Harris ***@***.***> wrote:
/remove-lifecycle stale
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#155 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAnglkECit83qyuQIndPXbniOy4-5ROzks5tg-twgaJpZM4Mj_gW>
.
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
This issues makes it impossible to handle errors like this: kubernetes/ingress-nginx#2837 |
Unless I'm mistaken that means that every client using the informer has no way to handle connection issues with the apiserver and cluster operators have no way to monitor this either. |
It should be possible to fix this now with kubernetes/kubernetes#73937 merged. |
Not exactly sure what 'integrate with healthz' means, but a callback/channel would be what I (naively?) would want, to simply fail/exit with error or reconnect the informer. An error channel seems the idiomatic way to me.
While such metric would be useful either way, I think it's more robust to handle failing connections more explicitly. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
This problem has been causing a lot of problems for us. Informers get into an access denied error loop and spew out an overwhelming amount of error logs. For details, see tilt-dev/tilt#2702 I proposed a PR upstream that will address our problems, but I don't know if it addresses some of the other ideas on this thread around healthz: kubernetes/kubernetes#87329 |
When creating an informer, this adds a way to add custom error handling, so that Kubernetes tooling can properly surface the errors to the end user. Fixes kubernetes/client-go#155
What a day! Thank you, @nicks, and all the careful reviewers. |
When creating an informer, this adds a way to add custom error handling, so that Kubernetes tooling can properly surface the errors to the end user. Fixes #155 Kubernetes-commit: 435b40aa1e5c0ae44e0aeb9aa6dbde79838b3390
The WaitForCacheSync waits forever and never returns in the case a persisten error occurs. On the other hand, it looks like there is no way in the current version of surfacing informer problems to the caller, as stated in this issue: kubernetes/client-go#155. Error handling for informers has been added recently in kubernetes/kubernetes#87329 which is only available in master for the moment.
When I call
cache.NewInformer
to create an Informer and then callcache.Controller.Run
on the returned value, the controller periodically contacts the API server to list the objects of the given type. At some point the API server rejects these requests, causing the controller to log messages like the following:These failures repeat periodically, swelling the log file, and it may take many days before we notice that our controller is ostensibly still running, but only inertly; it can't get its job done without talking to the API server. Sometimes the server changes its mind and starts fulfilling the requests again, but these failure periods can persist for days.
A caller of
cache.Controller.Run
should have some way of detecting that these failures are occurring in order to declare the process unhealthy. Retrying automatically to smooth over intermittent network trouble is a nice feature, but with neither the ability to control it nor detect its ongoing failure makes it dangerous.I would be happy with either of the following two improvements:
Controller.Run
) that tells a caller when these request failures arrive.It could also accept a caller-provided channel, and push errors into the channel when they arise, dropping errors that can't be delivered synchronously.
That leaves the health criteria opaque to callers—and probably begs for some way to configure the thresholds—but still allows a calling process to indicate that it's in dire shape.
We discussed this gap in the "kubernetes-dev" channel in the "Kubernetes" Slack team.
The text was updated successfully, but these errors were encountered: