Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache: add error handling to informers #87329

Merged

Conversation

nicks
Copy link
Contributor

@nicks nicks commented Jan 17, 2020

When creating an informer, this adds a way to add custom error handling
or backoff logic, so that Kubernetes tooling can properly surface
the errors to the terminal.

Fixes kubernetes/client-go#155

/kind feature

NONE

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 17, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @nicks!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @nicks. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jan 17, 2020
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 17, 2020
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 17, 2020
@nicks
Copy link
Contributor Author

nicks commented Jan 17, 2020

/assign @deads2k

@nicks nicks force-pushed the nicks/informer-error-handling branch 2 times, most recently from 13f9ad7 to 57b93a8 Compare January 18, 2020 00:39
@@ -327,7 +342,7 @@ func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
case err == io.ErrUnexpectedEOF:
klog.V(1).Infof("%s: Watch for %v closed with unexpected EOF: %v", r.name, r.expectedTypeName, err)
default:
utilruntime.HandleError(fmt.Errorf("%s: Failed to watch %v: %v", r.name, r.expectedTypeName, err))
errToReturn = fmt.Errorf("Failed to watch %v: %v", r.expectedTypeName, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not return the error right away?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted to preserve the current behavior where we don't return on connection refused (i.e., line 351)

@@ -318,6 +332,7 @@ func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {

w, err := r.listerWatcher.Watch(options)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about removing all the specific error handling here to a DefaultDropWatchHandler that is registered by default?
Because I'd expect people would like to handle EOF and ConnectionRefused error (if only for monitoring purposes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The most important thing for us is to have some mechanism to bubble up to user "hey, your auth tokens expired, and we can't talk to kubernetes anymore".

I'm deeply worried that this PR is going to turn into a quagmire where we can't add any error-handling mechanism at all until we've had a long discussion on which errors bubble up and which do not.

I think that's a good discussion to have. But I don't feel like I'm well-equipped to facilitate that discussion, or to weigh competing needs (e.g., people who want the informer to retry EOF internally vs those who do not)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but I'm just suggesting to keep the existing error handling but move it to a DefaultDropWatchHandler. That way a user can customize it without changing the existing default behaviour.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm...i spent some time playing around with this, and couldn't come up with a solution that fit well. This error handling depends on unexported fields and functions of the package (r.expectedTypeName, isExpiredError), and wasn't sure how much we should really be exposing.

Did you have a particular API in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicks A DefaultDropWatchHandler could use these fields, right? Something like this:

func DefaultDropWatchHandler(err error, doneCh <-chan struct{}) {
	switch {
			case isExpiredError(err):
				r.setIsLastSyncResourceVersionExpired(true)
				klog.V(4).Infof("%s: watch of %v closed with: %v", r.name, r.expectedTypeName, err)
			case err == io.EOF:
				// watch closed normally
			case err == io.ErrUnexpectedEOF:
				klog.V(1).Infof("%s: Watch for %v closed with unexpected EOF: %v", r.name, r.expectedTypeName, err)
			default:
				utilruntime.HandleError(fmt.Errorf("%s: Failed to watch %v: %v", r.name, r.expectedTypeName, err))
	}
	if !utilnet.IsConnectionRefused(err) {
		doneCh <- <- struct{}{}
	}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i pushed a new branch to demonstrate what that would look like, what do you think?

@fedebongio
Copy link
Contributor

/assign @jpbetz

@nicks nicks force-pushed the nicks/informer-error-handling branch from 57b93a8 to 16b822d Compare January 25, 2020 04:48
@nicks nicks force-pushed the nicks/informer-error-handling branch from 16b822d to d302563 Compare February 4, 2020 16:07
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2020
@nicks nicks force-pushed the nicks/informer-error-handling branch from d302563 to 2c6a01a Compare February 4, 2020 16:11
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2020

type DropWatchHandler func(err error, doneCh <-chan struct{})

func createDefaultDropWatchHandler(r *Reflector) DropWatchHandler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd change the DropWatchHandler type to func(r *Reflector, err error, doneCh <- chan struct{}) and then define func DefaultDropWatchHandler(...). Seems more idiomatic to me and having access to Reflector in custom handlers seems useful anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp, nicks

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2020
When creating an informer, this adds a way to add custom error handling, so that
Kubernetes tooling can properly surface the errors to the end user.

Fixes kubernetes/client-go#155
@nicks nicks force-pushed the nicks/informer-error-handling branch from 8890889 to 435b40a Compare March 14, 2020 00:25
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 14, 2020
@nicks
Copy link
Contributor Author

nicks commented Mar 16, 2020

/retest

@k8s-ci-robot
Copy link
Contributor

@nicks: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@lavalamp
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 16, 2020
@lavalamp
Copy link
Member

verify is almost never flaky, btw.

@nicks
Copy link
Contributor Author

nicks commented Mar 16, 2020

/test pull-kubernetes-verify

@lavalamp I 100% believe you that verify is not flaky, but is there a guide anywhere on how to read its output?

The logs both seem to be saying "all tests passed" and the job exited with failure.
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/87329/pull-kubernetes-verify/1239582059362521088

@nicks
Copy link
Contributor Author

nicks commented Mar 16, 2020

/retest

oh, nm, it passed!!

@lavalamp
Copy link
Member

lavalamp commented Mar 16, 2020 via email

@nicks
Copy link
Contributor Author

nicks commented Apr 3, 2020

Is there anything left to do with this PR? Tide says "Must be in milestone v1.18", but I'm not sure if that's valid anymore now that v1.18.0 is out.

@lavalamp
Copy link
Member

lavalamp commented Apr 3, 2020

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 3, 2020
@nicks
Copy link
Contributor Author

nicks commented Apr 3, 2020

/retest

1 similar comment
@nicks
Copy link
Contributor Author

nicks commented Apr 3, 2020

/retest

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

2 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit 7c6473b into kubernetes:master Apr 4, 2020
@k8s-ci-robot k8s-ci-robot added this to the v1.19 milestone Apr 4, 2020
@Levovar
Copy link

Levovar commented Apr 8, 2020

@nicks not trying to necro the PR but considering almost all but two members of cache.Reflector are unexported:
https://github.com/kubernetes/client-go/blob/master/tools/cache/reflector.go#L49
not sure how useful it is to reguire it in WatchErrorHandler's signature
I was very happy with the new API and planning to use it in my Informers, but I feel this input parameter will be mostly unused due to this
what do you think about it?

roivaz added a commit to 3scale-ops/marin3r that referenced this pull request Apr 15, 2020
The WaitForCacheSync waits forever and never returns in the case a
persisten error occurs. On the other hand, it looks like there is no
way in the current version of surfacing informer problems to the caller,
as stated in this issue: kubernetes/client-go#155.
Error handling for informers has been added recently in
kubernetes/kubernetes#87329 which is only available
in master for the moment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Informers do not surface API server request failures to callers