-
Notifications
You must be signed in to change notification settings - Fork 399
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Add GetNode() wrapper which retries on errors #409
Conversation
(Not tested yet.) |
e36d410
to
450fd12
Compare
I feel like we should acquire this once in the daemon around |
/test unit |
aren't we going to cache annotations this way? (unless I'm not understanding the comment) |
Hmm. Right. We have a node lister we should probably be using. Also looking over the code in the daemon we have many places that
|
Yup, I force-pushed an update so we do this everywhere. |
/approve |
/retest |
450fd12
to
260b412
Compare
Pushed an update which follows the pattern in #410. Still trying to test this though. My worker node for some reason is not coming up. (Also caught another instance of /hold |
pkg/daemon/node.go
Outdated
if err == wait.ErrWaitTimeout { | ||
glog.Warningf("Timed out trying to fetch node %s; last error: %v", lastErr) | ||
} | ||
return nil, lastErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, this is fine, but we can still wrap the time out error https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh OK, I thought the goal of #410 was keeping the same error so it could be matched higher up in the stack. If we wrap it, we're losing information, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an example, the instance in updateNodeRetry()
is matching against errors.IsConflict(err)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, mmm yeah, wrapping is something we still need to do for these kinds of error. Example: say this times out for some reason, callers want to know 1) is it a timeout? 2) what was the underlying error?
The code right now isn't telling callers that we actually timed out right? we need to bubble that up to make logic around permanent/temporarly failures (?)
The example here https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400 is telling you both 1) && 2) even we should start using pkg/errors.Wrap
which has a .Cause()
method which will tell us more about the wrapped error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm talking about this https://github.com/pkg/errors#adding-context-to-an-error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes, Cause()
is exactly what I mean. I didn't realize that's what you meant since we don't currently do that in https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400. K, will update to use Wrap
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes,
Cause()
is exactly what I mean. I didn't realize that's what you meant since we don't currently do that in https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400. K
well, by wrapping in that snippet I meant that we were still wrapping the timeout err with lastErr https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L398
I think this is what's causing the issue in: openshift#395 (comment)
260b412
to
a232d68
Compare
My testing this is blocked on openshift/machine-api-operator#205. Trying to pull up an older cluster. |
Test failure is:
which should be fixed now that openshift/installer#1180 is merged. /test e2e-aws-op |
|
/hold cancel Tested this now (albeit on an older cluster). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ashcrow, jlebon, runcom The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
👍 great |
I'm not opposed to this but I still think we should be using Informers at least eventually. |
can we track this somewhere in an issue maybe? just to know we have ideas on how to better tackle this issue |
AWS resource problems. Let's wait a bit before we issue a retest. |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
Flake /retest |
// getNodeAnnotationExt is like getNodeAnnotation, but allows one to customize ENOENT handling | ||
func getNodeAnnotationExt(client corev1.NodeInterface, node string, k string, allowNoent bool) (string, error) { | ||
// GetNode gets the node object. | ||
func GetNode(client corev1.NodeInterface, node string) (*core_v1.Node, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can lowercase the func name here as it's only used within pkg/daemon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, true. Though WDYT about getting this in, and we just work on #409 (comment) as a follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's ok as well (though I'm missing context on how to use informers for these scenarios) /me goes to learn
/retest |
6 similar comments
/retest |
/retest |
/retest |
/retest |
/retest |
/retest |
wtf is going on |
/retest |
1 similar comment
/retest |
I think this is what's causing the issue in:
#395 (comment)