Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Add GetNode() wrapper which retries on errors #409

Merged
merged 1 commit into from
Feb 15, 2019

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Feb 11, 2019

I think this is what's causing the issue in:
#395 (comment)

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2019
@jlebon
Copy link
Member Author

jlebon commented Feb 11, 2019

(Not tested yet.)

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 11, 2019
@cgwalters
Copy link
Member

I feel like we should acquire this once in the daemon around getStateAndConfigs or so. It's not like anything in it is going to change that we care about.

@ashcrow
Copy link
Member

ashcrow commented Feb 11, 2019

/test unit

@runcom
Copy link
Member

runcom commented Feb 11, 2019

I feel like we should acquire this once in the daemon around getStateAndConfigs or so. It's not like anything in it is going to change that we care about.

aren't we going to cache annotations this way? (unless I'm not understanding the comment)

@cgwalters
Copy link
Member

aren't we going to cache annotations this way? (unless I'm not understanding the comment)

Hmm. Right. We have a node lister we should probably be using.

Also looking over the code in the daemon we have many places that .Get() our own node:


$ git grep Nodes.*Get
daemon/daemon.go:               if err := dn.nodeWriter.SetUpdateDone(dn.kubeClient.CoreV1().Nodes(), dn.name, state.pendingConfig.GetName()); err != nil {
daemon/daemon.go:       node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})
daemon/daemon.go:       node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})
daemon/update.go:               node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})

@jlebon
Copy link
Member Author

jlebon commented Feb 11, 2019

Yup, I force-pushed an update so we do this everywhere.

@runcom
Copy link
Member

runcom commented Feb 11, 2019

/approve

@runcom
Copy link
Member

runcom commented Feb 12, 2019

/retest

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 12, 2019
@jlebon
Copy link
Member Author

jlebon commented Feb 12, 2019

Pushed an update which follows the pattern in #410. Still trying to test this though. My worker node for some reason is not coming up. (Also caught another instance of Nodes().Get()!)

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 12, 2019
if err == wait.ErrWaitTimeout {
glog.Warningf("Timed out trying to fetch node %s; last error: %v", lastErr)
}
return nil, lastErr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh OK, I thought the goal of #410 was keeping the same error so it could be matched higher up in the stack. If we wrap it, we're losing information, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an example, the instance in updateNodeRetry() is matching against errors.IsConflict(err).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, mmm yeah, wrapping is something we still need to do for these kinds of error. Example: say this times out for some reason, callers want to know 1) is it a timeout? 2) what was the underlying error?
The code right now isn't telling callers that we actually timed out right? we need to bubble that up to make logic around permanent/temporarly failures (?)

The example here https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400 is telling you both 1) && 2) even we should start using pkg/errors.Wrap which has a .Cause() method which will tell us more about the wrapped error

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, Cause() is exactly what I mean. I didn't realize that's what you meant since we don't currently do that in https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400. K, will update to use Wrap!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, Cause() is exactly what I mean. I didn't realize that's what you meant since we don't currently do that in https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400. K

well, by wrapping in that snippet I meant that we were still wrapping the timeout err with lastErr https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L398

@jlebon
Copy link
Member Author

jlebon commented Feb 12, 2019

My testing this is blocked on openshift/machine-api-operator#205. Trying to pull up an older cluster.

@jlebon
Copy link
Member Author

jlebon commented Feb 13, 2019

Test failure is:

F0212 18:30:50.637693 1 api.go:59] Machine Config Server exited with error: listen tcp :49500: bind: address already in use

which should be fixed now that openshift/installer#1180 is merged.

/test e2e-aws-op

@jlebon
Copy link
Member Author

jlebon commented Feb 13, 2019

e2e-aws passed, which seems to indicate that the happy path works at least.

@jlebon
Copy link
Member Author

jlebon commented Feb 13, 2019

/hold cancel

Tested this now (albeit on an older cluster).

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2019
Copy link
Member

@ashcrow ashcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@ashcrow
Copy link
Member

ashcrow commented Feb 13, 2019

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 13, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, jlebon, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ashcrow,jlebon,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom
Copy link
Member

runcom commented Feb 13, 2019

👍 great

@cgwalters
Copy link
Member

I'm not opposed to this but I still think we should be using Informers at least eventually.

@runcom
Copy link
Member

runcom commented Feb 13, 2019

I'm not opposed to this but I still think we should be using Informers at least eventually.

can we track this somewhere in an issue maybe? just to know we have ideas on how to better tackle this issue

@ashcrow
Copy link
Member

ashcrow commented Feb 13, 2019

AWS resource problems. Let's wait a bit before we issue a retest.

@rphillips
Copy link
Contributor

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@ashcrow
Copy link
Member

ashcrow commented Feb 14, 2019

Flake

/retest

// getNodeAnnotationExt is like getNodeAnnotation, but allows one to customize ENOENT handling
func getNodeAnnotationExt(client corev1.NodeInterface, node string, k string, allowNoent bool) (string, error) {
// GetNode gets the node object.
func GetNode(client corev1.NodeInterface, node string) (*core_v1.Node, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can lowercase the func name here as it's only used within pkg/daemon

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, true. Though WDYT about getting this in, and we just work on #409 (comment) as a follow up?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's ok as well (though I'm missing context on how to use informers for these scenarios) /me goes to learn

@runcom
Copy link
Member

runcom commented Feb 14, 2019

/retest

6 similar comments
@runcom
Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom
Copy link
Member

runcom commented Feb 14, 2019

wtf is going on

@runcom
Copy link
Member

runcom commented Feb 15, 2019

/retest

1 similar comment
@runcom
Copy link
Member

runcom commented Feb 15, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit 8cdcfbc into openshift:master Feb 15, 2019
@jlebon jlebon deleted the pr/node-retry branch May 1, 2023 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants