daemon: Add GetNode() wrapper which retries on errors #409

jlebon · 2019-02-11T22:32:10Z

I think this is what's causing the issue in:
#395 (comment)

jlebon · 2019-02-11T22:32:21Z

(Not tested yet.)

cgwalters · 2019-02-11T22:36:01Z

I feel like we should acquire this once in the daemon around getStateAndConfigs or so. It's not like anything in it is going to change that we care about.

ashcrow · 2019-02-11T22:39:00Z

/test unit

runcom · 2019-02-11T22:40:29Z

I feel like we should acquire this once in the daemon around getStateAndConfigs or so. It's not like anything in it is going to change that we care about.

aren't we going to cache annotations this way? (unless I'm not understanding the comment)

cgwalters · 2019-02-11T22:44:51Z

aren't we going to cache annotations this way? (unless I'm not understanding the comment)

Hmm. Right. We have a node lister we should probably be using.

Also looking over the code in the daemon we have many places that .Get() our own node:


$ git grep Nodes.*Get
daemon/daemon.go:               if err := dn.nodeWriter.SetUpdateDone(dn.kubeClient.CoreV1().Nodes(), dn.name, state.pendingConfig.GetName()); err != nil {
daemon/daemon.go:       node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})
daemon/daemon.go:       node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})
daemon/update.go:               node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})

jlebon · 2019-02-11T22:46:03Z

Yup, I force-pushed an update so we do this everywhere.

runcom · 2019-02-11T22:48:20Z

/approve

runcom · 2019-02-12T07:44:46Z

/retest

jlebon · 2019-02-12T16:36:23Z

Pushed an update which follows the pattern in #410. Still trying to test this though. My worker node for some reason is not coming up. (Also caught another instance of Nodes().Get()!)

/hold

runcom · 2019-02-12T16:40:02Z

pkg/daemon/node.go

+		if err == wait.ErrWaitTimeout {
+			glog.Warningf("Timed out trying to fetch node %s; last error: %v", lastErr)
+		}
+		return nil, lastErr


so, this is fine, but we can still wrap the time out error https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400

Ahh OK, I thought the goal of #410 was keeping the same error so it could be matched higher up in the stack. If we wrap it, we're losing information, right?

As an example, the instance in updateNodeRetry() is matching against errors.IsConflict(err).

well, mmm yeah, wrapping is something we still need to do for these kinds of error. Example: say this times out for some reason, callers want to know 1) is it a timeout? 2) what was the underlying error?
The code right now isn't telling callers that we actually timed out right? we need to bubble that up to make logic around permanent/temporarly failures (?)

The example here https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400 is telling you both 1) && 2) even we should start using pkg/errors.Wrap which has a .Cause() method which will tell us more about the wrapped error

I'm talking about this https://github.com/pkg/errors#adding-context-to-an-error

Ahh yes, Cause() is exactly what I mean. I didn't realize that's what you meant since we don't currently do that in https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400. K, will update to use Wrap!

Ahh yes, Cause() is exactly what I mean. I didn't realize that's what you meant since we don't currently do that in https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400. K

well, by wrapping in that snippet I meant that we were still wrapping the timeout err with lastErr https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L398

I think this is what's causing the issue in: openshift#395 (comment)

jlebon · 2019-02-12T19:01:06Z

My testing this is blocked on openshift/machine-api-operator#205. Trying to pull up an older cluster.

jlebon · 2019-02-13T19:44:02Z

Test failure is:

F0212 18:30:50.637693 1 api.go:59] Machine Config Server exited with error: listen tcp :49500: bind: address already in use

which should be fixed now that openshift/installer#1180 is merged.

/test e2e-aws-op

jlebon · 2019-02-13T19:46:44Z

e2e-aws passed, which seems to indicate that the happy path works at least.

jlebon · 2019-02-13T20:22:52Z

/hold cancel

Tested this now (albeit on an older cluster).

ashcrow

👍

ashcrow · 2019-02-13T20:36:16Z

/lgtm

openshift-ci-robot · 2019-02-13T20:36:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, jlebon, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ashcrow,jlebon,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

runcom · 2019-02-13T20:37:42Z

👍 great

cgwalters · 2019-02-13T21:10:43Z

I'm not opposed to this but I still think we should be using Informers at least eventually.

runcom · 2019-02-13T21:23:48Z

I'm not opposed to this but I still think we should be using Informers at least eventually.

can we track this somewhere in an issue maybe? just to know we have ideas on how to better tackle this issue

ashcrow · 2019-02-13T21:49:00Z

AWS resource problems. Let's wait a bit before we issue a retest.

rphillips · 2019-02-14T04:36:30Z

/retest

openshift-bot · 2019-02-14T12:39:26Z

/retest

Please review the full test history for this PR and help us cut down flakes.

ashcrow · 2019-02-14T16:21:57Z

Flake

/retest

runcom · 2019-02-14T17:09:51Z

pkg/daemon/node.go

-// getNodeAnnotationExt is like getNodeAnnotation, but allows one to customize ENOENT handling
-func getNodeAnnotationExt(client corev1.NodeInterface, node string, k string, allowNoent bool) (string, error) {
+// GetNode gets the node object.
+func GetNode(client corev1.NodeInterface, node string) (*core_v1.Node, error) {


you can lowercase the func name here as it's only used within pkg/daemon

Hmm, true. Though WDYT about getting this in, and we just work on #409 (comment) as a follow up?

That's ok as well (though I'm missing context on how to use informers for these scenarios) /me goes to learn

runcom · 2019-02-14T17:27:59Z

/retest

runcom · 2019-02-14T20:41:49Z

/retest

runcom · 2019-02-14T21:02:43Z

/retest

runcom · 2019-02-14T21:18:55Z

/retest

runcom · 2019-02-14T21:31:31Z

/retest

runcom · 2019-02-14T21:52:11Z

/retest

runcom · 2019-02-14T22:07:26Z

/retest

runcom · 2019-02-14T22:10:03Z

wtf is going on

runcom · 2019-02-15T00:52:50Z

/retest

runcom · 2019-02-15T08:00:56Z

/retest

openshift-ci-robot requested review from cgwalters and runcom February 11, 2019 22:32

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 11, 2019

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 11, 2019

jlebon force-pushed the pr/node-retry branch from e36d410 to 450fd12 Compare February 11, 2019 22:33

jlebon mentioned this pull request Feb 11, 2019

daemon: Clearly log if error is from draining #408

Merged

jlebon force-pushed the pr/node-retry branch from 450fd12 to 260b412 Compare February 12, 2019 16:34

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 12, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 12, 2019

runcom reviewed Feb 12, 2019

View reviewed changes

daemon: Add GetNode() wrapper which retries on errors

a232d68

I think this is what's causing the issue in: openshift#395 (comment)

jlebon force-pushed the pr/node-retry branch from 260b412 to a232d68 Compare February 12, 2019 17:29

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 13, 2019

ashcrow approved these changes Feb 13, 2019

View reviewed changes

openshift-ci-robot assigned ashcrow Feb 13, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 13, 2019

runcom reviewed Feb 14, 2019

View reviewed changes

openshift-merge-robot merged commit 8cdcfbc into openshift:master Feb 15, 2019

runcom mentioned this pull request Feb 20, 2019

pkg/daemon: stash the node object #464

Merged

jlebon deleted the pr/node-retry branch May 1, 2023 15:32

daemon: Add GetNode() wrapper which retries on errors #409

daemon: Add GetNode() wrapper which retries on errors #409

Conversation

jlebon commented Feb 11, 2019

jlebon commented Feb 11, 2019

cgwalters commented Feb 11, 2019

ashcrow commented Feb 11, 2019

runcom commented Feb 11, 2019 • edited Loading

cgwalters commented Feb 11, 2019

jlebon commented Feb 11, 2019

runcom commented Feb 11, 2019

runcom commented Feb 12, 2019

jlebon commented Feb 12, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlebon commented Feb 12, 2019

jlebon commented Feb 13, 2019

jlebon commented Feb 13, 2019

jlebon commented Feb 13, 2019

ashcrow left a comment

Choose a reason for hiding this comment

ashcrow commented Feb 13, 2019

openshift-ci-robot commented Feb 13, 2019

runcom commented Feb 13, 2019

cgwalters commented Feb 13, 2019

runcom commented Feb 13, 2019

ashcrow commented Feb 13, 2019

rphillips commented Feb 14, 2019

openshift-bot commented Feb 14, 2019

ashcrow commented Feb 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

runcom commented Feb 14, 2019

runcom commented Feb 14, 2019

runcom commented Feb 14, 2019

runcom commented Feb 14, 2019

runcom commented Feb 14, 2019

runcom commented Feb 14, 2019

runcom commented Feb 14, 2019

runcom commented Feb 14, 2019

runcom commented Feb 15, 2019

runcom commented Feb 15, 2019

runcom commented Feb 11, 2019 •

edited

Loading

jlebon commented Feb 12, 2019 •

edited

Loading