New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Add GetNode() wrapper which retries on errors #409

Merged
merged 1 commit into from Feb 15, 2019

Conversation

Projects
None yet
8 participants
@jlebon
Copy link
Member

jlebon commented Feb 11, 2019

I think this is what's causing the issue in:
#395 (comment)

@jlebon

This comment has been minimized.

Copy link
Member Author

jlebon commented Feb 11, 2019

(Not tested yet.)

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Feb 11, 2019

I feel like we should acquire this once in the daemon around getStateAndConfigs or so. It's not like anything in it is going to change that we care about.

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 11, 2019

/test unit

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 11, 2019

I feel like we should acquire this once in the daemon around getStateAndConfigs or so. It's not like anything in it is going to change that we care about.

aren't we going to cache annotations this way? (unless I'm not understanding the comment)

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Feb 11, 2019

aren't we going to cache annotations this way? (unless I'm not understanding the comment)

Hmm. Right. We have a node lister we should probably be using.

Also looking over the code in the daemon we have many places that .Get() our own node:


$ git grep Nodes.*Get
daemon/daemon.go:               if err := dn.nodeWriter.SetUpdateDone(dn.kubeClient.CoreV1().Nodes(), dn.name, state.pendingConfig.GetName()); err != nil {
daemon/daemon.go:       node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})
daemon/daemon.go:       node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})
daemon/update.go:               node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})
@jlebon

This comment has been minimized.

Copy link
Member Author

jlebon commented Feb 11, 2019

Yup, I force-pushed an update so we do this everywhere.

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 11, 2019

/approve

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 12, 2019

/retest

@jlebon jlebon force-pushed the jlebon:pr/node-retry branch from 450fd12 to 260b412 Feb 12, 2019

@openshift-ci-robot openshift-ci-robot added size/M and removed size/S labels Feb 12, 2019

@jlebon

This comment has been minimized.

Copy link
Member Author

jlebon commented Feb 12, 2019

Pushed an update which follows the pattern in #410. Still trying to test this though. My worker node for some reason is not coming up. (Also caught another instance of Nodes().Get()!)

/hold

if err == wait.ErrWaitTimeout {
glog.Warningf("Timed out trying to fetch node %s; last error: %v", lastErr)
}
return nil, lastErr

This comment has been minimized.

@runcom

This comment has been minimized.

@jlebon

jlebon Feb 12, 2019

Author Member

Ahh OK, I thought the goal of #410 was keeping the same error so it could be matched higher up in the stack. If we wrap it, we're losing information, right?

This comment has been minimized.

@jlebon

jlebon Feb 12, 2019

Author Member

As an example, the instance in updateNodeRetry() is matching against errors.IsConflict(err).

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Member

well, mmm yeah, wrapping is something we still need to do for these kinds of error. Example: say this times out for some reason, callers want to know 1) is it a timeout? 2) what was the underlying error?
The code right now isn't telling callers that we actually timed out right? we need to bubble that up to make logic around permanent/temporarly failures (?)

The example here https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400 is telling you both 1) && 2) even we should start using pkg/errors.Wrap which has a .Cause() method which will tell us more about the wrapped error

This comment has been minimized.

@runcom

This comment has been minimized.

@jlebon

jlebon Feb 12, 2019

Author Member

Ahh yes, Cause() is exactly what I mean. I didn't realize that's what you meant since we don't currently do that in https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400. K, will update to use Wrap!

This comment has been minimized.

@runcom

runcom Feb 12, 2019

Member

Ahh yes, Cause() is exactly what I mean. I didn't realize that's what you meant since we don't currently do that in https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L397-L400. K

well, by wrapping in that snippet I meant that we were still wrapping the timeout err with lastErr https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L398

daemon: Add GetNode() wrapper which retries on errors
I think this is what's causing the issue in:
#395 (comment)

@jlebon jlebon force-pushed the jlebon:pr/node-retry branch from 260b412 to a232d68 Feb 12, 2019

@jlebon

This comment has been minimized.

Copy link
Member Author

jlebon commented Feb 12, 2019

My testing this is blocked on openshift/machine-api-operator#205. Trying to pull up an older cluster.

@jlebon

This comment has been minimized.

Copy link
Member Author

jlebon commented Feb 13, 2019

Test failure is:

F0212 18:30:50.637693 1 api.go:59] Machine Config Server exited with error: listen tcp :49500: bind: address already in use

which should be fixed now that openshift/installer#1180 is merged.

/test e2e-aws-op

@jlebon

This comment has been minimized.

Copy link
Member Author

jlebon commented Feb 13, 2019

e2e-aws passed, which seems to indicate that the happy path works at least.

@jlebon

This comment has been minimized.

Copy link
Member Author

jlebon commented Feb 13, 2019

/hold cancel

Tested this now (albeit on an older cluster).

@ashcrow
Copy link
Member

ashcrow left a comment

👍

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 13, 2019

/lgtm

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Feb 13, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow, jlebon, runcom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [ashcrow,jlebon,runcom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 13, 2019

👍 great

@cgwalters

This comment has been minimized.

Copy link
Contributor

cgwalters commented Feb 13, 2019

I'm not opposed to this but I still think we should be using Informers at least eventually.

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 13, 2019

I'm not opposed to this but I still think we should be using Informers at least eventually.

can we track this somewhere in an issue maybe? just to know we have ideas on how to better tackle this issue

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 13, 2019

AWS resource problems. Let's wait a bit before we issue a retest.

@rphillips

This comment has been minimized.

Copy link
Contributor

rphillips commented Feb 14, 2019

/retest

@openshift-bot

This comment has been minimized.

Copy link

openshift-bot commented Feb 14, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 14, 2019

Flake

/retest

// getNodeAnnotationExt is like getNodeAnnotation, but allows one to customize ENOENT handling
func getNodeAnnotationExt(client corev1.NodeInterface, node string, k string, allowNoent bool) (string, error) {
// GetNode gets the node object.
func GetNode(client corev1.NodeInterface, node string) (*core_v1.Node, error) {

This comment has been minimized.

@runcom

runcom Feb 14, 2019

Member

you can lowercase the func name here as it's only used within pkg/daemon

This comment has been minimized.

@jlebon

jlebon Feb 14, 2019

Author Member

Hmm, true. Though WDYT about getting this in, and we just work on #409 (comment) as a follow up?

This comment has been minimized.

@runcom

runcom Feb 14, 2019

Member

That's ok as well (though I'm missing context on how to use informers for these scenarios) /me goes to learn

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 14, 2019

/retest

6 similar comments
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 14, 2019

/retest

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 14, 2019

wtf is going on

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 15, 2019

/retest

1 similar comment
@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 15, 2019

/retest

@openshift-merge-robot openshift-merge-robot merged commit 8cdcfbc into openshift:master Feb 15, 2019

6 checks passed

ci/prow/e2e-aws Job succeeded.
Details
ci/prow/e2e-aws-op Job succeeded.
Details
ci/prow/images Job succeeded.
Details
ci/prow/rhel-images Job succeeded.
Details
ci/prow/unit Job succeeded.
Details
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment