New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Clearly log if error is from draining #408

Merged
merged 1 commit into from Feb 12, 2019

Conversation

@cgwalters
Copy link
Contributor

cgwalters commented Feb 11, 2019

Saw this in a log:

I0211 21:20:46.924255   61902 daemon.go:660] Unable to apply update: rpc error: code = Unknown desc =

It must be from the drain; let's make that clear.

@ashcrow
Copy link
Member

ashcrow left a comment

👍

@ashcrow

This comment has been minimized.

Copy link
Member

ashcrow commented Feb 11, 2019

/lgtm

Show resolved Hide resolved pkg/daemon/update.go Outdated
@cgwalters

This comment has been minimized.

Copy link
Contributor Author

cgwalters commented Feb 11, 2019

/lgtm cancel

based on #408 (comment)

@openshift-ci-robot

This comment has been minimized.

Copy link

openshift-ci-robot commented Feb 11, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ashcrow

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 11, 2019

pretty sure the error reported on Slack comes from:

func (dn *Daemon) updateOSAndReboot(newConfig *mcfgv1.MachineConfig) error {
	if err := dn.updateOS(newConfig); err != nil {
		return err
	}

	// Skip draining of the node when we're not cluster driven
	if dn.onceFrom == "" {
		glog.Info("Update prepared; draining the node")

		node, err := dn.kubeClient.CoreV1().Nodes().Get(dn.name, metav1.GetOptions{})
		if err != nil {
			return err
		}

Otherwise we should have gotten a timeout error right?

@jlebon

This comment has been minimized.

Copy link
Member

jlebon commented Feb 11, 2019

Yeah, I agree the error is likely from the Nodes().Get(). See #409.

daemon: Clearly log if error is from draining
Saw this in a log:

```
I0211 21:20:46.924255   61902 daemon.go:660] Unable to apply update: rpc error: code = Unknown desc =
```

It must be from the drain; let's make that clear.

@cgwalters cgwalters force-pushed the cgwalters:drain-more-errors branch from 68c69a2 to ec40c08 Feb 11, 2019

@openshift-ci-robot openshift-ci-robot added size/S and removed size/XS labels Feb 11, 2019

if err != nil {
return err
if lastErr != nil {
return errors.Wrapf(lastErr, "Failed to drain node (%s tries)", backoff.Steps)
}

This comment has been minimized.

@runcom

runcom Feb 11, 2019

Member

we're gonna miss the wait.* error this way cause other err may happen other than just timeout if someone, someday adds another err case which return a real err:

// If the condition never returns true, ErrWaitTimeout is returned. All other
// errors terminate immediately.
func ExponentialBackoff(backoff Backoff, condition ConditionFunc) error {

I was thinking something like https://github.com/openshift/machine-config-operator/blob/master/pkg/operator/sync.go#L357-L362

This comment has been minimized.

@runcom

runcom Feb 11, 2019

Member

this is not blocking of course, we control the ConditionFunc here today but if someone jumps and add a return false, err, we're gonna miss it, just to have the same pattern

@runcom

This comment has been minimized.

Copy link
Member

runcom commented Feb 11, 2019

just a comment which is really a nit since we control the ConditionFunc

/lgtm

err := drain.Drain(dn.kubeClient, []*corev1.Node{node}, &drain.DrainOptions{
DeleteLocalData: true,
Force: true,
GracePeriodSeconds: 600,
IgnoreDaemonsets: true,
})
if err != nil {
glog.Infof("Draining failed with: %v; retrying...", err)

This comment has been minimized.

@jlebon

jlebon Feb 11, 2019

Member

Why not keep the logging here to show some progress if we're retrying?

This comment has been minimized.

@jlebon

jlebon Feb 12, 2019

Member

Follow up for this in #412!

@openshift-bot

This comment has been minimized.

Copy link

openshift-bot commented Feb 12, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot

This comment has been minimized.

Copy link

openshift-bot commented Feb 12, 2019

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit afcc5e2 into openshift:master Feb 12, 2019

6 checks passed

ci/prow/e2e-aws Job succeeded.
Details
ci/prow/e2e-aws-op Job succeeded.
Details
ci/prow/images Job succeeded.
Details
ci/prow/rhel-images Job succeeded.
Details
ci/prow/unit Job succeeded.
Details
tide In merge pool.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment