update: Retry Drain() multiple times #394

jlebon · 2019-02-07T21:26:03Z

Don't want to fail on transient errors.

Closes: #393

jlebon · 2019-02-07T21:27:10Z

(Not tested yet).

derekwaynecarr · 2019-02-07T21:28:27Z

pkg/daemon/update.go

-			Force:              true,
-			GracePeriodSeconds: 600,
-			IgnoreDaemonsets:   true,
+		err = wait.ExponentialBackoff(wait.Backoff{


if this errs, it will still go degraded right?

we need a follow-on item that ensures failure to talk to k8s api never results in degrade of machine.

after all, you can only report degraded by talking to the apiserver.

we need a follow-on item that ensures failure to talk to k8s api never results in degrade of machine.

Hmm, are you suggesting retrying forever?

I guess in theory we could try to handle the case where the desiredConfig changes back to the current value. Basically go back to the mainloop. We'd have inner and outer loops probably to retry a few times operations before bouncing back to the top and seeing if any state changed.

Problem with this though is that in the current design we've written the updated files so we're somewhat committed...we'd have to roll that back.

the problem i see now is that we dont distinguish between a permanent failure, and a transient failure. basically all failures in many situations of code result in permanent failure. any interaction with k8s that fails needs to be transient.

We've been slowly working towards addressing this. Like Colin hinted, it might make sense to retry at a higher level too.

OK, split this in #395.

runcom · 2019-02-07T21:28:30Z

pkg/daemon/update.go

-			Force:              true,
-			GracePeriodSeconds: 600,
-			IgnoreDaemonsets:   true,
+		err = wait.ExponentialBackoff(wait.Backoff{


lgtm, except I'm unsure if we should ever bail waiting

Hmm, meaning we should just go ahead with the reboot even if draining failed? Yeah, I think we can do that.

Don't want to fail on transient errors. Closes: openshift#393

jlebon · 2019-02-07T21:51:16Z

OK, tested this now.
I ran it with:

diff --git a/vendor/github.com/openshift/kubernetes-drain/drain.go b/vendor/github.com/openshift/kubernetes-drain/drain.go
index a6e21ee..4cfe1e8 100644
--- a/vendor/github.com/openshift/kubernetes-drain/drain.go
+++ b/vendor/github.com/openshift/kubernetes-drain/drain.go
@@ -20,6 +20,7 @@ import (
        "errors"
        "fmt"
        "math"
+       "math/rand"
        "sort"
        "strings"
        "time"
@@ -154,6 +155,10 @@ func GetNodes(client typedcorev1.NodeInterface, nodes []string, selector string)
 //
 // ![Workflow](http://kubernetes.io/images/docs/kubectl_drain.svg)
 func Drain(client kubernetes.Interface, nodes []*corev1.Node, options *DrainOptions) (err error) {
+       rand.Seed(time.Now().UnixNano())
+       if rand.Intn(2) == 1 {
+               return errors.New("Failed to be lucky enough");
+       }
        nodeInterface := client.CoreV1().Nodes()
        for _, node := range nodes {
                if err := Cordon(nodeInterface, node, options.Logger); err != nil {

And got:

I0207 21:46:59.459197   28786 update.go:76] Update prepared; draining the node
I0207 21:46:59.463602   28786 update.go:97] Draining failed with: Failed to be lucky enough; retrying...
I0207 21:46:59.463689   28786 daemon.go:223] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"test1-worker-0-ldmf5", UID:"6fe666a4-2b1f-11e9-9db1-664f163f5f0f", APIVersion:"v1", ResourceVersion:"18994", FieldPath:""}): type: 'Normal' reason: 'Drain' Draining node to update config.
I0207 21:47:09.463929   28786 update.go:97] Draining failed with: Failed to be lucky enough; retrying...
I0207 21:47:17.377699   28786 daemon.go:338] Kubelet health running
I0207 21:47:17.378454   28786 daemon.go:365] Kubelet health ok
I0207 21:47:30.088947   28786 update.go:105] Node successfully drained
...

derekwaynecarr · 2019-02-07T22:00:41Z

i dont want to block this pr, but we need to seriously assess the role of degraded in a follow-on.

cgwalters · 2019-02-07T22:07:02Z

/lgtm

openshift-ci-robot · 2019-02-07T22:07:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [cgwalters,jlebon]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 7, 2019

openshift-ci-robot requested review from kikisdeliveryservice and smarterclayton February 7, 2019 21:26

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2019

derekwaynecarr reviewed Feb 7, 2019

View reviewed changes

runcom reviewed Feb 7, 2019

View reviewed changes

update: Retry Drain() multiple times

2e978c3

Don't want to fail on transient errors. Closes: openshift#393

jlebon force-pushed the pr/master branch from c2ac894 to 2e978c3 Compare February 7, 2019 21:49

openshift-ci-robot assigned cgwalters Feb 7, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2019

openshift-merge-robot merged commit 5335dae into openshift:master Feb 7, 2019

retroflexer mentioned this pull request Jul 27, 2020

*: Remove files used for etcd-quorum-guard, as it is being migrated to... #1928

Merged

jlebon deleted the pr/master branch May 1, 2023 15:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update: Retry Drain() multiple times #394

update: Retry Drain() multiple times #394

jlebon commented Feb 7, 2019

jlebon commented Feb 7, 2019

derekwaynecarr Feb 7, 2019

jlebon Feb 7, 2019

cgwalters Feb 7, 2019

derekwaynecarr Feb 7, 2019

jlebon Feb 7, 2019

jlebon Feb 7, 2019

runcom Feb 7, 2019

jlebon Feb 7, 2019

jlebon commented Feb 7, 2019

derekwaynecarr commented Feb 7, 2019

cgwalters commented Feb 7, 2019

openshift-ci-robot commented Feb 7, 2019

update: Retry Drain() multiple times #394

update: Retry Drain() multiple times #394

Conversation

jlebon commented Feb 7, 2019

jlebon commented Feb 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jlebon commented Feb 7, 2019

derekwaynecarr commented Feb 7, 2019

cgwalters commented Feb 7, 2019

openshift-ci-robot commented Feb 7, 2019