Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update: Retry Drain() multiple times #394

Merged
merged 1 commit into from
Feb 7, 2019

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Feb 7, 2019

Don't want to fail on transient errors.

Closes: #393

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Feb 7, 2019
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 7, 2019
@jlebon
Copy link
Member Author

jlebon commented Feb 7, 2019

(Not tested yet).

Force: true,
GracePeriodSeconds: 600,
IgnoreDaemonsets: true,
err = wait.ExponentialBackoff(wait.Backoff{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this errs, it will still go degraded right?

we need a follow-on item that ensures failure to talk to k8s api never results in degrade of machine.

after all, you can only report degraded by talking to the apiserver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need a follow-on item that ensures failure to talk to k8s api never results in degrade of machine.

Hmm, are you suggesting retrying forever?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in theory we could try to handle the case where the desiredConfig changes back to the current value. Basically go back to the mainloop. We'd have inner and outer loops probably to retry a few times operations before bouncing back to the top and seeing if any state changed.

Problem with this though is that in the current design we've written the updated files so we're somewhat committed...we'd have to roll that back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem i see now is that we dont distinguish between a permanent failure, and a transient failure. basically all failures in many situations of code result in permanent failure. any interaction with k8s that fails needs to be transient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been slowly working towards addressing this. Like Colin hinted, it might make sense to retry at a higher level too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, split this in #395.

Force: true,
GracePeriodSeconds: 600,
IgnoreDaemonsets: true,
err = wait.ExponentialBackoff(wait.Backoff{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, except I'm unsure if we should ever bail waiting

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, meaning we should just go ahead with the reboot even if draining failed? Yeah, I think we can do that.

Don't want to fail on transient errors.

Closes: openshift#393
@jlebon
Copy link
Member Author

jlebon commented Feb 7, 2019

OK, tested this now.
I ran it with:

diff --git a/vendor/github.com/openshift/kubernetes-drain/drain.go b/vendor/github.com/openshift/kubernetes-drain/drain.go
index a6e21ee..4cfe1e8 100644
--- a/vendor/github.com/openshift/kubernetes-drain/drain.go
+++ b/vendor/github.com/openshift/kubernetes-drain/drain.go
@@ -20,6 +20,7 @@ import (
        "errors"
        "fmt"
        "math"
+       "math/rand"
        "sort"
        "strings"
        "time"
@@ -154,6 +155,10 @@ func GetNodes(client typedcorev1.NodeInterface, nodes []string, selector string)
 //
 // ![Workflow](http://kubernetes.io/images/docs/kubectl_drain.svg)
 func Drain(client kubernetes.Interface, nodes []*corev1.Node, options *DrainOptions) (err error) {
+       rand.Seed(time.Now().UnixNano())
+       if rand.Intn(2) == 1 {
+               return errors.New("Failed to be lucky enough");
+       }
        nodeInterface := client.CoreV1().Nodes()
        for _, node := range nodes {
                if err := Cordon(nodeInterface, node, options.Logger); err != nil {

And got:

I0207 21:46:59.459197   28786 update.go:76] Update prepared; draining the node
I0207 21:46:59.463602   28786 update.go:97] Draining failed with: Failed to be lucky enough; retrying...
I0207 21:46:59.463689   28786 daemon.go:223] Event(v1.ObjectReference{Kind:"Node", Namespace:"", Name:"test1-worker-0-ldmf5", UID:"6fe666a4-2b1f-11e9-9db1-664f163f5f0f", APIVersion:"v1", ResourceVersion:"18994", FieldPath:""}): type: 'Normal' reason: 'Drain' Draining node to update config.
I0207 21:47:09.463929   28786 update.go:97] Draining failed with: Failed to be lucky enough; retrying...
I0207 21:47:17.377699   28786 daemon.go:338] Kubelet health running
I0207 21:47:17.378454   28786 daemon.go:365] Kubelet health ok
I0207 21:47:30.088947   28786 update.go:105] Node successfully drained
...

@derekwaynecarr
Copy link
Member

i dont want to block this pr, but we need to seriously assess the role of degraded in a follow-on.

@cgwalters
Copy link
Member

/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, jlebon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 7, 2019
@openshift-merge-robot openshift-merge-robot merged commit 5335dae into openshift:master Feb 7, 2019
@jlebon jlebon deleted the pr/master branch May 1, 2023 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants