-
Notifications
You must be signed in to change notification settings - Fork 392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update: Retry Drain() multiple times #394
Conversation
(Not tested yet). |
Force: true, | ||
GracePeriodSeconds: 600, | ||
IgnoreDaemonsets: true, | ||
err = wait.ExponentialBackoff(wait.Backoff{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this errs, it will still go degraded right?
we need a follow-on item that ensures failure to talk to k8s api never results in degrade of machine.
after all, you can only report degraded by talking to the apiserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a follow-on item that ensures failure to talk to k8s api never results in degrade of machine.
Hmm, are you suggesting retrying forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in theory we could try to handle the case where the desiredConfig
changes back to the current value. Basically go back to the mainloop. We'd have inner and outer loops probably to retry a few times operations before bouncing back to the top and seeing if any state changed.
Problem with this though is that in the current design we've written the updated files so we're somewhat committed...we'd have to roll that back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem i see now is that we dont distinguish between a permanent failure, and a transient failure. basically all failures in many situations of code result in permanent failure. any interaction with k8s that fails needs to be transient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been slowly working towards addressing this. Like Colin hinted, it might make sense to retry at a higher level too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, split this in #395.
Force: true, | ||
GracePeriodSeconds: 600, | ||
IgnoreDaemonsets: true, | ||
err = wait.ExponentialBackoff(wait.Backoff{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, except I'm unsure if we should ever bail waiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, meaning we should just go ahead with the reboot even if draining failed? Yeah, I think we can do that.
Don't want to fail on transient errors. Closes: openshift#393
OK, tested this now. diff --git a/vendor/github.com/openshift/kubernetes-drain/drain.go b/vendor/github.com/openshift/kubernetes-drain/drain.go
index a6e21ee..4cfe1e8 100644
--- a/vendor/github.com/openshift/kubernetes-drain/drain.go
+++ b/vendor/github.com/openshift/kubernetes-drain/drain.go
@@ -20,6 +20,7 @@ import (
"errors"
"fmt"
"math"
+ "math/rand"
"sort"
"strings"
"time"
@@ -154,6 +155,10 @@ func GetNodes(client typedcorev1.NodeInterface, nodes []string, selector string)
//
// ![Workflow](http://kubernetes.io/images/docs/kubectl_drain.svg)
func Drain(client kubernetes.Interface, nodes []*corev1.Node, options *DrainOptions) (err error) {
+ rand.Seed(time.Now().UnixNano())
+ if rand.Intn(2) == 1 {
+ return errors.New("Failed to be lucky enough");
+ }
nodeInterface := client.CoreV1().Nodes()
for _, node := range nodes {
if err := Cordon(nodeInterface, node, options.Logger); err != nil { And got:
|
i dont want to block this pr, but we need to seriously assess the role of degraded in a follow-on. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: cgwalters, jlebon The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Don't want to fail on transient errors.
Closes: #393