-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validate cluster N times in rolling-update #8868
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: zetaab The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
// TODO should we expose this to the UI? | ||
ValidateTickDuration: 30 * time.Second, | ||
ValidateSuccessDuration: 10 * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was not exposed to CLI, so as I see it - its easy to delete
Could we get details on why things are flapping? Did the kube-apiserver pod not exist when the cluster validated earlier? Do we need to explicitly check for it like we now do for controller-manager? |
in my opinion checking individual pods is not maybe the best option because kube-system namespace can contain any pods which should be checked. That is why generic retry logic is better than checking individual pod |
48b4b95
to
7e35133
Compare
2e33b38
to
11eaacd
Compare
@hakman fixed |
Cool. Thanks @zetaab. |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I read the code, only the first instancegroup with unready nodes will get the effect of the new flag. Subsequent instancegroups will continue after a single successful validation, a reduction from the previous behavior.
ValidateCount int32 | ||
|
||
// ValidateSucceeded is the amount of times that a cluster validate is succeeded already | ||
ValidateSucceeded int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this public?
klog.Infof("Cluster validated; revalidating in %s to make sure it does not flap.", c.ValidateSuccessDuration) | ||
time.Sleep(c.ValidateSuccessDuration) | ||
result, err = c.ClusterValidator.Validate() | ||
if err == nil && len(result.Failures) == 0 && c.ValidateCount > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the c.ValidateCount > 0
check? It seems unnecessary.
@@ -430,10 +430,12 @@ func (c *RollingUpdateCluster) validateClusterWithDuration(duration time.Duratio | |||
func (c *RollingUpdateCluster) tryValidateCluster(duration time.Duration) bool { | |||
result, err := c.ClusterValidator.Validate() | |||
|
|||
if err == nil && len(result.Failures) == 0 && c.ValidateSuccessDuration > 0 { | |||
klog.Infof("Cluster validated; revalidating in %s to make sure it does not flap.", c.ValidateSuccessDuration) | |||
time.Sleep(c.ValidateSuccessDuration) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we return false, it will go back to previous function which do have sleep already
@johngmyers imo your comment is not true, I tested this. In case of all instancegroups it will loop through validatecount |
This is a follow-on to kubernetes#8868; I believe the intent of that was to expose the option to do more (or fewer) retries. We previously had a single retry to prevent flapping; this basically unifies the previous behaviour with the idea of making it configurable. * validate-count=0 effectively turns off validation. * validate-count=1 will do a single validation, without flapping detection. * validate-count>=2 will require N succesful validations in a row, waiting ValidateSuccessDuration in between. A nice side-effect of this is that the tests now explicitly specify ValidateCount=1 instead of setting ValidateSuccessDuration=0, which had the side effect of doing the equivalent to ValidateCount=1.
This is a follow-on to kubernetes#8868; I believe the intent of that was to expose the option to do more (or fewer) retries. We previously had a single retry to prevent flapping; this basically unifies the previous behaviour with the idea of making it configurable. * validate-count=0 effectively turns off validation. * validate-count=1 will do a single validation, without flapping detection. * validate-count>=2 will require N succesful validations in a row, waiting ValidateSuccessDuration in between. A nice side-effect of this is that the tests now explicitly specify ValidateCount=1 instead of setting ValidateSuccessDuration=0, which had the side effect of doing the equivalent to ValidateCount=1.
We are still seeing lots of rolling update errors in case of cluster validation after instance roll.
Example:
Disclaimer: we are running e2e tests quite heavily against kops. We are doing something like 20-50 rolling updates per day for clusters.
cc @hakman @johngmyers could you guys check this.