Validate cluster N times in rolling-update #8868

zetaab · 2020-04-07T21:37:09Z

We are still seeing lots of rolling update errors in case of cluster validation after instance roll.

Example:

I0407 16:41:39.366581 85 instancegroups.go:255] Cluster validated; revalidating in 10s to make sure it does not flap.
I0407 16:42:00.200357 85 instancegroups.go:271] Cluster validated.
master not healthy after update, stopping rolling-update: "cluster \"updateospr-f95a75.k8s.local\" did not pass validation: kube-system pod \"kube-apiserver-master-zone-1-1-1-updateospr-f95a75-k8s-local\" is pending"
I0407 16:42:17.290824 1 batch.go:902] error running kops rolling-update cluster --bastion-interval 2m --instance-group bastions,master-zone-1-1,master-zone-2-1,master-zone-3-1,nodes-z1,nodes-z2,nodes-z3 --validation-timeout 20m --yes

Disclaimer: we are running e2e tests quite heavily against kops. We are doing something like 20-50 rolling updates per day for clusters.

cc @hakman @johngmyers could you guys check this.

k8s-ci-robot · 2020-04-07T21:37:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: zetaab

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [zetaab]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

zetaab · 2020-04-07T21:37:55Z

cmd/kops/rollingupdatecluster.go

 		// TODO should we expose this to the UI?
-		ValidateTickDuration:    30 * time.Second,
-		ValidateSuccessDuration: 10 * time.Second,


this was not exposed to CLI, so as I see it - its easy to delete

johngmyers · 2020-04-07T23:02:59Z

Could we get details on why things are flapping? Did the kube-apiserver pod not exist when the cluster validated earlier? Do we need to explicitly check for it like we now do for controller-manager?

zetaab · 2020-04-08T04:52:07Z

in my opinion checking individual pods is not maybe the best option because kube-system namespace can contain any pods which should be checked. That is why generic retry logic is better than checking individual pod

cmd/kops/rollingupdatecluster.go

zetaab · 2020-04-08T10:56:49Z

@hakman fixed

hakman · 2020-04-08T10:59:29Z

Cool. Thanks @zetaab.
/lgtm

hakman · 2020-04-08T12:08:12Z

/retest

johngmyers

As I read the code, only the first instancegroup with unready nodes will get the effect of the new flag. Subsequent instancegroups will continue after a single successful validation, a reduction from the previous behavior.

johngmyers · 2020-04-10T03:40:10Z

pkg/instancegroups/rollingupdate.go

+	ValidateCount int32
+
+	// ValidateSucceeded is the amount of times that a cluster validate is succeeded already
+	ValidateSucceeded int32


Why is this public?

johngmyers · 2020-04-10T03:46:18Z

pkg/instancegroups/instancegroups.go

-		klog.Infof("Cluster validated; revalidating in %s to make sure it does not flap.", c.ValidateSuccessDuration)
-		time.Sleep(c.ValidateSuccessDuration)
-		result, err = c.ClusterValidator.Validate()
+	if err == nil && len(result.Failures) == 0 && c.ValidateCount > 0 {


Why the c.ValidateCount > 0 check? It seems unnecessary.

johngmyers · 2020-04-10T03:47:07Z

pkg/instancegroups/instancegroups.go

@@ -430,10 +430,12 @@ func (c *RollingUpdateCluster) validateClusterWithDuration(duration time.Duratio
 func (c *RollingUpdateCluster) tryValidateCluster(duration time.Duration) bool {
 	result, err := c.ClusterValidator.Validate()

-	if err == nil && len(result.Failures) == 0 && c.ValidateSuccessDuration > 0 {
-		klog.Infof("Cluster validated; revalidating in %s to make sure it does not flap.", c.ValidateSuccessDuration)
-		time.Sleep(c.ValidateSuccessDuration)


Why did you remove the sleep?

when we return false, it will go back to previous function which do have sleep already

zetaab · 2020-04-10T19:30:08Z

@johngmyers imo your comment is not true, I tested this. In case of all instancegroups it will loop through validatecount

This is a follow-on to kubernetes#8868; I believe the intent of that was to expose the option to do more (or fewer) retries. We previously had a single retry to prevent flapping; this basically unifies the previous behaviour with the idea of making it configurable. * validate-count=0 effectively turns off validation. * validate-count=1 will do a single validation, without flapping detection. * validate-count>=2 will require N succesful validations in a row, waiting ValidateSuccessDuration in between. A nice side-effect of this is that the tests now explicitly specify ValidateCount=1 instead of setting ValidateSuccessDuration=0, which had the side effect of doing the equivalent to ValidateCount=1.

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. area/rolling-update labels Apr 7, 2020

k8s-ci-robot requested review from joshbranham and robinpercy April 7, 2020 21:37

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2020

zetaab commented Apr 7, 2020

View reviewed changes

zetaab force-pushed the feature/validateNtimes branch from 48b4b95 to 7e35133 Compare April 8, 2020 05:03

zetaab changed the title ~~WIP: validate cluster n times in rolling update~~ Validate cluster n times in rolling update Apr 8, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2020

zetaab changed the title ~~Validate cluster n times in rolling update~~ Validate cluster N times in rolling-update Apr 8, 2020

k8s-ci-robot added the area/documentation label Apr 8, 2020

hakman reviewed Apr 8, 2020

View reviewed changes

cmd/kops/rollingupdatecluster.go Outdated Show resolved Hide resolved

zetaab added 2 commits April 8, 2020 13:55

validate cluster n times in rolling update

e1e7979

validationtimes -> validationcount

11eaacd

zetaab force-pushed the feature/validateNtimes branch from 2e33b38 to 11eaacd Compare April 8, 2020 10:55

k8s-ci-robot assigned hakman Apr 8, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 8, 2020

k8s-ci-robot merged commit 1e11c25 into kubernetes:master Apr 8, 2020

k8s-ci-robot added this to the v1.18 milestone Apr 8, 2020

zetaab deleted the feature/validateNtimes branch April 9, 2020 06:55

johngmyers reviewed Apr 10, 2020

View reviewed changes

justinsb mentioned this pull request Apr 17, 2020

Rolling-update validation harmonization #8931

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate cluster N times in rolling-update #8868

Validate cluster N times in rolling-update #8868

zetaab commented Apr 7, 2020 •

edited

Loading

k8s-ci-robot commented Apr 7, 2020

zetaab Apr 7, 2020

johngmyers commented Apr 7, 2020

zetaab commented Apr 8, 2020

zetaab commented Apr 8, 2020

hakman commented Apr 8, 2020

hakman commented Apr 8, 2020

johngmyers left a comment

johngmyers Apr 10, 2020

johngmyers Apr 10, 2020

johngmyers Apr 10, 2020

zetaab Apr 10, 2020

zetaab commented Apr 10, 2020

Validate cluster N times in rolling-update #8868

Validate cluster N times in rolling-update #8868

Conversation

zetaab commented Apr 7, 2020 • edited Loading

k8s-ci-robot commented Apr 7, 2020

zetaab Apr 7, 2020

Choose a reason for hiding this comment

johngmyers commented Apr 7, 2020

zetaab commented Apr 8, 2020

zetaab commented Apr 8, 2020

hakman commented Apr 8, 2020

hakman commented Apr 8, 2020

johngmyers left a comment

Choose a reason for hiding this comment

johngmyers Apr 10, 2020

Choose a reason for hiding this comment

johngmyers Apr 10, 2020

Choose a reason for hiding this comment

johngmyers Apr 10, 2020

Choose a reason for hiding this comment

zetaab Apr 10, 2020

Choose a reason for hiding this comment

zetaab commented Apr 10, 2020

zetaab commented Apr 7, 2020 •

edited

Loading