promoting drain and validate by setting feature flag to true #3329

chrislovecnm · 2017-09-02T01:07:11Z

I am unable to recreate #2407, and frankly, it may be an edge case. We could warn a user if their wait times are low, but that would be another PR.

This PR moves Drain and Validate functionality for rolling-updates into the default user experience, setting the Feature Flag to true.

Per feedback, I am using the node and master interval times for the validation.

gambol99 · 2017-09-02T15:25:01Z

lgtm ...

justinsb · 2017-09-02T15:26:21Z

Drain and validate are still broken for me with low intervals.

I'll verify and post output

chrislovecnm · 2017-09-03T20:27:06Z

@justinsb should we force reasonable intervals or tell the user to do cloud only? Drain and validate is not made for low intervals.

Doing a update in production is not something that is super fast in my opinion.

chrislovecnm · 2017-09-05T02:12:10Z

@justinsb I have completed some testing tonight.

Testing with 2m interval, which is a bit low. It typically takes 3-5 minutes for a node or master to start. Succeeds as expected: https://gist.github.com/chrislovecnm/9cf1d2644851aa04ed40dc23fce92c31
Testing with 2s interval which is nuts. This fails as it should. We validate the cluster 8 extra times, and the rolling-update stops. This is expected as documented see --validate-retries. Here are the results: https://gist.github.com/chrislovecnm/f5dd79d20724d54f787673563cc1ff75

We should warn if a user is setting the interval under 3 minutes, but I am unable to recreate the problems you are encountering.

What application do you have installed? Am I noticing Prometheus?

Can we setup a time to work through this? I have done a lot of testing with this code, and I know other people are using it in production. How do we proceed?

justinsb · 2017-09-14T11:30:22Z

So I run with 2 second interval because I don't want to miss any edge cases. Compared to 2m interval, it's like running 60x more tests!

We discussed this on slack:

The --node-interval and --master-interval probably should keep their existing meaning, which is the minimum interval to wait between cycling nodes. So if you want to do a slow cycle, you can set them to 1h and run it over the weekend.
I don't see a reason to expose the validation / poll interval. I think it's fine to keep this at 2 minutes if you want, though I think something more like 30 seconds is probably going to be faster. I agree it doesn't need to be 2 seconds - that is just because it uncovers challenges.
I think you want a --validation-timeout setting - i.e. a total time. This lets us adopt more advanced timing strategies in future, like watching for the instance to be started before starting a poll.

What do you think?

justinsb · 2017-09-14T11:51:39Z

Another option is to treat --master-interval and --node-interval as the maximum interval between cycling nodes, i.e. the validation timeout. This is nice in terms of compatibility, because you don't need any more flags, and we just proceed once we know the cluster is stable. But on the other hand, setting the minimum interval gives us a pod-disruption-budget style handling, where we say we don't want to cycle the cluster as fast as we can, because the applications need a little longer to be totally healthy (e.g. just because Cassandra is running doesn't mean it isn't doing a repair).

blakebarnett · 2017-09-22T20:00:48Z

I forgot to create an issue with details, but randomly I have seen rolling-update choose to do the nodes first. I'm completely baffled by how this can happen based on the code, but it has definitely happened at least twice for me, once on a production cluster. If I hadn't been paying close attention it probably would have caused a major outage.

justinsb · 2017-09-24T01:25:22Z

cmd/kops/rollingupdatecluster.go

-		export KOPS_FEATURE_FLAGS="+DrainAndValidateRollingUpdate"
+		# do not fail if the cluster does not validate
+		# wait 8 min to create new node, and at least 8 min
+	    # to validate the cluster.


What's going on with these indents? Not a merge blocker, but is this deliberate?

justinsb · 2017-09-24T01:30:18Z

pkg/instancegroups/instancegroups.go

-	for i := 0; i <= rollingUpdateData.ValidateRetries; i++ {
+// ValidateClusterWithDuration runs validation.ValidateCluster until either we get positive result or the timeout expires
+func (n *CloudInstanceGroup) ValidateClusterWithDuration(rollingUpdateData *RollingUpdateCluster, instanceGroupList *api.InstanceGroupList, duration time.Duration) error {
+	// TODO should we expose this to the UI?


Probably once a use-case is demonstrated for doing so, but until then, no :-)

justinsb · 2017-09-24T01:30:46Z

pkg/instancegroups/instancegroups.go

+		select {
+		case <-timeout:
+			// Got a timeout fail with a timeout error
+			return fmt.Errorf("cluster did not validate within a duation of %q", duration)


typo: duation

justinsb · 2017-09-24T01:31:26Z

docs/cli/kops_rolling-update_cluster.md

+If rolling-update does not report that the cluster needs to be rolled you can force the cluster to be
+rolled with the force flag.  Rolling update drains and validates the cluster by default.  A cluster is
+deemed validated when all required nodes are running, and all pods in the kube-system namespace are operational.
+When a node is deleted rolling-update sleeps the interval for the node type, and the tries for the same period


typo: s/the/then

justinsb · 2017-09-24T01:31:41Z

LGTM 🎉

justinsb · 2017-09-24T01:32:36Z

pkg/instancegroups/instancegroups.go

-	return fmt.Errorf("cluster validation failed: %v", err)
+func (n *CloudInstanceGroup) tryValidateCluster(rollingUpdateData *RollingUpdateCluster, instanceGroupList *api.InstanceGroupList, duration time.Duration, tickDuration time.Duration) bool {
+	if _, err := validation.ValidateCluster(rollingUpdateData.ClusterName, instanceGroupList, rollingUpdateData.K8sClient); err != nil {
+		glog.Infof("Cluster did not validate, will try again in %q util duration %q expires: %v.", tickDuration, duration, err)


justinsb · 2017-09-24T01:38:16Z

/lgtm

k8s-github-robot · 2017-09-24T01:39:32Z

/test all [submit-queue is verifying that this PR is safe to merge]

chrislovecnm · 2017-09-24T01:39:40Z

I will file an issue for the typos, and the indenting is removed when the markdown is generated.

k8s-github-robot · 2017-09-24T01:42:30Z

/lgtm cancel //PR changed after LGTM, removing LGTM. @chrislovecnm @justinsb

justinsb · 2017-09-24T01:43:14Z

/lgtm

k8s-github-robot · 2017-09-24T01:44:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: justinsb

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

~~OWNERS~~ [justinsb]

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

k8s-github-robot · 2017-09-24T03:29:16Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2017-09-24T04:07:26Z

Automatic merge from submit-queue. .

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 2, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 2, 2017

k8s-github-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 13, 2017

promoting drain and validate by setting feature flag to true

acb5e8b

chrislovecnm force-pushed the promote-drain-validate branch from be9136c to acb5e8b Compare September 23, 2017 22:48

reusing the node and master duration for validation periods

ec2f0df

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 24, 2017

cli docs updated

9ed7c55

justinsb reviewed Sep 24, 2017

View reviewed changes

k8s-ci-robot assigned justinsb Sep 24, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2017

k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 24, 2017

tweaking ux printing rolled cluster name

8dabeec

k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2017

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 24, 2017

chrislovecnm mentioned this pull request Sep 24, 2017

Fix typos with drain and validate code #3441

Closed

k8s-github-robot merged commit ba42020 into kubernetes:master Sep 24, 2017

chrislovecnm deleted the promote-drain-validate branch September 24, 2017 04:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

promoting drain and validate by setting feature flag to true #3329

promoting drain and validate by setting feature flag to true #3329

chrislovecnm commented Sep 2, 2017 •

edited

Loading

gambol99 commented Sep 2, 2017

justinsb commented Sep 2, 2017 •

edited

Loading

chrislovecnm commented Sep 3, 2017

chrislovecnm commented Sep 5, 2017 •

edited

Loading

justinsb commented Sep 14, 2017 •

edited

Loading

justinsb commented Sep 14, 2017

blakebarnett commented Sep 22, 2017

justinsb Sep 24, 2017

justinsb Sep 24, 2017

justinsb Sep 24, 2017

justinsb Sep 24, 2017

justinsb commented Sep 24, 2017

justinsb Sep 24, 2017

justinsb commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

chrislovecnm commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

justinsb commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

promoting drain and validate by setting feature flag to true #3329

promoting drain and validate by setting feature flag to true #3329

Conversation

chrislovecnm commented Sep 2, 2017 • edited Loading

gambol99 commented Sep 2, 2017

justinsb commented Sep 2, 2017 • edited Loading

chrislovecnm commented Sep 3, 2017

chrislovecnm commented Sep 5, 2017 • edited Loading

justinsb commented Sep 14, 2017 • edited Loading

justinsb commented Sep 14, 2017

blakebarnett commented Sep 22, 2017

justinsb Sep 24, 2017

Choose a reason for hiding this comment

justinsb Sep 24, 2017

Choose a reason for hiding this comment

justinsb Sep 24, 2017

Choose a reason for hiding this comment

justinsb Sep 24, 2017

Choose a reason for hiding this comment

justinsb commented Sep 24, 2017

justinsb Sep 24, 2017

Choose a reason for hiding this comment

justinsb commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

chrislovecnm commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

justinsb commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

k8s-github-robot commented Sep 24, 2017

chrislovecnm commented Sep 2, 2017 •

edited

Loading

justinsb commented Sep 2, 2017 •

edited

Loading

chrislovecnm commented Sep 5, 2017 •

edited

Loading

justinsb commented Sep 14, 2017 •

edited

Loading