Option to surge during rolling update #8313

johngmyers · 2020-01-11T06:59:42Z

Adds a MaxSurge field to the cluster/instancegroup RollingUpdate struct.

/kind feature
/area rolling-update

johngmyers · 2020-01-11T07:01:01Z

WIP because it depends on #8271 landing first.

johngmyers · 2020-01-11T17:49:39Z

/retest

johngmyers · 2020-01-26T00:07:11Z

/test pull-kops-bazel-test

johngmyers · 2020-01-26T01:04:32Z

/retest

johngmyers · 2020-01-26T03:18:55Z

/retest

hakman · 2020-01-26T04:50:05Z

/retest

johngmyers · 2020-02-15T21:17:41Z

/assign @justinsb

justinsb · 2020-02-28T14:29:47Z

pkg/instancegroups/instancegroups.go


 	if maxConcurrency == 0 {
 		klog.Infof("Rolling updates for InstanceGroup %s are disabled", r.CloudGroup.InstanceGroup.Name)
 		return nil
 	}

+	if r.CloudGroup.InstanceGroup.Spec.Role == api.InstanceGroupRoleMaster && maxSurge != 0 {
+		// Masters are incapable of surging because they rely on registering themselves through


Not sure how chatty it would be if users don't set any values, but a warning might be great here if we're going to override user settings.

I believe we should put on the MaxSurge field's documentation Does not have any effect on instance groups with role "master". (or "...does not apply to...")

I could add an api validation that throws a field.Forbidden if MaxSurge is explicitly set to a non-zero value on an InstanceGroup with role "Master", because that just doesn't make sense.

For the case where the user set a nonzero default MaxSurge on the Cluster but didn't override that at the InstanceGroup level, I believe the best user experience would be for kops to silently do the right thing. We shouldn't make the user have to explicitly override the value on each of their master InstanceGroups just to get rid of log noise.

Ah - I missed the override point. I agree with either/both of the suggestions, I also don't consider them a blocker to merging this.

pkg/instancegroups/instancegroups.go

justinsb · 2020-02-28T14:36:10Z

I think this looks good - it's a clever idea to track the surge state using a tag on the infrastructure level - that's where we basically came unstuck previously.

I'm going to try to think a little bit more about this today (and we should probably talk about it during office hours), but I'm inclined to merge ... particularly if I can satisfy myself it works on non-AWS also :-)

johngmyers · 2020-02-28T19:23:54Z

Other cloud providers have the option of either tagging/detaching like AWS does or temporarily increasing the desired size of the underlying ASG. When I designed out an interface to be agnostic to this implementation choice yet handle all the failure cases, it ended up looking the same as the detach interface in this PR.

justinsb · 2020-03-04T01:27:37Z

Thanks @johngmyers - this is really great stuff

/approve
/lgtm

k8s-ci-robot · 2020-03-04T01:29:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johngmyers, justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [justinsb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

johngmyers · 2020-03-04T01:48:01Z

/retest

johngmyers · 2020-03-04T02:02:22Z

One of my added tests has a flake. I was able to reproduce it once in 100 runs. Will investigate.

johngmyers · 2020-03-04T02:04:42Z

/hold

johngmyers · 2020-03-04T04:54:50Z

/hold cancel

johngmyers · 2020-03-04T05:42:24Z

/retest

rifelpet · 2020-03-04T17:44:42Z

/lgtm

rifelpet · 2020-03-04T17:46:38Z

It'd be great to have some documentation for this feature as well. perhaps in docs/instance_groups.md ?

johngmyers · 2020-03-04T18:26:38Z

@rifelpet see #8673

k8s-ci-robot requested review from drekle and robinpercy January 11, 2020 07:00

johngmyers force-pushed the surge branch from 998a06b to fdbac95 Compare January 11, 2020 19:31

johngmyers force-pushed the surge branch from fdbac95 to 4279bf4 Compare January 25, 2020 23:44

johngmyers added 3 commits January 27, 2020 20:15

Terminate AWS instances through EC2 instead of Autoscaling

640f5f5

Add fi.Cloud.DetachInstance()

cc5b6f4

Detached instances don't count against instancegroup minimums

be12d88

johngmyers force-pushed the surge branch from 4279bf4 to 441deeb Compare January 28, 2020 04:35

johngmyers added 7 commits January 27, 2020 20:44

Add MaxSurge setting to cluster and instancegroup

10f06c5

make apimachinery

c95a43c

make crds

b8e6650

Add MaxSurge to resolveSettings

4ddc58c

Implement MaxSurge happy path

cee662d

Implement recovery from previous failed surge rolling updates

ebfcf5d

Remove code made unnecessary by apimachinery validation

38b7219

johngmyers force-pushed the surge branch from 441deeb to 38b7219 Compare January 28, 2020 04:45

johngmyers changed the title ~~WIP Option to surge during rolling update~~ Option to surge during rolling update Jan 28, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 28, 2020

johngmyers added 2 commits February 17, 2020 09:17

Fix field name for api validation

53c362d

Merge branch 'master' into surge

9f9b98e

johngmyers force-pushed the surge branch from 2a0d88e to 9f9b98e Compare February 17, 2020 17:17

justinsb reviewed Feb 28, 2020

View reviewed changes

Address review comments

ed73726

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 4, 2020

Merge branch 'master' into surge

1b7c513

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2020

Fix flaky test

99100dc

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 4, 2020

k8s-ci-robot assigned rifelpet Mar 4, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 4, 2020

k8s-ci-robot merged commit a5dabf5 into kubernetes:master Mar 4, 2020

k8s-ci-robot added this to the v1.18 milestone Mar 4, 2020

johngmyers deleted the surge branch March 4, 2020 18:23

paalkr mentioned this pull request Sep 24, 2020

Surge during rolling update does not work for mixed instances Auto Scaling Groups #9983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to surge during rolling update #8313

Option to surge during rolling update #8313

johngmyers commented Jan 11, 2020

johngmyers commented Jan 11, 2020

johngmyers commented Jan 11, 2020

johngmyers commented Jan 26, 2020

johngmyers commented Jan 26, 2020

johngmyers commented Jan 26, 2020

hakman commented Jan 26, 2020

johngmyers commented Feb 15, 2020

justinsb Feb 28, 2020

johngmyers Feb 29, 2020 •

edited

Loading

justinsb Mar 4, 2020

justinsb commented Feb 28, 2020

johngmyers commented Feb 28, 2020

justinsb commented Mar 4, 2020

k8s-ci-robot commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers commented Mar 4, 2020

rifelpet commented Mar 4, 2020

rifelpet commented Mar 4, 2020

johngmyers commented Mar 4, 2020

Option to surge during rolling update #8313

Option to surge during rolling update #8313

Conversation

johngmyers commented Jan 11, 2020

johngmyers commented Jan 11, 2020

johngmyers commented Jan 11, 2020

johngmyers commented Jan 26, 2020

johngmyers commented Jan 26, 2020

johngmyers commented Jan 26, 2020

hakman commented Jan 26, 2020

johngmyers commented Feb 15, 2020

justinsb Feb 28, 2020

Choose a reason for hiding this comment

johngmyers Feb 29, 2020 • edited Loading

Choose a reason for hiding this comment

justinsb Mar 4, 2020

Choose a reason for hiding this comment

justinsb commented Feb 28, 2020

johngmyers commented Feb 28, 2020

justinsb commented Mar 4, 2020

k8s-ci-robot commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers commented Mar 4, 2020

rifelpet commented Mar 4, 2020

rifelpet commented Mar 4, 2020

johngmyers commented Mar 4, 2020

johngmyers Feb 29, 2020 •

edited

Loading