Support tainting all nodes needing update during rolling update #8021

johngmyers · 2019-11-27T17:25:49Z

Adds a per-instancegroup option to taint all nodes needing update near the start of a rolling update. The expectation is that this would only be enabled on instancegroups which have autoscaling enabled and which have workloads which can tolerate waiting for scale-up if needed.

Extends the InstanceGroup API to support configuration of the rolling update strategy. Extends the Cluster API to support per-cluster defaults for same. The configuration options are behind the new ConfigurableRollingUpdate feature flag, which defaults to off.

In order to limit the damage should the new instance specification produce instances which fail validation, in the case where there are no existing instances with the current spec the strategy will first cordon and update a single instance, waiting until its replacement validates before tainting the rest.

Fixes #7958

k8s-ci-robot · 2019-11-27T17:25:57Z

Hi @johngmyers. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

johngmyers · 2019-11-27T17:26:29Z

/area rolling-update

johngmyers · 2019-11-27T18:18:16Z

/kind feature

zetaab

/ok-to-test

johngmyers · 2019-11-28T04:49:20Z

Per comment in #7902, cordoning will cause the nodes to be taken out of the AWS ELB's group, so this will need to take a gentler approach to making the nodes unschedulable. I would still appreciate feedback on the approach to doing configuration of strategies.

/hold

johngmyers · 2019-11-29T00:09:10Z

/hold cancel

johngmyers · 2019-12-07T04:57:08Z

/retest

johngmyers · 2019-12-12T14:52:21Z

/test pull-kops-e2e-kubernetes-aws

johngmyers · 2019-12-23T03:23:21Z

/test pull-kops-verify-staticcheck

johngmyers · 2019-12-24T01:21:51Z

/test pull-kops-e2e-kubernetes-aws

johngmyers · 2019-12-24T17:15:57Z

Let me know if I should rebase and clean up the commit stream.

justinsb · 2019-12-27T13:53:48Z

pkg/apis/kops/cluster.go

+
+type RollingUpdate struct {
+	// TaintAllNeedUpdate taints all nodes in the instance group that need update before draining any of them
+	TaintAllNeedUpdate *bool `json:"taintAllNeedUpdate,omitempty"`


Not 100% sure about this name - should this be "cordon"? Maybe preTaint or taintAllFirst?

Though I do see exactly how you came up with this name ... my suggestions don't specify that we only do instances that we are rolling. We could optimize the name for the common case, where we are rolling all the instances in the group - it's only when an update was interrupted that it won't be all of them (I think!)

What about if we state the intent, rather than the mechanism: avoidRepeatedPodScheduling or something along those lines?

OTOH, maybe this doesn't matter, because I would actually expect we treat unspecified as true once we are happy, because it seems the right strategy (I think!)

We taint instead of cordon because cordoning can take the node out of rotation of the AWS ELB. Not a good thing to happen to all the nodes of an instancegroup.

I have a pending PR for allowing an external process to nominate nodes for updating, for example if they are older than site policy permits. Also, cluster autoscaler could add new nodes between spec update and rolling update. So interruption of rolling update isn't necessarily the only case where not all nodes in the instancegroup need updating.

It would be unfortunate if someone enabled this on an instancegroup that doesn't have autoscaling (or 100% surging), so I think having some indication to the admin that tainting will happen is desirable.

justinsb · 2019-12-27T13:54:53Z

pkg/featureflag/featureflag.go

@@ -84,6 +84,8 @@ var (
 	VSphereCloudProvider = New("VSphereCloudProvider", Bool(false))
 	// SkipEtcdVersionCheck will bypass the check that etcd-manager is using a supported etcd version
 	SkipEtcdVersionCheck = New("SkipEtcdVersionCheck", Bool(false))
+	// ConfigurableRollingUpdate enables the RollingUpdate strategy configuration settings
+	ConfigurableRollingUpdate = New("ConfigurableRollingUpdate", Bool(false))


Do we need a featureflag, or can we rely on the idea that if users don't specify the field it won't get activated?

I think it would be a good idea to have a featureflag until we have the whole set of features in. Then we might consider whether we want to consolidate some of the settings.

justinsb · 2019-12-27T13:56:51Z

pkg/instancegroups/instancegroups.go

+	settings := resolveSettings(cluster, r.CloudGroup.InstanceGroup)
+
+	for uIdx, u := range update {
+		if featureflag.ConfigurableRollingUpdate.Enabled() && *settings.TaintAllNeedUpdate {


~~This looks like it's tainting each node before draining?~~

I understand why it's in the loop, but see the next comment. It might be easier to use PreferNoSchedule, and then we might not need to wait for the first successful new node.

I do wonder if we should undo the tainting on failure, but as these nodes no longer map to an InstanceGroup, we probably should keep them tainted, reflecting their true state.

Currently kops doesn't uncordon a node on a failed drain. The only automated way to recover is to do another rolling update to complete the update.

In case the instance spec is reverted, we might add code to ensure that nodes with our taint are considered NotReady and thus updated on a subsequent rolling update. But this feature is likely to be only enabled on instancegroups with cluster autoscaler enabled, so that will eventually remove them for being underused.

justinsb · 2019-12-27T14:01:08Z

pkg/instancegroups/instancegroups.go

+	}
+
+	if noneReady {
+		// Wait until after one node is deleted and its replacement validates before the mass-cordoning


Oh - this is clever :-)

It does get more complicated if we start rolling multiple nodes and/or temporarily creating more nodes ("surge upgrades"), but I guess in all cases we just wait for the first.

We can also "soft taint" the nodes ("preferred" not "forbidden'), so that the scheduler will still schedule pods back to the old nodes if something goes wrong. This also allows for e.g. if something goes wrong concurrently with the nodes and we start losing them. I don't know if we should rely on that or this "wait for success before tainting" approach?

Yes, for both MaxUnavailable and MaxSurge, I intend to also have this toe-dipping behavior. You really don't want a whole fleet of dead surged instances. Especially if it takes two or three tries to get a working spec.

justinsb · 2019-12-27T14:02:22Z

pkg/instancegroups/instancegroups.go

+	if len(toTaint) > 0 {
+		noun := "nodes"
+		if len(toTaint) == 1 {
+			noun = "node"


It's normally fine not to worry about this, or just do node(s), but this does look better!

justinsb · 2019-12-27T14:04:52Z

pkg/instancegroups/instancegroups.go

+
+	node.Spec.Taints = append(node.Spec.Taints, corev1.Taint{
+		Key:    rollingUpdateTaintKey,
+		Effect: corev1.TaintEffectNoSchedule,


This is where we could use TaintEffectPreferNoSchedule, instead of the more complex "wait for one" heuristic.

We could also do both...

The downside of a soft taint is that it would not cause cluster autoscaler to step in and create new instances.

The one thing I haven't figured out is how to get the overprovisioning pods evicted from the tainted nodes, so the pods can perform their intended function. Unfortunately, the requiredDuringSchedulingRequiredDuringExecution anti-affinity isn't implemented. Perhaps the rolling update code needs to search for and preemptively delete them.

justinsb · 2019-12-27T14:07:46Z

pkg/instancegroups/instancegroups.go

+		return err
+	}
+
+	_, err = rollingUpdateData.K8sClient.CoreV1().Nodes().Patch(node.Name, types.StrategicMergePatchType, patchBytes)


I was going to say it would be good to log the patch so we knew CreateTwoWayMergePatch was doing what we thought, but then I saw you have a test - even better 👍

justinsb · 2019-12-27T14:30:13Z

This looks really great @johngmyers thanks so much & sorry about the delays!

A few comments that I've posted above:

field naming, as is always the case. I suggested naming to focus on the intent ("avoiding repeated pod bounces") and optimizing for the common case where all the machines in an instancegroup are rolling.
I wonder if we should just "soft" taint the old nodes with PreferNoSchedule, and then maybe we don't need the "wait for one" logic. Though I do like that!
I don't know if we need the featureflag, if users have to specify the field explicitly. But OTOH I also want to make it the default, and have the field unspecified turn on this behaviour - but I recognize we don't necessarily want to do that on day 1.

The name of the field is the only thing that's not really changeable, so that's the only real blocker in my view. (I know it's annoying because it's probably the least important thing technically). Just to throw out a strawman, I'd be happy with avoidPodRescheduling if that name makes sense for you? But really anything intent based...

johngmyers · 2019-12-30T20:53:48Z

I think the first thing to decide is whether it should be a hard or soft taint.

If it's soft, then there probably isn't a need for a setting at all. We could just do this always, including for masters. With hard, the setting is needed for instancegroups without either cluster autoscaling or 100% surging.

The disadvantage of soft is that it's less effective at avoiding the repeating pod rescheduling. Cluster autoscaler won't create new nodes until all the old ones are filled to capacity.

We could have the setting choose between hard/soft instead of hard/none.

johngmyers · 2019-12-31T16:44:39Z

I think it makes sense to proceed with soft tainting for now. There is the possibility of later adding a setting for choosing between soft and hard tainting, should we need it.

johngmyers · 2019-12-31T17:41:59Z

/test pull-kops-e2e-kubernetes-aws

geojaz

I don't see any dealbreakers here. if we can get aligned on our labels and naming, i'm +1

geojaz · 2020-01-03T06:22:08Z

pkg/instancegroups/instancegroups.go

@@ -34,6 +37,8 @@ import (
 	"k8s.io/kops/upup/pkg/fi"
 )

+const rollingUpdateTaintKey = "kops.k8s.io/rolling-update"


Perhaps .../rolling-update-in-progress?

I'd use something like /scheduled-for-update

i'm on board with that

geojaz · 2020-01-03T06:40:20Z

pkg/instancegroups/instancegroups.go

+			}
+		}
+	}
+	if len(toTaint) > 0 {


we haven't worried about this yet, (as per jsb's comment) but i'd be happy to see this pulled out to a util function somewhere. I'd be happy to use a helper like this as we build out GCE support.

geojaz · 2020-01-03T06:45:48Z

pkg/instancegroups/instancegroups.go

@@ -221,6 +233,64 @@ func (r *RollingUpdateInstanceGroup) RollingUpdate(rollingUpdateData *RollingUpd
 	return nil
 }

+func (r *RollingUpdateInstanceGroup) taintAllNeedUpdate(update []*cloudinstances.CloudInstanceGroupMember, rollingUpdateData *RollingUpdateCluster) error {


pretty much the only thing I don't like is the naming here. I would prefer something like taintOutdatedNodes, but i'm not a hard no, it just feels awkward to me.

justinsb · 2020-01-04T14:39:53Z

pkg/instancegroups/instancegroups.go

@@ -137,6 +142,13 @@ func (r *RollingUpdateInstanceGroup) RollingUpdate(rollingUpdateData *RollingUpd
 		}
 	}

+	if !rollingUpdateData.CloudOnly {
+		err = r.taintAllNeedUpdate(update, rollingUpdateData)


Tip (that I am a recent convert to): Using if err := foo(); err != nil { return err } avoids problems with variable shadowing (though it really confuses the go static analysis tools, and it only works when it's a single error retval!)

Thanks, I'll try it out.

justinsb · 2020-01-04T14:42:25Z

Thanks @johngmyers - this is really a great improvement

/approve
/lgtm

k8s-ci-robot · 2020-01-04T14:42:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johngmyers, justinsb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [justinsb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rifelpet · 2020-01-04T15:37:34Z

/retest

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 27, 2019

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Nov 27, 2019

k8s-ci-robot requested review from KashifSaadat and rdrgmnzs November 27, 2019 17:26

k8s-ci-robot added the area/rolling-update label Nov 27, 2019

johngmyers mentioned this pull request Nov 27, 2019

How to display which instances are in NeedsUpdate state? #7902

Closed

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. labels Nov 27, 2019

zetaab reviewed Nov 27, 2019

View reviewed changes

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Nov 27, 2019

johngmyers force-pushed the cordon branch from b0258b5 to 8fda21b Compare November 28, 2019 01:05

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 28, 2019

johngmyers force-pushed the cordon branch from aa6c318 to f4b3bd5 Compare November 28, 2019 23:13

johngmyers changed the title ~~Support cordoning all nodes needing update during rolling update~~ Support tainting all nodes needing update during rolling update Nov 28, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 29, 2019

johngmyers force-pushed the cordon branch 2 times, most recently from 0a7882c to ceb9409 Compare December 6, 2019 06:22

granular-ryanbonham self-assigned this Dec 6, 2019

k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 7, 2019

johngmyers force-pushed the cordon branch from c4830a0 to 416c24c Compare December 7, 2019 01:10

johngmyers force-pushed the cordon branch from 6d0988e to 79f02a1 Compare December 23, 2019 23:33

justinsb reviewed Dec 27, 2019

View reviewed changes

johngmyers added 4 commits December 30, 2019 13:48

Return groups from getTestSetup()

7776985

Create nodes for instances in rolling update tests

92581ab

Add a third instance to each nodes group in rolling update tests

5189cc1

Taint nodes needing update

97ad2c3

johngmyers force-pushed the cordon branch from e24f914 to 97ad2c3 Compare December 31, 2019 16:42

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 31, 2019

johngmyers mentioned this pull request Jan 3, 2020

Adds johngmyers to reviewers because he's been awesome #8255

Closed

geojaz reviewed Jan 3, 2020

View reviewed changes

johngmyers added 2 commits January 3, 2020 10:07

Change taint key per review comment

cba59af

Merge branch 'master' into cordon

4d16192

justinsb reviewed Jan 4, 2020

View reviewed changes

k8s-ci-robot assigned justinsb Jan 4, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 4, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 4, 2020

k8s-ci-robot merged commit 5ecf8d9 into kubernetes:master Jan 4, 2020

k8s-ci-robot added this to the v1.18 milestone Jan 4, 2020

johngmyers deleted the cordon branch January 4, 2020 17:07

johngmyers mentioned this pull request Jan 9, 2020

REQUEST: New membership for johngmyers kubernetes/org#1527

Closed

6 tasks

Support tainting all nodes needing update during rolling update #8021

Support tainting all nodes needing update during rolling update #8021

Conversation

johngmyers commented Nov 27, 2019 • edited Loading

k8s-ci-robot commented Nov 27, 2019

johngmyers commented Nov 27, 2019

johngmyers commented Nov 27, 2019

zetaab left a comment

Choose a reason for hiding this comment

johngmyers commented Nov 28, 2019

johngmyers commented Nov 29, 2019

johngmyers commented Dec 7, 2019

johngmyers commented Dec 12, 2019

johngmyers commented Dec 23, 2019

johngmyers commented Dec 24, 2019

johngmyers commented Dec 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsb commented Dec 27, 2019

johngmyers commented Dec 30, 2019

johngmyers commented Dec 31, 2019

johngmyers commented Dec 31, 2019

geojaz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsb commented Jan 4, 2020

k8s-ci-robot commented Jan 4, 2020

rifelpet commented Jan 4, 2020

johngmyers commented Nov 27, 2019 •

edited

Loading