Avoid waiting on validation during rolling update for inapplicable instance groups #10065

bharath-123 · 2020-10-17T19:17:01Z

When an error occurs during cluster validation during a rolling update, the validation phase waits for a certain period of time before continuing.

We can avoid this wait for certain errors inapplicable to an instance group whose rolling update occurring. eg: When a node in a different instance group is terminated leading to a error regarding instance group target size being lesser than expected. For such errors we do not want rolling update on the instance group to wait for errors on inapplicable instance groups.

Fixes #10009

Really appreciate the review comments and discussions. Working on this PR so far has been a great experience for me.

k8s-ci-robot · 2020-10-17T19:17:09Z

Welcome @bharath-123!

It looks like this is your first PR to kubernetes/kops 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kops has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2020-10-17T19:17:09Z

Hi @bharath-123. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

hakman · 2020-10-18T15:40:32Z

/cc @olemarkus @johngmyers

johngmyers · 2020-10-18T18:57:33Z

/ok-to-test

bharath-123 · 2020-10-18T19:31:14Z

Seems like some static checks and format tests are failing. Should have run gofmt and a bunch of other static checks before raising the PR. Will do so now.

johngmyers

You'll need to run make gofmt and fix that duplicate import in order to fix those failed tests.

A useful unit test for rolling update would be to mock in a validation failure for a nodes group that is not selected for update and verify that the rolling update completes regardless.

The validation unit tests should be updated to assert that the instanceGroup field is filled in appropriately.

In collectPodFailures, failures for pods of priority system-node-critical should be marked as being specific to the pod's node's group. These are typically things like CNI DaemonSets. Pods that are system-cluster-critical, on the other hand, are cluster-level failures.

johngmyers · 2020-10-18T19:19:16Z

pkg/instancegroups/instancegroups.go

+		// TODO: Should we check if we have enough time left before the deadline? 
+		if shouldWaitBeforeRetry(result.Failures, group) {
+			time.Sleep(c.ValidateTickDuration)
+		}


There's a logic error here: when shouldWaitBeforeRetry fails, instead of letting the validation succeed and proceeding with updating the next node it instead goes into a fast spinloop.

When all failures are inapplicable it should go through the code block that increments successCount.

One question, even if the failures are inapplicable and we skip the wait period. We definitely want to log the failures right? I think we would need to change the log messages here if I am right. If the validation fails and the failures are inapplicable, showing the failure log messages and a cluster validated messages would be very unintuitive for the sys admins.
I am planning to copy this block of code and add in the else stmt if shouldWaitBeforeRetry fails. I will change up the log messages to indicate that there were failures but they were ignored because they are in a different instance group.

successCount++ if successCount >= validateCount { klog.Info("Cluster validated.") return nil } klog.Infof("Cluster validated; revalidating in %s to make sure it does not flap.", c.ValidateSuccessDuration) time.Sleep(c.ValidateSuccessDuration) continue

johngmyers · 2020-10-18T19:23:01Z

pkg/validation/validate_cluster_test.go

@@ -154,22 +155,24 @@ func Test_ValidateCloudGroupMissing(t *testing.T) {
 			Kind:    "InstanceGroup",
 			Name:    "node-1",
 			Message: "InstanceGroup \"node-1\" is missing from the cloud provider",
+			InstanceGroup: instanceGroup,


Nit: you could have made a smaller change by using instanceGroups[0].

bharath-123 · 2020-10-18T19:37:12Z

/retest

johngmyers · 2020-10-18T19:46:08Z

If the system-node-critical feature is too complicated, it can be done as a follow-up PR.

bharath-123 · 2020-10-18T19:48:46Z

@johngmyers I was not able to figure out how to infer the InstanceGroup for a pod in collectPodFailures? The pod has a NodeName field in it's podspec. But I was not really sure how we could infer the InstanceGroup for a pod from that?

bharath-123 · 2020-10-18T19:50:02Z

I will follow this up with the unit test and the code changes that you have suggested. Thank you.

johngmyers · 2020-10-18T19:50:05Z

It would be better to edit the commit stream so that the formatting and staticcheck issues were fixed in their original commits.

I like the separation between the first and second commits. The third commit should be combined with the second as it is part of the same logical change.

bharath-123 · 2020-10-18T19:50:30Z

If the system-node-critical feature is too complicated, it can be done as a follow-up PR.

Shouldn't be difficult if we can figure out how to infer the instanceGroup from a pod given its structure.

bharath-123 · 2020-10-18T19:51:24Z

It would be better to edit the commit stream so that the formatting and staticcheck issues were fixed in their original commits.

I like the separation between the first and second commits. The third commit should be combined with the second as it is part of the same logical change.

yes I agree with this. I will anyways be doing a whole bunch of changes now. So will revise the commit stream.

johngmyers · 2020-10-18T20:26:32Z

validateNodes could construct a map of nodes to groups. collectPodFailures already constructs a map of IP address to node, so it could then use that map of nodes to groups to go from pod to group.

johngmyers · 2020-10-19T06:15:08Z

The rolling update code should probably remove the inapplicable failures from what it logs.

justinsb · 2020-10-20T13:39:18Z

pkg/validation/validate_cluster.go

@@ -25,6 +25,7 @@ import (

 	"k8s.io/apimachinery/pkg/runtime"
 	"k8s.io/client-go/tools/pager"
+	kopsapi "k8s.io/kops/pkg/apis/kops"


This should already be imported as kops below, I believe

yes @justinsb will be doing this.

bharath-123 · 2020-10-25T06:19:30Z

The rolling update code should probably remove the inapplicable failures from what it logs.

Makes sense. I am working on a different implementation of this feature now.

bharath-123 · 2020-10-25T12:20:30Z

@johngmyers @justinsb @hakman @olemarkus @zetaab do have a look at this. This is a complete implementation with unit tests added for rolling update. Do review this.

bharath-123 · 2020-10-25T12:21:28Z

There are a couple of unused imports and unrelated comments which I had added when I was trying a different implementation. I ll remove them now.

bharath-123 · 2020-10-26T06:21:11Z

gentle ping @johngmyers @justinsb @hakman

rifelpet · 2020-10-26T14:09:45Z

@bharath-123 Feel free to remove [WIP] from the PR title if this is ready for review.

bharath-123 · 2020-10-26T14:13:20Z

@bharath-123 Feel free to remove [WIP] from the PR title if this is ready for review.

done

johngmyers

I don't see where in the code failures specific to a master IG cause rolling update to be blocked from proceeding.

From comments it seems you intended to put this logic in the validation code. I think it might be simpler to put it in the rolling update code that examines the returned list of failures.

johngmyers · 2020-10-28T04:44:22Z

pkg/instancegroups/instancegroups.go

@@ -445,7 +446,7 @@ func (c *RollingUpdateCluster) validateClusterWithTimeout(validateCount int) err
 	for {
 		// Note that we validate at least once before checking the timeout, in case the cluster is healthy with a short timeout
 		result, err := c.ClusterValidator.Validate()
-		if err == nil && len(result.Failures) == 0 {
+		if err == nil && (len(result.Failures) == 0 || isNotRelatedInstanceGroupError(result.Failures, group)) {


The len(result.Failures) == 0 is redundant.

yes. Will remove it.

johngmyers · 2020-10-28T04:48:07Z

pkg/instancegroups/instancegroups.go

 		// TODO: Should we check if we have enough time left before the deadline?
 		time.Sleep(c.ValidateTickDuration)
 	}

 	return fmt.Errorf("cluster did not validate within a duration of %q", c.ValidationTimeout)
 }

+func isNotRelatedInstanceGroupError(failures []*validation.ValidationError, group *cloudinstances.CloudInstanceGroup) bool {


The sense of the negation is confusing. They're also "failures"; "errors" are something else. I would recommend inverting the sense of the return value, calling this something like hasFailureRelevantToGroup()

yes. makes sense. Will do that.

johngmyers · 2020-10-28T04:48:54Z

pkg/instancegroups/rollingupdate_test.go

@@ -105,6 +105,30 @@ func (*erroringClusterValidator) Validate() (*validation.ValidationCluster, erro
 	return nil, errors.New("testing validation error")
 }

+// simulates failures in a specific node group in the map of instance groups


Suggested change

// simulates failures in a specific node group in the map of instance groups

// instanceGroupNodeSpecificErrorClusterValidator simulates failures in a specific node group in the map of instance groups.

johngmyers · 2020-10-28T04:50:21Z

pkg/validation/validate_cluster.go

+	// optional field to indicate which instance group this validation error is coming from
+	// if InstanceGroup is nil, then the validation error is either a cluster wide error
+	// or an error in the master instancegroup which indicates a cluster wide error


Suggested change

// optional field to indicate which instance group this validation error is coming from

// if InstanceGroup is nil, then the validation error is either a cluster wide error

// or an error in the master instancegroup which indicates a cluster wide error

// InstanceGroup is an optional field to indicate which instance group this validation error is coming from.

// If nil, then the validation error is either a cluster wide error

// or an error in the master instancegroup which indicates a cluster wide error.

Acked these comment suggestions. They are indeed more helpful.

johngmyers · 2020-10-28T04:52:15Z

pkg/validation/validate_cluster.go

@@ -238,6 +243,9 @@ func (v *ValidationCluster) collectPodFailures(ctx context.Context, client kuber
 		if pod.Status.Phase == v1.PodSucceeded {
 			return nil
 		}
+
+		// pod validationErrors usually do not have instanceGroup field as all pod errors caught
+		// are system critical pods.


This isn't true: some pod errors are node critical pods, not system critical pods.

I inferred this from line 240 containing the following code:

if priority != "system-cluster-critical" && priority != "system-node-critical" { return nil }

But I think I wrongly read system-node-critical. Need to brainstorm on how to get the InstanceGroup for the pods which doesn't seem to be straightforward.

As mentioned previously, this can be deferred to a later PR.

Alright. So we can leave the InstanceGroup field as nil for validation error for kind Pods.

johngmyers · 2020-10-28T04:59:21Z

pkg/instancegroups/rollingupdate_test.go

+}
+
+func (igErrorValidator *instanceGroupNodeSpecificErrorClusterValidator) Validate() (*validation.ValidationCluster, error) {
+	instanceGroup := igErrorValidator.Groups[igErrorValidator.NodeGroup].InstanceGroup


It would be simpler if the instanceGroupNodeSpecificErrorClusterValidator struct only contained an InstanceGroup field.

alright. Will do that. It makes sense.

johngmyers · 2020-10-28T05:03:33Z

pkg/instancegroups/rollingupdate_test.go

+func (igErrorValidator *instanceGroupNodeSpecificErrorClusterValidator) Validate() (*validation.ValidationCluster, error) {
+	instanceGroup := igErrorValidator.Groups[igErrorValidator.NodeGroup].InstanceGroup
+	if instanceGroup.IsMaster() || instanceGroup.IsBastion() {
+		instanceGroup = nil


This is testing the mock, not the code under test.

Would it be better if we include the Validate logic in the Validate mock? I thought just mocking the failures should have been enough to test this. Since we are mostly concerned about the failures return by Validate() and not how it returns it. Am I understanding something wrongly?

If it was the rolling update code that was responsible for ensuring failures for IGs of role master still blocked rolling update, then testing that in the rolling updates code would make sense. In the current design, where it is the validation code that is responsible for that logic, that logic should be tested by the validation unit tests, not the rolling update unit tests.

Fair enough. This makes sense. I ll push all logic to determine whether we should wait or not for validation to the rolling update code rather than the validation code. It would make things much easier and simpler

bharath-123 · 2020-10-30T07:18:47Z

I don't see where in the code failures specific to a master IG cause rolling update to be blocked from proceeding.

From comments it seems you intended to put this logic in the validation code. I think it might be simpler to put it in the rolling update code that examines the returned list of failures.

In the ValidationError struct, If the InstanceGroup field is nil then this failure is a cluster wide failure for which rolling update should indeed be blocked. For all validation failures in master IG, the InstanceGroup field is nil. isNotRelatedInstanceGroupError ( Which I will change the name of ) returns false if InstanceGroup field is nil which causes rolling update to be blocked.

It would definitely be cleaner if we can have InstanceGroup field filled up for all failures. I mainly did this because it would be a bit difficult to identify the groups for the pods. I ll think more about this and get back.

johngmyers · 2020-10-30T18:35:45Z

For all validation failures in master IG, the InstanceGroup field is nil.

I did not see code for ensuring the InstanceGroup field is nil for failures in IGs of role master.

bharath-123 · 2020-10-31T10:44:02Z

For all validation failures in master IG, the InstanceGroup field is nil.

I did not see code for ensuring the InstanceGroup field is nil for failures in IGs of role master.

Will be pushing all the logic to rolling update code now.

bharath-123 · 2020-10-31T12:31:56Z

@johngmyers working on refining this. Just to confirm, for Pods and ComponentStatuses it's not straightforward to get the InstanceGroup. This will anyways be deferred to a future PR.

For these cases, in the rolling update code we will be considering cases where InstanceGroup == nil to be a relevant InstanceGroup failure. Would I need to assert these cases in the validate_cluster tests? In the rolling update test, I ll include a test to fail rolling update if a validation error is returned if InstanceGroup is nil.

The InstanceGroup field in ValidationError struct is an optional field meant to indicate the InstanceGroup which has reported that failure. This field either holds a pointer to the instance group which caused the validation error or can be nil which indicates that we were unable to determine the instance group to which this failure should be attributed to. This field is mainly used to identify whether a failure is worth waiting for when validating a particular instance group.

This commit fixes the unit tests for validate_cluster to reflect the addition of the new InstanceGroup field in struct ValidationError

When unrelated instance groups produce validation errors, the instance group being updated produces a failure and is forced to wait for rolling update to continue. This can be avoided as failures in different node instance groups usually don't affect the instance group being affected in any way.

The tests create a cluster with 2 node instance groups and 1 master and bastion instance groups. Only one node instance group requires rolling update. instanceGroupNodeSpecificErrorClusterValidator mocks a validation failure for a given node group. rolling update should not fail if the cluster validator reports an error in an unrelated instance group.

bharath-123 · 2020-11-04T17:42:18Z

gentle ping on this @johngmyers . Thank you for your help and patience on this PR.

johngmyers · 2020-11-06T06:06:03Z

/lgtm
/approve

k8s-ci-robot · 2020-11-06T06:06:40Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bharath-123, johngmyers

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johngmyers]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Oct 17, 2020

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 17, 2020

k8s-ci-robot requested review from hakman and zetaab October 17, 2020 19:17

k8s-ci-robot added area/rolling-update size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 17, 2020

k8s-ci-robot requested review from johngmyers and olemarkus October 18, 2020 15:40

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Oct 18, 2020

johngmyers added the hacktoberfest-accepted Accepted contribution for Hacktoberfest label Oct 18, 2020

johngmyers requested changes Oct 18, 2020

View reviewed changes

justinsb reviewed Oct 20, 2020

View reviewed changes

bharath-123 force-pushed the feature/instancegroup-specific-validation branch 2 times, most recently from 9fe80ab to b9927ac Compare October 25, 2020 11:56

bharath-123 force-pushed the feature/instancegroup-specific-validation branch from b9927ac to d4e762d Compare October 25, 2020 12:25

bharath-123 changed the title ~~[WIP] Avoid waiting on validation during rolling update for inapplicable instance groups~~ Avoid waiting on validation during rolling update for inapplicable instance groups Oct 26, 2020

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 26, 2020

johngmyers requested changes Oct 28, 2020

View reviewed changes

bharath-123 force-pushed the feature/instancegroup-specific-validation branch from d4e762d to 4016b36 Compare October 31, 2020 13:34

bharath-123 added 4 commits October 31, 2020 19:16

validate_cluster_test: Update validate_cluster_tests

f99c04f

This commit fixes the unit tests for validate_cluster to reflect the addition of the new InstanceGroup field in struct ValidationError

bharath-123 force-pushed the feature/instancegroup-specific-validation branch from 4016b36 to 1e18a5d Compare October 31, 2020 13:51

k8s-ci-robot assigned johngmyers Nov 6, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 6, 2020

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 6, 2020

k8s-ci-robot merged commit 7b26ec4 into kubernetes:master Nov 6, 2020

k8s-ci-robot added this to the v1.20 milestone Nov 6, 2020

	// simulates failures in a specific node group in the map of instance groups
	// instanceGroupNodeSpecificErrorClusterValidator simulates failures in a specific node group in the map of instance groups.

Avoid waiting on validation during rolling update for inapplicable instance groups #10065

Avoid waiting on validation during rolling update for inapplicable instance groups #10065

Conversation

bharath-123 commented Oct 17, 2020 • edited Loading

k8s-ci-robot commented Oct 17, 2020

k8s-ci-robot commented Oct 17, 2020

hakman commented Oct 18, 2020

johngmyers commented Oct 18, 2020

bharath-123 commented Oct 18, 2020

johngmyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharath-123 commented Oct 18, 2020

johngmyers commented Oct 18, 2020

bharath-123 commented Oct 18, 2020

bharath-123 commented Oct 18, 2020

johngmyers commented Oct 18, 2020

bharath-123 commented Oct 18, 2020

bharath-123 commented Oct 18, 2020

johngmyers commented Oct 18, 2020

johngmyers commented Oct 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharath-123 commented Oct 25, 2020

bharath-123 commented Oct 25, 2020

bharath-123 commented Oct 25, 2020

bharath-123 commented Oct 26, 2020

rifelpet commented Oct 26, 2020

bharath-123 commented Oct 26, 2020

johngmyers left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bharath-123 commented Oct 30, 2020

johngmyers commented Oct 30, 2020

bharath-123 commented Oct 31, 2020

bharath-123 commented Oct 31, 2020

bharath-123 commented Nov 4, 2020

johngmyers commented Nov 6, 2020

k8s-ci-robot commented Nov 6, 2020

bharath-123 commented Oct 17, 2020 •

edited

Loading