add scale test for non graceful node shutdown #118848

sonasingh46 · 2023-06-24T12:03:18Z

Signed-off-by: Ashutosh Kumar sonasingh46@gmail.com

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR adds scale test for non graceful shutdown of the node.

Ref:
Blog Link: https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/
Feature PR Link: #108486
Issue Link : kubernetes/enhancements#2268

None

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

https://github.com/kubernetes/enhancements/tree/master/keps/sig-storage/2268-non-graceful-shutdown

k8s-ci-robot · 2023-06-24T12:03:27Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

YuikoTakada · 2023-06-27T08:06:42Z

Thank you for the PR.
MinNodeRequired is set as 10, is this number enought to test scalability?

msau42 · 2023-06-28T04:31:51Z

/ok-to-test
/assign @xing-yang

xing-yang · 2023-07-06T18:52:08Z

/kind test

k8s-ci-robot · 2023-07-06T18:52:10Z

@xing-yang: The label(s) kind/test cannot be applied, because the repository doesn't have them.

In response to this:

/kind test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

xing-yang · 2023-07-06T18:54:37Z

/kind feature

test/e2e/storage/non_graceful_node_shutdown_scale.go

YuikoTakada · 2023-07-07T10:50:26Z

need some sig-scalability members' review?

YuikoTakada · 2023-07-11T08:25:46Z

/assign @gnufied

Signed-off-by: Ashutosh Kumar <sonasingh46@gmail.com>

k8s-ci-robot · 2023-07-11T12:24:00Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sonasingh46
Once this PR has been reviewed and has the lgtm label, please ask for approval from gnufied. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/e2e/storage/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sonasingh46 · 2023-07-11T12:31:07Z

test/e2e/storage/non_graceful_node_shutdown_scale.go

+ 1. Deploys a gce-pd csi driver.
+ 2. Creates a gce-pd csi storage class.
+ 3. Taints 50% of nodes out of 'N' nodes.
+ 4. Creates a stateful set with "N/2" number of replicas.


With current implementation the number of pods that we can create for this test is always limited by the number of nodes because sts replica pods spreads across the nodes.
For example, for '10' nodes, STS can only have 5 replica pods and then the nodes where these pods got scheduled can be shutdown to move it to other '5' healthy nodes.

While I was working on this, one idea that came was the following:

We have let us say 'N' number of nodes in the k8s cluster.

Create 'K' statefulsets with only 1 or 2 replica count ( Let us say replica count is represented by 'RC').

Now have at least 'RC' number of nodes available ( these are nodes that do not go through shutdown in the test )

This way number of pods will not be constrained by the number of nodes.

cc @xing-yang

xing-yang · 2023-07-11T13:59:05Z

test/e2e/storage/non_graceful_node_shutdown_scale.go

+			framework.Logf("Failed to list node: %v", err)
+		}
+		if len(nodeListCache.Items) < MinNodeRequired {
+			ginkgo.Skip("At least 2 nodes are required to run the test")


"2" should be replaced by MinNodeRequired.

xing-yang · 2023-07-11T14:04:39Z

test/e2e/storage/non_graceful_node_shutdown_scale.go

+				e2enode.RemoveTaintOffNode(ctx, c, oldNodeName, taint)
+			}
+
+			// Verify that pods gets scheduled to the older nodes that was terminated non gracefully and now


gets -> get
was -> were
is -> are

xing-yang · 2023-07-11T14:05:15Z

test/e2e/storage/non_graceful_node_shutdown_scale.go

+	})
+})
+
+// createAndVerifyStatefulDeployment creates a statefulset


createAndVerifyStatefulDeployment -> createAndVerifySTS

xing-yang · 2023-07-11T14:33:36Z

test/e2e/storage/non_graceful_node_shutdown_scale.go

+			createAndVerifySTS(ctx, scName, newSTSName, ns, &newReplicaCount, newPodLabels, c)
+		})
+	})
+})


Do these pods and PVCs get cleaned up at the end of testing?

As per the current implementation, it does not. I will add that logic. Also, is the cleanup required?

Yes, typically the resources created for e2e tests should be cleaned up at the end of the tests.

ACK. Making the changes.

test/e2e/storage/non_graceful_node_shutdown_scale.go

gnufied · 2023-07-13T15:52:10Z

test/e2e/storage/non_graceful_node_shutdown_scale.go

+		}
+	})
+
+	ginkgo.Describe("[NonGracefulNodeShutdown] pod that uses a persistent volume via gce pd driver", func() {


This looks like a disruptive test and if ran in parallel with other tests could cause problem for other tests.

Will mark the test disruptive.

At line no 71, the top level describe is marked Disruptive

utils.SIGDescribe("[Feature:NodeOutOfServiceVolumeDetach] [Scalability] [Disruptive] [LinuxOnly] NonGracefulNodeShutdown", func() {

Do we still need to explicitly put Disruptive in here?

gnufied · 2023-07-13T16:13:26Z

test/e2e/storage/non_graceful_node_shutdown_scale.go

+			}
+
+			// Create STS with 'N/2' replicas ( 'N' = Number of nodes in the k8s cluster ).
+			replicaCount := int32(MinNodeRequired / 2)


Am I reading this right? If there are 10 worker nodes in the cluster then 5 nodes get tainted and then we create SS with replicaCount of 5 and then we shutdown kubelet on those nodes and taint the node. In which case, how will exactly those pods run on other set of nodes, since there are no nodes without taints?

Good catch. I think while working around the code, I accidentally deleted that piece of code that removes the taint after the STS pods get to running.
Let me fix it.

gnufied · 2023-07-13T16:13:46Z

test/e2e/storage/non_graceful_node_shutdown_scale.go

+				Effect: v1.TaintEffectNoSchedule,
+			}
+			for _, nodeName := range nodesToBeTainted {
+				e2enode.AddOrUpdateTaintOnNode(ctx, c, nodeName.Name, taintNew)


Do we need to remove these taints eventually?

gnufied

We should mark these tests disruptive.

xing-yang · 2023-07-14T15:54:10Z

/test pull-kubernetes-e2e-gce-cos-alpha-features

Signed-off-by: Ashutosh Kumar <sonasingh46@gmail.com>

YuikoTakada · 2023-07-25T08:18:25Z

/retest

xing-yang · 2023-07-27T12:09:40Z

/hold
We are not going to merge this test as it depends on a specific cloud provider.
Instead, we are going to add an integration test here: #119478

YuikoTakada · 2023-08-03T01:41:10Z

@xing-yang @sonasingh46 close this PR?

xing-yang · 2023-08-03T01:57:55Z

/close

k8s-ci-robot · 2023-08-03T01:58:01Z

@xing-yang: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Jun 24, 2023

k8s-ci-robot requested review from msau42 and saikat-royc June 24, 2023 12:03

k8s-ci-robot assigned xing-yang Jun 28, 2023

k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Jun 28, 2023

xing-yang mentioned this pull request Jul 6, 2023

Non-graceful node shutdown kubernetes/enhancements#2268

Closed

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. and removed do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jul 6, 2023

carlory reviewed Jul 7, 2023

View reviewed changes

test/e2e/storage/non_graceful_node_shutdown_scale.go Outdated Show resolved Hide resolved

sonasingh46 force-pushed the nongraceful_shutdown_scale_test branch from 6f5077b to 2f88d86 Compare July 10, 2023 14:31

k8s-ci-robot assigned gnufied Jul 11, 2023

add scale test for non graceful node shutdown

e214548

Signed-off-by: Ashutosh Kumar <sonasingh46@gmail.com>

sonasingh46 force-pushed the nongraceful_shutdown_scale_test branch from 2f88d86 to e214548 Compare July 11, 2023 12:22

sonasingh46 changed the title ~~[WIP] add scale test for non graceful node shutdown~~ add scale test for non graceful node shutdown Jul 11, 2023

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 11, 2023

sonasingh46 commented Jul 11, 2023

View reviewed changes

sonasingh46 mentioned this pull request Jul 11, 2023

move non-graceful node shutdown to GA #118228

Merged

xing-yang reviewed Jul 11, 2023

View reviewed changes

YuikoTakada reviewed Jul 13, 2023

View reviewed changes

test/e2e/storage/non_graceful_node_shutdown_scale.go Show resolved Hide resolved

YuikoTakada reviewed Jul 13, 2023

View reviewed changes

test/e2e/storage/non_graceful_node_shutdown_scale.go Show resolved Hide resolved

sonasingh46 mentioned this pull request Jul 13, 2023

add scale test for node out-of-service-detach feature kubernetes/test-infra#30083

Closed

gnufied reviewed Jul 13, 2023

View reviewed changes

incorporate review comments

423fdb4

Signed-off-by: Ashutosh Kumar <sonasingh46@gmail.com>

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 27, 2023

k8s-ci-robot closed this Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add scale test for non graceful node shutdown #118848

add scale test for non graceful node shutdown #118848

sonasingh46 commented Jun 24, 2023 •

edited

k8s-ci-robot commented Jun 24, 2023

YuikoTakada commented Jun 27, 2023

msau42 commented Jun 28, 2023

xing-yang commented Jul 6, 2023

k8s-ci-robot commented Jul 6, 2023

xing-yang commented Jul 6, 2023

YuikoTakada commented Jul 7, 2023

YuikoTakada commented Jul 11, 2023

k8s-ci-robot commented Jul 11, 2023

sonasingh46 Jul 11, 2023

xing-yang Jul 11, 2023

xing-yang Jul 11, 2023

xing-yang Jul 11, 2023

xing-yang Jul 11, 2023

sonasingh46 Jul 11, 2023

xing-yang Jul 11, 2023

sonasingh46 Jul 13, 2023

gnufied Jul 13, 2023

sonasingh46 Jul 13, 2023

sonasingh46 Jul 17, 2023

gnufied Jul 13, 2023

sonasingh46 Jul 13, 2023

gnufied Jul 13, 2023

gnufied left a comment

xing-yang commented Jul 14, 2023

YuikoTakada commented Jul 25, 2023

xing-yang commented Jul 27, 2023

YuikoTakada commented Aug 3, 2023

xing-yang commented Aug 3, 2023

k8s-ci-robot commented Aug 3, 2023

add scale test for non graceful node shutdown #118848

add scale test for non graceful node shutdown #118848

Conversation

sonasingh46 commented Jun 24, 2023 • edited

What type of PR is this?

What this PR does / why we need it:

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot commented Jun 24, 2023

YuikoTakada commented Jun 27, 2023

msau42 commented Jun 28, 2023

xing-yang commented Jul 6, 2023

k8s-ci-robot commented Jul 6, 2023

xing-yang commented Jul 6, 2023

YuikoTakada commented Jul 7, 2023

YuikoTakada commented Jul 11, 2023

k8s-ci-robot commented Jul 11, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied left a comment

Choose a reason for hiding this comment

xing-yang commented Jul 14, 2023

YuikoTakada commented Jul 25, 2023

xing-yang commented Jul 27, 2023

YuikoTakada commented Aug 3, 2023

xing-yang commented Aug 3, 2023

k8s-ci-robot commented Aug 3, 2023

sonasingh46 commented Jun 24, 2023 •

edited