New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add scale test for non graceful node shutdown #118848
add scale test for non graceful node shutdown #118848
Conversation
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thank you for the PR. |
/ok-to-test |
/kind test |
@xing-yang: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind feature |
need some sig-scalability members' review? |
6f5077b
to
2f88d86
Compare
/assign @gnufied |
Signed-off-by: Ashutosh Kumar <sonasingh46@gmail.com>
2f88d86
to
e214548
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: sonasingh46 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
1. Deploys a gce-pd csi driver. | ||
2. Creates a gce-pd csi storage class. | ||
3. Taints 50% of nodes out of 'N' nodes. | ||
4. Creates a stateful set with "N/2" number of replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With current implementation the number of pods that we can create for this test is always limited by the number of nodes because sts replica pods spreads across the nodes.
For example, for '10' nodes, STS can only have 5 replica pods and then the nodes where these pods got scheduled can be shutdown to move it to other '5' healthy nodes.
While I was working on this, one idea that came was the following:
- We have let us say 'N' number of nodes in the k8s cluster.
- Create 'K' statefulsets with only 1 or 2 replica count ( Let us say replica count is represented by 'RC').
- Now have at least 'RC' number of nodes available ( these are nodes that do not go through shutdown in the test )
This way number of pods will not be constrained by the number of nodes.
cc @xing-yang
framework.Logf("Failed to list node: %v", err) | ||
} | ||
if len(nodeListCache.Items) < MinNodeRequired { | ||
ginkgo.Skip("At least 2 nodes are required to run the test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"2" should be replaced by MinNodeRequired
.
e2enode.RemoveTaintOffNode(ctx, c, oldNodeName, taint) | ||
} | ||
|
||
// Verify that pods gets scheduled to the older nodes that was terminated non gracefully and now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gets -> get
was -> were
is -> are
}) | ||
}) | ||
|
||
// createAndVerifyStatefulDeployment creates a statefulset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
createAndVerifyStatefulDeployment -> createAndVerifySTS
createAndVerifySTS(ctx, scName, newSTSName, ns, &newReplicaCount, newPodLabels, c) | ||
}) | ||
}) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do these pods and PVCs get cleaned up at the end of testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per the current implementation, it does not. I will add that logic. Also, is the cleanup required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, typically the resources created for e2e tests should be cleaned up at the end of the tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK. Making the changes.
} | ||
}) | ||
|
||
ginkgo.Describe("[NonGracefulNodeShutdown] pod that uses a persistent volume via gce pd driver", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a disruptive test and if ran in parallel with other tests could cause problem for other tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will mark the test disruptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At line no 71, the top level describe is marked Disruptive
utils.SIGDescribe("[Feature:NodeOutOfServiceVolumeDetach] [Scalability] [Disruptive] [LinuxOnly] NonGracefulNodeShutdown", func() {
Do we still need to explicitly put Disruptive
in here?
} | ||
|
||
// Create STS with 'N/2' replicas ( 'N' = Number of nodes in the k8s cluster ). | ||
replicaCount := int32(MinNodeRequired / 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I reading this right? If there are 10 worker nodes in the cluster then 5 nodes get tainted and then we create SS with replicaCount of 5 and then we shutdown kubelet on those nodes and taint the node. In which case, how will exactly those pods run on other set of nodes, since there are no nodes without taints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I think while working around the code, I accidentally deleted that piece of code that removes the taint after the STS pods get to running.
Let me fix it.
Effect: v1.TaintEffectNoSchedule, | ||
} | ||
for _, nodeName := range nodesToBeTainted { | ||
e2enode.AddOrUpdateTaintOnNode(ctx, c, nodeName.Name, taintNew) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to remove these taints eventually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mark these tests disruptive.
/test pull-kubernetes-e2e-gce-cos-alpha-features |
Signed-off-by: Ashutosh Kumar <sonasingh46@gmail.com>
/retest |
/hold |
@xing-yang @sonasingh46 close this PR? |
/close |
@xing-yang: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Signed-off-by: Ashutosh Kumar sonasingh46@gmail.com
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds scale test for non graceful shutdown of the node.
Ref:
Blog Link: https://kubernetes.io/blog/2022/05/20/kubernetes-1-24-non-graceful-node-shutdown-alpha/
Feature PR Link: #108486
Issue Link : kubernetes/enhancements#2268
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: