-
Notifications
You must be signed in to change notification settings - Fork 41.8k
[Flaky Test] Fix multiple pods eviction flaky #94958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky Test] Fix multiple pods eviction flaky #94958
Conversation
|
kubernetes/test/e2e/node/taints.go Line 425 in ac7d944
kubernetes/test/e2e/node/taints.go Line 113 in ac7d944
and the tolerationSeconds field that dictates how long the pod will stay bound to the node after the taint is added according to https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ So this change seems straightforward. /lgtm |
|
@zhouya0: GitHub didn't allow me to request PR reviews from the following users: ScrapCodes. Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cc @liggitt |
|
@oomichi Can you approve this or let me know someone could review this PR? |
|
I'm not sure this is the right fix. Looking at the logs: Seems the second pod never successfully got scheduled to the node. Need to wait for pods to become ready before applying the taint. |
|
Isn't it enough to check if the pod is in running state before applying the taint. // Wait for pods to be running state before eviction happens
framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(cs, pod1))
framework.ExpectNoError(e2epod.WaitForPodRunningInNamespace(cs, pod2))
framework.Logf("Pod2 is running on %v. Tainting Node", nodeName)
// 2. Taint the nodes running those pods with a no-execute taint
ginkgo.By("Trying to apply a taint on the Node")
testTaint := getTestTaint()
e2enode.AddOrUpdateTaintOnNode(cs, nodeName, testTaint)
framework.ExpectNodeHasTaint(cs, nodeName, &testTaint)
defer e2enode.RemoveTaintOffNode(cs, nodeName, testTaint)What is more surprising is, how can a pod be in |
|
We can see both pods go running around the same time (I just include b2 logs, b1 right below this line) in kubelet log: So wait for running is working as intended. Later, after the pods both receive the DELETE, the kubelet patches them back to pending. Process between delete calls for pod b1 and b2 is about 20 seconds on the kubelet:
There's some errors around missing paths and other innocuous things in the kubelet log. It seems that the pods go pending after they are marked for deletion, but the deletion event is delayed quite a bit until all containers are deleted. I'm concerned if we adjust the timing, we might be masking something else. The pending transition I think is another issue and is documented elsewhere (possibly?). Probably need sig-node to look at the kubelet logs. |
|
@hasheddan @zhouya0 can we get together and seek out someone on @kubernetes/sig-node-bugs to help us out We need a root cause analysis done to understand why the pods remain in a PENDING state as long as they do, see @michaelgugino's commentary in the last comment here. Posted an assist request here https://kubernetes.slack.com/archives/C0BP8PW9G/p1605598026147600 Many thanks |
@michaelgugino Sorry it is like way to late response. But, can you help us chase that another issue you are pointing out above. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: pacoxu, zhouya0 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
should we merge #94958 instead? It has a comment and bigger timeout. |
|
@zhouya0 Can you reply what @SergeyKanzhelev requested above? |
|
@zhouya0: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Seems someone has fixed this. |
|
@zhouya0: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind flake
What this PR does / why we need it:
As all error log showing:
I think it's just the time to wait the pod eviction is not long enough since:
kubernetes/test/e2e/node/taints.go
Lines 423 to 426 in ac7d944
We would better to wait for
5*additionalWaitPerDeleteSecondsright?Which issue(s) this PR fixes:
Fixes #94931
Special notes for your reviewer:
Does this PR introduce a user-facing change?: