-
Notifications
You must be signed in to change notification settings - Fork 39.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flaky test] [sig-node] Kubelet should correctly account for terminated pods after restart #105686
Comments
Introduced by #105527, maybe we should use kubernetes/test/e2e_node/restart_test.go Line 176 in 9804a83
See the log below, a pod becomes Succeed may exhaust tens of seconds
|
/triage accepted |
Yeah I only tested this locally and not in CI, it seems that in CI the pods take much longer. Timings just need to be tweaked. |
/reopen still a number of flakes since the attached PR merged, I don't think that fixed it. |
@ehashman: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@ehashman https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Kubelet%20should%20correctly%20account%20for%20terminated%20pods%20after%20restart |
so this problem should have been fixed by #106371 |
From gubernator: Failure cluster f3393094bdefc50e4d1fError text:
Recent failures:11/14/2021, 6:36:14 AM ci-kubernetes-node-kubelet-serial In all cases, it's timing out. I'll try upping the startTimeout. |
/reopen still flaking |
@ehashman: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Luckily the kubelet log is clear as to what's going on. From https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1461404704922669056/artifacts/n1-standard-2-cos-93-16623-39-21-991516df/kubelet.log (flake https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-node-kubelet-serial/1461404704922669056)
The test should work fine if we drop the CPU request a bit, let me tweak it to make up for the possible extra 100m scheduled. I'm not sure what's causing that (aren't these tests serial? what the heck is getting scheduled) but we should be resilient to it. Also, https://github.com/kubernetes/kubernetes/pull/105926/files#r752688249 needs to be fixed. This test will no longer detect the regression it was designed to show after that change. I will revert it and also fix the resource requirements and that should fix this test. |
/milestone v1.23 need to fix this before test freeze |
Which jobs are flaking?
node-kubelet-serial
Which tests are flaking?
Kubelet should correctly account for terminated pods after restart
Since when has it been flaking?
Since test was created, flakes 10-20% of the time. Not entirely clear why, I haven't looked.
Testgrid link
https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Kubelet%20should%20correctly%20account%20for%20terminated%20pods%20after%20restart
Reason for failure (if possible)
No response
Anything else we need to know?
/assign
Relevant SIG(s)
/sig node
The text was updated successfully, but these errors were encountered: