-
Notifications
You must be signed in to change notification settings - Fork 38.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure cluster [6140da15...] e2e Flake: Finished pod is deleted before events can be retrieved #114971
Comments
/cc @aojea @BenTheElder interesting 👀 , we should not have tests that depends on Events , they are already documented to be best effort The test must be changed to assert in something different, this test will be always be flaky if rely on events |
/triage accepted |
Events are not guaranteed to be delivered kubernetes/test/e2e/node/pods.go Lines 276 to 292 in 2f6c4f5
we should have another way of asserting , this is always going to be flaky |
Agreed, it doesn't sound like events are meant to be used for something that requires 100% reliability. The test in question is called Should this even be an e2e test, or would it be better as a unit test in kubelet ( There must be some reason this test was created in the first place. A regression maybe? Without context, it doesn't seem particularly useful. It seems obvious that an extra sandbox should not be created. Is this even something we need to check? It must have happened at some point I guess. |
This issue is labeled with You can:
For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/ /remove-triage accepted |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Failure cluster 6140da152b810b30d99b
Error text:
Recent failures:
1/9/2023, 3:08:19 PM pr:pull-kubernetes-e2e-kind
1/9/2023, 12:37:26 PM pr:pull-kubernetes-e2e-kind-ipv6
1/9/2023, 9:57:40 AM pr:pull-kubernetes-e2e-kind-ipv6
1/6/2023, 6:35:47 PM pr:pull-kubernetes-e2e-kind-ipv6
1/6/2023, 4:48:42 PM pr:pull-kubernetes-e2e-kind-ipv6
/kind failing-test
/kind flake
/sig node
Additional Information
It looks like the test does the following:
/bin/true
)When this test flakes/fails, it seems like steps 1 and 2 are successful, but it times out on step 3 because apparently the pod no longer exists.
This makes me wonder if, under certain load conditions, the pod GC is cleaning up the terminated pod before step 3 (or even step 2) can do its thing.
It looks like the default value for
--terminated-pod-gc-threshold
is 12500,That seems high enough that it would be unlikely that the threshold would be reached (could there really be 12500 completed pods?), but I wonder if something in the e2e tests is lowering it. I noticed this line which sets
TERMINATED_POD_GC_THRESHOLD
to 100, but that is gce-specific. Is it possible something similar is being set to lower the threshold for the e2e-kind tests? I did a search but didn't see anywhere that it was being set.The text was updated successfully, but these errors were encountered: