New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP:attempt to fix timeout flakes #91497
WIP:attempt to fix timeout flakes #91497
Conversation
@brianpursley: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
ff9ee86
to
58ce667
Compare
58ce667
to
57d7e39
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: brianpursley The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
57d7e39
to
a8e0268
Compare
@brianpursley: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/assign @wojtek-t |
@brianpursley - do you have any issue description of what exact problem you faced (not that I'm opposed to the approach here) |
@wojtek-t This PR is more of an experiment to try some things to see if they have any effect on the "timed out waiting for the condition" flakes. You are welcome to look at what I'm doing, but I'm unassigning you as this is not something I expect to ever merge. Side note: I have my own cluster now and am able to reproduce some of the flakes locally, so I should not need to use WIP PRs as much... I don't like to needlessly stress the CI. It seems testing on my local machine doesn't reproduce the flakes as well. I'll try to unpack what I was thinking below: NamespacedResourceDeleterSo my thinking here is that when I did some logging, it looks like timeouts usually occur when deleting namespaces... and especially if the namespace contains something storage-related, like pvcs or statefulsets. I noticed that A theory I had is that the order in which things are deleted actually matters and that by just kicking off all of the deletes at once using a go routine would give them all a chance to occur without being "deadlocked" waiting for a dependent resource to be deleted. Honestly, I haven't been able to prove or find a clear example of the order mattering in a reproducable way, although I have managed to mess up my local cluster in a way that a namespace would never delete... I waited for 30+ minutes and ultimately had to force delete it. In that case I wasn't able to detect the problem, but I plan on trying to reproduce it again locally. persistent_volumes-local e2e testThis test created 50 pods but never deleted them, so it just waited for the namespace to be cleaned up before it deleted them. Knowing there is a per-node pod limit (110?) I was thinking that maybe it was causing some other pod creation to queue up, so I wanted to explicitly delete the pods. It seems that many other e2e tests do cleanup and it is probably a good idea for resource management during e2e, but honestly, it feels like if Kubernetes were this fragile, that in itself would be a problem, so I really don't think this matters (or at least should matter). DeleteCollectionWorkersChanging DeleteCollectionWorkers from 1 to 5, was just something I was trying... I don't think it matters although I do wonder why it is set to only 1. The corresponding change in the registry store code to fix the TODO comment, I actually am making as a separate PR (91544), although I don't believe it is the culprit of the timeouts... more of a cleanup thing. And I am not proposing to change the default workers from 1 to 5 in that PR. |
@wojtek-t Thanks for xref-ing that issue by the way. What lavalamp is describing might actually be the source of the flakes... |
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: