Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP:attempt to fix timeout flakes #91497

Conversation

brianpursley
Copy link
Member

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change
/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

@brianpursley: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 27, 2020
@k8s-ci-robot k8s-ci-robot added area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 27, 2020
@brianpursley brianpursley force-pushed the timeout-flake-investigation branch 2 times, most recently from ff9ee86 to 58ce667 Compare May 27, 2020 19:12
@brianpursley brianpursley changed the title WIP:attempt to fix timeout issues WIP:attempt to fix timeout flakes May 27, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: brianpursley
To complete the pull request process, please assign deads2k
You can assign the PR to them by writing /assign @deads2k in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@brianpursley: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-kubernetes-integration a8e0268 link /test pull-kubernetes-integration
pull-kubernetes-e2e-kind a8e0268 link /test pull-kubernetes-e2e-kind

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@cheftako
Copy link
Member

/assign @wojtek-t

@wojtek-t
Copy link
Member

@brianpursley - do you have any issue description of what exact problem you faced (not that I'm opposed to the approach here)

@brianpursley
Copy link
Member Author

@brianpursley - do you have any issue description of what exact problem you faced (not that I'm opposed to the approach here)

@wojtek-t This PR is more of an experiment to try some things to see if they have any effect on the "timed out waiting for the condition" flakes.

You are welcome to look at what I'm doing, but I'm unassigning you as this is not something I expect to ever merge.
/unassign @wojtek-t

Side note: I have my own cluster now and am able to reproduce some of the flakes locally, so I should not need to use WIP PRs as much... I don't like to needlessly stress the CI. It seems testing on my local machine doesn't reproduce the flakes as well.

I'll try to unpack what I was thinking below:


NamespacedResourceDeleter

So my thinking here is that when I did some logging, it looks like timeouts usually occur when deleting namespaces... and especially if the namespace contains something storage-related, like pvcs or statefulsets.

I noticed that NamespacedResourceDeleter ranges through groupVersionResources and performs deleteAllContentForGroupVersionResource one-by-one, but the order in which it is deleting things is undefined.

A theory I had is that the order in which things are deleted actually matters and that by just kicking off all of the deletes at once using a go routine would give them all a chance to occur without being "deadlocked" waiting for a dependent resource to be deleted.

Honestly, I haven't been able to prove or find a clear example of the order mattering in a reproducable way, although I have managed to mess up my local cluster in a way that a namespace would never delete... I waited for 30+ minutes and ultimately had to force delete it. In that case I wasn't able to detect the problem, but I plan on trying to reproduce it again locally.


persistent_volumes-local e2e test

This test created 50 pods but never deleted them, so it just waited for the namespace to be cleaned up before it deleted them. Knowing there is a per-node pod limit (110?) I was thinking that maybe it was causing some other pod creation to queue up, so I wanted to explicitly delete the pods.

It seems that many other e2e tests do cleanup and it is probably a good idea for resource management during e2e, but honestly, it feels like if Kubernetes were this fragile, that in itself would be a problem, so I really don't think this matters (or at least should matter).


DeleteCollectionWorkers

Changing DeleteCollectionWorkers from 1 to 5, was just something I was trying... I don't think it matters although I do wonder why it is set to only 1.

The corresponding change in the registry store code to fix the TODO comment, I actually am making as a separate PR (91544), although I don't believe it is the culprit of the timeouts... more of a cleanup thing. And I am not proposing to change the default workers from 1 to 5 in that PR.

@brianpursley
Copy link
Member Author

@wojtek-t Thanks for xref-ing that issue by the way. What lavalamp is describing might actually be the source of the flakes...

@brianpursley brianpursley deleted the timeout-flake-investigation branch February 2, 2023 01:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/storage Categorizes an issue or PR as relevant to SIG Storage. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants