WIP:attempt to fix timeout flakes #91497

brianpursley · 2020-05-27T15:11:27Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change
/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2020-05-27T15:11:28Z

@brianpursley: Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot · 2020-05-27T19:32:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: brianpursley
To complete the pull request process, please assign deads2k
You can assign the PR to them by writing /assign @deads2k in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2020-05-28T17:16:49Z

@brianpursley: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
pull-kubernetes-integration	`a8e0268`	link	`/test pull-kubernetes-integration`
pull-kubernetes-e2e-kind	`a8e0268`	link	`/test pull-kubernetes-e2e-kind`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

cheftako · 2020-05-28T20:06:06Z

/assign @wojtek-t

wojtek-t · 2020-05-29T12:35:51Z

@brianpursley - do you have any issue description of what exact problem you faced (not that I'm opposed to the approach here)

brianpursley · 2020-05-29T13:26:43Z

@brianpursley - do you have any issue description of what exact problem you faced (not that I'm opposed to the approach here)

@wojtek-t This PR is more of an experiment to try some things to see if they have any effect on the "timed out waiting for the condition" flakes.

You are welcome to look at what I'm doing, but I'm unassigning you as this is not something I expect to ever merge.
/unassign @wojtek-t

Side note: I have my own cluster now and am able to reproduce some of the flakes locally, so I should not need to use WIP PRs as much... I don't like to needlessly stress the CI. It seems testing on my local machine doesn't reproduce the flakes as well.

I'll try to unpack what I was thinking below:

NamespacedResourceDeleter

So my thinking here is that when I did some logging, it looks like timeouts usually occur when deleting namespaces... and especially if the namespace contains something storage-related, like pvcs or statefulsets.

I noticed that NamespacedResourceDeleter ranges through groupVersionResources and performs deleteAllContentForGroupVersionResource one-by-one, but the order in which it is deleting things is undefined.

A theory I had is that the order in which things are deleted actually matters and that by just kicking off all of the deletes at once using a go routine would give them all a chance to occur without being "deadlocked" waiting for a dependent resource to be deleted.

Honestly, I haven't been able to prove or find a clear example of the order mattering in a reproducable way, although I have managed to mess up my local cluster in a way that a namespace would never delete... I waited for 30+ minutes and ultimately had to force delete it. In that case I wasn't able to detect the problem, but I plan on trying to reproduce it again locally.

persistent_volumes-local e2e test

This test created 50 pods but never deleted them, so it just waited for the namespace to be cleaned up before it deleted them. Knowing there is a per-node pod limit (110?) I was thinking that maybe it was causing some other pod creation to queue up, so I wanted to explicitly delete the pods.

It seems that many other e2e tests do cleanup and it is probably a good idea for resource management during e2e, but honestly, it feels like if Kubernetes were this fragile, that in itself would be a problem, so I really don't think this matters (or at least should matter).

DeleteCollectionWorkers

Changing DeleteCollectionWorkers from 1 to 5, was just something I was trying... I don't think it matters although I do wonder why it is set to only 1.

The corresponding change in the registry store code to fix the TODO comment, I actually am making as a separate PR (91544), although I don't believe it is the culprit of the timeouts... more of a cleanup thing. And I am not proposing to change the default workers from 1 to 5 in that PR.

brianpursley · 2020-05-29T13:33:37Z

@wojtek-t Thanks for xref-ing that issue by the way. What lavalamp is describing might actually be the source of the flakes...

k8s-ci-robot requested review from msau42 and smarterclayton May 27, 2020 15:11

brianpursley force-pushed the timeout-flake-investigation branch 2 times, most recently from ff9ee86 to 58ce667 Compare May 27, 2020 19:12

brianpursley changed the title ~~WIP:attempt to fix timeout issues~~ WIP:attempt to fix timeout flakes May 27, 2020

k8s-ci-robot added the area/apiserver label May 27, 2020

brianpursley force-pushed the timeout-flake-investigation branch from 58ce667 to 57d7e39 Compare May 27, 2020 19:30

attempt to fix timeout flakes

a8e0268

brianpursley force-pushed the timeout-flake-investigation branch from 57d7e39 to a8e0268 Compare May 28, 2020 15:56

k8s-ci-robot assigned wojtek-t May 28, 2020

wojtek-t mentioned this pull request May 29, 2020

DELETECOLLECTION doesn't always #90743

Open

k8s-ci-robot unassigned wojtek-t May 29, 2020

brianpursley closed this May 29, 2020

brianpursley deleted the timeout-flake-investigation branch February 2, 2023 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP:attempt to fix timeout flakes #91497

WIP:attempt to fix timeout flakes #91497

brianpursley commented May 27, 2020

k8s-ci-robot commented May 27, 2020

k8s-ci-robot commented May 27, 2020

k8s-ci-robot commented May 28, 2020

cheftako commented May 28, 2020

wojtek-t commented May 29, 2020

brianpursley commented May 29, 2020

brianpursley commented May 29, 2020

WIP:attempt to fix timeout flakes #91497

WIP:attempt to fix timeout flakes #91497

Conversation

brianpursley commented May 27, 2020

k8s-ci-robot commented May 27, 2020

k8s-ci-robot commented May 27, 2020

k8s-ci-robot commented May 28, 2020

cheftako commented May 28, 2020

wojtek-t commented May 29, 2020

brianpursley commented May 29, 2020

NamespacedResourceDeleter

persistent_volumes-local e2e test

DeleteCollectionWorkers

brianpursley commented May 29, 2020