Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GC TestCustomResourceCascadingDeletion and TestMixedRelationships flakes #50022

Closed
liggitt opened this issue Aug 2, 2017 · 10 comments · Fixed by #50291
Closed

GC TestCustomResourceCascadingDeletion and TestMixedRelationships flakes #50022

liggitt opened this issue Aug 2, 2017 · 10 comments · Fixed by #50291
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.

Comments

@liggitt liggitt added kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. labels Aug 2, 2017
@liggitt liggitt changed the title GC TestCustomResourceCascadingDeletion/TestMixedRelationships flakes GC TestCustomResourceCascadingDeletion and TestMixedRelationships flakes Aug 2, 2017
@ironcladlou
Copy link
Contributor

I was able to reproduce this locally (although it's very hard to trigger). Initial thoughts are that it's different from #49966; in this case, the actual sync is instant (millseconds) but the CRD created in the test fails to appear in discovery calls over the course of repeated attempts every 5 seconds for the remainder of the test. Normally, the type shows up in discovery almost immediately after creation.

@tcharding
Copy link
Contributor

Looks like PR #50023 might be hitting a related flaky test. I get 6 failures in 60 runs from the command

stress ./garbagecollector.test -test.run=TestCustomResourceCascadingDeletion

I'm super new around here so thought it prudent to add the information I have found before attempting to dig deeper.

I followed instructions in flaky-tests.md to build garbagecollector.test from within directory
k8s.io/kubernetes/test/intergration/garbagecollector/

If the test failure logs are of any use to anyone just message and I'll post one of them.

@ironcladlou
Copy link
Contributor

@tcharding Thanks. I've also been able to reproduce locally and am debugging it today. I did have to hack up the integration test setup to get apiserver logs in this context. I've also opened #50035 to explore making the integration test script more configurable with regards to logging.

@ironcladlou
Copy link
Contributor

I was wrong about the discovery aspect; the CRD is coming back from discovery, but the restMapper fails to map it. I also notice we're using DiscoveryInterface.ServerPreferredResources which is not actually implemented with a cache in memCacheClient.

@ironcladlou
Copy link
Contributor

@caesarxuchao okay, now I'm getting somewhere. I made a test in a new branch which exercises discovery and REST mapping without the GC in the picture at all, and have been able to reproduce the flake 1 time out of 40 runs so far with stress.

@ironcladlou
Copy link
Contributor

First problem is pretty clear in hindsight; we reset the restmapper, then get deletable resources using the discovery client directly, then sync based on the resources from discovery. The direct discovery call doesn't use the client's internal cache, and so the restmapper might be out of sync with the subsequent discovery call, causing us to sync with resources not yet represented in the restmapper's cache.

That explains the isolated flake in the test I made, but doesn't yet explain why the restmapper would continue to fail on subsequent sync calls (since the nest restmapper reset would usually bring it up to date with the resources returned from the direct discovery call).

@ironcladlou
Copy link
Contributor

I'm out tomorrow, so a quick update here in case somebody else looks into it...

My branch can reproduce the issue where DeferredDiscoveryRESTMapper sometimes becomes unable to map the CRD resource, even after the type is available from discovery. I've also ruled out the memCacheClient by replacing it with a non-caching implementation like:

type uncachedDiscoveryClient struct {
	discovery.DiscoveryInterface
}

func (_ *uncachedDiscoveryClient) Fresh() bool { return true }
func (_ *uncachedDiscoveryClient) Invalidate() {}

And using that in the setup like:

discoveryClient := &uncachedDiscoveryClient{clientSet.Discovery()}

The issue can still be reproduced that way. Looks like more debugging of DeferredDiscoveryRESTMapper itself is required.

@ironcladlou
Copy link
Contributor

Finally tracked this down. Custom resource kinds generated in the tests are always given a plural suffix of "s". If the singular name ends in "s", the rest mapper assumes the plural form is an "es" suffix, which will never match the generated kind. This explains the intermittent nature of the failure (the randomly generated names often happen to match the assumptions of the rest mapper).

@ironcladlou
Copy link
Contributor

#49948 is the root cause

@tcharding
Copy link
Contributor

Good effort @ironcladlou

monopole pushed a commit to monopole/kubernetes that referenced this issue Aug 8, 2017
Automatic merge from submit-queue

Change test to work around restmapper pluralization bug

Fixes kubernetes#50022

Works around the pluralization bug to unblock the queue.
Once the restmapper bug is fixed in kubernetes#50012, we should add tests specifically for unconventional singular/plural word endings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants