-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial #111347
Conversation
/assign @smarterclayton @liggitt |
This reverts commit c493557.
The test breaks the controllers that depend on api services to be resolvable, per example, the namespace controller, that is heavily used by the e2e framework to clean the environment
/priority critical-urgent |
/hold cancel This solves the problem, checking current job https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/111347/pull-kubernetes-e2e-kind/1550422769034858496/artifacts/ CSI tests run in less than 10 mins
and no more namespace_controller errors
|
/test pull-kubernetes-conformance-image-test |
/lgtm |
/approve Expected this as a possibility, although I’m surprised at this impact (shouldn’t the normal parallel suite have demonstrated this)? |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, liggitt, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This solidifies my opinion that we should expand the existing conformance test which uses a working APIService ( |
It flaked on the PR #110237 (comment) that introduced the test, but it is not something easy to detect because our e2e tests are very flexible to "absorb" busy and slows CIs, that a 50% of the time the job passed , taking much time, but passed ... @msau42 was the only one that connected the dots between the timeous on CSI test, the namespace controller errors and this PR Long history in #111086 |
Really points to a lack of synthetic / overall / anomaly detection mechanics in the test suite infra. Wouldn’t be surprised if this would have been caught in an openshift e2e run (due to the extra testing. The 10m thing surprises me though - the test runs once, and once it’s done new namespaces should clear? Or is namespace controller actually hanging / failing closed? Agree with Jordan that bringing it into the working apiservice test is better |
Also… why is an apiservice that has never “gone green” even visible to clients? |
deleting a collection of APIServices via a label selector. | ||
*/ | ||
framework.ConformanceIt("should manage the lifecycle of an APIService", func() { | ||
ginkgo.It("should manage the lifecycle of an APIService [Serial][Disruptive]", func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure this is any more disruptive than the existing apiservice test. Not sure applying disruptive is necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I think that can be open for interpretation, from the description of the tag
[Disruptive]: If a test may impact workloads that it didn't create, it should be marked as [Disruptive]. Examples of disruptive behavior include, but are not limited to, restarting components or tainting nodes. Any [Disruptive] test is also assumed to qualify for the [Serial] label, but need not be labeled as both. These tests are not run against soak clusters to avoid restarting components.
it clear impact workloads, but indirectly, because the namespace controller (and I think that the garbage collector controller) will not work correctly , maybe the definition is for direct disruptive actions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically (at least historically), Serial meant “disruptive, but only during the test” (ie adding and removing firewall rules to test services behavior on a node). It’s not an issue here for now, but I’ll review and see whether it’s truly required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's other reason why we should have Disruptive, and is for job help filtering this test.
As you can see here https://testgrid.k8s.io/sig-network-gce#gci-gce-serial-kube-dns-nodecache , the job is not filtered and is affecting other tests
I'm a bit of an outsider and new to the In this particular case, though, there may be some lower-hanging fruit. The use of
I can share some ideas here but they go well beyond the scope of this single problem. I don't know if there's much appetite or capacity to invest in shifting the culture around the e2e suite from "let's build in buffer because it's flaky" to "failures probably signify real code/test issues, not CI/resource issues". I'd like to believe that goal is achievable but concede it may well take a lot of effort! |
Generally we don’t buffer for flakiness directly, but we do buffer because the tests are intended to run in a wide range of environments and often in parallel, and no test is truly zero impact to other tests. So in many cases the buffer should be a reasonable amount of time to wait, and bumping the buffer to prevent flakes should be exceptional, but that’s hard to police. Anything we can do to discourage buffering to avoid flakes (via code or culture) is a great angle to pursue. |
/triage accepted |
/kind bug
/kind failing-test
/kind flake
/kind regression
The test breaks the controllers that depend on api services to be resolvable, per example, the namespace controller, that is heavily used by the e2e framework to clean the environment, impacting all the other test that are running on the environment.
This can be verified by checking the logs in the controller-manager in any of the current jobs:
This is especially critical for the CSI sig-storage tests, that need to delete the namespace as part of the test, causing that test that use to run on the order of ~ 1 min, take more than 10 minutes
Fixes: #111086, #111247