Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial #111347

aojea · 2022-07-22T10:03:00Z

/kind bug
/kind failing-test
/kind flake
/kind regression

none

The test breaks the controllers that depend on api services to be resolvable, per example, the namespace controller, that is heavily used by the e2e framework to clean the environment, impacting all the other test that are running on the environment.

This can be verified by checking the logs in the controller-manager in any of the current jobs:

2022-07-22T07:56:34.495444693Z stderr F E0722 07:56:34.495346       1 namespace_controller.go:162] deletion of namespace kubectl-4138 failed: unable to retrieve the complete list of server APIs: e2e-prmvh.example.com/v1alpha1: the server is currently unable to handle the request
2022-07-22T07:56:35.361838874Z stderr F E0722 07:56:35.361677       1 namespace_controller.go:162] deletion of namespace events-7733 failed: unable to retrieve the complete list of server APIs: e2e-prmvh.example.com/v1alpha1: the server is currently unable to handle the request
2022-07-22T07:56:35.390257453Z stderr F E0722 07:56:35.390170       1 namespace_controller.go:162] deletion of namespace nettest-6099 failed: unable to retrieve the complete list of server APIs: e2e-prmvh.example.com/v1alpha1: the server is currently unable to handle the request
2022-07-22T07:56:35.451344876Z stderr F E0722 07:56:35.451217       1 namespace_controller.go:162] deletion of namespace nettest-2475 failed: unable to retrieve the complete list of server APIs: e2e-prmvh.example.com/v1alpha1: the server is currently unable to handle the request
2022-07-22T07:56:35.459872098Z stderr F E0722 07:56:35.459730       1 namespa

wget https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/111172/pull-kubernetes-e2e-kind/1550377010985963520/artifacts/kind-control-plane/pods/kube-system_kube-controller-manager-kind-control-plane_e9ec38736496c717a237032e25dc47c8/kube-controller-manager/0.log -O- | grep namespace_controller | grep "failed: unable to retrieve the complete list" | wc
   3411  105741 1087294

This is especially critical for the CSI sig-storage tests, that need to delete the namespace as part of the test, causing that test that use to run on the order of ~ 1 min, take more than 10 minutes

Fixes: #111086, #111247

aojea · 2022-07-22T10:03:34Z

/assign @smarterclayton @liggitt
/cc @msau42 @chendave
/hold
to verify that it really is the root cause
/sig testing

This reverts commit c493557.

The test breaks the controllers that depend on api services to be resolvable, per example, the namespace controller, that is heavily used by the e2e framework to clean the environment

aojea · 2022-07-22T10:10:22Z

/priority critical-urgent

aojea · 2022-07-22T10:56:17Z

/hold cancel

This solves the problem, checking current job https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/111347/pull-kubernetes-e2e-kind/1550422769034858496/artifacts/

CSI tests run in less than 10 mins

180.749733466 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod&#39;s predecessor fails
183.569470362 [It] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Generic Ephemeral-volume (default fs) (immediate-binding)] ephemeral should support expansion of pvcs created for ephemeral pvcs
188.197826182 [It] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition
190.293967806 [It] [sig-storage] CSI mock volume Delegate FSGroup to CSI driver [LinuxOnly] should pass FSGroup to CSI driver if it is set in pod and driver supports VOLUME_MOUNT_GROUP
191.070930325 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should implement legacy replacement when the update strategy is OnDelete
191.302821315 [It] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] volumes should store data
192.865364374 [It] [sig-storage] CSI mock volume CSI online volume expansion should expand volume without restarting pod if attach=on, nodeExpansion=on
193.541804979 [It] [sig-storage] CSI mock volume CSI Volume expansion should not expand volume if resizingOnDriver=off, resizingOnSC=on
224.411363925 [It] [sig-storage] CSI mock volume CSI workload information using mock driver contain ephemeral=true when using inline volume
227.605814136 [It] [sig-storage] CSI mock volume CSI workload information using mock driver should be passed when podInfoOnMount=true
230.248551935 [It] [sig-network] Services should be able to up and down services
239.656329671 [It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails
246.312045839 [It] [sig-node] Probing container should *not* be restarted by liveness probe because startup probe delays it
249.869986279 [It] [sig-node] Probing container should *not* be restarted with a /healthz http liveness probe [NodeConformance] [Conformance]
250.48737233 [It] [sig-node] Probing container should *not* be restarted with a exec &#34;cat /tmp/health&#34; liveness probe [NodeConformance] [Conformance]
251.348979601 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should perform canary updates and phased rolling updates of template modifications [Conformance]
255.405241453 [It] [sig-node] Probing container should *not* be restarted with a tcp:8080 liveness probe [NodeConformance] [Conformance]
258.599663703 [It] [sig-node] Probing container should *not* be restarted with a GRPC liveness probe [NodeConformance]
261.634964249 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance]
264.056242421 [It] [sig-node] Probing container should *not* be restarted with a non-local redirect http liveness probe
278.205636344 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity
412.200752023 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs

and no more namespace_controller errors

cat kube-controller-manager-kind-control-plane_kube-system_kube-controller-manager-2fb959e7764bed8b14d7769854a984f7dd34143862d4454895710ad76035b615.log | grep namespace_controller | grep "failed: unable to retrieve the complete list" | wc
      0       0       0

aojea · 2022-07-22T10:56:32Z

/test pull-kubernetes-conformance-image-test

liggitt · 2022-07-22T11:06:46Z

/lgtm
/approve

smarterclayton · 2022-07-22T14:04:53Z

/approve

Expected this as a possibility, although I’m surprised at this impact (shouldn’t the normal parallel suite have demonstrated this)?

k8s-ci-robot · 2022-07-22T14:05:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, liggitt, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/conformance/testdata/OWNERS~~ [smarterclayton]
~~test/e2e/apimachinery/OWNERS~~ [aojea,liggitt,smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

liggitt · 2022-07-22T14:09:40Z

This solidifies my opinion that we should expand the existing conformance test which uses a working APIService (Should be able to support the 1.17 Sample API Server using the current Aggregator) to exercise the additional operations we want to do on an APIService, rather than creating a non-working APIService merely to drive CRUD operations

aojea · 2022-07-22T14:14:07Z

Expected this as a possibility, although I’m surprised at this impact (shouldn’t the normal parallel suite have demonstrated this)?

It flaked on the PR #110237 (comment) that introduced the test, but it is not something easy to detect because our e2e tests are very flexible to "absorb" busy and slows CIs, that a 50% of the time the job passed , taking much time, but passed ... @msau42 was the only one that connected the dots between the timeous on CSI test, the namespace controller errors and this PR

Long history in #111086

aojea · 2022-07-22T14:31:41Z

The problem is more evident here, because the e2e namespace controller tests are directly impacted and fail as soons this test was promoted and started to run in this job

smarterclayton · 2022-07-22T14:32:30Z

Really points to a lack of synthetic / overall / anomaly detection mechanics in the test suite infra. Wouldn’t be surprised if this would have been caught in an openshift e2e run (due to the extra testing.

The 10m thing surprises me though - the test runs once, and once it’s done new namespaces should clear? Or is namespace controller actually hanging / failing closed?

Agree with Jordan that bringing it into the working apiservice test is better

smarterclayton · 2022-07-22T14:33:27Z

Also… why is an apiservice that has never “gone green” even visible to clients?

smarterclayton · 2022-07-22T14:45:06Z

test/e2e/apimachinery/aggregator.go

-		deleting a collection of APIServices via a label selector.
-	*/
-	framework.ConformanceIt("should manage the lifecycle of an APIService", func() {
+	ginkgo.It("should manage the lifecycle of an APIService [Serial][Disruptive]", func() {


I’m not sure this is any more disruptive than the existing apiservice test. Not sure applying disruptive is necessary here?

yeah, I think that can be open for interpretation, from the description of the tag

[Disruptive]: If a test may impact workloads that it didn't create, it should be marked as [Disruptive]. Examples of disruptive behavior include, but are not limited to, restarting components or tainting nodes. Any [Disruptive] test is also assumed to qualify for the [Serial] label, but need not be labeled as both. These tests are not run against soak clusters to avoid restarting components.

it clear impact workloads, but indirectly, because the namespace controller (and I think that the garbage collector controller) will not work correctly , maybe the definition is for direct disruptive actions?

technically (at least historically), Serial meant “disruptive, but only during the test” (ie adding and removing firewall rules to test services behavior on a node). It’s not an issue here for now, but I’ll review and see whether it’s truly required.

There's other reason why we should have Disruptive, and is for job help filtering this test.

As you can see here https://testgrid.k8s.io/sig-network-gce#gci-gce-serial-kube-dns-nodecache , the job is not filtered and is affecting other tests

onsi · 2022-07-22T16:37:52Z

I'm a bit of an outsider and new to the e2e suite. I agree that a suite that starts running more consistently slowly should be flagged for review. This would have helped catch this particular subtle issue sooner. Doing so can be complex and require additional infrastructure to monitor and compare test run speeds (either within the suite, or outside of the suite).

In this particular case, though, there may be some lower-hanging fruit. The use of tryFunc in the CSI cleanup function was obscuring the actual issue. Rather than have the test fail, it - instead - takes (substantially) longer to run. I know there can be a lot of fear-of-change around complex test suites like this - but I'd offer that removing the tryFunc would provided a tighter feedback loop and surface this sort of issue more efficiently. As it stands, the only reason this was eventually caught is because the slower tests eventually (sometimes - hence flaky) added up and caused the entire suite to timeout.

it is not something easy to detect because our e2e tests are very flexible to "absorb" busy and slows CIs, that a 50% of the time the job passed , taking much time, but passed

I can share some ideas here but they go well beyond the scope of this single problem. I don't know if there's much appetite or capacity to invest in shifting the culture around the e2e suite from "let's build in buffer because it's flaky" to "failures probably signify real code/test issues, not CI/resource issues". I'd like to believe that goal is achievable but concede it may well take a lot of effort!

smarterclayton · 2022-07-22T21:09:52Z

Generally we don’t buffer for flakiness directly, but we do buffer because the tests are intended to run in a wide range of environments and often in parallel, and no test is truly zero impact to other tests. So in many cases the buffer should be a reasonable amount of time to wait, and bumping the buffer to prevent flakes should be exceptional, but that’s hard to police.

Anything we can do to discourage buffering to avoid flakes (via code or culture) is a great angle to pursue.

leilajal · 2022-07-26T20:09:26Z

/triage accepted

k8s-ci-robot assigned liggitt Jul 22, 2022

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2022

k8s-ci-robot assigned smarterclayton Jul 22, 2022

k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 22, 2022

k8s-ci-robot requested review from chendave and msau42 July 22, 2022 10:03

k8s-ci-robot requested review from caesarxuchao and lavalamp July 22, 2022 10:04

aojea added 2 commits July 22, 2022 12:07

Revert "Promote APIService e2e test to Conformance"

38df6fc

This reverts commit c493557.

Tag test should manage the lifecycle of an APIService as Serial

6f681c8

The test breaks the controllers that depend on api services to be resolvable, per example, the namespace controller, that is heavily used by the e2e framework to clean the environment

aojea force-pushed the aggregator_serial branch from 2b7efe8 to 6f681c8 Compare July 22, 2022 10:09

k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Jul 22, 2022

k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 22, 2022

aojea changed the title ~~Conformance test "manage the lifecycle of an APIService" should run in Serial~~ Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial Jul 22, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 22, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2022

k8s-ci-robot merged commit 9ec582d into kubernetes:master Jul 22, 2022

k8s-ci-robot added this to the v1.25 milestone Jul 22, 2022

aojea mentioned this pull request Jul 22, 2022

[Umbrella Issue] VERY SLOW tests running on presubmits #111247

Closed

smarterclayton reviewed Jul 22, 2022

View reviewed changes

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 26, 2022

riaankleinhans mentioned this pull request Aug 3, 2022

Revert "e2e: should manage the lifecycle of an APIService" #111675

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial #111347

Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial #111347

aojea commented Jul 22, 2022 •

edited

aojea commented Jul 22, 2022

aojea commented Jul 22, 2022

aojea commented Jul 22, 2022

aojea commented Jul 22, 2022

liggitt commented Jul 22, 2022

smarterclayton commented Jul 22, 2022

k8s-ci-robot commented Jul 22, 2022

liggitt commented Jul 22, 2022

aojea commented Jul 22, 2022 •

edited

aojea commented Jul 22, 2022 •

edited

smarterclayton commented Jul 22, 2022

smarterclayton commented Jul 22, 2022

smarterclayton Jul 22, 2022

aojea Jul 22, 2022

smarterclayton Jul 22, 2022

aojea Jul 24, 2022

onsi commented Jul 22, 2022

smarterclayton commented Jul 22, 2022 •

edited

leilajal commented Jul 26, 2022

Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial #111347

Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial #111347

Conversation

aojea commented Jul 22, 2022 • edited

aojea commented Jul 22, 2022

aojea commented Jul 22, 2022

aojea commented Jul 22, 2022

aojea commented Jul 22, 2022

liggitt commented Jul 22, 2022

smarterclayton commented Jul 22, 2022

k8s-ci-robot commented Jul 22, 2022

liggitt commented Jul 22, 2022

aojea commented Jul 22, 2022 • edited

aojea commented Jul 22, 2022 • edited

smarterclayton commented Jul 22, 2022

smarterclayton commented Jul 22, 2022

smarterclayton Jul 22, 2022

Choose a reason for hiding this comment

aojea Jul 22, 2022

Choose a reason for hiding this comment

smarterclayton Jul 22, 2022

Choose a reason for hiding this comment

aojea Jul 24, 2022

Choose a reason for hiding this comment

onsi commented Jul 22, 2022

smarterclayton commented Jul 22, 2022 • edited

leilajal commented Jul 26, 2022

aojea commented Jul 22, 2022 •

edited

aojea commented Jul 22, 2022 •

edited

aojea commented Jul 22, 2022 •

edited

smarterclayton commented Jul 22, 2022 •

edited