Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial #111347

Merged
merged 2 commits into from
Jul 22, 2022

Conversation

aojea
Copy link
Member

@aojea aojea commented Jul 22, 2022

/kind bug
/kind failing-test
/kind flake
/kind regression

none

The test breaks the controllers that depend on api services to be resolvable, per example, the namespace controller, that is heavily used by the e2e framework to clean the environment, impacting all the other test that are running on the environment.

This can be verified by checking the logs in the controller-manager in any of the current jobs:

2022-07-22T07:56:34.495444693Z stderr F E0722 07:56:34.495346       1 namespace_controller.go:162] deletion of namespace kubectl-4138 failed: unable to retrieve the complete list of server APIs: e2e-prmvh.example.com/v1alpha1: the server is currently unable to handle the request
2022-07-22T07:56:35.361838874Z stderr F E0722 07:56:35.361677       1 namespace_controller.go:162] deletion of namespace events-7733 failed: unable to retrieve the complete list of server APIs: e2e-prmvh.example.com/v1alpha1: the server is currently unable to handle the request
2022-07-22T07:56:35.390257453Z stderr F E0722 07:56:35.390170       1 namespace_controller.go:162] deletion of namespace nettest-6099 failed: unable to retrieve the complete list of server APIs: e2e-prmvh.example.com/v1alpha1: the server is currently unable to handle the request
2022-07-22T07:56:35.451344876Z stderr F E0722 07:56:35.451217       1 namespace_controller.go:162] deletion of namespace nettest-2475 failed: unable to retrieve the complete list of server APIs: e2e-prmvh.example.com/v1alpha1: the server is currently unable to handle the request
2022-07-22T07:56:35.459872098Z stderr F E0722 07:56:35.459730       1 namespa
wget https://storage.googleapis.com/kubernetes-jenkins/pr-logs/pull/111172/pull-kubernetes-e2e-kind/1550377010985963520/artifacts/kind-control-plane/pods/kube-system_kube-controller-manager-kind-control-plane_e9ec38736496c717a237032e25dc47c8/kube-controller-manager/0.log -O- | grep namespace_controller | grep "failed: unable to retrieve the complete list" | wc
   3411  105741 1087294

This is especially critical for the CSI sig-storage tests, that need to delete the namespace as part of the test, causing that test that use to run on the order of ~ 1 min, take more than 10 minutes

Fixes: #111086, #111247

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. kind/regression Categorizes issue or PR as related to a regression from a prior release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 22, 2022
@aojea
Copy link
Member Author

aojea commented Jul 22, 2022

/assign @smarterclayton @liggitt
/cc @msau42 @chendave
/hold
to verify that it really is the root cause
/sig testing

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2022
@k8s-ci-robot k8s-ci-robot added the sig/testing Categorizes an issue or PR as relevant to SIG Testing. label Jul 22, 2022
@k8s-ci-robot k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 22, 2022
The test breaks the controllers that depend on api services to be resolvable,
per example, the namespace controller, that is heavily used by the e2e framework to clean the environment
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. area/conformance Issues or PRs related to kubernetes conformance tests and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 22, 2022
@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Jul 22, 2022
@aojea
Copy link
Member Author

aojea commented Jul 22, 2022

/priority critical-urgent

@k8s-ci-robot k8s-ci-robot added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 22, 2022
@aojea aojea changed the title Conformance test "manage the lifecycle of an APIService" should run in Serial Conformance test "manage the lifecycle of an APIService" is Disruptive and should run in Serial Jul 22, 2022
@aojea
Copy link
Member Author

aojea commented Jul 22, 2022

/hold cancel

This solves the problem, checking current job https://gcsweb.k8s.io/gcs/kubernetes-jenkins/pr-logs/pull/111347/pull-kubernetes-e2e-kind/1550422769034858496/artifacts/

CSI tests run in less than 10 mins

180.749733466 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should not deadlock when a pod's predecessor fails
183.569470362 [It] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Generic Ephemeral-volume (default fs) (immediate-binding)] ephemeral should support expansion of pvcs created for ephemeral pvcs
188.197826182 [It] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Generic Ephemeral-volume (block volmode) (late-binding)] ephemeral should support two pods which have the same volume definition
190.293967806 [It] [sig-storage] CSI mock volume Delegate FSGroup to CSI driver [LinuxOnly] should pass FSGroup to CSI driver if it is set in pod and driver supports VOLUME_MOUNT_GROUP
191.070930325 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should implement legacy replacement when the update strategy is OnDelete
191.302821315 [It] [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (block volmode)] volumes should store data
192.865364374 [It] [sig-storage] CSI mock volume CSI online volume expansion should expand volume without restarting pod if attach=on, nodeExpansion=on
193.541804979 [It] [sig-storage] CSI mock volume CSI Volume expansion should not expand volume if resizingOnDriver=off, resizingOnSC=on
224.411363925 [It] [sig-storage] CSI mock volume CSI workload information using mock driver contain ephemeral=true when using inline volume
227.605814136 [It] [sig-storage] CSI mock volume CSI workload information using mock driver should be passed when podInfoOnMount=true
230.248551935 [It] [sig-network] Services should be able to up and down services
239.656329671 [It] [sig-node] Pods Extended Pod Container Status should never report container start when an init container fails
246.312045839 [It] [sig-node] Probing container should *not* be restarted by liveness probe because startup probe delays it
249.869986279 [It] [sig-node] Probing container should *not* be restarted with a /healthz http liveness probe [NodeConformance] [Conformance]
250.48737233 [It] [sig-node] Probing container should *not* be restarted with a exec "cat /tmp/health" liveness probe [NodeConformance] [Conformance]
251.348979601 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should perform canary updates and phased rolling updates of template modifications [Conformance]
255.405241453 [It] [sig-node] Probing container should *not* be restarted with a tcp:8080 liveness probe [NodeConformance] [Conformance]
258.599663703 [It] [sig-node] Probing container should *not* be restarted with a GRPC liveness probe [NodeConformance]
261.634964249 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications [Conformance]
264.056242421 [It] [sig-node] Probing container should *not* be restarted with a non-local redirect http liveness probe
278.205636344 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should provide basic identity
412.200752023 [It] [sig-apps] StatefulSet Basic StatefulSet functionality [StatefulSetBasic] should perform rolling updates and roll backs of template modifications with PVCs

and no more namespace_controller errors

cat kube-controller-manager-kind-control-plane_kube-system_kube-controller-manager-2fb959e7764bed8b14d7769854a984f7dd34143862d4454895710ad76035b615.log | grep namespace_controller | grep "failed: unable to retrieve the complete list" | wc
      0       0       0

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 22, 2022
@aojea
Copy link
Member Author

aojea commented Jul 22, 2022

/test pull-kubernetes-conformance-image-test

@liggitt
Copy link
Member

liggitt commented Jul 22, 2022

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 22, 2022
@smarterclayton
Copy link
Contributor

/approve

Expected this as a possibility, although I’m surprised at this impact (shouldn’t the normal parallel suite have demonstrated this)?

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, liggitt, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2022
@liggitt
Copy link
Member

liggitt commented Jul 22, 2022

This solidifies my opinion that we should expand the existing conformance test which uses a working APIService (Should be able to support the 1.17 Sample API Server using the current Aggregator) to exercise the additional operations we want to do on an APIService, rather than creating a non-working APIService merely to drive CRUD operations

@aojea
Copy link
Member Author

aojea commented Jul 22, 2022

Expected this as a possibility, although I’m surprised at this impact (shouldn’t the normal parallel suite have demonstrated this)?

It flaked on the PR #110237 (comment) that introduced the test, but it is not something easy to detect because our e2e tests are very flexible to "absorb" busy and slows CIs, that a 50% of the time the job passed , taking much time, but passed ... @msau42 was the only one that connected the dots between the timeous on CSI test, the namespace controller errors and this PR

Long history in #111086

@k8s-ci-robot k8s-ci-robot merged commit 9ec582d into kubernetes:master Jul 22, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jul 22, 2022
@aojea
Copy link
Member Author

aojea commented Jul 22, 2022

The problem is more evident here, because the e2e namespace controller tests are directly impacted and fail as soons this test was promoted and started to run in this job
image

@smarterclayton
Copy link
Contributor

Really points to a lack of synthetic / overall / anomaly detection mechanics in the test suite infra. Wouldn’t be surprised if this would have been caught in an openshift e2e run (due to the extra testing.

The 10m thing surprises me though - the test runs once, and once it’s done new namespaces should clear? Or is namespace controller actually hanging / failing closed?

Agree with Jordan that bringing it into the working apiservice test is better

@smarterclayton
Copy link
Contributor

Also… why is an apiservice that has never “gone green” even visible to clients?

deleting a collection of APIServices via a label selector.
*/
framework.ConformanceIt("should manage the lifecycle of an APIService", func() {
ginkgo.It("should manage the lifecycle of an APIService [Serial][Disruptive]", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure this is any more disruptive than the existing apiservice test. Not sure applying disruptive is necessary here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think that can be open for interpretation, from the description of the tag

[Disruptive]: If a test may impact workloads that it didn't create, it should be marked as [Disruptive]. Examples of disruptive behavior include, but are not limited to, restarting components or tainting nodes. Any [Disruptive] test is also assumed to qualify for the [Serial] label, but need not be labeled as both. These tests are not run against soak clusters to avoid restarting components.

it clear impact workloads, but indirectly, because the namespace controller (and I think that the garbage collector controller) will not work correctly , maybe the definition is for direct disruptive actions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically (at least historically), Serial meant “disruptive, but only during the test” (ie adding and removing firewall rules to test services behavior on a node). It’s not an issue here for now, but I’ll review and see whether it’s truly required.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's other reason why we should have Disruptive, and is for job help filtering this test.

As you can see here https://testgrid.k8s.io/sig-network-gce#gci-gce-serial-kube-dns-nodecache , the job is not filtered and is affecting other tests

@onsi
Copy link
Contributor

onsi commented Jul 22, 2022

I'm a bit of an outsider and new to the e2e suite. I agree that a suite that starts running more consistently slowly should be flagged for review. This would have helped catch this particular subtle issue sooner. Doing so can be complex and require additional infrastructure to monitor and compare test run speeds (either within the suite, or outside of the suite).

In this particular case, though, there may be some lower-hanging fruit. The use of tryFunc in the CSI cleanup function was obscuring the actual issue. Rather than have the test fail, it - instead - takes (substantially) longer to run. I know there can be a lot of fear-of-change around complex test suites like this - but I'd offer that removing the tryFunc would provided a tighter feedback loop and surface this sort of issue more efficiently. As it stands, the only reason this was eventually caught is because the slower tests eventually (sometimes - hence flaky) added up and caused the entire suite to timeout.

it is not something easy to detect because our e2e tests are very flexible to "absorb" busy and slows CIs, that a 50% of the time the job passed , taking much time, but passed

I can share some ideas here but they go well beyond the scope of this single problem. I don't know if there's much appetite or capacity to invest in shifting the culture around the e2e suite from "let's build in buffer because it's flaky" to "failures probably signify real code/test issues, not CI/resource issues". I'd like to believe that goal is achievable but concede it may well take a lot of effort!

@smarterclayton
Copy link
Contributor

smarterclayton commented Jul 22, 2022

Generally we don’t buffer for flakiness directly, but we do buffer because the tests are intended to run in a wide range of environments and often in parallel, and no test is truly zero impact to other tests. So in many cases the buffer should be a reasonable amount of time to wait, and bumping the buffer to prevent flakes should be exceptional, but that’s hard to police.

Anything we can do to discourage buffering to avoid flakes (via code or culture) is a great angle to pursue.

@leilajal
Copy link
Contributor

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/conformance Issues or PRs related to kubernetes conformance tests area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. kind/regression Categorizes issue or PR as related to a regression from a prior release. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Flake] pull-kubernetes-e2e-kind timeouts and flake
6 participants