Allow list-resources.sh to continue if a resource fails to list #89664

spiffxp · 2020-03-30T19:17:52Z

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
The cluster/gce/list-resources.sh script is used solely by our CI, specifically any job using kubetest with the --check-leaked-resources flag. Currently if a single resource fails to list, we fail the entire job.

I think this is too brittle. A review of previous issues on kubernetes/kubernetes that relate to failure of this script shows that the issues usually resolve themselves, or would be caught by the diff of before/after.

Let's instead allow the script to continue listing all resources, and let kubetest's resource diff fail the job.

Special notes for your reviewer:
This will require cherrypick back to previous release branches to be of benefit to them. I tried looking at fixing this centrally in kubetest, but that wouldn't allow us to fail gracefully and list other resources if one fails to list.

Does this PR introduce a user-facing change?:

NONE

/priority important-soon

/sig testing
/cc @ixdy @BenTheElder

/sig scalability
/cc @mm4tt
This would be relevant to scalability jobs, eg: would have prevented #89573

/sig release
/cc @droslean
This would be relevant to many of the release-blocking jobs, eg: would have prevented #89572

The list-resources.sh script is used solely by our CI, specifically kubernetes/test-infra/kubetest with the --check-leaked-resources flag. Currently if a single resource fails to list, we fail the entire job. I think this is too brittle. A review of previous issues on kubernetes/kubernetes that relate to failure of this script shows that the issues usually resolve themselves, or would be caught by the diff of before/after. Let's instead allow the script to continue listing all resources, and let kubetest's resource diff fail the job.

BenTheElder · 2020-03-30T19:53:52Z

hmm, why bother checking for leaked resources if we're going to ignore it failing?
retries?
not setting this flag?

ixdy · 2020-03-30T20:41:37Z

The only concern would be if things break permanently (not just a transient failure like we usually see) and we never notice because we'd be ignoring the failures.

That said, I'm not sure whether this has actually ever happened, whereas we have known issues of the transient failures causing job failures. This LGTM, though I'll let others chime in.

spiffxp · 2020-03-30T20:50:29Z

/retest

hmm, why bother checking for leaked resources if we're going to ignore it failing?

In practice, when this script has failed, it's been due to some temporary environmental issue which can't be addressed by contributors anyway. So it's not clear to me that failures due to us being unable to list resources are actionable.

retries?

The script retries 5 times for each resource

not setting this flag?

We care about catching leaked resources, it's caught valid issues in the past (eg: #88355, #74890, #74417, #70191)

mm4tt · 2020-03-31T09:19:49Z

Thanks, @spiffxp!

The recent issues caused a lot of our jobs to turn red even though the underlying tests were passing correctly.
+1 to avoiding that in future, even at the cost of potentially ignoring some leaked resources for a short time.

BenTheElder · 2020-03-31T18:14:49Z

The only concern would be if things break permanently (not just a transient failure like we usually see) and we never notice because we'd be ignoring the failures.

this is my concern ...
flipping the flag should be easier than PRing a change to the set +e everywhere but it seems everyone else is onboard, and code wise this is not offensive otherwise sooo

/lgtm
/approve
/hold

k8s-ci-robot · 2020-03-31T18:15:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster/gce/OWNERS~~ [BenTheElder,spiffxp]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

spiffxp · 2020-03-31T20:12:58Z

/hold cancel

spiffxp · 2020-03-31T23:49:10Z

/retest

spiffxp · 2020-04-01T01:09:12Z

/retest

spiffxp · 2020-04-02T19:32:39Z

Opened cherry picks:

v1.18 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to #89789
v1.17 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to #89790
v1.16 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to #89791

…4-upstream-release-1.17 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to

…4-upstream-release-1.18 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to

…4-upstream-release-1.16 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to

k8s-ci-robot requested review from BenTheElder, droslean, ixdy and mm4tt March 30, 2020 19:17

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2020

k8s-ci-robot assigned BenTheElder Mar 31, 2020

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 31, 2020

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2020

k8s-ci-robot merged commit 6a552da into kubernetes:master Apr 1, 2020

k8s-ci-robot added this to the v1.19 milestone Apr 1, 2020

spiffxp deleted the ignore-list-resources-fails branch April 1, 2020 03:10

k8s-ci-robot added a commit that referenced this pull request Apr 6, 2020

Merge pull request #89790 from spiffxp/automated-cherry-pick-of-#8966…

747c0d5

…4-upstream-release-1.17 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to

k8s-ci-robot added a commit that referenced this pull request Apr 6, 2020

Merge pull request #89789 from spiffxp/automated-cherry-pick-of-#8966…

a8701ff

…4-upstream-release-1.18 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to

k8s-ci-robot added a commit that referenced this pull request Apr 6, 2020

Merge pull request #89791 from spiffxp/automated-cherry-pick-of-#8966…

965ed0c

…4-upstream-release-1.16 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow list-resources.sh to continue if a resource fails to list #89664

Allow list-resources.sh to continue if a resource fails to list #89664

spiffxp commented Mar 30, 2020

BenTheElder commented Mar 30, 2020

ixdy commented Mar 30, 2020

spiffxp commented Mar 30, 2020

mm4tt commented Mar 31, 2020

BenTheElder commented Mar 31, 2020

k8s-ci-robot commented Mar 31, 2020

spiffxp commented Mar 31, 2020

spiffxp commented Mar 31, 2020

spiffxp commented Apr 1, 2020

spiffxp commented Apr 2, 2020

Allow list-resources.sh to continue if a resource fails to list #89664

Allow list-resources.sh to continue if a resource fails to list #89664

Conversation

spiffxp commented Mar 30, 2020

BenTheElder commented Mar 30, 2020

ixdy commented Mar 30, 2020

spiffxp commented Mar 30, 2020

mm4tt commented Mar 31, 2020

BenTheElder commented Mar 31, 2020

k8s-ci-robot commented Mar 31, 2020

spiffxp commented Mar 31, 2020

spiffxp commented Mar 31, 2020

spiffxp commented Apr 1, 2020

spiffxp commented Apr 2, 2020