Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow list-resources.sh to continue if a resource fails to list #89664

Merged

Conversation

spiffxp
Copy link
Member

@spiffxp spiffxp commented Mar 30, 2020

What type of PR is this?
/kind cleanup

What this PR does / why we need it:
The cluster/gce/list-resources.sh script is used solely by our CI, specifically any job using kubetest with the --check-leaked-resources flag. Currently if a single resource fails to list, we fail the entire job.

I think this is too brittle. A review of previous issues on kubernetes/kubernetes that relate to failure of this script shows that the issues usually resolve themselves, or would be caught by the diff of before/after.

Let's instead allow the script to continue listing all resources, and let kubetest's resource diff fail the job.

Special notes for your reviewer:
This will require cherrypick back to previous release branches to be of benefit to them. I tried looking at fixing this centrally in kubetest, but that wouldn't allow us to fail gracefully and list other resources if one fails to list.

Does this PR introduce a user-facing change?:

NONE

/priority important-soon

/sig testing
/cc @ixdy @BenTheElder

/sig scalability
/cc @mm4tt
This would be relevant to scalability jobs, eg: would have prevented #89573

/sig release
/cc @droslean
This would be relevant to many of the release-blocking jobs, eg: would have prevented #89572

The list-resources.sh script is used solely by our CI, specifically
kubernetes/test-infra/kubetest with the --check-leaked-resources
flag. Currently if a single resource fails to list, we fail the entire
job.

I think this is too brittle. A review of previous issues on
kubernetes/kubernetes that relate to failure of this script shows that
the issues usually resolve themselves, or would be caught by the diff
of before/after.

Let's instead allow the script to continue listing all resources,
and let kubetest's resource diff fail the job.
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/release Categorizes an issue or PR as relevant to SIG Release. area/release-eng Issues or PRs related to the Release Engineering subproject sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 30, 2020
@BenTheElder
Copy link
Member

hmm, why bother checking for leaked resources if we're going to ignore it failing?
retries?
not setting this flag?

@ixdy
Copy link
Member

ixdy commented Mar 30, 2020

The only concern would be if things break permanently (not just a transient failure like we usually see) and we never notice because we'd be ignoring the failures.

That said, I'm not sure whether this has actually ever happened, whereas we have known issues of the transient failures causing job failures. This LGTM, though I'll let others chime in.

@spiffxp
Copy link
Member Author

spiffxp commented Mar 30, 2020

/retest

hmm, why bother checking for leaked resources if we're going to ignore it failing?

In practice, when this script has failed, it's been due to some temporary environmental issue which can't be addressed by contributors anyway. So it's not clear to me that failures due to us being unable to list resources are actionable.

retries?

The script retries 5 times for each resource

not setting this flag?

We care about catching leaked resources, it's caught valid issues in the past (eg: #88355, #74890, #74417, #70191)

@mm4tt
Copy link
Contributor

mm4tt commented Mar 31, 2020

Thanks, @spiffxp!

The recent issues caused a lot of our jobs to turn red even though the underlying tests were passing correctly.
+1 to avoiding that in future, even at the cost of potentially ignoring some leaked resources for a short time.

@BenTheElder
Copy link
Member

The only concern would be if things break permanently (not just a transient failure like we usually see) and we never notice because we'd be ignoring the failures.

this is my concern ...
flipping the flag should be easier than PRing a change to the set +e everywhere but it seems everyone else is onboard, and code wise this is not offensive otherwise sooo

/lgtm
/approve
/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 31, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@spiffxp
Copy link
Member Author

spiffxp commented Mar 31, 2020

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 31, 2020
@spiffxp
Copy link
Member Author

spiffxp commented Mar 31, 2020

/retest

1 similar comment
@spiffxp
Copy link
Member Author

spiffxp commented Apr 1, 2020

/retest

@k8s-ci-robot k8s-ci-robot merged commit 6a552da into kubernetes:master Apr 1, 2020
k8s-ci-robot added a commit that referenced this pull request Apr 6, 2020
…4-upstream-release-1.17

Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to
k8s-ci-robot added a commit that referenced this pull request Apr 6, 2020
…4-upstream-release-1.18

Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to
k8s-ci-robot added a commit that referenced this pull request Apr 6, 2020
…4-upstream-release-1.16

Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/release-eng Issues or PRs related to the Release Engineering subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants