New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow list-resources.sh to continue if a resource fails to list #89664
Allow list-resources.sh to continue if a resource fails to list #89664
Conversation
The list-resources.sh script is used solely by our CI, specifically kubernetes/test-infra/kubetest with the --check-leaked-resources flag. Currently if a single resource fails to list, we fail the entire job. I think this is too brittle. A review of previous issues on kubernetes/kubernetes that relate to failure of this script shows that the issues usually resolve themselves, or would be caught by the diff of before/after. Let's instead allow the script to continue listing all resources, and let kubetest's resource diff fail the job.
hmm, why bother checking for leaked resources if we're going to ignore it failing? |
The only concern would be if things break permanently (not just a transient failure like we usually see) and we never notice because we'd be ignoring the failures. That said, I'm not sure whether this has actually ever happened, whereas we have known issues of the transient failures causing job failures. This LGTM, though I'll let others chime in. |
/retest
In practice, when this script has failed, it's been due to some temporary environmental issue which can't be addressed by contributors anyway. So it's not clear to me that failures due to us being unable to list resources are actionable.
The script retries 5 times for each resource
We care about catching leaked resources, it's caught valid issues in the past (eg: #88355, #74890, #74417, #70191) |
Thanks, @spiffxp! The recent issues caused a lot of our jobs to turn red even though the underlying tests were passing correctly. |
this is my concern ... /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: BenTheElder, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
/retest |
1 similar comment
/retest |
…4-upstream-release-1.17 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to
…4-upstream-release-1.18 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to
…4-upstream-release-1.16 Automated cherry pick of #89664: Allow list-resources.sh to continue if a resource fails to
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The
cluster/gce/list-resources.sh
script is used solely by our CI, specifically any job usingkubetest
with the--check-leaked-resources
flag. Currently if a single resource fails to list, we fail the entire job.I think this is too brittle. A review of previous issues on kubernetes/kubernetes that relate to failure of this script shows that the issues usually resolve themselves, or would be caught by the diff of before/after.
Let's instead allow the script to continue listing all resources, and let
kubetest
's resource diff fail the job.Special notes for your reviewer:
This will require cherrypick back to previous release branches to be of benefit to them. I tried looking at fixing this centrally in
kubetest
, but that wouldn't allow us to fail gracefully and list other resources if one fails to list.Does this PR introduce a user-facing change?:
/priority important-soon
/sig testing
/cc @ixdy @BenTheElder
/sig scalability
/cc @mm4tt
This would be relevant to scalability jobs, eg: would have prevented #89573
/sig release
/cc @droslean
This would be relevant to many of the release-blocking jobs, eg: would have prevented #89572