Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move kubectl wait to informers with a cache to avoid hanging due to objects disappearing from the cluster #110923

Merged
merged 1 commit into from Jul 7, 2022

Conversation

mpuckett159
Copy link
Contributor

What type of PR is this?

/kind bug

What this PR does / why we need it:

Copied from #108086 as this PR attempts to address some regressions that were caused by this PR and add tests to ensure the regression is not done inadvertently again.

This moves the kubectl wait set of functions to using informers with cache updates for waiting on resources to reach a specified state. It prevents wait from hanging due to resources disappearing and outputs a descriptive error message when a resource it is waiting on disappears.

Example output:

☸  context: minikube (namespace: default) in ~/tmp 
❯ devkc wait --for=condition=ready pod --all --timeout=15s
pod/nginx-deployment-9456bbbf9-f9k7k condition met
pod/nginx-deployment-9456bbbf9-k2fhr condition met
pod/nginx-deployment-9456bbbf9-l886q condition met
pod/nginx-deployment-9456bbbf9-qsgqh condition met
pod/nginx-deployment-9456bbbf9-sjfdp condition met
pod/nginx-deployment-9456bbbf9-w85zc condition met
Error from server (NotFound): pods "nginx-deployment-9456bbbf9-zs5vq" not found

☸  context: minikube (namespace: default) in ~/tmp 
❯ devkc wait --for=condition=ready pod --all --timeout=15s
pod/nginx-deployment-9456bbbf9-f9k7k condition met
pod/nginx-deployment-9456bbbf9-l886q condition met
pod/nginx-deployment-9456bbbf9-sjfdp condition met
Error from server (NotFound): pods "nginx-deployment-9456bbbf9-k2fhr" not found

☸  context: minikube (namespace: default) in ~/tmp 
❯ devkc wait --for=condition=ready pod --all --timeout=15s
pod/nginx-deployment-9456bbbf9-f9k7k condition met

Which issue(s) this PR fixes:

Fixes kubernetes/kubectl#1120

Special notes for your reviewer:

I had to bump all the timeouts in the tests up to at least 1 second due to how the informer and caching works. From what I can tell on my local testing it doesn't actually increase testing time that significantly (2 seconds for me) but just fyi.

Does this PR introduce a user-facing change?

NONE

cleared release note since this is reverted in #110922, original release note was:

kubectl wait no longer hangs on resources that have disappeared from the cluster

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 2, 2022
@mpuckett159
Copy link
Contributor Author

/triage accepted
/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. triage/accepted Indicates an issue or PR is ready to be actively worked on. area/kubectl sig/cli Categorizes an issue or PR as relevant to SIG CLI. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jul 2, 2022
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ensures the comment from #108086 (comment) is addressed and adds tests ensuring that we don't break it in the future.

/lgtm
/approve

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/priority backlog

@k8s-ci-robot k8s-ci-robot added priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jul 7, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 7, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mpuckett159, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 7, 2022
@k8s-ci-robot k8s-ci-robot merged commit 9d68640 into kubernetes:master Jul 7, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.25 milestone Jul 7, 2022
@liggitt
Copy link
Member

liggitt commented Jul 13, 2022

wait unit tests have been significantly flakier since this merged:

https://storage.googleapis.com/k8s-triage/index.html?pr=1&test=kubectl%2Fpkg%2Fcmd%2Fwait

@aleksandra-malinowska
Copy link
Contributor

This also correlates with the start of #111111 (test times out because kubectl delete never completes).

@mpuckett159
Copy link
Contributor Author

This also correlates with the start of #111111 (test times out because kubectl delete never completes).

Note for myself for fixing, it looks like the delete code doesn't set the timeout value properly in the waitOptions, using 0 for "forever" when it should be using it for "check once and report immediately."

@aleksandra-malinowska this may point to an underlying issue with the testing, however. Could you point me to the specific test code so I can check to see what specifically is being run to cause these hangs? It sounds like the resource is not being deleted as one would expect, and in combination with the timeout issue is causing this hanging to occur. If the test just does kubectl delete then returns an error that is ignored then the wrong timeout value issue would definitely be the source of these failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/kubectl cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. sig/cli Categorizes an issue or PR as relevant to SIG CLI. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
5 participants