New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a TaintBasedEviction integration test flake #84766
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
why is this not a problem for other integration tests? |
It's because TaintBasedEviction feature is actually implemented by node lifecycle manager (inside controller-manager), not scheduler. And in terms of applying NoExecute taints, node lifecycle manager behaves differently on different modes (Normal, PartialDisrupted, FullyDisrupted). The flake occurs at the exact step of ensuring NoExecute taint is properly applied:
So from the log, we can see that the Node being set NotReady=true intentionally has been adding to taint queue:
However, in fullyDisrupted mode, it's expected to suppress further actions like Pods eviction since evicted Pods won't have any room to be placed still, so the node is reset to Reachable statue: kubernetes/pkg/controller/nodelifecycle/node_lifecycle_controller.go Lines 1150 to 1161 in 590cbef
|
/lgtm |
/retest |
Interesting, thank you for (hopefully!) solving this |
What type of PR is this?
/kind flake
What this PR does / why we need it:
If all nodes of a cluster become NotReady, internally it enters a "fullyDisrupted" mode, and hence some functions are not honored (like applying taint, enforce Pods eviction, etc.)
In the TaintBasedEviction test, if a subtest can't finished in 5 seconds, it flakes with following pattern of messages:
This PR ensures each Node has a goroutine reporting heartbeat info regularly.
Which issue(s) this PR fixes:
May fix #83321.
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/sig scheduling
/cc @ahg-g @ravisantoshgudimetla @damemi
/priority important-soon