Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a TaintBasedEviction integration test flake #84766

Merged
merged 1 commit into from Nov 5, 2019

Conversation

Huang-Wei
Copy link
Member

What type of PR is this?

/kind flake

What this PR does / why we need it:

If all nodes of a cluster become NotReady, internally it enters a "fullyDisrupted" mode, and hence some functions are not honored (like applying taint, enforce Pods eviction, etc.)

In the TaintBasedEviction test, if a subtest can't finished in 5 seconds, it flakes with following pattern of messages:

I1101 17:05:16.059199  105553 node_lifecycle_controller.go:1060] node node-0 hasn't been updated for 5.000511225s. Last Ready is: &NodeCondition{Type:Ready,Status:True,LastHeartbeatTime:0001-01-01 00:00:00 +0000 UTC,LastTransitionTime:0001-01-01 00:00:00 +0000 UTC,Reason:,Message:,}
...
I1101 17:05:16.078489  105553 node_lifecycle_controller.go:1132] Controller detected that all Nodes are not-Ready. Entering master disruption mode.
...

This PR ensures each Node has a goroutine reporting heartbeat info regularly.

Which issue(s) this PR fixes:

May fix #83321.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

/sig scheduling
/cc @ahg-g @ravisantoshgudimetla @damemi
/priority important-soon

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 5, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 5, 2019
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Nov 5, 2019
@ahg-g
Copy link
Member

ahg-g commented Nov 5, 2019

why is this not a problem for other integration tests?

@Huang-Wei
Copy link
Member Author

why is this not a problem for other integration tests?

It's because TaintBasedEviction feature is actually implemented by node lifecycle manager (inside controller-manager), not scheduler. And in terms of applying NoExecute taints, node lifecycle manager behaves differently on different modes (Normal, PartialDisrupted, FullyDisrupted). The flake occurs at the exact step of ensuring NoExecute taint is properly applied:

I1101 17:05:42.006204  105553 reflector.go:268] k8s.io/client-go/informers/factory.go:134: forcing resync
    --- FAIL: TestTaintBasedEvictions/Taint_based_evictions_for_NodeNotReady_with_no_pod_tolerations (34.97s)
        taint_test.go:770: Failed to taint node in test 1 <node-2>, err: timed out waiting for the condition

So from the log, we can see that the Node being set NotReady=true intentionally has been adding to taint queue:

I1101 17:05:16.078448  105553 node_lifecycle_controller.go:814] Node node-2 is NotReady as of 2019-11-01 17:05:16.078430562 +0000 UTC m=+268.763594109. Adding it to the Taint queue.

However, in fullyDisrupted mode, it's expected to suppress further actions like Pods eviction since evicted Pods won't have any room to be placed still, so the node is reset to Reachable statue:

// We're switching to full disruption mode
if allAreFullyDisrupted {
klog.V(0).Info("Controller detected that all Nodes are not-Ready. Entering master disruption mode.")
for i := range nodes {
if nc.useTaintBasedEvictions {
_, err := nc.markNodeAsReachable(nodes[i])
if err != nil {
klog.Errorf("Failed to remove taints from Node %v", nodes[i].Name)
}
} else {
nc.cancelPodEviction(nodes[i])
}

@ahg-g
Copy link
Member

ahg-g commented Nov 5, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 5, 2019
@ahg-g
Copy link
Member

ahg-g commented Nov 5, 2019

/retest

@k8s-ci-robot k8s-ci-robot merged commit ee309ce into kubernetes:master Nov 5, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 5, 2019
@damemi
Copy link
Contributor

damemi commented Nov 5, 2019

Interesting, thank you for (hopefully!) solving this

k8s-ci-robot added a commit that referenced this pull request Dec 6, 2019
…856-#84036-#84766-#84883-upstream-release-1.14

Automated cherry pick of #81856: Convert tbe e2e to integration test #84036: Ensure TaintBasedEviction int test not rely on #84766: Fix a TaintBasedEviction integration test flake #84883: Update test logic to simulate NodeReady/False and
k8s-ci-robot added a commit that referenced this pull request Dec 6, 2019
…856-#84036-#84766-#84883-upstream-release-1.15

Automated cherry pick of #81856: Convert tbe e2e to integration test #84036: Ensure TaintBasedEviction int test not rely on #84766: Fix a TaintBasedEviction integration test flake #84883: Update test logic to simulate NodeReady/False and
k8s-ci-robot added a commit that referenced this pull request Dec 6, 2019
…856-#84036-#84766-#84883-upstream-release-1.16

Automated cherry pick of #81856: Convert tbe e2e to integration test #84036: Ensure TaintBasedEviction int test not rely on #84766: Fix a TaintBasedEviction integration test flake #84883: Update test logic to simulate NodeReady/False and
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

TestTaintBasedEvictions is flaky
4 participants