Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve stability and performance of the taint_manager unit tests #113386

Merged

Conversation

mimowo
Copy link
Contributor

@mimowo mimowo commented Oct 27, 2022

What type of PR is this?

/kind flake

What this PR does / why we need it:

It improves stability and performance of the unit tests in taint_manager_test.go, by shortening the constant wait for the controller and using wait.PollImmediate to await for the expected actions to be observed. On the environment used for testing it reduces execution time from about 24.5s to 7s.

Which issue(s) this PR fixes:

Flaky tests in taint_manager.go.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. sig/apps Categorizes an issue or PR as relevant to SIG Apps. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 27, 2022
@mimowo mimowo changed the title Improve stability if the taint_manager tests Improve stability and performance if the taint_manager tests Oct 27, 2022
@mimowo mimowo changed the title Improve stability and performance if the taint_manager tests Improve stability and performance of the taint_manager tests Oct 27, 2022
@mimowo mimowo changed the title Improve stability and performance of the taint_manager tests Improve stability and performance of the taint_manager unit tests Oct 27, 2022
@mimowo
Copy link
Contributor Author

mimowo commented Oct 28, 2022

/assign @k82cn

@mimowo
Copy link
Contributor Author

mimowo commented Nov 9, 2022

/assign @alculquicondor

@@ -39,7 +40,7 @@ import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

var timeForControllerToProgress = 500 * time.Millisecond
var timeForControllerToProgress = 10 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this enough? The unit tests might run in a very small container in CI, needing more than 10ms.

Ideally, we should eliminate all sleeps and do Polls

Copy link
Contributor Author

@mimowo mimowo Nov 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is enough because we do periodic poll in the verifyPodActions function. Now, if we eliminate the sleep the tests will pass now.

However, I want to leave it though, because I've noticed that if I eliminate the sleep entirely and expect in the asserts expectDelete=false then the tests pass, because the delete was not executed yet (but would execute in a couple of ms). So I think leaving it as 10ms is pragmatic cause we would detect such issues when running locally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in TestUpdatePod, there is a sleep without a corresponding verify. I think we should add some verify there and then we can simply eliminate all sleeps. And maybe we can do wait.Poll instead of wait.PollImmediate

Copy link
Contributor Author

@mimowo mimowo Nov 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. However, I still keep the timeForControllerToProgress, reverted to the previous value as there are 2 tests which check that the controller does not panic after the last update action. I suggest to decouple removing these sleeps into a dedicated issue/pr as it is not clear to me ATM what we could wait on in these tests to effectively do the same as the sleep does.

@mimowo mimowo force-pushed the improve-stability-of-taint-manager-tests branch from 40be170 to a34049d Compare November 10, 2022 10:35
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 10, 2022
@mimowo mimowo requested review from alculquicondor and removed request for smarterclayton and k82cn November 10, 2022 12:39
@@ -499,8 +496,7 @@ func TestUpdateNode(t *testing.T) {
controller.recorder = testutil.NewFakeRecorder()
go controller.Run(ctx)
controller.NodeUpdated(item.oldNode, item.newNode)
// wait a bit
time.Sleep(timeForControllerToProgress)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the constant still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in two places still. However, I renamed it to timeForControllerToProgressForSanityTesting as its only purpose now is to wait a little bit after the last action so that there is no panic.

@mimowo mimowo force-pushed the improve-stability-of-taint-manager-tests branch from f2459cb to 8391f46 Compare November 10, 2022 13:57
@mimowo
Copy link
Contributor Author

mimowo commented Nov 10, 2022

This is just tests change, so it should be eligible for 1.26
/milestone v1.26

@k8s-ci-robot
Copy link
Contributor

@mimowo: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Milestone Maintainers Team and have them propose you as an additional delegate for this responsibility.

In response to this:

This is just tests change, so it should be eligible for 1.26
/milestone v1.26

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@mimowo
Copy link
Contributor Author

mimowo commented Nov 10, 2022

It's a nice to have tests improvement which I suggest to be included in 1.26. @alculquicondor @soltysh WDYT?

Comment on lines 307 to 310
wait.PollImmediate(10*time.Millisecond, time.Second, func() (bool, error) {
scheduledEviction := controller.taintEvictionQueue.GetWorkerUnsafe(item.prevPod.Namespace)
return scheduledEviction != nil, nil
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks rather problematic. It looks like it's trying to check that the item was in the queue. But if the worker picks it up fast enough, it wouldn't be in the queue anymore.

I think we can just get rid of this entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed, the test is tricky, especially the one with taints. I'm going to think a little on how to make this reliable. What do you mean by "get rid of this entirely" - revert to the sleep in this place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted this bit due to lack of a better fixing idea for now. Still, a couple of other places can already benefit from this change. Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant remove the sleep as well. I don't think it's needed.

Copy link
Contributor Author

@mimowo mimowo Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sleep in one form or another is needed in order to await for this test case:

description: "lengthening toleration shouldn't work",

This test case checks that if the eviction is scheduled for taint with 1s, then even if the pod is updated to lenghten the toleration then it still gets evicted after 1s (as queued initially). If there is no sleep then the eviction is not scheduled and only the second eviction (with toleration of 100s is scheduled).

After the second thought I think it makes sense to add the explicit waiting for the eviction to be queued. Just adding the two test cases when it is needed. The eviction once scheduled remains in the queue for the duration of the toleration - 1s.

Also, I renamed the const once more and lowered it to only 20s. I measured locally it takes usually 10-20ms to schedule an eviction by the controller. I think the constant was set higher as it played the other role of giving controller processing time to schedule the eviction or delete the pod. It had to be big in order not to fail on slower or loaded machines. For the purpose of sanity testing at the end of the test I think 20ms is enough.

@mimowo mimowo force-pushed the improve-stability-of-taint-manager-tests branch 2 times, most recently from 3d43bde to 4f5dbf6 Compare November 10, 2022 16:55
@soltysh
Copy link
Contributor

soltysh commented Nov 10, 2022

/triage accepted
/priority backlog

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/backlog Higher priority than priority/awaiting-more-evidence. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 10, 2022
@mimowo mimowo force-pushed the improve-stability-of-taint-manager-tests branch from 4f5dbf6 to ab0983f Compare November 10, 2022 17:25
@mimowo mimowo force-pushed the improve-stability-of-taint-manager-tests branch from ab0983f to a910ca5 Compare November 14, 2022 09:11
fakeClientset.ClearActions()
time.Sleep(timeForControllerToProgress)
if item.awaitForScheduledEviction {
nsName := types.NamespacedName{Namespace: item.prevPod.Namespace, Name: item.prevPod.Name}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run this tests 100 times to see if it's stable?

Copy link
Contributor Author

@mimowo mimowo Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, pasting output of the last run for the executed command:

go test ./pkg/controller/nodelifecycle/scheduler/ -v -run "TestUpdatePod$" -count=100
(...)
=== RUN   TestUpdatePod
=== RUN   TestUpdatePod/scheduling_onto_tainted_Node_results_in_patch_and_delete_when_PodDisruptionConditions_enabled
I1114 14:48:34.460891 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.460968 3040701 taint_manager.go:211] "Sending events to api server"
I1114 14:48:34.461094 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1"
=== RUN   TestUpdatePod/scheduling_onto_tainted_Node
I1114 14:48:34.471714 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.471812 3040701 taint_manager.go:211] "Sending events to api server"
I1114 14:48:34.471949 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1"
=== RUN   TestUpdatePod/scheduling_onto_tainted_Node_with_toleration
I1114 14:48:34.482521 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.482574 3040701 taint_manager.go:211] "Sending events to api server"
=== RUN   TestUpdatePod/removing_toleration
I1114 14:48:34.493840 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.493886 3040701 taint_manager.go:211] "Sending events to api server"
I1114 14:48:34.504168 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1"
=== RUN   TestUpdatePod/lengthening_toleration_shouldn't_work
I1114 14:48:34.514886 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.515041 3040701 taint_manager.go:211] "Sending events to api server"
I1114 14:48:35.515644 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1"
--- PASS: TestUpdatePod (1.07s)
    --- PASS: TestUpdatePod/scheduling_onto_tainted_Node_results_in_patch_and_delete_when_PodDisruptionConditions_enabled (0.01s)
    --- PASS: TestUpdatePod/scheduling_onto_tainted_Node (0.01s)
    --- PASS: TestUpdatePod/scheduling_onto_tainted_Node_with_toleration (0.01s)
    --- PASS: TestUpdatePod/removing_toleration (0.02s)
    --- PASS: TestUpdatePod/lengthening_toleration_shouldn't_work (1.01s)
PASS
ok  	k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler	106.365s

Copy link
Contributor Author

@mimowo mimowo Nov 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, in the last test case which takes ~1s, the awaitForScheduledEviction section takes ~20ms. The 1s is spent in the verifyPodActions to await for processing of the eviction which results in observed delete action.

@mimowo mimowo requested a review from soltysh November 14, 2022 13:48
Copy link
Member

@alculquicondor alculquicondor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2022
Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 14, 2022
@soltysh
Copy link
Contributor

soltysh commented Nov 14, 2022

This is just tests so it's eligible for 1.26
/milestone v1.26

@k8s-ci-robot k8s-ci-robot added this to the v1.26 milestone Nov 14, 2022
@k8s-ci-robot k8s-ci-robot merged commit c474920 into kubernetes:master Nov 14, 2022
@mimowo mimowo deleted the improve-stability-of-taint-manager-tests branch March 18, 2023 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/backlog Higher priority than priority/awaiting-more-evidence. release-note-none Denotes a PR that doesn't merit a release note. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants