Improve stability and performance of the taint_manager unit tests #113386

mimowo · 2022-10-27T07:56:14Z

What type of PR is this?

/kind flake

What this PR does / why we need it:

It improves stability and performance of the unit tests in taint_manager_test.go, by shortening the constant wait for the controller and using wait.PollImmediate to await for the expected actions to be observed. On the environment used for testing it reduces execution time from about 24.5s to 7s.

Which issue(s) this PR fixes:

Flaky tests in taint_manager.go.

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

mimowo · 2022-10-28T06:54:24Z

/assign @k82cn

mimowo · 2022-11-09T14:20:42Z

/assign @alculquicondor

alculquicondor · 2022-11-09T14:54:26Z

pkg/controller/nodelifecycle/scheduler/taint_manager_test.go

@@ -39,7 +40,7 @@ import (
 	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
 )

-var timeForControllerToProgress = 500 * time.Millisecond
+var timeForControllerToProgress = 10 * time.Millisecond


why is this enough? The unit tests might run in a very small container in CI, needing more than 10ms.

Ideally, we should eliminate all sleeps and do Polls

It is enough because we do periodic poll in the verifyPodActions function. Now, if we eliminate the sleep the tests will pass now.

However, I want to leave it though, because I've noticed that if I eliminate the sleep entirely and expect in the asserts expectDelete=false then the tests pass, because the delete was not executed yet (but would execute in a couple of ms). So I think leaving it as 10ms is pragmatic cause we would detect such issues when running locally.

in TestUpdatePod, there is a sleep without a corresponding verify. I think we should add some verify there and then we can simply eliminate all sleeps. And maybe we can do wait.Poll instead of wait.PollImmediate

Done. However, I still keep the timeForControllerToProgress, reverted to the previous value as there are 2 tests which check that the controller does not panic after the last update action. I suggest to decouple removing these sleeps into a dedicated issue/pr as it is not clear to me ATM what we could wait on in these tests to effectively do the same as the sleep does.

pkg/controller/nodelifecycle/scheduler/taint_manager_test.go

alculquicondor · 2022-11-10T13:23:47Z

pkg/controller/nodelifecycle/scheduler/taint_manager_test.go

@@ -499,8 +496,7 @@ func TestUpdateNode(t *testing.T) {
 			controller.recorder = testutil.NewFakeRecorder()
 			go controller.Run(ctx)
 			controller.NodeUpdated(item.oldNode, item.newNode)
-			// wait a bit
-			time.Sleep(timeForControllerToProgress)


is the constant still needed?

It is used in two places still. However, I renamed it to timeForControllerToProgressForSanityTesting as its only purpose now is to wait a little bit after the last action so that there is no panic.

pkg/controller/nodelifecycle/scheduler/taint_manager_test.go

mimowo · 2022-11-10T14:39:26Z

This is just tests change, so it should be eligible for 1.26
/milestone v1.26

k8s-ci-robot · 2022-11-10T14:39:28Z

@mimowo: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Milestone Maintainers Team and have them propose you as an additional delegate for this responsibility.

In response to this:

This is just tests change, so it should be eligible for 1.26
/milestone v1.26

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mimowo · 2022-11-10T14:42:17Z

It's a nice to have tests improvement which I suggest to be included in 1.26. @alculquicondor @soltysh WDYT?

alculquicondor · 2022-11-10T16:22:11Z

pkg/controller/nodelifecycle/scheduler/taint_manager_test.go

+			wait.PollImmediate(10*time.Millisecond, time.Second, func() (bool, error) {
+				scheduledEviction := controller.taintEvictionQueue.GetWorkerUnsafe(item.prevPod.Namespace)
+				return scheduledEviction != nil, nil
+			})


this looks rather problematic. It looks like it's trying to check that the item was in the queue. But if the worker picks it up fast enough, it wouldn't be in the queue anymore.

I think we can just get rid of this entirely

indeed, the test is tricky, especially the one with taints. I'm going to think a little on how to make this reliable. What do you mean by "get rid of this entirely" - revert to the sleep in this place?

Reverted this bit due to lack of a better fixing idea for now. Still, a couple of other places can already benefit from this change. Let me know what you think.

I meant remove the sleep as well. I don't think it's needed.

The sleep in one form or another is needed in order to await for this test case:

kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager_test.go

Line 285 in 8e48df1

description: "lengthening toleration shouldn't work",

This test case checks that if the eviction is scheduled for taint with 1s, then even if the pod is updated to lenghten the toleration then it still gets evicted after 1s (as queued initially). If there is no sleep then the eviction is not scheduled and only the second eviction (with toleration of 100s is scheduled).

After the second thought I think it makes sense to add the explicit waiting for the eviction to be queued. Just adding the two test cases when it is needed. The eviction once scheduled remains in the queue for the duration of the toleration - 1s.

Also, I renamed the const once more and lowered it to only 20s. I measured locally it takes usually 10-20ms to schedule an eviction by the controller. I think the constant was set higher as it played the other role of giving controller processing time to schedule the eviction or delete the pod. It had to be big in order not to fail on slower or loaded machines. For the purpose of sanity testing at the end of the test I think 20ms is enough.

pkg/controller/nodelifecycle/scheduler/taint_manager_test.go

soltysh · 2022-11-10T17:16:48Z

/triage accepted
/priority backlog

alculquicondor · 2022-11-14T13:41:33Z

pkg/controller/nodelifecycle/scheduler/taint_manager_test.go

-			fakeClientset.ClearActions()
-			time.Sleep(timeForControllerToProgress)
+			if item.awaitForScheduledEviction {
+				nsName := types.NamespacedName{Namespace: item.prevPod.Namespace, Name: item.prevPod.Name}


Can you run this tests 100 times to see if it's stable?

Sure, pasting output of the last run for the executed command:

go test ./pkg/controller/nodelifecycle/scheduler/ -v -run "TestUpdatePod$" -count=100

(...) === RUN TestUpdatePod === RUN TestUpdatePod/scheduling_onto_tainted_Node_results_in_patch_and_delete_when_PodDisruptionConditions_enabled I1114 14:48:34.460891 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager" I1114 14:48:34.460968 3040701 taint_manager.go:211] "Sending events to api server" I1114 14:48:34.461094 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1" === RUN TestUpdatePod/scheduling_onto_tainted_Node I1114 14:48:34.471714 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager" I1114 14:48:34.471812 3040701 taint_manager.go:211] "Sending events to api server" I1114 14:48:34.471949 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1" === RUN TestUpdatePod/scheduling_onto_tainted_Node_with_toleration I1114 14:48:34.482521 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager" I1114 14:48:34.482574 3040701 taint_manager.go:211] "Sending events to api server" === RUN TestUpdatePod/removing_toleration I1114 14:48:34.493840 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager" I1114 14:48:34.493886 3040701 taint_manager.go:211] "Sending events to api server" I1114 14:48:34.504168 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1" === RUN TestUpdatePod/lengthening_toleration_shouldn't_work I1114 14:48:34.514886 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager" I1114 14:48:34.515041 3040701 taint_manager.go:211] "Sending events to api server" I1114 14:48:35.515644 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1" --- PASS: TestUpdatePod (1.07s) --- PASS: TestUpdatePod/scheduling_onto_tainted_Node_results_in_patch_and_delete_when_PodDisruptionConditions_enabled (0.01s) --- PASS: TestUpdatePod/scheduling_onto_tainted_Node (0.01s) --- PASS: TestUpdatePod/scheduling_onto_tainted_Node_with_toleration (0.01s) --- PASS: TestUpdatePod/removing_toleration (0.02s) --- PASS: TestUpdatePod/lengthening_toleration_shouldn't_work (1.01s) PASS ok k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler 106.365s

FYI, in the last test case which takes ~1s, the awaitForScheduledEviction section takes ~20ms. The 1s is spent in the verifyPodActions to await for processing of the eviction which results in observed delete action.

alculquicondor

/lgtm

soltysh

/approve

k8s-ci-robot · 2022-11-14T14:39:29Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mimowo, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/controller/nodelifecycle/OWNERS~~ [soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

soltysh · 2022-11-14T14:39:32Z

This is just tests so it's eligible for 1.26
/milestone v1.26

k8s-ci-robot requested review from k82cn and smarterclayton October 27, 2022 07:57

mimowo mentioned this pull request Oct 27, 2022

Use SSA to add pod failure conditions #113304

Merged

mimowo changed the title ~~Improve stability if the taint_manager tests~~ Improve stability and performance if the taint_manager tests Oct 27, 2022

mimowo changed the title ~~Improve stability and performance if the taint_manager tests~~ Improve stability and performance of the taint_manager tests Oct 27, 2022

mimowo changed the title ~~Improve stability and performance of the taint_manager tests~~ Improve stability and performance of the taint_manager unit tests Oct 27, 2022

k8s-ci-robot assigned k82cn Oct 28, 2022

k8s-ci-robot assigned alculquicondor Nov 9, 2022

alculquicondor reviewed Nov 9, 2022

View reviewed changes

mimowo force-pushed the improve-stability-of-taint-manager-tests branch from 40be170 to a34049d Compare November 10, 2022 10:35

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 10, 2022

mimowo requested review from alculquicondor and removed request for smarterclayton and k82cn November 10, 2022 12:39

alculquicondor reviewed Nov 10, 2022

View reviewed changes

mimowo force-pushed the improve-stability-of-taint-manager-tests branch from f2459cb to 8391f46 Compare November 10, 2022 13:57

mimowo requested a review from alculquicondor November 10, 2022 14:01

alculquicondor reviewed Nov 10, 2022

View reviewed changes

mimowo force-pushed the improve-stability-of-taint-manager-tests branch 2 times, most recently from 3d43bde to 4f5dbf6 Compare November 10, 2022 16:55

soltysh requested changes Nov 10, 2022

View reviewed changes

pkg/controller/nodelifecycle/scheduler/taint_manager_test.go Show resolved Hide resolved

mimowo force-pushed the improve-stability-of-taint-manager-tests branch from 4f5dbf6 to ab0983f Compare November 10, 2022 17:25

mimowo added 2 commits November 13, 2022 19:40

Improve stability if the taint_manager tests

3b5c3ac

Fix race conditions

a910ca5

mimowo force-pushed the improve-stability-of-taint-manager-tests branch from ab0983f to a910ca5 Compare November 14, 2022 09:11

alculquicondor reviewed Nov 14, 2022

View reviewed changes

mimowo requested a review from soltysh November 14, 2022 13:48

alculquicondor reviewed Nov 14, 2022

View reviewed changes

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 14, 2022

soltysh approved these changes Nov 14, 2022

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 14, 2022

k8s-ci-robot added this to the v1.26 milestone Nov 14, 2022

k8s-ci-robot merged commit c474920 into kubernetes:master Nov 14, 2022

mimowo mentioned this pull request Nov 14, 2022

Do not use sleep for unit tests in taint_manager_test.go #111140

Closed

mimowo deleted the improve-stability-of-taint-manager-tests branch March 18, 2023 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve stability and performance of the taint_manager unit tests #113386

Improve stability and performance of the taint_manager unit tests #113386

mimowo commented Oct 27, 2022 •

edited

mimowo commented Oct 28, 2022

mimowo commented Nov 9, 2022

alculquicondor Nov 9, 2022

mimowo Nov 9, 2022 •

edited

alculquicondor Nov 9, 2022

mimowo Nov 10, 2022 •

edited

alculquicondor Nov 10, 2022

mimowo Nov 10, 2022

mimowo commented Nov 10, 2022

k8s-ci-robot commented Nov 10, 2022

mimowo commented Nov 10, 2022

alculquicondor Nov 10, 2022

mimowo Nov 10, 2022

mimowo Nov 10, 2022

alculquicondor Nov 10, 2022

mimowo Nov 14, 2022 •

edited

soltysh commented Nov 10, 2022

alculquicondor Nov 14, 2022

mimowo Nov 14, 2022 •

edited

mimowo Nov 14, 2022 •

edited

alculquicondor left a comment

soltysh left a comment

k8s-ci-robot commented Nov 14, 2022

soltysh commented Nov 14, 2022

Improve stability and performance of the taint_manager unit tests #113386

Improve stability and performance of the taint_manager unit tests #113386

Conversation

mimowo commented Oct 27, 2022 • edited

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

mimowo commented Oct 28, 2022

mimowo commented Nov 9, 2022

Choose a reason for hiding this comment

mimowo Nov 9, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Nov 10, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo commented Nov 10, 2022

k8s-ci-robot commented Nov 10, 2022

mimowo commented Nov 10, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mimowo Nov 14, 2022 • edited

Choose a reason for hiding this comment

soltysh commented Nov 10, 2022

Choose a reason for hiding this comment

mimowo Nov 14, 2022 • edited

Choose a reason for hiding this comment

mimowo Nov 14, 2022 • edited

Choose a reason for hiding this comment

alculquicondor left a comment

Choose a reason for hiding this comment

soltysh left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Nov 14, 2022

soltysh commented Nov 14, 2022

mimowo commented Oct 27, 2022 •

edited

mimowo Nov 9, 2022 •

edited

mimowo Nov 10, 2022 •

edited

mimowo Nov 14, 2022 •

edited

mimowo Nov 14, 2022 •

edited

mimowo Nov 14, 2022 •

edited