-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve stability and performance of the taint_manager unit tests #113386
Improve stability and performance of the taint_manager unit tests #113386
Conversation
/assign @k82cn |
/assign @alculquicondor |
@@ -39,7 +40,7 @@ import ( | |||
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | |||
) | |||
|
|||
var timeForControllerToProgress = 500 * time.Millisecond | |||
var timeForControllerToProgress = 10 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this enough? The unit tests might run in a very small container in CI, needing more than 10ms.
Ideally, we should eliminate all sleeps and do Polls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is enough because we do periodic poll in the verifyPodActions
function. Now, if we eliminate the sleep the tests will pass now.
However, I want to leave it though, because I've noticed that if I eliminate the sleep entirely and expect in the asserts expectDelete=false
then the tests pass, because the delete was not executed yet (but would execute in a couple of ms). So I think leaving it as 10ms is pragmatic cause we would detect such issues when running locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in TestUpdatePod, there is a sleep without a corresponding verify. I think we should add some verify there and then we can simply eliminate all sleeps. And maybe we can do wait.Poll
instead of wait.PollImmediate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. However, I still keep the timeForControllerToProgress
, reverted to the previous value as there are 2 tests which check that the controller does not panic after the last update action. I suggest to decouple removing these sleeps into a dedicated issue/pr as it is not clear to me ATM what we could wait on in these tests to effectively do the same as the sleep does.
40be170
to
a34049d
Compare
@@ -499,8 +496,7 @@ func TestUpdateNode(t *testing.T) { | |||
controller.recorder = testutil.NewFakeRecorder() | |||
go controller.Run(ctx) | |||
controller.NodeUpdated(item.oldNode, item.newNode) | |||
// wait a bit | |||
time.Sleep(timeForControllerToProgress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the constant still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is used in two places still. However, I renamed it to timeForControllerToProgressForSanityTesting
as its only purpose now is to wait a little bit after the last action so that there is no panic.
f2459cb
to
8391f46
Compare
This is just tests change, so it should be eligible for 1.26 |
@mimowo: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Milestone Maintainers Team and have them propose you as an additional delegate for this responsibility. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
It's a nice to have tests improvement which I suggest to be included in 1.26. @alculquicondor @soltysh WDYT? |
wait.PollImmediate(10*time.Millisecond, time.Second, func() (bool, error) { | ||
scheduledEviction := controller.taintEvictionQueue.GetWorkerUnsafe(item.prevPod.Namespace) | ||
return scheduledEviction != nil, nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks rather problematic. It looks like it's trying to check that the item was in the queue. But if the worker picks it up fast enough, it wouldn't be in the queue anymore.
I think we can just get rid of this entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indeed, the test is tricky, especially the one with taints. I'm going to think a little on how to make this reliable. What do you mean by "get rid of this entirely" - revert to the sleep in this place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted this bit due to lack of a better fixing idea for now. Still, a couple of other places can already benefit from this change. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant remove the sleep as well. I don't think it's needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sleep in one form or another is needed in order to await for this test case:
description: "lengthening toleration shouldn't work", |
This test case checks that if the eviction is scheduled for taint with 1s, then even if the pod is updated to lenghten the toleration then it still gets evicted after 1s (as queued initially). If there is no sleep then the eviction is not scheduled and only the second eviction (with toleration of 100s is scheduled).
After the second thought I think it makes sense to add the explicit waiting for the eviction to be queued. Just adding the two test cases when it is needed. The eviction once scheduled remains in the queue for the duration of the toleration - 1s.
Also, I renamed the const once more and lowered it to only 20s. I measured locally it takes usually 10-20ms to schedule an eviction by the controller. I think the constant was set higher as it played the other role of giving controller processing time to schedule the eviction or delete the pod. It had to be big in order not to fail on slower or loaded machines. For the purpose of sanity testing at the end of the test I think 20ms is enough.
3d43bde
to
4f5dbf6
Compare
/triage accepted |
4f5dbf6
to
ab0983f
Compare
ab0983f
to
a910ca5
Compare
fakeClientset.ClearActions() | ||
time.Sleep(timeForControllerToProgress) | ||
if item.awaitForScheduledEviction { | ||
nsName := types.NamespacedName{Namespace: item.prevPod.Namespace, Name: item.prevPod.Name} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run this tests 100 times to see if it's stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, pasting output of the last run for the executed command:
go test ./pkg/controller/nodelifecycle/scheduler/ -v -run "TestUpdatePod$" -count=100
(...)
=== RUN TestUpdatePod
=== RUN TestUpdatePod/scheduling_onto_tainted_Node_results_in_patch_and_delete_when_PodDisruptionConditions_enabled
I1114 14:48:34.460891 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.460968 3040701 taint_manager.go:211] "Sending events to api server"
I1114 14:48:34.461094 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1"
=== RUN TestUpdatePod/scheduling_onto_tainted_Node
I1114 14:48:34.471714 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.471812 3040701 taint_manager.go:211] "Sending events to api server"
I1114 14:48:34.471949 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1"
=== RUN TestUpdatePod/scheduling_onto_tainted_Node_with_toleration
I1114 14:48:34.482521 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.482574 3040701 taint_manager.go:211] "Sending events to api server"
=== RUN TestUpdatePod/removing_toleration
I1114 14:48:34.493840 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.493886 3040701 taint_manager.go:211] "Sending events to api server"
I1114 14:48:34.504168 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1"
=== RUN TestUpdatePod/lengthening_toleration_shouldn't_work
I1114 14:48:34.514886 3040701 taint_manager.go:206] "Starting NoExecuteTaintManager"
I1114 14:48:34.515041 3040701 taint_manager.go:211] "Sending events to api server"
I1114 14:48:35.515644 3040701 taint_manager.go:108] "NoExecuteTaintManager is deleting pod" pod="default/pod1"
--- PASS: TestUpdatePod (1.07s)
--- PASS: TestUpdatePod/scheduling_onto_tainted_Node_results_in_patch_and_delete_when_PodDisruptionConditions_enabled (0.01s)
--- PASS: TestUpdatePod/scheduling_onto_tainted_Node (0.01s)
--- PASS: TestUpdatePod/scheduling_onto_tainted_Node_with_toleration (0.01s)
--- PASS: TestUpdatePod/removing_toleration (0.02s)
--- PASS: TestUpdatePod/lengthening_toleration_shouldn't_work (1.01s)
PASS
ok k8s.io/kubernetes/pkg/controller/nodelifecycle/scheduler 106.365s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, in the last test case which takes ~1s, the awaitForScheduledEviction section takes ~20ms. The 1s is spent in the verifyPodActions
to await for processing of the eviction which results in observed delete action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mimowo, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This is just tests so it's eligible for 1.26 |
What type of PR is this?
/kind flake
What this PR does / why we need it:
It improves stability and performance of the unit tests in
taint_manager_test.go
, by shortening the constant wait for the controller and using wait.PollImmediate to await for the expected actions to be observed. On the environment used for testing it reduces execution time from about 24.5s to 7s.Which issue(s) this PR fixes:
Flaky tests in taint_manager.go.
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: