-
Notifications
You must be signed in to change notification settings - Fork 38.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TestUnreservePlugin flake #79371
Conversation
Welcome @alenkacz! |
@@ -581,8 +581,8 @@ func TestUnreservePlugin(t *testing.T) { | |||
if err = waitForPodUnschedulable(cs, pod); err != nil { | |||
t.Errorf("test #%v: Didn't expected the pod to be scheduled. error: %v", i, err) | |||
} | |||
if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled { | |||
t.Errorf("test #%v: Expected the unreserve plugin to be called %d times, was called %d times.", i, pbdPlugin.numPrebindCalled, unresPlugin.numUnreserveCalled) | |||
if unresPlugin.numUnreserveCalled < 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is the right fix.
The problem is that the tests in this file modify some shared state (the members in TesterPlugin above). What we need to do is reset this state for every test. As a quick fix, you can reset pbdPlugin.numPrebindCalled and unresPlugin.numUnreserveCalled to zero at the beginning of the for loop.
But what we should really be doing is refactor this whole file to ensure that each test starts from a clean state to make sure that this doesn't happen again in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does the reset here https://github.com/kubernetes/kubernetes/pull/79371/files#diff-f6b72dfe2b2c2cee201d12156e80366fL596 and the tests don't run in parallel...
I agree with the proposed refactoring, I still feel like this is the right fix at this point :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does the reset here https://github.com/kubernetes/kubernetes/pull/79371/files#diff-f6b72dfe2b2c2cee201d12156e80366fL596
reset() should be called at the beginning, not the end.
and the tests don't run in parallel...
The tests don't run in parallel, but they may run in different order.
I agree with the proposed refactoring, I still feel like this is the right fix at this point :)
I don't think it is, we should expect bind and unreserve plugins to be executed the same number of times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that TestPrebindPlugin does not reset the counter. But a test shouldn't be relying on another, so the right fix is for this test to make sure the state it relies on is reset before it runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on a closer look, I think you are right that we can't guarantee that they are equal because the scheduler and the test are running in parallel; what we can guarantee is that prebind counter is at most off by one :)
I still think that we have a bigger problem that the tests modify shared state and rely on each other to make sure the state is reset.
For now, in addition to the modification you did, I would suggest that you also reset the prebind counter in TestPrebindPlugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g what about the race I described? I saw that in logs...
we wait for pod being unschedulable
that happens, the pod goes back to scheduling queue
then we check (at later time) that those two numbers (prebind, unreserve) equal...
but in the meantime, the pod got rescheduled and the prebind was executed, but not the unreserve one (YET)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh you replied in the meantime, good 👍
sure I can do that! And I can work on the refactoring after if no one else is working on that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ahg-g added the reset to the other test. You can re-review
/lgtm |
@@ -581,8 +582,8 @@ func TestUnreservePlugin(t *testing.T) { | |||
if err = waitForPodUnschedulable(cs, pod); err != nil { | |||
t.Errorf("test #%v: Didn't expected the pod to be scheduled. error: %v", i, err) | |||
} | |||
if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled { | |||
t.Errorf("test #%v: Expected the unreserve plugin to be called %d times, was called %d times.", i, pbdPlugin.numPrebindCalled, unresPlugin.numUnreserveCalled) | |||
if unresPlugin.numUnreserveCalled < 1 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can unresPlugin.numUnreserveCalled
be negative? If not, I'd say if unresPlugin.numUnreserveCalled == 0
makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed @Huang-Wei
@ahg-g @Huang-Wei in the meantime, this got merged 40090e8#diff-f6b72dfe2b2c2cee201d12156e80366f so the reset for prebind is already in place. I removed it from this PR. Please re-review |
@alenkacz please squash the commits into one commit and it'd good to be merged then. |
ed4a3ff
to
b84e07e
Compare
@Huang-Wei should be ready |
b84e07e
to
05e733c
Compare
@alenkacz: Those labels are not set on the issue: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-area apiserver cloudprovider code-generation conformance dependency e2e-test-framework kubeadm kubectl kubelet release-eng |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alenkacz, Huang-Wei The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind flake
What this PR does / why we need it:
Fixes flakiness in TestUnreservePlugin.
For that particular test we wait for
waitForPodUnschedulable
and then assert thatunresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled
but this assert happens asynchronously and in the meantime the pod could have been already rescheduled and prebind might have been called more times than unreserve.This PR simplifies the test by letting it pass when unreserve was hit AT LEAST ONCE. That should be enough to cover the scenario.
Which issue(s) this PR fixes:
Fixes #79166
Special notes for your reviewer:
Does this PR introduce a user-facing change?:
/sig scheduling
cc. @bsalamat @draveness