Fix TestUnreservePlugin flake #79371

alenkacz · 2019-06-25T14:21:13Z

What type of PR is this?
/kind flake

What this PR does / why we need it:
Fixes flakiness in TestUnreservePlugin.
For that particular test we wait for waitForPodUnschedulable and then assert that unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled but this assert happens asynchronously and in the meantime the pod could have been already rescheduled and prebind might have been called more times than unreserve.

This PR simplifies the test by letting it pass when unreserve was hit AT LEAST ONCE. That should be enough to cover the scenario.

Which issue(s) this PR fixes:

Fixes #79166

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

/sig scheduling
cc. @bsalamat @draveness

k8s-ci-robot · 2019-06-25T14:21:14Z

Welcome @alenkacz!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

ahg-g · 2019-06-26T13:29:08Z

test/integration/scheduler/framework_test.go

@@ -581,8 +581,8 @@ func TestUnreservePlugin(t *testing.T) {
 				if err = waitForPodUnschedulable(cs, pod); err != nil {
 					t.Errorf("test #%v: Didn't expected the pod to be scheduled. error: %v", i, err)
 				}
-				if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled {
-					t.Errorf("test #%v: Expected the unreserve plugin to be called %d times, was called %d times.", i, pbdPlugin.numPrebindCalled, unresPlugin.numUnreserveCalled)
+				if unresPlugin.numUnreserveCalled < 1 {


I don't think this is the right fix.

The problem is that the tests in this file modify some shared state (the members in TesterPlugin above). What we need to do is reset this state for every test. As a quick fix, you can reset pbdPlugin.numPrebindCalled and unresPlugin.numUnreserveCalled to zero at the beginning of the for loop.

But what we should really be doing is refactor this whole file to ensure that each test starts from a clean state to make sure that this doesn't happen again in the future.

It does the reset here https://github.com/kubernetes/kubernetes/pull/79371/files#diff-f6b72dfe2b2c2cee201d12156e80366fL596 and the tests don't run in parallel...

I agree with the proposed refactoring, I still feel like this is the right fix at this point :)

It does the reset here https://github.com/kubernetes/kubernetes/pull/79371/files#diff-f6b72dfe2b2c2cee201d12156e80366fL596

reset() should be called at the beginning, not the end.

and the tests don't run in parallel...

The tests don't run in parallel, but they may run in different order.

I agree with the proposed refactoring, I still feel like this is the right fix at this point :)

I don't think it is, we should expect bind and unreserve plugins to be executed the same number of times.

The problem is that TestPrebindPlugin does not reset the counter. But a test shouldn't be relying on another, so the right fix is for this test to make sure the state it relies on is reset before it runs.

on a closer look, I think you are right that we can't guarantee that they are equal because the scheduler and the test are running in parallel; what we can guarantee is that prebind counter is at most off by one :)

I still think that we have a bigger problem that the tests modify shared state and rely on each other to make sure the state is reset.

For now, in addition to the modification you did, I would suggest that you also reset the prebind counter in TestPrebindPlugin.

@ahg-g what about the race I described? I saw that in logs...

we wait for pod being unschedulable
that happens, the pod goes back to scheduling queue
then we check (at later time) that those two numbers (prebind, unreserve) equal...
but in the meantime, the pod got rescheduled and the prebind was executed, but not the unreserve one (YET)

oh you replied in the meantime, good 👍

sure I can do that! And I can work on the refactoring after if no one else is working on that

@ahg-g added the reset to the other test. You can re-review

ahg-g · 2019-06-26T17:13:31Z

/lgtm

Huang-Wei · 2019-06-27T02:57:23Z

test/integration/scheduler/framework_test.go

@@ -581,8 +582,8 @@ func TestUnreservePlugin(t *testing.T) {
 				if err = waitForPodUnschedulable(cs, pod); err != nil {
 					t.Errorf("test #%v: Didn't expected the pod to be scheduled. error: %v", i, err)
 				}
-				if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled {
-					t.Errorf("test #%v: Expected the unreserve plugin to be called %d times, was called %d times.", i, pbdPlugin.numPrebindCalled, unresPlugin.numUnreserveCalled)
+				if unresPlugin.numUnreserveCalled < 1 {


Can unresPlugin.numUnreserveCalled be negative? If not, I'd say if unresPlugin.numUnreserveCalled == 0 makes more sense.

fixed @Huang-Wei

alenkacz · 2019-06-28T07:05:33Z

@ahg-g @Huang-Wei in the meantime, this got merged 40090e8#diff-f6b72dfe2b2c2cee201d12156e80366f so the reset for prebind is already in place. I removed it from this PR. Please re-review

Huang-Wei · 2019-06-28T07:29:04Z

@alenkacz please squash the commits into one commit and it'd good to be merged then.

alenkacz · 2019-06-30T05:49:31Z

@Huang-Wei should be ready

k8s-ci-robot · 2019-06-30T06:33:44Z

@alenkacz: Those labels are not set on the issue: sig/

In response to this:

/remove-sig api-machinery apps architecture auth cli cloud-provider cluster-lifecycle contributor-experience instrumentation network node storage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alenkacz · 2019-06-30T06:35:30Z

/remove-area apiserver cloudprovider code-generation conformance dependency e2e-test-framework kubeadm kubectl kubelet release-eng

fejta-bot · 2019-06-30T06:47:40Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

Huang-Wei · 2019-06-30T08:49:12Z

/lgtm
/approve

k8s-ci-robot · 2019-06-30T08:49:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alenkacz, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/integration/scheduler/OWNERS~~ [Huang-Wei]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from Huang-Wei and resouer June 25, 2019 14:22

k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 25, 2019

ahg-g reviewed Jun 26, 2019

View reviewed changes

k8s-ci-robot assigned ahg-g Jun 26, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2019

Huang-Wei reviewed Jun 27, 2019

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 27, 2019

alenkacz force-pushed the av/unreserve-flake branch from ed4a3ff to b84e07e Compare June 30, 2019 05:07

alenkacz force-pushed the av/unreserve-flake branch from b84e07e to 05e733c Compare June 30, 2019 06:20

k8s-ci-robot removed sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jun 30, 2019

k8s-ci-robot assigned Huang-Wei Jun 30, 2019

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2019

k8s-ci-robot merged commit 28366a1 into kubernetes:master Jun 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix TestUnreservePlugin flake #79371

Fix TestUnreservePlugin flake #79371

alenkacz commented Jun 25, 2019

k8s-ci-robot commented Jun 25, 2019

ahg-g Jun 26, 2019

alenkacz Jun 26, 2019

ahg-g Jun 26, 2019

ahg-g Jun 26, 2019

ahg-g Jun 26, 2019

alenkacz Jun 26, 2019

alenkacz Jun 26, 2019

alenkacz Jun 26, 2019

ahg-g commented Jun 26, 2019

Huang-Wei Jun 27, 2019

alenkacz Jun 27, 2019

alenkacz commented Jun 28, 2019

Huang-Wei commented Jun 28, 2019 •

edited

alenkacz commented Jun 30, 2019

k8s-ci-robot commented Jun 30, 2019

alenkacz commented Jun 30, 2019

fejta-bot commented Jun 30, 2019

Huang-Wei commented Jun 30, 2019

k8s-ci-robot commented Jun 30, 2019

Fix TestUnreservePlugin flake #79371

Fix TestUnreservePlugin flake #79371

Conversation

alenkacz commented Jun 25, 2019

k8s-ci-robot commented Jun 25, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahg-g commented Jun 26, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alenkacz commented Jun 28, 2019

Huang-Wei commented Jun 28, 2019 • edited

alenkacz commented Jun 30, 2019

k8s-ci-robot commented Jun 30, 2019

alenkacz commented Jun 30, 2019

fejta-bot commented Jun 30, 2019

Huang-Wei commented Jun 30, 2019

k8s-ci-robot commented Jun 30, 2019

Huang-Wei commented Jun 28, 2019 •

edited