Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix TestUnreservePlugin flake #79371

Merged
merged 1 commit into from
Jun 30, 2019

Conversation

alenkacz
Copy link
Contributor

What type of PR is this?
/kind flake

What this PR does / why we need it:
Fixes flakiness in TestUnreservePlugin.
For that particular test we wait for waitForPodUnschedulable and then assert that unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled but this assert happens asynchronously and in the meantime the pod could have been already rescheduled and prebind might have been called more times than unreserve.

This PR simplifies the test by letting it pass when unreserve was hit AT LEAST ONCE. That should be enough to cover the scenario.

Which issue(s) this PR fixes:

Fixes #79166

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

/sig scheduling
cc. @bsalamat @draveness

@k8s-ci-robot
Copy link
Contributor

Welcome @alenkacz!

It looks like this is your first PR to kubernetes/kubernetes 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/kubernetes has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. kind/flake Categorizes issue or PR as related to a flaky test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jun 25, 2019
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Jun 25, 2019
@@ -581,8 +581,8 @@ func TestUnreservePlugin(t *testing.T) {
if err = waitForPodUnschedulable(cs, pod); err != nil {
t.Errorf("test #%v: Didn't expected the pod to be scheduled. error: %v", i, err)
}
if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled {
t.Errorf("test #%v: Expected the unreserve plugin to be called %d times, was called %d times.", i, pbdPlugin.numPrebindCalled, unresPlugin.numUnreserveCalled)
if unresPlugin.numUnreserveCalled < 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right fix.

The problem is that the tests in this file modify some shared state (the members in TesterPlugin above). What we need to do is reset this state for every test. As a quick fix, you can reset pbdPlugin.numPrebindCalled and unresPlugin.numUnreserveCalled to zero at the beginning of the for loop.

But what we should really be doing is refactor this whole file to ensure that each test starts from a clean state to make sure that this doesn't happen again in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does the reset here https://github.com/kubernetes/kubernetes/pull/79371/files#diff-f6b72dfe2b2c2cee201d12156e80366fL596 and the tests don't run in parallel...

I agree with the proposed refactoring, I still feel like this is the right fix at this point :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does the reset here https://github.com/kubernetes/kubernetes/pull/79371/files#diff-f6b72dfe2b2c2cee201d12156e80366fL596

reset() should be called at the beginning, not the end.

and the tests don't run in parallel...

The tests don't run in parallel, but they may run in different order.

I agree with the proposed refactoring, I still feel like this is the right fix at this point :)

I don't think it is, we should expect bind and unreserve plugins to be executed the same number of times.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that TestPrebindPlugin does not reset the counter. But a test shouldn't be relying on another, so the right fix is for this test to make sure the state it relies on is reset before it runs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on a closer look, I think you are right that we can't guarantee that they are equal because the scheduler and the test are running in parallel; what we can guarantee is that prebind counter is at most off by one :)

I still think that we have a bigger problem that the tests modify shared state and rely on each other to make sure the state is reset.

For now, in addition to the modification you did, I would suggest that you also reset the prebind counter in TestPrebindPlugin.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g what about the race I described? I saw that in logs...

we wait for pod being unschedulable
that happens, the pod goes back to scheduling queue
then we check (at later time) that those two numbers (prebind, unreserve) equal...
but in the meantime, the pod got rescheduled and the prebind was executed, but not the unreserve one (YET)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you replied in the meantime, good 👍

sure I can do that! And I can work on the refactoring after if no one else is working on that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahg-g added the reset to the other test. You can re-review

@ahg-g
Copy link
Member

ahg-g commented Jun 26, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2019
@@ -581,8 +582,8 @@ func TestUnreservePlugin(t *testing.T) {
if err = waitForPodUnschedulable(cs, pod); err != nil {
t.Errorf("test #%v: Didn't expected the pod to be scheduled. error: %v", i, err)
}
if unresPlugin.numUnreserveCalled == 0 || unresPlugin.numUnreserveCalled != pbdPlugin.numPrebindCalled {
t.Errorf("test #%v: Expected the unreserve plugin to be called %d times, was called %d times.", i, pbdPlugin.numPrebindCalled, unresPlugin.numUnreserveCalled)
if unresPlugin.numUnreserveCalled < 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can unresPlugin.numUnreserveCalled be negative? If not, I'd say if unresPlugin.numUnreserveCalled == 0 makes more sense.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed @Huang-Wei

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 27, 2019
@alenkacz
Copy link
Contributor Author

@ahg-g @Huang-Wei in the meantime, this got merged 40090e8#diff-f6b72dfe2b2c2cee201d12156e80366f so the reset for prebind is already in place. I removed it from this PR. Please re-review

@Huang-Wei
Copy link
Member

Huang-Wei commented Jun 28, 2019

@alenkacz please squash the commits into one commit and it'd good to be merged then.

@alenkacz
Copy link
Contributor Author

@Huang-Wei should be ready

@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. area/apiserver area/cloudprovider area/code-generation area/conformance Issues or PRs related to kubernetes conformance tests area/dependency Issues or PRs related to dependency changes area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 30, 2019
@k8s-ci-robot k8s-ci-robot removed sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. labels Jun 30, 2019
@k8s-ci-robot
Copy link
Contributor

@alenkacz: Those labels are not set on the issue: sig/

In response to this:

/remove-sig api-machinery apps architecture auth cli cloud-provider cluster-lifecycle contributor-experience instrumentation network node storage

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot removed sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/storage Categorizes an issue or PR as relevant to SIG Storage. labels Jun 30, 2019
@alenkacz
Copy link
Contributor Author

/remove-area apiserver cloudprovider code-generation conformance dependency e2e-test-framework kubeadm kubectl kubelet release-eng

@k8s-ci-robot k8s-ci-robot removed area/apiserver area/cloudprovider area/code-generation area/conformance Issues or PRs related to kubernetes conformance tests area/dependency Issues or PRs related to dependency changes area/e2e-test-framework Issues or PRs related to refactoring the kubernetes e2e test framework area/kubeadm area/kubectl area/kubelet area/release-eng Issues or PRs related to the Release Engineering subproject labels Jun 30, 2019
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@Huang-Wei
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 30, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alenkacz, Huang-Wei

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2019
@k8s-ci-robot k8s-ci-robot merged commit 28366a1 into kubernetes:master Jun 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pull-kubernetes-integration#TestUnreservePlugin fails frequently
5 participants