Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PD multizone tests are flaky #72378

Closed
msau42 opened this issue Dec 27, 2018 · 4 comments · Fixed by #72410
Closed

PD multizone tests are flaky #72378

msau42 opened this issue Dec 27, 2018 · 4 comments · Fixed by #72410
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@msau42
Copy link
Member

msau42 commented Dec 27, 2018

Which jobs are failing:
gce/gke multizone/regional jobs

Which test(s) are failing:
Only PD tests

Since when has it been failing:
Around 12/21 with the 16:49 run

Testgrid link:
https://k8s-testgrid.appspot.com/sig-storage#gce-multizone

Reason for failure:
TBD

Anything else we need to know:

/kind failing-test

@k8s-ci-robot k8s-ci-robot added kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 27, 2018
@msau42
Copy link
Member Author

msau42 commented Dec 27, 2018

@kubernetes/sig-storage-test-failures

@k8s-ci-robot k8s-ci-robot added sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 27, 2018
@msau42
Copy link
Member Author

msau42 commented Dec 27, 2018

The runs started flaking after #70862 merged. cc @pohly @verult

Here's one failing run:
https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-multizone/26745

Pod pod-subpath-test-gcepd-dynamicpv-pfsm failed to schedule because of conflicting node selectors:

I1227 19:03:12.381780       1 factory.go:1102] Unable to schedule volumes-5284/pod-subpath-test-gcepd-dynamicpv-pfsm: no fit: 0/10 nodes are available: 1 node(s) were unschedulable, 3 node(s) had volume node affinity conflict, 6 node(s) didn't match node selector.; waiting

We shouldn't need to set the node selector for PD tests using PVs.

@pohly
Copy link
Contributor

pohly commented Dec 28, 2018

The PR broke the subpath test because the modification to the config struct embedded in the driver info now persists across tests:

if volType == testpatterns.InlineVolume {
// PD will be created in framework.TestContext.CloudConfig.Zone zone,
// so pods should be also scheduled there.
g.driverInfo.Config.ClientNodeSelector = map[string]string{
kubeletapis.LabelZoneFailureDomain: framework.TestContext.CloudConfig.Zone,
}

A quick fix would be to reset that field in CreateDriver. The long-term fix is the change discussed in #72288

My preference is to do the quick fix now, then either merge or close PRs in this order:

What I'd like to avoid is having to rebase PR #70992 on top of the long-term solution for issue #72288 - that'll be lots of code conflicts.

@pohly
Copy link
Contributor

pohly commented Dec 28, 2018

/assign

PR is here: PR #72410

pohly added a commit to pohly/kubernetes that referenced this issue Dec 28, 2018
PR kubernetes#70862 made each driver responsible for resetting its config, but
as it turned out, one place was missed in that PR: the in-tree gcepd
sets a node selector. Not resetting that caused other tests to fail
randomly depending on test execution order.

Now the test suite resets the config by taking a copy after setting up
the driver and restoring that copy before each test.

Long term the intention is to separate the entire test config from the
static driver info (kubernetes#72288),
but for now resetting the config is the fastest way to fix the test flake.

Fixes: kubernetes#72378
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants