Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] [sig-storage] [Serial] Volume metrics PVController should create bound pv/pvc count metrics for pvc controller after creating both pv and pvc #75275

Closed
mariantalla opened this Issue Mar 12, 2019 · 22 comments

Comments

@mariantalla
Copy link
Contributor

mariantalla commented Mar 12, 2019

Which jobs are flaking:

Which test(s) are flaking:
[sig-storage] [Serial] Volume metrics PVController should create bound pv/pvc count metrics for pvc controller after creating both pv and pvc

Reason for failure:

/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/volume_metrics.go:412
Wrong pv controller metric pv_collector_bound_pvc_count(namespace): wanted map[pv-7445:1], got map[]
Expected
    <map[string]int64 | len:0>: {}
to equal
    <map[string]int64 | len:1>: {"pv-7445": 1}
/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/volume_metrics.go:355

Anything else we need to know:

  • At the moment, this test fails anywhere from 25% to 40% of the time in the above release blocking jobs.
  • Triage

/sig storage
/kind flake
/remove-kind failing-test
/priority important-soon
/milestone v1.14

cc @kubernetes/sig-storage-test-failures
cc @smourapina @alejandrox1 @kacole2 @mortent

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 12, 2019

/assign @gnufied

@mariantalla mariantalla moved this from Flakes to Under investigation (prioritized) in 1.15 CI Signal Mar 12, 2019

@gnufied

This comment has been minimized.

Copy link
Member

gnufied commented Mar 12, 2019

Fabio has volunteered to work on this.

/assign @bertinatto

@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 13, 2019

Hey @bertinatto @gnufied just checking in on this. Any findings?

@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 13, 2019

I've run this test in a loop for around 1.5 hours but hasn't flaked.

This is what I think is happening:

  1. PV Controller is running and has some not-yet-syncronized metrics in-memory. The standard interval that metrics are collected is 1m: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#configuration-file
  2. PV Controller is upgraded and restarted
  3. The running test checks for the PV controller metrics and finds nothing there
@mariantalla

This comment has been minimized.

Copy link
Contributor Author

mariantalla commented Mar 13, 2019

Thanks @bertinatto . Is this expected? Is the PV controller expected to have persisted any in-memory metrics before shutting down? Or is it best-effort?

@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 13, 2019

Unfortunately so far I couldn't neither reproduce the flake nor find any proof that my theory is correct.

@mlmhl, since you created that test, do you have any thoughts on why this fake is happening?

PS: My suspicion comes from the fact that the flake only happens in *upgrade jobs, but I couldn't find any proof that it's correct.

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 13, 2019

This test is not run across an upgrade.

The upgrade happens and finishes, and then the test will run.

@mlmhl

This comment has been minimized.

Copy link
Contributor

mlmhl commented Mar 14, 2019

Hi @bertinatto

Wrong pv controller metric pv_collector_bound_pvc_count(namespace): wanted map[pv-7445:1], got map[]

From the error message, we can see that the bound pvc number is incorrect. However, in the test case, we check the bound pv number first before checking bound pvc number, so this means the bound pv number is correct, but bound pvc number is incorrect. This is odd.

Can anyone help to post more details of the upgrade job?

@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 14, 2019

@mlmhl, I think we do get at least 1 PV/PVC when we wait here, however, the test fails here when we check the number of PVCs again. So we check once and it succeeds, we check twice and it fails. That's why I suspected the controller was reloading, but as @msau42 pointed out that's not the case.

Do you think calculateRelativeValues could be misbehaving depending on the input that it's given?

@mlmhl

This comment has been minimized.

Copy link
Contributor

mlmhl commented Mar 14, 2019

I think we do get at least 1 PV/PVC when we wait here, however, the test fails here when we check the number of PVCs again.

Yeah, I agree with this. So I wonder if we have multi jobs running in parallel? For example:

  1. Before we starting this test, some other jobs created a PVC A, and the original bound PVC number would be 1
  2. Then this test started and created a PVC B too, meanwhile the other job deleted the PVC A, so the updated PVC number is till 1, and the calculateRelativeValues method will return 0.
@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 14, 2019

Yeah, I agree with this. So I wonder if we have multi jobs running in parallel? For example:

@mlmhl, since this in the [Serial] test suit there shouldn't be other jobs running, no?

@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 14, 2019

Quick status:

  • Noticed that the client usually have a different version from the server. I don't think this is a problem, but in any case I tried and it didn't fail
    • This happened once, so it might have been a hiccup:
I0311 15:08:26.565] Client Version: version.Info{Major:"1", Minor:"13+", GitVersion:"v1.13.5-beta.0.7+6c1e64b94a3e11", GitCommit:"6c1e64b94a3e111199c934c39a0c25bc219ed5f9", GitTreeState:"clean", BuildDate:"2019-03-05T15:08:15Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
I0311 15:08:26.565] Server Version: version.Info{Major:"1", Minor:"15+", GitVersion:"v1.15.0-alpha.0.1138+6ec5a7d3371729", GitCommit:"6ec5a7d3371729a6d922f66ad8703de5e3d6ac95", GitTreeState:"clean", BuildDate:"2019-03-11T14:15:32Z", GoVersion:"go1.12", Compiler:"gc", Platform:"linux/amd64"}
  • Ran the test alone for hours and it didn't flake
  • Started running tests from the whole volume_metrics.go file and a similar flake happened:
    Wrong pv controller metric pv_collector_unbound_pv_count(storage_class): wanted map[], got map[standard:-1]
    Expected
        <map[string]int64 | len:1>: {"standard": -1}
    to equal
        <map[string]int64 | len:0>: {}
@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 14, 2019

We do run serial and disruptive tests along with serial:. Looks like this change is fairly recent.

@msau42, do you think this makes sense?

Edit: many jobs run serial and disruptive together, so I'm not sure...

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 14, 2019

I think running serial and disruptive together is fine. We do that for non-upgrade jobs too.

@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 15, 2019

This job from March 11:

https://storage.googleapis.com/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-new-master-upgrade-master/2312/build-log.txt

This is where it fails:

I0311 16:58:35.177] 
I0311 16:58:35.177]     �[91mWrong pv controller metric pv_collector_bound_pvc_count(namespace): wanted map[e2e-tests-pv-zrz6h:1], got map[]
I0311 16:58:35.177]     Expected
I0311 16:58:35.177]         <map[string]int64 | len:0>: {}
I0311 16:58:35.178]     to equal
I0311 16:58:35.178]         <map[string]int64 | len:1>: {
I0311 16:58:35.178]             "e2e-tests-pv-zrz6h": 1,
I0311 16:58:35.178]         }�[0m
I0311 16:58:35.178] 

Around 10 seconds earlier, it runs this test:

I0311 16:58:23.604]   Local volume provisioner [Serial]
I0311 16:58:23.605]   �[90m/go/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/storage/persistent_volumes-local.go:433�[0m
I0311 16:58:23.605]     should create and recreate local persistent volume

Which was removed in January 2019.

So I think this job is running tests from an old tree.

Edit: seems like this test only upgrades master.

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 15, 2019

The test only upgrades the master, and runs the 1.13 version of the test:

I0311 15:08:44.679] Mar 11 15:08:44.679: INFO: e2e test version: v1.13.5-beta.0.7+6c1e64b94a3e11

@msau42

This comment has been minimized.

Copy link
Member

msau42 commented Mar 15, 2019

Are we sure that no other tests were running at the same time? If you search pv controller logs, do you see other pvs being created within the time of this test?

@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 16, 2019

None of the logs I saw had other tests running at the same timeframe, but I'll check again.

@bertinatto

This comment has been minimized.

Copy link
Member

bertinatto commented Mar 16, 2019

This flake indeed seems to be caused by a failing upgrade. From the logs:

W0311 15:08:26.309] 2019/03/11 15:08:26 main.go:307: Something went wrong: encountered 1 errors: [error during ./hack/ginkgo-e2e.sh --ginkgo.focus=\[Feature:MasterUpgrade\] --upgrade-target=ci/latest --report-dir=/workspace/_artifacts --disable-log-dump=true --report-prefix=upgrade: exit status 1]

It hasn't happened in almost 3 days, so I I think this workaround, later reverted, worked. The proper fix would be the go 1.12.1 update, which happened recently.

I'll keep an eye on it, but it seems to be resolved.

@spiffxp spiffxp moved this from Under investigation (prioritized) to Open PR-wait for >5 successes before "Resolved" in 1.15 CI Signal Mar 16, 2019

@spiffxp

This comment has been minimized.

Copy link
Member

spiffxp commented Mar 18, 2019

/close
Situation unchanged (EDIT: by this I mean, we're still not seeing any failures), calling this resolved

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Mar 18, 2019

@spiffxp: Closing this issue.

In response to this:

/close
Situation unchanged, calling this resolved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp spiffxp moved this from Open PR-wait for >5 successes before "Resolved" to Resolved (week Mar 11) in 1.15 CI Signal Mar 18, 2019

@spiffxp spiffxp moved this from Resolved (week Mar 11) to Resolved (week Mar 18) in 1.15 CI Signal Mar 18, 2019

@alejandrox1 alejandrox1 moved this from Resolved (week Mar 18) to Resolved (>2 weeks old) in 1.15 CI Signal Apr 19, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.