Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flaky test] [release-1.20] pull-kubernetes-e2e-gce-100-performance failing on Prometheus stack timeout #98039

Closed
ehashman opened this issue Jan 13, 2021 · 8 comments · Fixed by kubernetes/test-infra#20476
Assignees
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing.

Comments

@ehashman
Copy link
Member

Which jobs are flaking:

pull-kubernetes-e2e-gce-100-performance

Seems to only be on the release-1.20 branch?

Which test(s) are flaking:

e2e.go: ClusterLoaderV2

error during /go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --nodes=100 --provider=gce --report-dir=/workspace/_artifacts --testconfig=testing/load/config.yaml --testoverrides=./testing/experiments/enable_restart_count_check.yaml --testoverrides=./testing/experiments/use_simple_latency_query.yaml --testoverrides=./testing/overrides/load_throughput.yaml: exit status 1

Testgrid link:

https://testgrid.k8s.io/presubmits-kubernetes-scalability#pull-kubernetes-e2e-gce-100-performance

Reason for failure:

Further investigation shows the Prometheus stack is timing out:

W0113 01:29:37.787] F0113 01:29:37.770024   99322 clusterloader.go:291] Error while setting up prometheus stack: timed out waiting for the condition 

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/97998/pull-kubernetes-e2e-gce-100-performance/1349479663537229824/#1:build-log.txt%3A1783

more log excerpt
I0112 14:27:23.924] 724229f9f6cfd883c2d537dde566e4de9fdce225 - Sat Jan 9 01:23:06 2021 (Merge +refs/pull/97969/head:refs/pr/97969)
I0112 14:27:23.924] 1fc106ac18331af232daaf93fb952d91c7befd7c - Tue Jan 12 09:12:33 2021 (fixes nil panic for nil delegated auth options)
I0112 14:27:23.924] 6a61eae3621867b8a654901531718650b7e7dd12 - Sat Jan 9 01:23:05 2021 (Merge pull request #97463 from tkashem/automated-cherry-pick-of-#97206-upstream-release-1.20)
I0112 14:27:23.924] a5351e9c70221191d88dd0109dea2ebd14836cc3 - Fri Jan 8 23:41:07 2021 (Merge pull request #97866 from MikeSpreitzer/automated-cherry-pick-of-#97860-upstream-release-1.20)
I0112 14:27:23.925] d8a1dfb21f1f78595d1aaf9985eb58ee50edc940 - Fri Jan 8 18:32:38 2021 (move all variables in sampleAndWaterMarkHistograms::innerSet)
I0112 14:27:23.925] bde721813437d22206b59e4bbe0f8dca2b5209e4 - Fri Jan 8 16:44:54 2021 (Merge pull request #97826 from pacoxu/cherry-pick/nodeport-quota)
I0112 14:27:23.925] c4a3ce0cac9e922486663ea6f598fa0c45342a60 - Fri Jan 8 13:40:54 2021 (Merge pull request #97847 from pacoxu/automated-cherry-pick-of-#97625-upstream-release-1.20)
W0112 14:43:11.697]       "reportingComponent": "",
W0112 14:43:11.697]       "reportingInstance": ""
W0112 14:43:11.697]     }
W0112 14:43:11.697]   ]
W0112 14:43:11.697] }
W0112 14:43:11.697] F0112 14:43:11.682867   99367 clusterloader.go:291] Error while setting up prometheus stack: timed out waiting for the condition 

Anything else we need to know:

@ehashman ehashman added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 13, 2021
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 13, 2021
@k8s-ci-robot
Copy link
Contributor

@ehashman: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ehashman
Copy link
Member Author

/sig testing
/sig scalability

just guessing, I haven't looked into the test failure in detail

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 13, 2021
@ehashman
Copy link
Member Author

/sig release

@k8s-ci-robot k8s-ci-robot added the sig/release Categorizes an issue or PR as relevant to SIG Release. label Jan 13, 2021
@ehashman
Copy link
Member Author

/kind failing-test

@k8s-ci-robot k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 13, 2021
@wongma7
Copy link
Contributor

wongma7 commented Jan 13, 2021

W0113 21:39:11.244] I0113 21:39:11.244089   99722 util.go:93] 113/115 targets are ready, example not ready target: {map[endpoint:kube-controller-manager instance:10.40.0.3:10257 job:master namespace:monitoring service:master] down} 

@justaugustus
Copy link
Member

cc: @kubernetes/sig-scalability @kubernetes/release-engineering

@wojtek-t
Copy link
Member

@jkaniuk - can you please take a look into it?
It must be somewhat related to being presubmit, because periodic jobs seem to be perfectly green....

@wojtek-t
Copy link
Member

/assign @jkaniuk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants