[Flaky test] [release-1.20] pull-kubernetes-e2e-gce-100-performance failing on Prometheus stack timeout #98039

ehashman · 2021-01-13T23:18:49Z

Which jobs are flaking:

pull-kubernetes-e2e-gce-100-performance

Seems to only be on the release-1.20 branch?

Which test(s) are flaking:

e2e.go: ClusterLoaderV2

error during /go/src/k8s.io/perf-tests/run-e2e.sh cluster-loader2 --nodes=100 --provider=gce --report-dir=/workspace/_artifacts --testconfig=testing/load/config.yaml --testoverrides=./testing/experiments/enable_restart_count_check.yaml --testoverrides=./testing/experiments/use_simple_latency_query.yaml --testoverrides=./testing/overrides/load_throughput.yaml: exit status 1

Testgrid link:

https://testgrid.k8s.io/presubmits-kubernetes-scalability#pull-kubernetes-e2e-gce-100-performance

Reason for failure:

Further investigation shows the Prometheus stack is timing out:

W0113 01:29:37.787] F0113 01:29:37.770024   99322 clusterloader.go:291] Error while setting up prometheus stack: timed out waiting for the condition

https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/97998/pull-kubernetes-e2e-gce-100-performance/1349479663537229824/#1:build-log.txt%3A1783

more log excerpt

I0112 14:27:23.924] 724229f9f6cfd883c2d537dde566e4de9fdce225 - Sat Jan 9 01:23:06 2021 (Merge +refs/pull/97969/head:refs/pr/97969)
I0112 14:27:23.924] 1fc106ac18331af232daaf93fb952d91c7befd7c - Tue Jan 12 09:12:33 2021 (fixes nil panic for nil delegated auth options)
I0112 14:27:23.924] 6a61eae3621867b8a654901531718650b7e7dd12 - Sat Jan 9 01:23:05 2021 (Merge pull request #97463 from tkashem/automated-cherry-pick-of-#97206-upstream-release-1.20)
I0112 14:27:23.924] a5351e9c70221191d88dd0109dea2ebd14836cc3 - Fri Jan 8 23:41:07 2021 (Merge pull request #97866 from MikeSpreitzer/automated-cherry-pick-of-#97860-upstream-release-1.20)
I0112 14:27:23.925] d8a1dfb21f1f78595d1aaf9985eb58ee50edc940 - Fri Jan 8 18:32:38 2021 (move all variables in sampleAndWaterMarkHistograms::innerSet)
I0112 14:27:23.925] bde721813437d22206b59e4bbe0f8dca2b5209e4 - Fri Jan 8 16:44:54 2021 (Merge pull request #97826 from pacoxu/cherry-pick/nodeport-quota)
I0112 14:27:23.925] c4a3ce0cac9e922486663ea6f598fa0c45342a60 - Fri Jan 8 13:40:54 2021 (Merge pull request #97847 from pacoxu/automated-cherry-pick-of-#97625-upstream-release-1.20)
W0112 14:43:11.697]       "reportingComponent": "",
W0112 14:43:11.697]       "reportingInstance": ""
W0112 14:43:11.697]     }
W0112 14:43:11.697]   ]
W0112 14:43:11.697] }
W0112 14:43:11.697] F0112 14:43:11.682867   99367 clusterloader.go:291] Error while setting up prometheus stack: timed out waiting for the condition

Anything else we need to know:

links to go.k8s.io/triage appreciated: https://storage.googleapis.com/k8s-gubernator/triage/index.html?pr=1&job=pull-kubernetes-e2e-gce-100-performance#ee23545fee29a24e6fcd
links to specific failures in spyglass appreciated

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-01-13T23:18:57Z

@ehashman: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

ehashman · 2021-01-13T23:19:56Z

/sig testing
/sig scalability

just guessing, I haven't looked into the test failure in detail

ehashman · 2021-01-13T23:23:19Z

/sig release

ehashman · 2021-01-13T23:24:21Z

/kind failing-test

wongma7 · 2021-01-13T23:28:31Z

W0113 21:39:11.244] I0113 21:39:11.244089   99722 util.go:93] 113/115 targets are ready, example not ready target: {map[endpoint:kube-controller-manager instance:10.40.0.3:10257 job:master namespace:monitoring service:master] down}

justaugustus · 2021-01-13T23:30:54Z

cc: @kubernetes/sig-scalability @kubernetes/release-engineering

wojtek-t · 2021-01-14T07:23:09Z

@jkaniuk - can you please take a look into it?
It must be somewhat related to being presubmit, because periodic jobs seem to be perfectly green....

wojtek-t · 2021-01-14T07:23:17Z

/assign @jkaniuk

ehashman added the kind/flake Categorizes issue or PR as related to a flaky test. label Jan 13, 2021

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 13, 2021

k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 13, 2021

ehashman mentioned this issue Jan 13, 2021

Automated cherry pick of #97980: Revert "Merge pull request #92817 from kmala/kubelet" #97998

Merged

wongma7 mentioned this issue Jan 13, 2021

Automated cherry pick of #96821: Use volumeHandle as PV name when translating EBS inline #98030

Merged

ehashman mentioned this issue Jan 13, 2021

Automated cherry pick of #94087: node sync at least once #97995

Merged

k8s-ci-robot added the sig/release Categorizes an issue or PR as relevant to SIG Release. label Jan 13, 2021

k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jan 13, 2021

gjkim42 mentioned this issue Jan 14, 2021

vendor: update cAdvisor to v0.38.7 #98014

Merged

k8s-ci-robot assigned jkaniuk Jan 14, 2021

tosi3k mentioned this issue Jan 14, 2021

Fix performance presubmit test job to release-1.20 branch of k/k kubernetes/test-infra#20476

Merged

k8s-ci-robot closed this as completed in kubernetes/test-infra#20476 Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Flaky test] [release-1.20] pull-kubernetes-e2e-gce-100-performance failing on Prometheus stack timeout #98039

[Flaky test] [release-1.20] pull-kubernetes-e2e-gce-100-performance failing on Prometheus stack timeout #98039

ehashman commented Jan 13, 2021

k8s-ci-robot commented Jan 13, 2021

ehashman commented Jan 13, 2021

ehashman commented Jan 13, 2021

ehashman commented Jan 13, 2021

wongma7 commented Jan 13, 2021

justaugustus commented Jan 13, 2021

wojtek-t commented Jan 14, 2021

wojtek-t commented Jan 14, 2021

[Flaky test] [release-1.20] pull-kubernetes-e2e-gce-100-performance failing on Prometheus stack timeout #98039

[Flaky test] [release-1.20] pull-kubernetes-e2e-gce-100-performance failing on Prometheus stack timeout #98039

Comments

ehashman commented Jan 13, 2021

k8s-ci-robot commented Jan 13, 2021

ehashman commented Jan 13, 2021

ehashman commented Jan 13, 2021

ehashman commented Jan 13, 2021

wongma7 commented Jan 13, 2021

justaugustus commented Jan 13, 2021

wojtek-t commented Jan 14, 2021

wojtek-t commented Jan 14, 2021