test: Verify platform metrics are available #24117

smarterclayton · 2019-11-08T18:45:24Z

Ensure there are no regressions.

openshift-ci-robot · 2019-11-08T18:46:17Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/extended/prometheus/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Ensure there are no regressions.

lilic · 2019-11-09T13:49:23Z

/test e2e-gcp

lilic

Sounds reasonable to me.

Restarted the failed gcp test, curious if any of the metrics made it fail because it specifically run on gcp?

lilic · 2019-11-09T14:00:20Z

test/extended/prometheus/prometheus.go

+				`cluster_feature_set`:                       true,
+
+				// track installer type
+				`cluster_installer{type!="",invoker!=""}`: true,


How come we do not check that this value is > 0 as well?

The old logic waited for prometheus to come up. That should no longer be necessary as we wait for cluster bringup before e2e tests are run. When a particular query is failing repeatedly, there is no need to print the error multiple times. Also check for unmarshal failures and check the status of the query, and be sure to print a newline.

s-urbaniak · 2019-11-11T11:25:11Z

just one little concern from my side: Is this asserting a functioning monitoring stack? If not I suggest to create a dedicated platforms e2e test folder.

smarterclayton · 2019-11-16T18:25:10Z

Openshift e2e requires a cluster monitoring stack (non optional part), unless you mean something different?

smarterclayton · 2019-11-16T18:25:25Z

/retest

With Michal’s fixes

smarterclayton · 2019-11-16T23:06:33Z

/retest

smarterclayton · 2019-11-17T05:28:14Z

/retest

smarterclayton · 2019-11-22T16:30:47Z

To catch regressions

openshift-bot · 2019-11-22T17:17:01Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-11-22T18:08:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2019-11-22T18:22:04Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Since 10c6be0 (test: Prometheus query test should fail more quickly, 2019-11-09, openshift#24117), we've been failing after only a few seconds of failures, which causes problems like [1]: fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:156]: Expected <map[string]error | len:1>: { "openshift_build_total{phase=\"Complete\"} >= 0": { s: "promQL query: openshift_build_total{phase=\"Complete\"} >= 0 had reported incorrect results: model.Vector{}", }, } to be empty ... failed: (1m4s) 2019-11-26T23:05:24 "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]" when we haven't waited long enough for the Prometheus scrape to notice the new builds. Looking at the timing in that job: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/73/build-log.txt | grep 'openshift_build_total.*Complete' | cut -b -50 STEP: perform prometheus metric query openshift_bu Nov 26 23:05:08.772: INFO: Running '/usr/bin/kubec Nov 26 23:05:09.480: INFO: stderr: "+ curl -s -k - Nov 26 23:05:09.480: INFO: promQL query: openshift STEP: perform prometheus metric query openshift_bu Nov 26 23:05:10.481: INFO: Running '/usr/bin/kubec Nov 26 23:05:11.121: INFO: stderr: "+ curl -s -k - STEP: perform prometheus metric query openshift_bu Nov 26 23:05:12.122: INFO: Running '/usr/bin/kubec Nov 26 23:05:12.751: INFO: stderr: "+ curl -s -k - STEP: perform prometheus metric query openshift_bu Nov 26 23:05:13.751: INFO: Running '/usr/bin/kubec Nov 26 23:05:14.356: INFO: stderr: "+ curl -s -k - STEP: perform prometheus metric query openshift_bu Nov 26 23:05:15.356: INFO: Running '/usr/bin/kubec Nov 26 23:05:15.922: INFO: stderr: "+ curl -s -k - "openshift_build_total{phase=\"Complete\"} s: "promQL query: openshift_build_tota so we had five queries over ~7s. With this commit, we'll wait at least 10s between retries, for a minium duration of 40s between the first and fifth attempt, which should give us long enough to include at least one scrape (scrapes every 30s [2]). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1777189 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1777189#c3

Since 10c6be0 (test: Prometheus query test should fail more quickly, 2019-11-09, openshift#24117), we've been failing after only a few seconds of failures, which causes problems like [1]: fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:156]: Expected <map[string]error | len:1>: { "openshift_build_total{phase=\"Complete\"} >= 0": { s: "promQL query: openshift_build_total{phase=\"Complete\"} >= 0 had reported incorrect results: model.Vector{}", }, } to be empty ... failed: (1m4s) 2019-11-26T23:05:24 "[Feature:Prometheus][Feature:Builds] Prometheus when installed on the cluster should start and expose a secured proxy and verify build metrics [Suite:openshift/conformance/parallel]" when we haven't waited long enough for the Prometheus scrape to notice the new builds. Looking at the timing in that job: $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.4/73/build-log.txt | grep 'openshift_build_total.*Complete' | cut -b -50 STEP: perform prometheus metric query openshift_bu Nov 26 23:05:08.772: INFO: Running '/usr/bin/kubec Nov 26 23:05:09.480: INFO: stderr: "+ curl -s -k - Nov 26 23:05:09.480: INFO: promQL query: openshift STEP: perform prometheus metric query openshift_bu Nov 26 23:05:10.481: INFO: Running '/usr/bin/kubec Nov 26 23:05:11.121: INFO: stderr: "+ curl -s -k - STEP: perform prometheus metric query openshift_bu Nov 26 23:05:12.122: INFO: Running '/usr/bin/kubec Nov 26 23:05:12.751: INFO: stderr: "+ curl -s -k - STEP: perform prometheus metric query openshift_bu Nov 26 23:05:13.751: INFO: Running '/usr/bin/kubec Nov 26 23:05:14.356: INFO: stderr: "+ curl -s -k - STEP: perform prometheus metric query openshift_bu Nov 26 23:05:15.356: INFO: Running '/usr/bin/kubec Nov 26 23:05:15.922: INFO: stderr: "+ curl -s -k - "openshift_build_total{phase=\"Complete\"} s: "promQL query: openshift_build_tota so we had five queries over ~7s. With this commit, we'll wait at least 10s between retries, for a minium duration of 40s between the first and fifth attempt, which should give us long enough to include at least one scrape (scrapes every 30s [2]). Also rename maxPrometheusQueryRetries to maxPrometheusQueryAttempts, because this count also includes the initial, non-retry attempt. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1777189 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1777189#c3

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 8, 2019

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 8, 2019

openshift-ci-robot requested review from lilic and s-urbaniak November 8, 2019 18:46

test: Verify platform metrics are available

7cac162

Ensure there are no regressions.

smarterclayton force-pushed the platform_metric_test branch from ecf3813 to 7cac162 Compare November 9, 2019 01:05

lilic reviewed Nov 9, 2019

View reviewed changes

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Nov 9, 2019

smarterclayton force-pushed the platform_metric_test branch from b1efc8a to 10c6be0 Compare November 9, 2019 23:17

smarterclayton added the lgtm Indicates that a PR is ready to be merged. label Nov 22, 2019

openshift-merge-robot merged commit cc2707f into openshift:master Nov 22, 2019

wking mentioned this pull request Dec 2, 2019

Bug 1777189: test/extended/prometheus/prometheus_builds: Wait up to 40s #24248

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Verify platform metrics are available #24117

test: Verify platform metrics are available #24117

smarterclayton commented Nov 8, 2019

openshift-ci-robot commented Nov 8, 2019

lilic commented Nov 9, 2019

lilic left a comment

lilic Nov 9, 2019

smarterclayton Nov 9, 2019

s-urbaniak commented Nov 11, 2019

smarterclayton commented Nov 16, 2019

smarterclayton commented Nov 16, 2019

smarterclayton commented Nov 16, 2019

smarterclayton commented Nov 17, 2019

smarterclayton commented Nov 22, 2019

openshift-bot commented Nov 22, 2019

openshift-bot commented Nov 22, 2019

openshift-bot commented Nov 22, 2019

test: Verify platform metrics are available #24117

test: Verify platform metrics are available #24117

Conversation

smarterclayton commented Nov 8, 2019

openshift-ci-robot commented Nov 8, 2019

lilic commented Nov 9, 2019

lilic left a comment

Choose a reason for hiding this comment

lilic Nov 9, 2019

Choose a reason for hiding this comment

smarterclayton Nov 9, 2019

Choose a reason for hiding this comment

s-urbaniak commented Nov 11, 2019

smarterclayton commented Nov 16, 2019

smarterclayton commented Nov 16, 2019

smarterclayton commented Nov 16, 2019

smarterclayton commented Nov 17, 2019

smarterclayton commented Nov 22, 2019

openshift-bot commented Nov 22, 2019

openshift-bot commented Nov 22, 2019

openshift-bot commented Nov 22, 2019