test/extended/prometheus: better check for firing alerts #24005

paulfantom · 2019-10-23T12:16:33Z

Revert #23995 and improve reporting of which alerts are firing.

test/extended/prometheus/prometheus.go

brancz · 2019-10-23T13:46:57Z

/lgtm

brancz · 2019-10-23T13:47:18Z

/hold

Putting on hold as WIP, but feel free to remove.

paulfantom · 2019-10-23T14:14:47Z

flaky infra

/retest

soltysh

/lgtm

paulfantom · 2019-10-23T16:02:39Z

/test e2e-aws

soltysh · 2019-10-23T18:30:20Z

/retest

tnozicka · 2019-10-24T07:07:04Z

/retest

tnozicka · 2019-10-24T07:24:07Z

test/extended/prometheus/prometheus.go

-				// Checking for specific alert is done in "should have a Watchdog alert in firing state".
-				`ALERTS{alertstate="firing"}`: {metricTest{greaterThanEqual: false, value: 2}},
+				// Checking Watchdog alert state is done in "should have a Watchdog alert in firing state".
+				`ALERTS{alertname!="Watchdog",alertstate="firing"}`: {metricTest{greaterThanEqual: true, value: 1}},


{greaterThanEqual: true, value: 1} will this output success on 1? I thought if there is 1 for particular alert record it should fail, I don't know much about metrics though

like the alerts outputted for me are:

ALERTS{alertname="UsingDeprecatedAPIExtensionsV1Beta1",alertstate="firing",client="cluster-policy-controller/v0.0.0 (linux/amd64) kubernetes/$Format",code="0",component="apiserver",contentType="application/vnd.kubernetes.protobuf;stream=watch",endpoint="https",group="extensions",instance="10.0.138.54:6443",job="apiserver",namespace="default",resource="daemonsets",scope="cluster",service="kubernetes",severity="warning",verb="WATCH",version="v1beta1"} | 1 @1571823124.618 1 @1571823154.618 1 @1571823184.618 1 @1571823214.618 ....|

paulfantom · 2019-10-24T07:59:44Z

I had to extend test framework to be able to fail if any result is returned. This is necessary as ALERTS{} metrics are dynamically created only when alert is fired.

I think what we have here now is quite unreadable and some features aren't even used. @s-urbaniak @brancz wdyt about refactoring our tests to just execute promQL and only expect if there is anything returned or not? So instead of doing:

`ALERTS{alertname!="Watchdog",alertstate="firing"}`: {metricTest{greaterThanEqual: true, value: 1, nodata: true}},

We could just forward ALERTS{alertname!="Watchdog",alertstate="firing"} >= 1 and fail if anything is returned.

paulfantom · 2019-10-24T08:04:24Z

flakes...

/retest

tnozicka · 2019-10-24T08:05:35Z

I think the CI cluster is broken, second time connect to base-4-3-rhel8.ocp.svc port 80: No route to host

brancz · 2019-10-24T08:48:41Z

We just saw this being fixed on other builds, so retrying.

/retest

paulfantom · 2019-10-24T09:29:12Z

/retest

tnozicka · 2019-10-24T09:35:19Z

the borked pods (with SDN) were deleted few minutes back and should be working now

paulfantom · 2019-10-24T09:37:43Z

let's try again, but it seems to be broken

/retest

paulfantom · 2019-10-24T09:51:54Z

/test images

…lerts This extends test framework by adding a way to expect no metrics being returned. Additionally it should improve testing if no alerts are firing, apart from Watchdog.

paulfantom · 2019-10-25T07:21:22Z

/retest

paulfantom · 2019-10-28T08:18:23Z

/retest

bwplotka

/lgtm

openshift-ci-robot · 2019-10-28T13:29:22Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, bwplotka, paulfantom, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/extended/prometheus/OWNERS~~ [brancz,paulfantom,soltysh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

s-urbaniak · 2019-10-28T13:29:50Z

test/extended/prometheus/prometheus_builds.go

+				if tcs[j].nodata && len(metrics) == 0 {
+					tcs[j].success = true
+					break
+				}


can you quickly elaborate why this is needed?

This is a check for when no metrics were reported and that was expected by us.

paulfantom · 2019-10-28T14:51:37Z

/hold cancel

tnozicka · 2019-10-29T14:35:55Z

there still seem to be alerts firing and e2e are green https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/625/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2869/artifacts/e2e-aws/metrics/prometheus.tar

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 23, 2019

openshift-ci-robot requested review from brancz and deads2k October 23, 2019 12:17

paulfantom force-pushed the revert_23995 branch from 9c6b37b to 36c3091 Compare October 23, 2019 12:18

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Oct 23, 2019

brancz reviewed Oct 23, 2019

View reviewed changes

test/extended/prometheus/prometheus.go Outdated Show resolved Hide resolved

paulfantom force-pushed the revert_23995 branch from 36c3091 to 76b49b4 Compare October 23, 2019 13:20

openshift-ci-robot assigned brancz Oct 23, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 23, 2019

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 23, 2019

paulfantom changed the title ~~[WIP] test/extended/prometheus: sum alerts by name when getting all alerts~~ test/extended/prometheus: sum alerts by name when getting all alerts Oct 23, 2019

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 23, 2019

soltysh approved these changes Oct 23, 2019

View reviewed changes

openshift-ci-robot assigned soltysh Oct 23, 2019

paulfantom changed the title ~~test/extended/prometheus: sum alerts by name when getting all alerts~~ test/extended/prometheus: better check for firing alerts Oct 23, 2019

tnozicka mentioned this pull request Oct 24, 2019

Add alert for deprecated apis openshift/cluster-kube-apiserver-operator#625

Merged

tnozicka reviewed Oct 24, 2019

View reviewed changes

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 24, 2019

paulfantom mentioned this pull request Oct 24, 2019

test/extended/prometheus: better check for firing alerts #24017

Closed

paulfantom force-pushed the revert_23995 branch 2 times, most recently from a3967eb to d52ce10 Compare October 24, 2019 13:37

test/extended/prometheus: revert 23995 and improve reporting firing a…

3e0b931

…lerts This extends test framework by adding a way to expect no metrics being returned. Additionally it should improve testing if no alerts are firing, apart from Watchdog.

paulfantom force-pushed the revert_23995 branch from d52ce10 to 3e0b931 Compare October 24, 2019 15:47

bwplotka approved these changes Oct 28, 2019

View reviewed changes

openshift-ci-robot assigned bwplotka Oct 28, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2019

s-urbaniak reviewed Oct 28, 2019

View reviewed changes

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 28, 2019

openshift-merge-robot merged commit 2b4c6af into openshift:master Oct 28, 2019

paulfantom deleted the revert_23995 branch October 28, 2019 21:03

paulfantom mentioned this pull request Oct 28, 2019

Bug 1766638: [MON-813] Refactor prometheus tests #24019

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test/extended/prometheus: better check for firing alerts #24005

test/extended/prometheus: better check for firing alerts #24005

paulfantom commented Oct 23, 2019

brancz commented Oct 23, 2019

brancz commented Oct 23, 2019

paulfantom commented Oct 23, 2019

soltysh left a comment

paulfantom commented Oct 23, 2019

soltysh commented Oct 23, 2019

tnozicka commented Oct 24, 2019

tnozicka Oct 24, 2019

tnozicka Oct 24, 2019

paulfantom commented Oct 24, 2019

paulfantom commented Oct 24, 2019

tnozicka commented Oct 24, 2019

brancz commented Oct 24, 2019

paulfantom commented Oct 24, 2019

tnozicka commented Oct 24, 2019

paulfantom commented Oct 24, 2019

paulfantom commented Oct 24, 2019

paulfantom commented Oct 25, 2019

paulfantom commented Oct 28, 2019

bwplotka left a comment

openshift-ci-robot commented Oct 28, 2019

s-urbaniak Oct 28, 2019

paulfantom Oct 28, 2019

paulfantom commented Oct 28, 2019

tnozicka commented Oct 29, 2019

test/extended/prometheus: better check for firing alerts #24005

test/extended/prometheus: better check for firing alerts #24005

Conversation

paulfantom commented Oct 23, 2019

brancz commented Oct 23, 2019

brancz commented Oct 23, 2019

paulfantom commented Oct 23, 2019

soltysh left a comment

Choose a reason for hiding this comment

paulfantom commented Oct 23, 2019

soltysh commented Oct 23, 2019

tnozicka commented Oct 24, 2019

tnozicka Oct 24, 2019

Choose a reason for hiding this comment

tnozicka Oct 24, 2019

Choose a reason for hiding this comment

paulfantom commented Oct 24, 2019

paulfantom commented Oct 24, 2019

tnozicka commented Oct 24, 2019

brancz commented Oct 24, 2019

paulfantom commented Oct 24, 2019

tnozicka commented Oct 24, 2019

paulfantom commented Oct 24, 2019

paulfantom commented Oct 24, 2019

paulfantom commented Oct 25, 2019

paulfantom commented Oct 28, 2019

bwplotka left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Oct 28, 2019

s-urbaniak Oct 28, 2019

Choose a reason for hiding this comment

paulfantom Oct 28, 2019

Choose a reason for hiding this comment

paulfantom commented Oct 28, 2019

tnozicka commented Oct 29, 2019