New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/extended/prometheus: better check for firing alerts #24005
test/extended/prometheus: better check for firing alerts #24005
Conversation
9c6b37b
to
36c3091
Compare
36c3091
to
76b49b4
Compare
/lgtm |
/hold Putting on hold as WIP, but feel free to remove. |
flaky infra /retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/test e2e-aws |
/retest |
1 similar comment
/retest |
// Checking for specific alert is done in "should have a Watchdog alert in firing state". | ||
`ALERTS{alertstate="firing"}`: {metricTest{greaterThanEqual: false, value: 2}}, | ||
// Checking Watchdog alert state is done in "should have a Watchdog alert in firing state". | ||
`ALERTS{alertname!="Watchdog",alertstate="firing"}`: {metricTest{greaterThanEqual: true, value: 1}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{greaterThanEqual: true, value: 1} will this output success on 1? I thought if there is 1 for particular alert record it should fail, I don't know much about metrics though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like the alerts outputted for me are:
ALERTS{alertname="UsingDeprecatedAPIExtensionsV1Beta1",alertstate="firing",client="cluster-policy-controller/v0.0.0 (linux/amd64) kubernetes/$Format",code="0",component="apiserver",contentType="application/vnd.kubernetes.protobuf;stream=watch",endpoint="https",group="extensions",instance="10.0.138.54:6443",job="apiserver",namespace="default",resource="daemonsets",scope="cluster",service="kubernetes",severity="warning",verb="WATCH",version="v1beta1"}
| 1 @1571823124.618 1 @1571823154.618 1 @1571823184.618 1 @1571823214.618 ....|
I had to extend test framework to be able to fail if any result is returned. This is necessary as I think what we have here now is quite unreadable and some features aren't even used. @s-urbaniak @brancz wdyt about refactoring our tests to just execute promQL and only expect if there is anything returned or not? So instead of doing:
We could just forward |
flakes... /retest |
I think the CI cluster is broken, second time |
We just saw this being fixed on other builds, so retrying. /retest |
/retest |
the borked pods (with SDN) were deleted few minutes back and should be working now |
let's try again, but it seems to be broken /retest |
/test images |
a3967eb
to
d52ce10
Compare
…lerts This extends test framework by adding a way to expect no metrics being returned. Additionally it should improve testing if no alerts are firing, apart from Watchdog.
d52ce10
to
3e0b931
Compare
/retest |
1 similar comment
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: brancz, bwplotka, paulfantom, soltysh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if tcs[j].nodata && len(metrics) == 0 { | ||
tcs[j].success = true | ||
break | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you quickly elaborate why this is needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a check for when no metrics were reported and that was expected by us.
/hold cancel |
there still seem to be alerts firing and e2e are green https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/625/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2869/artifacts/e2e-aws/metrics/prometheus.tar |
Revert #23995 and improve reporting of which alerts are firing.