Revert "test/extended/prometheus: Validate alerting rules" #26499

stbenjam · 2021-10-04T13:13:16Z

Many jobs are failing - OVN, and vsphere - due to this new test. This test should be changed to flaking, or have a list of exceptions while they are being fixed.

openshift-ci · 2021-10-04T13:13:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stbenjam
To complete the pull request process, please assign bparees after the PR has been reviewed.
You can assign the PR to them by writing /assign @bparees in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

test/extended/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

deads2k · 2021-10-04T13:18:16Z

Manually merging revert because of permafails across numerous jobs (to be listed by Stephen). Before being un-reverted, the permafails must be eliminated. Same as product code, we are biased toward revert instead of fixes. This is the third such revert in the past week.

dgrisonnet · 2021-10-04T13:40:24Z

I haven't looked at all the failures yet, but I think it would have been better to update the exception list rather than reverting the whole test since this test comes with one.
From what I can see we missed the OVN specific alerts when adding this test, but the challenge when merging such a test is that new alerts breaking our guidelines might be merged before this test is, so it will result in yet another batch of permanent failures. To prevent that, I think we should try to run the test on all possible jobs to get the infrastructure-specific alerts allow listed, and then if we are still seeing failures once the test is merged, update the allow list with the few alerts that were not detected beforehand. Otherwise, we will go back and forth with reverting this job every time an alert wasn't tested against this test before being merged. cc @bison

deads2k · 2021-10-04T14:04:38Z

I haven't looked at all the failures yet, but I think it would have been better to update the exception list rather than reverting the whole test since this test comes with one.
From what I can see we missed the OVN specific alerts when adding this test, but the challenge when merging such a test is that new alerts breaking our guidelines might be merged before this test is, so it will result in yet another batch of permanent failures. To prevent that, I think we should try to run the test on all possible jobs to get the infrastructure-specific alerts allow listed, and then if we are still seeing failures once the test is merged, update the allow list with the few alerts that were not detected beforehand. Otherwise, we will go back and forth with reverting this job every time an alert wasn't tested against this test before being merged. cc @bison

The unrevert can add an exception list and get into a working condition before merging. With large areas of CI stuck, the priority is getting back to a functional state. The unrevert+exceptions can come in another PR that gets the exceptions correct before merging.

Otherwise, we will go back and forth with reverting this job every time an alert wasn't tested against this test before being merged.

I don't see this as the case. If you want to make an unenforcing test to find offenders first, that's a very good idea. If you want to run all the jobs before merging, that's another approach. There is no reason to believe this cannot be done as PR without sacrificing payload promotion and entire platforms of CI jobs.

simonpasquier · 2021-10-04T14:07:49Z

Looking at the job failures, they all relate to missing annotations (either description or summary or both). We would have added exceptions if we knew about these failures but unfortunately they weren't exercised in #26476. @deads2k is there any way this could have been prevented?

For OVN

Alerting rule "NetworkPodsCrashLooping" (group: cluster-network-operator-ovn.rules) has no 'description' annotation
Alerting rule "NoOvnMasterLeader" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "NoRunningOvnMaster" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "NodeWithoutOVNKubeNodePodRunning" (group: cluster-network-operator-ovn.rules) has no 'description' annotation
Alerting rule "NorthboundStale" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "SouthboundStale" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "V4SubnetAllocationThresholdExceeded" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "V6SubnetAllocationThresholdExceeded" (group: cluster-network-operator-master.rules) has no 'description' annotation

For vSphere

Alerting rule "VSphereOpenshiftClusterHealthFail" (group: vsphere-problem-detector.rules) has no 'description' annotation, but has a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation
Alerting rule "VSphereOpenshiftClusterHealthFail" (group: vsphere-problem-detector.rules) has no 'summary' annotation
Alerting rule "VSphereOpenshiftNodeHealthFail" (group: vsphere-problem-detector.rules) has no 'description' annotation, but has a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation
Alerting rule "VSphereOpenshiftNodeHealthFail" (group: vsphere-problem-detector.rules) has no 'summary' annotation

deads2k · 2021-10-04T14:09:50Z

Looking at the job failures, they all relate to missing annotations (either description or summary or both). We would have added exceptions if we knew about these failures but unfortunately they weren't exercised in #26476. @deads2k is there any way this could have been prevented?

Yes. You could create a non-failing test to gather data. Or you could manually run vsphere and ovn jobs in the origin repo.

simonpasquier · 2021-10-04T14:20:31Z

Ack, we didn't anticipated the side effect and we were too confident about the signal we got from the CI on the particular PR. Lesson learned for the future...

bison · 2021-10-04T16:06:24Z

What's the best way to make a non-failing test? Marking it as a flake means it shows as passing, right? Do we do that, then just manually inspect the logs for runs over the next few days?

simonpasquier · 2021-10-05T08:50:17Z

@bison my understanding is that we should use testresult.Flakef() instead of framework.Failf(). Then after it merges, we look for offenders with https://search.ci.openshift.org.

origin/test/extended/prometheus/prometheus.go

Line 208 in e000efe

    
           testresult.Flakef("Unexpected alert behavior during test:\n\n%s", strings.Join(flakes.List(), "\n"))

bison · 2021-10-05T09:13:42Z

Yeah, that makes sense. I just wasn't sure exactly how to search for flakes. Anyway, I'll get a new PR up this morning.

The [OpenShift Alerting Consistency][1] enhancement defines a style guide for the alerts shipped as part of OpenShift. This adds a test validating some of the guidelines considered required. This was originally added in openshift#26476, but was reverted in openshift#26499 due to failures with OVN and vSphere clusters. This adds the tests back, but adds exceptions for the non-compliant alerts as well as marking the failing tests as flakes for now. We'll gather data and make the tests required once we're reasonably sure things are passing with all the existing alerts. [1]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md

bison · 2021-10-05T11:37:39Z

New PR: #26504

The [OpenShift Alerting Consistency][1] enhancement defines a style guide for the alerts shipped as part of OpenShift. This adds a test validating some of the guidelines considered required. This was originally added in openshift#26476, but was reverted in openshift#26499 due to failures with OVN and vSphere clusters. This adds the tests back, but adds exceptions for the non-compliant alerts as well as marking the failing tests as flakes for now. We'll gather data and make the tests required once we're reasonably sure things are passing with all the existing alerts. [1]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md

Revert "test/extended/prometheus: Validate alerting rules"

c073f9a

openshift-ci bot requested review from dgrisonnet and gabemontero October 4, 2021 13:13

deads2k merged commit 1fd95df into openshift:master Oct 4, 2021

stbenjam mentioned this pull request Oct 4, 2021

test/extended/prometheus: Validate alerting rules #26476

Merged

bison mentioned this pull request Oct 5, 2021

test/extended/prometheus: Validate alerting rules #26504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "test/extended/prometheus: Validate alerting rules" #26499

Revert "test/extended/prometheus: Validate alerting rules" #26499

stbenjam commented Oct 4, 2021

openshift-ci bot commented Oct 4, 2021

deads2k commented Oct 4, 2021

dgrisonnet commented Oct 4, 2021

deads2k commented Oct 4, 2021

simonpasquier commented Oct 4, 2021

deads2k commented Oct 4, 2021

simonpasquier commented Oct 4, 2021

bison commented Oct 4, 2021

simonpasquier commented Oct 5, 2021

bison commented Oct 5, 2021

bison commented Oct 5, 2021

Revert "test/extended/prometheus: Validate alerting rules" #26499

Revert "test/extended/prometheus: Validate alerting rules" #26499

Conversation

stbenjam commented Oct 4, 2021

openshift-ci bot commented Oct 4, 2021

deads2k commented Oct 4, 2021

dgrisonnet commented Oct 4, 2021

deads2k commented Oct 4, 2021

simonpasquier commented Oct 4, 2021

deads2k commented Oct 4, 2021

simonpasquier commented Oct 4, 2021

bison commented Oct 4, 2021

simonpasquier commented Oct 5, 2021

bison commented Oct 5, 2021

bison commented Oct 5, 2021