Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "test/extended/prometheus: Validate alerting rules" #26499

Merged

Conversation

stbenjam
Copy link
Member

@stbenjam stbenjam commented Oct 4, 2021

Reverts #26476

Many jobs are failing - OVN, and vsphere - due to this new test. This test should be changed to flaking, or have a list of exceptions while they are being fixed.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 4, 2021

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stbenjam
To complete the pull request process, please assign bparees after the PR has been reviewed.
You can assign the PR to them by writing /assign @bparees in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@deads2k deads2k merged commit 1fd95df into openshift:master Oct 4, 2021
@deads2k
Copy link
Contributor

deads2k commented Oct 4, 2021

Manually merging revert because of permafails across numerous jobs (to be listed by Stephen). Before being un-reverted, the permafails must be eliminated. Same as product code, we are biased toward revert instead of fixes. This is the third such revert in the past week.

@dgrisonnet
Copy link
Member

I haven't looked at all the failures yet, but I think it would have been better to update the exception list rather than reverting the whole test since this test comes with one.
From what I can see we missed the OVN specific alerts when adding this test, but the challenge when merging such a test is that new alerts breaking our guidelines might be merged before this test is, so it will result in yet another batch of permanent failures. To prevent that, I think we should try to run the test on all possible jobs to get the infrastructure-specific alerts allow listed, and then if we are still seeing failures once the test is merged, update the allow list with the few alerts that were not detected beforehand. Otherwise, we will go back and forth with reverting this job every time an alert wasn't tested against this test before being merged. cc @bison

@deads2k
Copy link
Contributor

deads2k commented Oct 4, 2021

I haven't looked at all the failures yet, but I think it would have been better to update the exception list rather than reverting the whole test since this test comes with one.
From what I can see we missed the OVN specific alerts when adding this test, but the challenge when merging such a test is that new alerts breaking our guidelines might be merged before this test is, so it will result in yet another batch of permanent failures. To prevent that, I think we should try to run the test on all possible jobs to get the infrastructure-specific alerts allow listed, and then if we are still seeing failures once the test is merged, update the allow list with the few alerts that were not detected beforehand. Otherwise, we will go back and forth with reverting this job every time an alert wasn't tested against this test before being merged. cc @bison

The unrevert can add an exception list and get into a working condition before merging. With large areas of CI stuck, the priority is getting back to a functional state. The unrevert+exceptions can come in another PR that gets the exceptions correct before merging.

Otherwise, we will go back and forth with reverting this job every time an alert wasn't tested against this test before being merged.

I don't see this as the case. If you want to make an unenforcing test to find offenders first, that's a very good idea. If you want to run all the jobs before merging, that's another approach. There is no reason to believe this cannot be done as PR without sacrificing payload promotion and entire platforms of CI jobs.

@simonpasquier
Copy link
Contributor

Looking at the job failures, they all relate to missing annotations (either description or summary or both). We would have added exceptions if we knew about these failures but unfortunately they weren't exercised in #26476. @deads2k is there any way this could have been prevented?

For OVN

Alerting rule "NetworkPodsCrashLooping" (group: cluster-network-operator-ovn.rules) has no 'description' annotation
Alerting rule "NoOvnMasterLeader" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "NoRunningOvnMaster" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "NodeWithoutOVNKubeNodePodRunning" (group: cluster-network-operator-ovn.rules) has no 'description' annotation
Alerting rule "NorthboundStale" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "SouthboundStale" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "V4SubnetAllocationThresholdExceeded" (group: cluster-network-operator-master.rules) has no 'description' annotation
Alerting rule "V6SubnetAllocationThresholdExceeded" (group: cluster-network-operator-master.rules) has no 'description' annotation

For vSphere

Alerting rule "VSphereOpenshiftClusterHealthFail" (group: vsphere-problem-detector.rules) has no 'description' annotation, but has a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation
Alerting rule "VSphereOpenshiftClusterHealthFail" (group: vsphere-problem-detector.rules) has no 'summary' annotation
Alerting rule "VSphereOpenshiftNodeHealthFail" (group: vsphere-problem-detector.rules) has no 'description' annotation, but has a 'message' annotation. OpenShift alerts must use 'description' -- consider renaming the annotation
Alerting rule "VSphereOpenshiftNodeHealthFail" (group: vsphere-problem-detector.rules) has no 'summary' annotation

@deads2k
Copy link
Contributor

deads2k commented Oct 4, 2021

Looking at the job failures, they all relate to missing annotations (either description or summary or both). We would have added exceptions if we knew about these failures but unfortunately they weren't exercised in #26476. @deads2k is there any way this could have been prevented?

Yes. You could create a non-failing test to gather data. Or you could manually run vsphere and ovn jobs in the origin repo.

@simonpasquier
Copy link
Contributor

Ack, we didn't anticipated the side effect and we were too confident about the signal we got from the CI on the particular PR. Lesson learned for the future...

@bison
Copy link

bison commented Oct 4, 2021

What's the best way to make a non-failing test? Marking it as a flake means it shows as passing, right? Do we do that, then just manually inspect the logs for runs over the next few days?

@simonpasquier
Copy link
Contributor

@bison my understanding is that we should use testresult.Flakef() instead of framework.Failf(). Then after it merges, we look for offenders with https://search.ci.openshift.org.

testresult.Flakef("Unexpected alert behavior during test:\n\n%s", strings.Join(flakes.List(), "\n"))

@bison
Copy link

bison commented Oct 5, 2021

Yeah, that makes sense. I just wasn't sure exactly how to search for flakes. Anyway, I'll get a new PR up this morning.

bison pushed a commit to bison/origin that referenced this pull request Oct 5, 2021
The [OpenShift Alerting Consistency][1] enhancement defines a style
guide for the alerts shipped as part of OpenShift.  This adds a test
validating some of the guidelines considered required.

This was originally added in openshift#26476, but was reverted in openshift#26499 due to
failures with OVN and vSphere clusters.  This adds the tests back, but
adds exceptions for the non-compliant alerts as well as marking the
failing tests as flakes for now.  We'll gather data and make the tests
required once we're reasonably sure things are passing with all the
existing alerts.

[1]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md
@bison
Copy link

bison commented Oct 5, 2021

New PR: #26504

bison pushed a commit to bison/origin that referenced this pull request Oct 5, 2021
The [OpenShift Alerting Consistency][1] enhancement defines a style
guide for the alerts shipped as part of OpenShift.  This adds a test
validating some of the guidelines considered required.

This was originally added in openshift#26476, but was reverted in openshift#26499 due to
failures with OVN and vSphere clusters.  This adds the tests back, but
adds exceptions for the non-compliant alerts as well as marking the
failing tests as flakes for now.  We'll gather data and make the tests
required once we're reasonably sure things are passing with all the
existing alerts.

[1]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md
jsafrane pushed a commit to jsafrane/origin that referenced this pull request Nov 11, 2021
The [OpenShift Alerting Consistency][1] enhancement defines a style
guide for the alerts shipped as part of OpenShift.  This adds a test
validating some of the guidelines considered required.

This was originally added in openshift#26476, but was reverted in openshift#26499 due to
failures with OVN and vSphere clusters.  This adds the tests back, but
adds exceptions for the non-compliant alerts as well as marking the
failing tests as flakes for now.  We'll gather data and make the tests
required once we're reasonably sure things are passing with all the
existing alerts.

[1]: https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants