Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1992555: Comply with Openshift alerting guidelines #288

Merged
merged 1 commit into from Aug 19, 2021
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 6 additions & 3 deletions manifests/0000_90_dns-operator_03_prometheusrules.yaml
Expand Up @@ -19,14 +19,16 @@ spec:
labels:
severity: warning
annotations:
message: "{{ $value }} CoreDNS panics observed on {{ $labels.instance }}"
summary: CoreDNS panic
description: "{{ $value }} CoreDNS panics observed on {{ $labels.instance }}"
Comment on lines -22 to +23
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does description supersede message? The alerting-consistency enhancement document doesn't say whether message is still needed or not, so if you don't know the answer, we can ask the document's author for clarification or examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I used the cluster-etcd-operator as an example, they did not have 'message'. I will look into the document and ask if need be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation mentions this

Documentation Required

1. The name of the alerting rule should clearly identify the component impacted by the issue (for example etcdInsufficientMembers instead of InsufficientMembers, MachineConfigDaemonDrainError instead of MCDDrainError). It should camel case, without whitespace, starting with a capital letter. The first part of the alert name should be the same for all alerts originating from the same component.
2. Alerting rules should have a "severity" label whose value is either info, warning or critical (matching what we have today and staying aside from the discussion whether we want minor or not).
3. Alerting rules should have a description annotation providing details about what is happening and how to resolve the issue.
4. Alerting rules should have a summary annotation providing a high-level description (similar to the first line of a commit message or email subject).
5. If there's a runbook in https://github.com/openshift/runbooks, it should be linked in the runbook_url annotation.

So it looks like summary & description are mandatory. I will ask about message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the #forum-monitoring "it hasn't been superseded, but there was a consensus upstream to use summary and description in favor of message in order to have multiple level of abstraction" this is what upstream recommends: https://github.com/monitoring-mixins/docs#guidelines-for-alert-names-labels-and-annotations. So I think we can drop the message annotation.

- alert: CoreDNSHealthCheckSlow
expr: histogram_quantile(.95, sum(rate(coredns_health_request_duration_seconds_bucket[5m])) by (instance, le)) > 10
for: 5m
labels:
severity: warning
annotations:
message: "CoreDNS Health Checks are slowing down (instance {{ $labels.instance }})"
summary: CoreDNS health checks
description: "CoreDNS Health Checks are slowing down (instance {{ $labels.instance }})"
- alert: CoreDNSErrorsHigh
expr: |
(sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[5m]))
Expand All @@ -37,4 +39,5 @@ spec:
labels:
severity: warning
annotations:
message: "CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of requests."
summary: CoreDNS serverfail
description: "CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of requests."