Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port coredns errors alert #184

Merged
merged 2 commits into from Jul 22, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
14 changes: 13 additions & 1 deletion manifests/0000_90_dns-operator_03_prometheusrules.yaml
Expand Up @@ -10,7 +10,8 @@ spec:
- name: openshift-dns.rules
rules:
- alert: CoreDNSPanicking
expr: coredns_panic_count_total > 0
expr: increase(coredns_panic_count_total[10m]) > 0
for: 5m
labels:
severity: warning
annotations:
Expand All @@ -22,3 +23,14 @@ spec:
severity: warning
annotations:
message: "CoreDNS Health Checks are slowing down (instance {{ $labels.instance }})"
- alert: CoreDNSErrorsHigh
expr: |
(sum(rate(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}[5m]))
/
sum(rate(coredns_dns_response_rcode_count_total[5m])))
> 0.01
Comment on lines +28 to +31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to grok what this expr is attempting to do. It's a % of SERFAIL rate change sums over 5 minute increments and alerts if it's over 1%?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeya24 not really clear what this is trying to do with the sum mixed in. Can this be written with it and simplify the query? If we're trying to say "if the rate of failures increases by more than 1% of total responses over a 5 minute period of time" that would be:

        expr: |
          sum(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"})
            /
          sum(coredns_dns_response_rcode_count_total)
          > 0.01
        for: 5m

The difference is we're not looking at the change in failures over time. I think this is better unless we assume there's some high chance of 1% of responses failing in a 5 min period. The original query is looking at the changes in rates over time. You could exclude sum given the alert is "for: 5m". So if your rate of errors goes up slow enough relative to the rate of total responses you can have an ever increasing number of failures as long as the change over time isn't more than 1% in a 5 minute block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some misunderstanding of what the sum function does here. From my understanding, the sum function is just combining the rates for each separate server/zone, to arrive at a single aggregated rate measurement.
So I don't think we would want to exclude sum here. @yeya24 does that sound right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I left sum in and removed rate. I see risk with sum of rates over time in that subtle cumulative increases in failure rates over time will not trip the alert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late.

        expr: |
          sum(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"})
            /
          sum(coredns_dns_response_rcode_count_total)
          > 0.01
        for: 5m

This query calculates the failure ratio over all the responses, which is not what I want.

rate restricts the samples in 5min and IMO this makes sense in this context.

for: 5m
labels:
severity: warning
annotations:
message: "CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of requests."