Port coredns errors alert #184

yeya24 · 2020-07-20T19:04:22Z

This pr updates the existing CoreDNSPanicking metric and add alerts for CoreDNSErrorsHigh. For more context, please check https://coreos.slack.com/archives/CCH60A77E/p1595256269123600

Signed-off-by: Ben Ye <yb532204897@gmail.com>

RiRa12621 · 2020-07-20T19:05:36Z

/lgtm

sgreene570 · 2020-07-20T19:15:35Z

manifests/0000_90_dns-operator_03_prometheusrules.yaml

+          severity: critical
+        annotations:
+          message: "CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of requests."
+      - alert: CoreDNSErrorsHigh


I think having only one CoreDNSErrorsHigh alert would be best, in order to remove duplicate alerts and thus minimize "alert fatigue". As far as which value to use (.03 vs .01), I don't have any strong opinions on which one to use. Thoughts?

I prefer to keep the 0.01 one since I don't think it is really necessary to add a critical level alert here.

sounds good to me.

@RiRa12621 Hello, would you mind taking a look at this? Which alert should we use? The critical one or the warning one?

+1 for warning from me
/cc @jewzaam

Signed-off-by: Ben Ye <yb532204897@gmail.com>

jewzaam · 2020-07-22T14:35:06Z

manifests/0000_90_dns-operator_03_prometheusrules.yaml

+          (sum(rate(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}[5m]))
+            /
+          sum(rate(coredns_dns_response_rcode_count_total[5m])))
+          > 0.01


Trying to grok what this expr is attempting to do. It's a % of SERFAIL rate change sums over 5 minute increments and alerts if it's over 1%?

@yeya24 not really clear what this is trying to do with the sum mixed in. Can this be written with it and simplify the query? If we're trying to say "if the rate of failures increases by more than 1% of total responses over a 5 minute period of time" that would be:

expr: | sum(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}) / sum(coredns_dns_response_rcode_count_total) > 0.01 for: 5m

The difference is we're not looking at the change in failures over time. I think this is better unless we assume there's some high chance of 1% of responses failing in a 5 min period. The original query is looking at the changes in rates over time. You could exclude sum given the alert is "for: 5m". So if your rate of errors goes up slow enough relative to the rate of total responses you can have an ever increasing number of failures as long as the change over time isn't more than 1% in a 5 minute block.

I think there's some misunderstanding of what the sum function does here. From my understanding, the sum function is just combining the rates for each separate server/zone, to arrive at a single aggregated rate measurement.
So I don't think we would want to exclude sum here. @yeya24 does that sound right?

I agree, I left sum in and removed rate. I see risk with sum of rates over time in that subtle cumulative increases in failure rates over time will not trip the alert.

Sorry for the late.

expr: | sum(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}) / sum(coredns_dns_response_rcode_count_total) > 0.01 for: 5m

This query calculates the failure ratio over all the responses, which is not what I want.

rate restricts the samples in 5min and IMO this makes sense in this context.

RiRa12621 · 2020-07-22T14:51:33Z

/lgtm

sgreene570 · 2020-07-22T17:46:29Z

Thanks @yeya24 !
Looks good.
/lgtm

openshift-ci-robot · 2020-07-22T17:46:46Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RiRa12621, sgreene570, yeya24

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [sgreene570]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

use increase in CoreDNS panic alerts

2b8b7b7

Signed-off-by: Ben Ye <yb532204897@gmail.com>

openshift-ci-robot requested review from frobware and Miciah July 20, 2020 19:04

openshift-ci-robot assigned RiRa12621 Jul 20, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2020

sgreene570 reviewed Jul 20, 2020

View reviewed changes

add CoreDNSErrorsHigh alerts

904afb2

Signed-off-by: Ben Ye <yb532204897@gmail.com>

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2020

openshift-ci-robot requested a review from jewzaam July 21, 2020 14:12

yeya24 mentioned this pull request Jul 21, 2020

Remove DNSErrors05MinSRE alert openshift/managed-cluster-config#400

Closed

jewzaam suggested changes Jul 22, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020

openshift-ci-robot assigned sgreene570 Jul 22, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2020

openshift-merge-robot merged commit b428d7b into openshift:master Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port coredns errors alert #184

Port coredns errors alert #184

yeya24 commented Jul 20, 2020

RiRa12621 commented Jul 20, 2020

sgreene570 Jul 20, 2020

yeya24 Jul 20, 2020

sgreene570 Jul 20, 2020

yeya24 Jul 21, 2020

RiRa12621 Jul 21, 2020

jewzaam Jul 22, 2020

yeya24 Jul 22, 2020

jewzaam Jul 22, 2020

sgreene570 Jul 22, 2020 •

edited

jewzaam Jul 22, 2020

yeya24 Jul 29, 2020

RiRa12621 commented Jul 22, 2020

sgreene570 commented Jul 22, 2020

openshift-ci-robot commented Jul 22, 2020

Port coredns errors alert #184

Port coredns errors alert #184

Conversation

yeya24 commented Jul 20, 2020

RiRa12621 commented Jul 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sgreene570 Jul 22, 2020 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RiRa12621 commented Jul 22, 2020

sgreene570 commented Jul 22, 2020

openshift-ci-robot commented Jul 22, 2020

sgreene570 Jul 22, 2020 •

edited