Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port coredns errors alert #184

Merged
merged 2 commits into from Jul 22, 2020
Merged

Port coredns errors alert #184

merged 2 commits into from Jul 22, 2020

Conversation

yeya24
Copy link
Contributor

@yeya24 yeya24 commented Jul 20, 2020

This pr updates the existing CoreDNSPanicking metric and add alerts for CoreDNSErrorsHigh. For more context, please check https://coreos.slack.com/archives/CCH60A77E/p1595256269123600

Signed-off-by: Ben Ye <yb532204897@gmail.com>
@RiRa12621
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 20, 2020
severity: critical
annotations:
message: "CoreDNS is returning SERVFAIL for {{ $value | humanizePercentage }} of requests."
- alert: CoreDNSErrorsHigh
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having only one CoreDNSErrorsHigh alert would be best, in order to remove duplicate alerts and thus minimize "alert fatigue". As far as which value to use (.03 vs .01), I don't have any strong opinions on which one to use. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to keep the 0.01 one since I don't think it is really necessary to add a critical level alert here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RiRa12621 Hello, would you mind taking a look at this? Which alert should we use? The critical one or the warning one?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for warning from me
/cc @jewzaam

Signed-off-by: Ben Ye <yb532204897@gmail.com>
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Jul 21, 2020
Comment on lines +28 to +31
(sum(rate(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"}[5m]))
/
sum(rate(coredns_dns_response_rcode_count_total[5m])))
> 0.01
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to grok what this expr is attempting to do. It's a % of SERFAIL rate change sums over 5 minute increments and alerts if it's over 1%?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeya24 not really clear what this is trying to do with the sum mixed in. Can this be written with it and simplify the query? If we're trying to say "if the rate of failures increases by more than 1% of total responses over a 5 minute period of time" that would be:

        expr: |
          sum(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"})
            /
          sum(coredns_dns_response_rcode_count_total)
          > 0.01
        for: 5m

The difference is we're not looking at the change in failures over time. I think this is better unless we assume there's some high chance of 1% of responses failing in a 5 min period. The original query is looking at the changes in rates over time. You could exclude sum given the alert is "for: 5m". So if your rate of errors goes up slow enough relative to the rate of total responses you can have an ever increasing number of failures as long as the change over time isn't more than 1% in a 5 minute block.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's some misunderstanding of what the sum function does here. From my understanding, the sum function is just combining the rates for each separate server/zone, to arrive at a single aggregated rate measurement.
So I don't think we would want to exclude sum here. @yeya24 does that sound right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I left sum in and removed rate. I see risk with sum of rates over time in that subtle cumulative increases in failure rates over time will not trip the alert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late.

        expr: |
          sum(coredns_dns_response_rcode_count_total{rcode="SERVFAIL"})
            /
          sum(coredns_dns_response_rcode_count_total)
          > 0.01
        for: 5m

This query calculates the failure ratio over all the responses, which is not what I want.

rate restricts the samples in 5min and IMO this makes sense in this context.

@RiRa12621
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 22, 2020
@sgreene570
Copy link
Contributor

Thanks @yeya24 !
Looks good.
/lgtm

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: RiRa12621, sgreene570, yeya24

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 22, 2020
@openshift-merge-robot openshift-merge-robot merged commit b428d7b into openshift:master Jul 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants