-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1798450: alerts for aggregated API metrics #746
Conversation
/assign @deads2k |
message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} flapping." | ||
description: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} availability changes too often. Seen at least 2 changes in the last 5 minutes." | ||
expr: | | ||
sum by(name, namespace)(changes(aggregator_unavailable_apiservice[5m])) > 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it deserves an e2e test that would help find good values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also is not perfect, frequent changes can be unnoticeable - value hasn't changed between the scrapes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
longer duration might give a better perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, we could use sum by(name, namespace, reason)(changes(aggregator_unavailable_apiservice_count[5m])) > X
severity: warning | ||
- alert: AggregatedAPIDown | ||
annotations: | ||
message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} down." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the moment this metric doesn't work. I suspect that the upstream code is broken. In an HA
setup one instance can set its instance of the metric to 1
whereas the other instance might flip its own instance to 0
preventing the first instance from changing it to 0
(I need to investigate it). But definitely something fishy is going on because I saw some instances reported a service as unavailable whereas it was available.
@sttts FYI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suppose that we fix it upstream, will this alert be present on downgrade?
message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} have reported errors." | ||
description: "The number of error ({{ $labels.reason }}) has increased for an aggregated API {{ $labels.name }}/{{ $labels.namespace }} in the last 2 minutes." | ||
expr: | | ||
sum by(name, namespace, reason)(increase(aggregator_unavailable_apiservice_count[2m])) > 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will only report errors that have occurred in the last 2 minutes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high values actually might indicate frequent changes in availability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that values > 1
indicate errors in the given timeframe.
/assign @sttts |
will something clean up the new alerts on downgrade? I am asking because the alert I am going to add will likely fire as the previous versions don't have openshift/origin#24496. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: p0lyn0mial The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
requires openshift/origin#24496 |
backport it. |
how far back? I think up to |
@p0lyn0mial: This pull request references Bugzilla bug 1798450, which is valid. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retest |
/retest |
expr: | | ||
sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[1m])) > 2 | ||
labels: | ||
severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sichvoge we are kind of blindly impementing these alerts, without having some rule of thumbs about severity and how quick they should fire. Can you comment on this, here in this case about aggregated APIs being down for one minute (like service-catalog, openshift-apiserver or metrics apiserver) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 minute is more of a normal time range, so I would change it to that, for both of them.
I like to ask myself what can a user do if this fires, what would the rulebook/playbook say. How can they fix this?
/retest |
1 similar comment
/retest |
@p0lyn0mial: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
moved the alerts |
closing as it is replaced by openshift/cluster-monitoring-operator#691 /close |
@p0lyn0mial: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
defines alerts for aggregated API as requested in https://bugzilla.redhat.com/show_bug.cgi?id=1772564