Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1798450: alerts for aggregated API metrics #746

Closed

Conversation

p0lyn0mial
Copy link
Contributor

defines alerts for aggregated API as requested in https://bugzilla.redhat.com/show_bug.cgi?id=1772564

@p0lyn0mial
Copy link
Contributor Author

/assign @deads2k

@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 30, 2020
message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} flapping."
description: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} availability changes too often. Seen at least 2 changes in the last 5 minutes."
expr: |
sum by(name, namespace)(changes(aggregator_unavailable_apiservice[5m])) > 2
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it deserves an e2e test that would help find good values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also is not perfect, frequent changes can be unnoticeable - value hasn't changed between the scrapes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

longer duration might give a better perspective.

Copy link
Contributor Author

@p0lyn0mial p0lyn0mial Jan 31, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alternatively, we could use sum by(name, namespace, reason)(changes(aggregator_unavailable_apiservice_count[5m])) > X

severity: warning
- alert: AggregatedAPIDown
annotations:
message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} down."
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment this metric doesn't work. I suspect that the upstream code is broken. In an HA setup one instance can set its instance of the metric to 1 whereas the other instance might flip its own instance to 0 preventing the first instance from changing it to 0(I need to investigate it). But definitely something fishy is going on because I saw some instances reported a service as unavailable whereas it was available.

@sttts FYI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suppose that we fix it upstream, will this alert be present on downgrade?

message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} have reported errors."
description: "The number of error ({{ $labels.reason }}) has increased for an aggregated API {{ $labels.name }}/{{ $labels.namespace }} in the last 2 minutes."
expr: |
sum by(name, namespace, reason)(increase(aggregator_unavailable_apiservice_count[2m])) > 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will only report errors that have occurred in the last 2 minutes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high values actually might indicate frequent changes in availability

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that values > 1 indicate errors in the given timeframe.

@p0lyn0mial
Copy link
Contributor Author

/assign @sttts

@p0lyn0mial
Copy link
Contributor Author

will something clean up the new alerts on downgrade? I am asking because the alert I am going to add will likely fire as the previous versions don't have openshift/origin#24496.

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: p0lyn0mial
To complete the pull request process, please assign deads2k
You can assign the PR to them by writing /assign @deads2k in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@p0lyn0mial
Copy link
Contributor Author

requires openshift/origin#24496

@deads2k
Copy link
Contributor

deads2k commented Feb 4, 2020

will something clean up the new alerts on downgrade? I am asking because the alert I am going to add will likely fire as the previous versions don't have openshift/origin#24496.

backport it.

@p0lyn0mial
Copy link
Contributor Author

backport it

how far back? I think up to 4.2

@p0lyn0mial p0lyn0mial changed the title alerts for aggregated API metrics Bug 1798450: alerts for aggregated API metrics Feb 5, 2020
@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 5, 2020
@openshift-ci-robot
Copy link

@p0lyn0mial: This pull request references Bugzilla bug 1798450, which is valid.

In response to this:

Bug 1798450: alerts for aggregated API metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@p0lyn0mial
Copy link
Contributor Author

/retest

@sttts
Copy link
Contributor

sttts commented Feb 5, 2020

/retest

expr: |
sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[1m])) > 2
labels:
severity: warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sichvoge we are kind of blindly impementing these alerts, without having some rule of thumbs about severity and how quick they should fire. Can you comment on this, here in this case about aggregated APIs being down for one minute (like service-catalog, openshift-apiserver or metrics apiserver) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minute is more of a normal time range, so I would change it to that, for both of them.

I like to ask myself what can a user do if this fires, what would the rulebook/playbook say. How can they fix this?

@mfojtik
Copy link
Contributor

mfojtik commented Feb 10, 2020

/retest

1 similar comment
@p0lyn0mial
Copy link
Contributor Author

/retest

@openshift-ci-robot
Copy link

@p0lyn0mial: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 13, 2020
@p0lyn0mial
Copy link
Contributor Author

moved the alerts kubernetes-mixin repo kubernetes-monitoring/kubernetes-mixin#355

@p0lyn0mial
Copy link
Contributor Author

closing as it is replaced by openshift/cluster-monitoring-operator#691

/close

@openshift-ci-robot
Copy link

@p0lyn0mial: Closed this PR.

In response to this:

closing as it is replaced by openshift/cluster-monitoring-operator#691

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants