Bug 1798450: alerts for aggregated API metrics #746

p0lyn0mial · 2020-01-30T10:19:16Z

defines alerts for aggregated API as requested in https://bugzilla.redhat.com/show_bug.cgi?id=1772564

p0lyn0mial · 2020-01-30T10:19:26Z

p0lyn0mial · 2020-01-30T10:20:27Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

+        message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} flapping."
+        description: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} availability changes too often. Seen at least 2 changes in the last 5 minutes."
+      expr: |
+        sum by(name, namespace)(changes(aggregator_unavailable_apiservice[5m])) > 2


it deserves an e2e test that would help find good values.

also is not perfect, frequent changes can be unnoticeable - value hasn't changed between the scrapes.

longer duration might give a better perspective.

alternatively, we could use sum by(name, namespace, reason)(changes(aggregator_unavailable_apiservice_count[5m])) > X

p0lyn0mial · 2020-01-30T10:32:04Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

+        severity: warning
+    - alert: AggregatedAPIDown
+      annotations:
+        message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} down."


At the moment this metric doesn't work. I suspect that the upstream code is broken. In an HA setup one instance can set its instance of the metric to 1 whereas the other instance might flip its own instance to 0 preventing the first instance from changing it to 0(I need to investigate it). But definitely something fishy is going on because I saw some instances reported a service as unavailable whereas it was available.

@sttts FYI

suppose that we fix it upstream, will this alert be present on downgrade?

p0lyn0mial · 2020-01-30T10:32:47Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

+        message: "An aggregated API {{ $labels.name }}/{{ $labels.namespace }} have reported errors."
+        description: "The number of error ({{ $labels.reason }}) has increased for an aggregated API {{ $labels.name }}/{{ $labels.namespace }} in the last 2 minutes."
+      expr: |
+        sum by(name, namespace, reason)(increase(aggregator_unavailable_apiservice_count[2m])) > 1


this will only report errors that have occurred in the last 2 minutes.

high values actually might indicate frequent changes in availability

note that values > 1 indicate errors in the given timeframe.

p0lyn0mial · 2020-01-31T15:44:47Z

/assign @sttts

p0lyn0mial · 2020-02-04T13:19:23Z

will something clean up the new alerts on downgrade? I am asking because the alert I am going to add will likely fire as the previous versions don't have openshift/origin#24496.

openshift-ci-robot · 2020-02-04T13:50:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: p0lyn0mial
To complete the pull request process, please assign deads2k
You can assign the PR to them by writing /assign @deads2k in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

p0lyn0mial · 2020-02-04T13:54:17Z

requires openshift/origin#24496

deads2k · 2020-02-04T14:03:04Z

will something clean up the new alerts on downgrade? I am asking because the alert I am going to add will likely fire as the previous versions don't have openshift/origin#24496.

backport it.

p0lyn0mial · 2020-02-04T14:10:35Z

backport it

how far back? I think up to 4.2

openshift-ci-robot · 2020-02-05T11:56:05Z

@p0lyn0mial: This pull request references Bugzilla bug 1798450, which is valid.

In response to this:

Bug 1798450: alerts for aggregated API metrics

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

p0lyn0mial · 2020-02-05T11:56:24Z

/retest

sttts · 2020-02-05T19:47:36Z

/retest

sttts · 2020-02-05T19:50:48Z

manifests/0000_90_kube-apiserver-operator_04_servicemonitor-apiserver.yaml

+      expr: |
+        sum by(name, namespace)(increase(aggregator_unavailable_apiservice_count[1m])) > 2
+      labels:
+        severity: warning


@sichvoge we are kind of blindly impementing these alerts, without having some rule of thumbs about severity and how quick they should fire. Can you comment on this, here in this case about aggregated APIs being down for one minute (like service-catalog, openshift-apiserver or metrics apiserver) ?

5 minute is more of a normal time range, so I would change it to that, for both of them.

I like to ask myself what can a user do if this fires, what would the rulebook/playbook say. How can they fix this?

mfojtik · 2020-02-10T09:14:25Z

/retest

p0lyn0mial · 2020-02-10T11:25:05Z

/retest

openshift-ci-robot · 2020-02-13T00:07:15Z

@p0lyn0mial: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

p0lyn0mial · 2020-02-14T10:58:49Z

moved the alerts kubernetes-mixin repo kubernetes-monitoring/kubernetes-mixin#355

p0lyn0mial · 2020-03-06T11:51:33Z

closing as it is replaced by openshift/cluster-monitoring-operator#691

/close

openshift-ci-robot · 2020-03-06T11:51:35Z

@p0lyn0mial: Closed this PR.

In response to this:

closing as it is replaced by openshift/cluster-monitoring-operator#691

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

alerts for aggregated API metrics

a3faa20

openshift-ci-robot assigned deads2k Jan 30, 2020

openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 30, 2020

openshift-ci-robot requested review from deads2k and soltysh January 30, 2020 10:19

p0lyn0mial commented Jan 30, 2020

View reviewed changes

openshift-ci-robot assigned sttts Jan 31, 2020

p0lyn0mial mentioned this pull request Feb 4, 2020

Bug 1798450: makes unavailableGauge metric to always reflect the current state of a service openshift/origin#24496

Merged

no seperate alert for flapping

74f0190

p0lyn0mial changed the title ~~alerts for aggregated API metrics~~ Bug 1798450: alerts for aggregated API metrics Feb 5, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 5, 2020

p0lyn0mial mentioned this pull request Feb 5, 2020

Bug 1798461: makes unavailableGauge metric to always reflect the current state of a service openshift/origin#24501

Merged

fix ut

e927bd1

sttts reviewed Feb 5, 2020

View reviewed changes

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 13, 2020

openshift-ci-robot closed this Mar 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1798450: alerts for aggregated API metrics #746

Bug 1798450: alerts for aggregated API metrics #746

p0lyn0mial commented Jan 30, 2020

p0lyn0mial commented Jan 30, 2020

p0lyn0mial Jan 30, 2020

p0lyn0mial Jan 31, 2020

p0lyn0mial Jan 31, 2020

p0lyn0mial Jan 31, 2020 •

edited

Loading

p0lyn0mial Jan 30, 2020

p0lyn0mial Feb 4, 2020

p0lyn0mial Jan 30, 2020

p0lyn0mial Jan 31, 2020

p0lyn0mial Jan 31, 2020

p0lyn0mial commented Jan 31, 2020

p0lyn0mial commented Feb 4, 2020

openshift-ci-robot commented Feb 4, 2020

p0lyn0mial commented Feb 4, 2020

deads2k commented Feb 4, 2020

p0lyn0mial commented Feb 4, 2020

openshift-ci-robot commented Feb 5, 2020

p0lyn0mial commented Feb 5, 2020

sttts commented Feb 5, 2020

sttts Feb 5, 2020

lilic Feb 13, 2020

mfojtik commented Feb 10, 2020

p0lyn0mial commented Feb 10, 2020

openshift-ci-robot commented Feb 13, 2020

p0lyn0mial commented Feb 14, 2020

p0lyn0mial commented Mar 6, 2020

openshift-ci-robot commented Mar 6, 2020

Bug 1798450: alerts for aggregated API metrics #746

Bug 1798450: alerts for aggregated API metrics #746

Conversation

p0lyn0mial commented Jan 30, 2020

p0lyn0mial commented Jan 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0lyn0mial Jan 31, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0lyn0mial commented Jan 31, 2020

p0lyn0mial commented Feb 4, 2020

openshift-ci-robot commented Feb 4, 2020

p0lyn0mial commented Feb 4, 2020

deads2k commented Feb 4, 2020

p0lyn0mial commented Feb 4, 2020

openshift-ci-robot commented Feb 5, 2020

p0lyn0mial commented Feb 5, 2020

sttts commented Feb 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mfojtik commented Feb 10, 2020

p0lyn0mial commented Feb 10, 2020

openshift-ci-robot commented Feb 13, 2020

p0lyn0mial commented Feb 14, 2020

p0lyn0mial commented Mar 6, 2020

openshift-ci-robot commented Mar 6, 2020

p0lyn0mial Jan 31, 2020 •

edited

Loading