Bug 1904503: Add prometheus alerts for vsphere #126

gnufied · 2021-01-14T04:23:48Z

Depends on https://github.com/openshift/vsphere-problem-detector/pull/24/files

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1904503

jsafrane · 2021-01-14T11:44:55Z

/retest

assets/vsphere_problem_detector/12_prometheusrules.yaml

jsafrane · 2021-01-14T11:41:05Z

assets/vsphere_problem_detector/12_prometheusrules.yaml

+        labels:
+          severity: warning
+        annotations:
+          message: "Vsphere node health checks are failing on {{ $labels.node }} with {{ $labels.check }}"


Is it OK to alert on each node separately? This may be quite spammy if all nodes are equally bad.

cc @openshift/openshift-team-monitoring

According to monitoring team:

then you will receive only one notification containing the description of all the metrics for which the expression is true
e.g you have 2 nodes for which vsphere_node_check_errors = 1, you'll receive 1 notification containing 2 alerts firing along with the message for each failing node

I think it might be okay. It is not unusual for a 100 node cluster to all going wrong at once, but I think the entire class can be disabled at once and hence should be okay. If we remove node name from here, then it will be less useful I think because we can't tell from which node these alerts are coming.

openshift-ci-robot · 2021-01-14T16:46:12Z

@gnufied: This pull request references Bugzilla bug 1904503, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1904503: Add prometheus alerts for vsphere

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-01-14T16:46:12Z

@gnufied: This pull request references Bugzilla bug 1904503, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1904503: Add prometheus alerts for vsphere

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jsafrane · 2021-01-14T17:17:51Z

assets/vsphere_problem_detector/12_prometheusrules.yaml

+        expr: vsphere_cluster_check_errors == 1
+        for: 10m
+        labels:
+          severity: critical


The other one is warning, do we really want it critical? I'd start as low as possible, we don't know how many clusters are going to report the alert after upgrade to 4.7

changed this to warning as well.

jsafrane · 2021-01-14T17:18:04Z

assets/vsphere_problem_detector/12_prometheusrules.yaml

+        labels:
+          severity: critical
+        annotations:
+          message: "VSpehre cluster health checks are failing with {{ $labels.check }}"


still typo: VSpehre

err. I fixed wrong typo. fix this one. sorry.

jsafrane · 2021-01-14T17:22:52Z

assets/vsphere_problem_detector/12_prometheusrules.yaml

+        labels:
+          severity: warning
+        annotations:
+          message: "VSphere node health checks are failing on {{ $labels.node }} with {{ $labels.check }}"


This sounds better to me: "VSphere health check {{ $labels.check }} is failing on node {{ $labels.node }}"

But who I am to comment on other's English style :-)

You win this round. I renamed. :-)

gnufied · 2021-01-14T22:59:48Z

/retest

lilic

Thank you for ping! 🎉

Just curious what type is, bit worried this will falsely be firing always depending on the type of the metric.

lilic · 2021-01-15T08:45:23Z

assets/vsphere_problem_detector/12_prometheusrules.yaml

+    - name: vsphere-problem-detector.rules
+      rules:
+      - alert: VSphereOpenshiftNodeHealthFail
+        expr:  vsphere_node_check_errors == 1


What type is this metric? A counter or gauge? Quick search could not find it in the operator.

It's gauge now, openshift/vsphere-problem-detector#24
(used to be counter yesterday, hard to alert on it).

One small comment to improve the alerting rule: in the current version, a failed scrape by Prometheus would resolve the alert if it was firing previously.
To protect against it, you can use min_over_time() like this

Suggested change

expr: vsphere_node_check_errors == 1

expr: min_over_time(vsphere_node_check_errors[5m]) == 1

see https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0

(used to be counter yesterday, hard to alert on it).

Nice thanks!

Agreed with Simons suggestion, otherwise looks good to me.

why min_over_time - shouldn't this be max_over_time? I would think if a scrape failed and a value is missing on time t1 then it would replace with some kind of sentinel value (0? - I don't know prometheus very well and how it fills the holes in the data) .

If the target is down (up == 0) then Prometheus will mark the vsphere_node_check_errors metric as stale (meaning it doesn't exist anymore). On the next evaluation of the alerting rule, the result from the rule's expression would be "no data" so Prometheus will consider that the alert is resolved.

jsafrane · 2021-01-15T09:04:52Z

/retest

jsafrane · 2021-01-15T09:06:04Z

lgtm-ish, waiting for @lilic's approval / additional comments.

jsafrane · 2021-01-18T09:54:01Z

/lgtm

openshift-ci-robot · 2021-01-18T09:54:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [gnufied,jsafrane]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jsafrane · 2021-01-18T09:56:44Z

/retest

openshift-bot · 2021-01-18T09:57:36Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2021-01-18T12:20:35Z

@gnufied: All pull requests linked via external trackers have merged:

openshift/cluster-storage-operator#126

Bugzilla bug 1904503 has been moved to the MODIFIED state.

In response to this:

Bug 1904503: Add prometheus alerts for vsphere

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Add code to sync prometheus rule

5a524ec

openshift-ci-robot requested review from bertinatto and jsafrane January 14, 2021 04:24

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2021

jsafrane reviewed Jan 14, 2021

View reviewed changes

jsafrane mentioned this pull request Jan 14, 2021

Bug 1905141: Add vsphere-problem-detector to telemetry openshift/cluster-monitoring-operator#1037

Merged

2 tasks

gnufied force-pushed the add-prometheus-alert-vsphere branch 2 times, most recently from 6cc44d2 to 3e06c4f Compare January 14, 2021 16:45

gnufied changed the title ~~Add prometheus alerts for vsphere~~ Bug 1904503: Add prometheus alerts for vsphere Jan 14, 2021

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. labels Jan 14, 2021

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 14, 2021

gnufied force-pushed the add-prometheus-alert-vsphere branch from 3e06c4f to 07c1c3a Compare January 14, 2021 16:49

jsafrane reviewed Jan 14, 2021

View reviewed changes

gnufied force-pushed the add-prometheus-alert-vsphere branch from 07c1c3a to 66c8a4f Compare January 14, 2021 17:48

Add test

8ec0ecf

gnufied force-pushed the add-prometheus-alert-vsphere branch from 66c8a4f to 8ec0ecf Compare January 14, 2021 19:33

lilic reviewed Jan 15, 2021

View reviewed changes

openshift-ci-robot assigned jsafrane Jan 18, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 18, 2021

jsafrane mentioned this pull request Jan 18, 2021

Adding node performance check based on vCenter performance metrics openshift/vsphere-problem-detector#22

Merged

openshift-merge-robot merged commit 282370a into openshift:master Jan 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1904503: Add prometheus alerts for vsphere #126

Bug 1904503: Add prometheus alerts for vsphere #126

gnufied commented Jan 14, 2021 •

edited

jsafrane commented Jan 14, 2021

jsafrane Jan 14, 2021

gnufied Jan 14, 2021 •

edited

gnufied Jan 14, 2021 •

edited

openshift-ci-robot commented Jan 14, 2021

openshift-ci-robot commented Jan 14, 2021

jsafrane Jan 14, 2021

gnufied Jan 14, 2021

jsafrane Jan 14, 2021

gnufied Jan 14, 2021

jsafrane Jan 14, 2021

gnufied Jan 14, 2021

gnufied commented Jan 14, 2021

lilic left a comment

lilic Jan 15, 2021

jsafrane Jan 15, 2021

simonpasquier Jan 15, 2021

lilic Jan 15, 2021

gnufied Jan 15, 2021

simonpasquier Jan 18, 2021

jsafrane commented Jan 15, 2021

jsafrane commented Jan 15, 2021

jsafrane commented Jan 18, 2021

openshift-ci-robot commented Jan 18, 2021

jsafrane commented Jan 18, 2021

openshift-bot commented Jan 18, 2021

openshift-ci-robot commented Jan 18, 2021

	expr: vsphere_node_check_errors == 1
	expr: min_over_time(vsphere_node_check_errors[5m]) == 1

Bug 1904503: Add prometheus alerts for vsphere #126

Bug 1904503: Add prometheus alerts for vsphere #126

Conversation

gnufied commented Jan 14, 2021 • edited

jsafrane commented Jan 14, 2021

Choose a reason for hiding this comment

gnufied Jan 14, 2021 • edited

Choose a reason for hiding this comment

gnufied Jan 14, 2021 • edited

Choose a reason for hiding this comment

openshift-ci-robot commented Jan 14, 2021

openshift-ci-robot commented Jan 14, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gnufied commented Jan 14, 2021

lilic left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsafrane commented Jan 15, 2021

jsafrane commented Jan 15, 2021

jsafrane commented Jan 18, 2021

openshift-ci-robot commented Jan 18, 2021

jsafrane commented Jan 18, 2021

openshift-bot commented Jan 18, 2021

openshift-ci-robot commented Jan 18, 2021

gnufied commented Jan 14, 2021 •

edited

gnufied Jan 14, 2021 •

edited

gnufied Jan 14, 2021 •

edited