New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1904503: Add prometheus alerts for vsphere #126
Bug 1904503: Add prometheus alerts for vsphere #126
Conversation
/retest |
labels: | ||
severity: warning | ||
annotations: | ||
message: "Vsphere node health checks are failing on {{ $labels.node }} with {{ $labels.check }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK to alert on each node separately? This may be quite spammy if all nodes are equally bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @openshift/openshift-team-monitoring
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to monitoring team:
then you will receive only one notification containing the description of all the metrics for which the expression is true
e.g you have 2 nodes for which vsphere_node_check_errors = 1, you'll receive 1 notification containing 2 alerts firing along with the message for each failing node
I think it might be okay. It is not unusual for a 100 node cluster to all going wrong at once, but I think the entire class can be disabled at once and hence should be okay. If we remove node name from here, then it will be less useful I think because we can't tell from which node these alerts are coming.
6cc44d2
to
3e06c4f
Compare
@gnufied: This pull request references Bugzilla bug 1904503, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@gnufied: This pull request references Bugzilla bug 1904503, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3e06c4f
to
07c1c3a
Compare
expr: vsphere_cluster_check_errors == 1 | ||
for: 10m | ||
labels: | ||
severity: critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other one is warning
, do we really want it critical? I'd start as low as possible, we don't know how many clusters are going to report the alert after upgrade to 4.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to warning as well.
labels: | ||
severity: critical | ||
annotations: | ||
message: "VSpehre cluster health checks are failing with {{ $labels.check }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still typo: VSpehre
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
err. I fixed wrong typo. fix this one. sorry.
labels: | ||
severity: warning | ||
annotations: | ||
message: "VSphere node health checks are failing on {{ $labels.node }} with {{ $labels.check }}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds better to me: "VSphere health check {{ $labels.check }} is failing on node {{ $labels.node }}"
But who I am to comment on other's English style :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You win this round. I renamed. :-)
07c1c3a
to
66c8a4f
Compare
66c8a4f
to
8ec0ecf
Compare
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for ping! 🎉
Just curious what type is, bit worried this will falsely be firing always depending on the type of the metric.
- name: vsphere-problem-detector.rules | ||
rules: | ||
- alert: VSphereOpenshiftNodeHealthFail | ||
expr: vsphere_node_check_errors == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type is this metric? A counter or gauge? Quick search could not find it in the operator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's gauge now, openshift/vsphere-problem-detector#24
(used to be counter yesterday, hard to alert on it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment to improve the alerting rule: in the current version, a failed scrape by Prometheus would resolve the alert if it was firing previously.
To protect against it, you can use min_over_time()
like this
expr: vsphere_node_check_errors == 1 | |
expr: min_over_time(vsphere_node_check_errors[5m]) == 1 |
see https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(used to be counter yesterday, hard to alert on it).
Nice thanks!
Agreed with Simons suggestion, otherwise looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why min_over_time
- shouldn't this be max_over_time
? I would think if a scrape failed and a value is missing on time t1 then it would replace with some kind of sentinel value (0
? - I don't know prometheus very well and how it fills the holes in the data) .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the target is down (up == 0) then Prometheus will mark the vsphere_node_check_errors
metric as stale (meaning it doesn't exist anymore). On the next evaluation of the alerting rule, the result from the rule's expression would be "no data" so Prometheus will consider that the alert is resolved.
/retest |
lgtm-ish, waiting for @lilic's approval / additional comments. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gnufied, jsafrane The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/retest Please review the full test history for this PR and help us cut down flakes. |
@gnufied: All pull requests linked via external trackers have merged: Bugzilla bug 1904503 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Depends on https://github.com/openshift/vsphere-problem-detector/pull/24/files
Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1904503