Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1904503: Add prometheus alerts for vsphere #126

Merged

Conversation

gnufied
Copy link
Member

@gnufied gnufied commented Jan 14, 2021

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2021
@jsafrane
Copy link
Contributor

/retest

assets/vsphere_problem_detector/12_prometheusrules.yaml Outdated Show resolved Hide resolved
labels:
severity: warning
annotations:
message: "Vsphere node health checks are failing on {{ $labels.node }} with {{ $labels.check }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to alert on each node separately? This may be quite spammy if all nodes are equally bad.

Copy link
Member Author

@gnufied gnufied Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @openshift/openshift-team-monitoring

Copy link
Member Author

@gnufied gnufied Jan 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to monitoring team:

then you will receive only one notification containing the description of all the metrics for which the expression is true
e.g you have 2 nodes for which vsphere_node_check_errors = 1, you'll receive 1 notification containing 2 alerts firing along with the message for each failing node

I think it might be okay. It is not unusual for a 100 node cluster to all going wrong at once, but I think the entire class can be disabled at once and hence should be okay. If we remove node name from here, then it will be less useful I think because we can't tell from which node these alerts are coming.

@gnufied gnufied force-pushed the add-prometheus-alert-vsphere branch 2 times, most recently from 6cc44d2 to 3e06c4f Compare January 14, 2021 16:45
@gnufied gnufied changed the title Add prometheus alerts for vsphere Bug 1904503: Add prometheus alerts for vsphere Jan 14, 2021
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. labels Jan 14, 2021
@openshift-ci-robot
Copy link
Contributor

@gnufied: This pull request references Bugzilla bug 1904503, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1904503: Add prometheus alerts for vsphere

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@gnufied: This pull request references Bugzilla bug 1904503, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1904503: Add prometheus alerts for vsphere

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 14, 2021
@gnufied gnufied force-pushed the add-prometheus-alert-vsphere branch from 3e06c4f to 07c1c3a Compare January 14, 2021 16:49
expr: vsphere_cluster_check_errors == 1
for: 10m
labels:
severity: critical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other one is warning, do we really want it critical? I'd start as low as possible, we don't know how many clusters are going to report the alert after upgrade to 4.7

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to warning as well.

labels:
severity: critical
annotations:
message: "VSpehre cluster health checks are failing with {{ $labels.check }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still typo: VSpehre

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

err. I fixed wrong typo. fix this one. sorry.

labels:
severity: warning
annotations:
message: "VSphere node health checks are failing on {{ $labels.node }} with {{ $labels.check }}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds better to me: "VSphere health check {{ $labels.check }} is failing on node {{ $labels.node }}"

But who I am to comment on other's English style :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You win this round. I renamed. :-)

@gnufied gnufied force-pushed the add-prometheus-alert-vsphere branch from 07c1c3a to 66c8a4f Compare January 14, 2021 17:48
@gnufied gnufied force-pushed the add-prometheus-alert-vsphere branch from 66c8a4f to 8ec0ecf Compare January 14, 2021 19:33
@gnufied
Copy link
Member Author

gnufied commented Jan 14, 2021

/retest

Copy link

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for ping! 🎉

Just curious what type is, bit worried this will falsely be firing always depending on the type of the metric.

- name: vsphere-problem-detector.rules
rules:
- alert: VSphereOpenshiftNodeHealthFail
expr: vsphere_node_check_errors == 1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type is this metric? A counter or gauge? Quick search could not find it in the operator.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's gauge now, openshift/vsphere-problem-detector#24
(used to be counter yesterday, hard to alert on it).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment to improve the alerting rule: in the current version, a failed scrape by Prometheus would resolve the alert if it was firing previously.
To protect against it, you can use min_over_time() like this

Suggested change
expr: vsphere_node_check_errors == 1
expr: min_over_time(vsphere_node_check_errors[5m]) == 1

see https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(used to be counter yesterday, hard to alert on it).

Nice thanks!

Agreed with Simons suggestion, otherwise looks good to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why min_over_time - shouldn't this be max_over_time? I would think if a scrape failed and a value is missing on time t1 then it would replace with some kind of sentinel value (0? - I don't know prometheus very well and how it fills the holes in the data) .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the target is down (up == 0) then Prometheus will mark the vsphere_node_check_errors metric as stale (meaning it doesn't exist anymore). On the next evaluation of the alerting rule, the result from the rule's expression would be "no data" so Prometheus will consider that the alert is resolved.

@jsafrane
Copy link
Contributor

/retest

@jsafrane
Copy link
Contributor

lgtm-ish, waiting for @lilic's approval / additional comments.

@jsafrane
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 18, 2021
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, jsafrane

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jsafrane
Copy link
Contributor

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci-robot
Copy link
Contributor

@gnufied: All pull requests linked via external trackers have merged:

Bugzilla bug 1904503 has been moved to the MODIFIED state.

In response to this:

Bug 1904503: Add prometheus alerts for vsphere

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants