Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive #723

Merged
merged 2 commits into from Mar 31, 2020

Conversation

paulfantom
Copy link
Contributor

Delaying alert by 30m prevents false-positive as Watchdog alert should already fire in this time and populate at least one of alertmanager_notifications_total metrics.

/cc @openshift/openshift-team-monitoring

I wonder if we can shorten it to 15m or less as Watchdog should fire instantly. WDYT @brancz @simonpasquier ?

@openshift-ci-robot openshift-ci-robot requested a review from a team March 25, 2020 09:47
@openshift-ci-robot
Copy link
Contributor

@paulfantom: This pull request references Bugzilla bug 1809232, which is invalid:

  • expected the bug to target the "4.5.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 25, 2020
@paulfantom
Copy link
Contributor Author

In this case bugzilla doesn't matter as this targets master, so don't worry about bugzilla/invalid-bug. I just marked it so it will be easier in case we need to backport it.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/hold

as you pinged other folks for opinion.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 25, 2020
@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2020
@smarterclayton
Copy link
Contributor

Will this fire the moment a user logs into the console after an install?

@paulfantom
Copy link
Contributor Author

It will fire 30m after prometheus is up and running. We can decrease this to a much lower value, but 0m is sometimes causing false-positives during an upgrade.

@@ -591,6 +591,7 @@ spec:
occur. Check the OpenShift documentation to learn how to configure notifications
with Alertmanager.
expr: cluster:alertmanager_routing_enabled:max == 0
for: 30m
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets then adjust to 15m and that should ™️ give enough time to fire watchdog alert after upgrades. I would say we can also add note in bugzilla for QE to verify that when cluster first spins up and user logs in this alert is firing. WDYT?

@lilic
Copy link
Contributor

lilic commented Mar 30, 2020

Screenshot 2020-03-29 at 14 45 59

We did another upgrade yesterday of our clusters, it seems it took 5 minutes in total for the alert to be resolved. So anything above 5 minutes should work to not have flaky alerts.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2020
@paulfantom
Copy link
Contributor Author

Changed to 10m as this should be enough according to #723 (comment)

/bugzilla refresh
/unhold

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Mar 31, 2020
@openshift-ci-robot
Copy link
Contributor

@paulfantom: This pull request references Bugzilla bug 1809232, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.5.0) matches configured target release for branch (4.5.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Changed to 10m as this should be enough according to #723 (comment)

/bugzilla refresh
/unhold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@lilic lilic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

:shipit:

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lilic, paulfantom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit b366675 into openshift:master Mar 31, 2020
@openshift-ci-robot
Copy link
Contributor

@paulfantom: All pull requests linked via external trackers have merged: openshift/cluster-monitoring-operator#723. Bugzilla bug 1809232 has been moved to the MODIFIED state.

In response to this:

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@paulfantom paulfantom deleted the alert-fix branch April 1, 2020 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants