Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive #723

paulfantom · 2020-03-25T09:47:37Z

Delaying alert by 30m prevents false-positive as Watchdog alert should already fire in this time and populate at least one of alertmanager_notifications_total metrics.

/cc @openshift/openshift-team-monitoring

I wonder if we can shorten it to 15m or less as Watchdog should fire instantly. WDYT @brancz @simonpasquier ?

openshift-ci-robot · 2020-03-25T09:47:40Z

@paulfantom: This pull request references Bugzilla bug 1809232, which is invalid:

expected the bug to target the "4.5.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

paulfantom · 2020-03-25T09:49:06Z

In this case bugzilla doesn't matter as this targets master, so don't worry about bugzilla/invalid-bug. I just marked it so it will be easier in case we need to backport it.

lilic

/lgtm

/hold

as you pinged other folks for opinion.

smarterclayton · 2020-03-25T17:49:45Z

Will this fire the moment a user logs into the console after an install?

paulfantom · 2020-03-26T08:04:06Z

It will fire 30m after prometheus is up and running. We can decrease this to a much lower value, but 0m is sometimes causing false-positives during an upgrade.

lilic · 2020-03-27T12:42:35Z

assets/prometheus-k8s/rules.yaml

@@ -591,6 +591,7 @@ spec:
          occur. Check the OpenShift documentation to learn how to configure notifications
          with Alertmanager.
      expr: cluster:alertmanager_routing_enabled:max == 0
+      for: 30m


Lets then adjust to 15m and that should ™️ give enough time to fire watchdog alert after upgrades. I would say we can also add note in bugzilla for QE to verify that when cluster first spins up and user logs in this alert is firing. WDYT?

lilic · 2020-03-30T11:50:40Z

We did another upgrade yesterday of our clusters, it seems it took 5 minutes in total for the alert to be resolved. So anything above 5 minutes should work to not have flaky alerts.

…t false-positive

paulfantom · 2020-03-31T12:55:32Z

Changed to 10m as this should be enough according to #723 (comment)

/bugzilla refresh
/unhold

openshift-ci-robot · 2020-03-31T12:55:42Z

@paulfantom: This pull request references Bugzilla bug 1809232, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.0) matches configured target release for branch (4.5.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Changed to 10m as this should be enough according to #723 (comment)

/bugzilla refresh
/unhold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lilic

/lgtm

openshift-ci-robot · 2020-03-31T12:59:24Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lilic, paulfantom

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lilic,paulfantom]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-03-31T13:31:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-03-31T13:44:14Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-03-31T14:54:35Z

@paulfantom: All pull requests linked via external trackers have merged: openshift/cluster-monitoring-operator#723. Bugzilla bug 1809232 has been moved to the MODIFIED state.

In response to this:

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested a review from a team March 25, 2020 09:47

openshift-ci-robot added bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Mar 25, 2020

lilic reviewed Mar 25, 2020

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 25, 2020

openshift-ci-robot assigned lilic Mar 25, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 25, 2020

lilic reviewed Mar 27, 2020

View reviewed changes

paulfantom added 2 commits March 31, 2020 14:50

jsonnet: delay firing of AlertmanagerReceiversNotConfigured to preven…

233207d

…t false-positive

assets,pkg/manifests: regenerate

6fca059

paulfantom force-pushed the alert-fix branch from fcb2430 to 6fca059 Compare March 31, 2020 12:54

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2020

lilic approved these changes Mar 31, 2020

View reviewed changes

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Mar 31, 2020

openshift-merge-robot merged commit b366675 into openshift:master Mar 31, 2020

paulfantom deleted the alert-fix branch April 1, 2020 12:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive #723

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive #723

paulfantom commented Mar 25, 2020

openshift-ci-robot commented Mar 25, 2020

paulfantom commented Mar 25, 2020

lilic left a comment

smarterclayton commented Mar 25, 2020

paulfantom commented Mar 26, 2020

lilic Mar 27, 2020

lilic commented Mar 30, 2020

paulfantom commented Mar 31, 2020

openshift-ci-robot commented Mar 31, 2020

lilic left a comment

openshift-ci-robot commented Mar 31, 2020

openshift-bot commented Mar 31, 2020

openshift-bot commented Mar 31, 2020

openshift-ci-robot commented Mar 31, 2020

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive #723

Bug 1809232: prevent AlertmanagerReceiversNotConfigured false-positive #723

Conversation

paulfantom commented Mar 25, 2020

openshift-ci-robot commented Mar 25, 2020

paulfantom commented Mar 25, 2020

lilic left a comment

Choose a reason for hiding this comment

smarterclayton commented Mar 25, 2020

paulfantom commented Mar 26, 2020

lilic Mar 27, 2020

Choose a reason for hiding this comment

lilic commented Mar 30, 2020

paulfantom commented Mar 31, 2020

openshift-ci-robot commented Mar 31, 2020

lilic left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Mar 31, 2020

openshift-bot commented Mar 31, 2020

openshift-bot commented Mar 31, 2020

openshift-ci-robot commented Mar 31, 2020