test/e2e/upgrade/alert: Allow some Watchdog changes #26262

wking · 2021-06-22T18:27:16Z

To avoid locking on hard-anti-affinity vs. volume-affinity, Prometheus returned to soft-anti-affinity and dropped its PDBs rhbz#1967614. rbhz#1949262 has been re-opened to track making Prometheus highly-available again, but until that is addressed, we'll have occasional Prometheus outages like:

disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success    1h9m50s
  Jun  9 11:57:23.048: Unexpected alerts fired or pending during the upgrade:

  Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade

In that job, Watchdog was down for around 30s, and had two changes (going down and coming back up). It's possible that both Prometheus pods come up on node A, and then are both drained off onto node B, and then are both drained off back to node A, causing four changes during a standard update. If there is a node C that could also host the pods (e.g. it was in the same availability zone so the volume could attach), the first drain off A would lead to Proms on B and C via soft anti-affinity, so having three consecutive paired drains seems fairly unlikely. But we also have chained-update jobs and update-and-rollback jobs, where there are multiple compute-pool rolls under the same update monitor, so 8 gives us some space for those. If we fail more complicated stuff on this, we can always raise the threshold higher in the future.

Using a delay (e.g. allowing Watchdog to be down for up to five minutes) is another alternative. But not all CI jobs have persistent volumes configured for Prometheus. For example, openshift/release#15677 mentions that several classes of jobs still lack those persistent volumes.

smarterclayton · 2021-06-22T19:04:42Z

/lgtm

To avoid locking on hard-anti-affinity vs. volume-affinity, Prometheus returned to soft-anti-affinity and dropped its PDBs [1]. [2] has been re-opened to track making Prometheus highly-available again, but until that is addressed, we'll have occasional Prometheus outages like [3]: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h9m50s Jun 9 11:57:23.048: Unexpected alerts fired or pending during the upgrade: Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade In that job, Watchdog was down for around 30s, and had two changes (going down and coming back up). It's possible that both Prometheus pods come up on node A, and then are both drained off onto node B, and then are both drained off back to node A, causing four changes during a standard update. If there is a node C that could also host the pods (e.g. it was in the same availability zone so the volume could attach), the first drain off A would lead to Proms on B and C via soft anti-affinity, so having three consecutive paired drains seems fairly unlikely. But we also have chained-update jobs and update-and-rollback jobs, where there are multiple compute-pool rolls under the same update monitor, so 8 gives us some space for those. If we fail more complicated stuff on this, we can always raise the threshold higher in the future. Using a delay (e.g. allowing Watchdog to be down for up to five minutes) is another alternative. But not all CI jobs have persistent volumes configured for Prometheus. For example, [4] mentions that several classes of jobs still lack those persistent volumes. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1967614 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1949262 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1402565624210657280 [4]: openshift/release#15677

smarterclayton · 2021-06-22T19:27:46Z

/lgtm

openshift-ci · 2021-06-22T19:27:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~test/OWNERS~~ [smarterclayton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-06-22T20:56:06Z

/retest