Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test/e2e/upgrade/alert: Allow some Watchdog changes #26262

Merged

Conversation

wking
Copy link
Member

@wking wking commented Jun 22, 2021

To avoid locking on hard-anti-affinity vs. volume-affinity, Prometheus returned to soft-anti-affinity and dropped its PDBs rhbz#1967614. rbhz#1949262 has been re-opened to track making Prometheus highly-available again, but until that is addressed, we'll have occasional Prometheus outages like:

disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success    1h9m50s
  Jun  9 11:57:23.048: Unexpected alerts fired or pending during the upgrade:

  Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade

In that job, Watchdog was down for around 30s, and had two changes (going down and coming back up). It's possible that both Prometheus pods come up on node A, and then are both drained off onto node B, and then are both drained off back to node A, causing four changes during a standard update. If there is a node C that could also host the pods (e.g. it was in the same availability zone so the volume could attach), the first drain off A would lead to Proms on B and C via soft anti-affinity, so having three consecutive paired drains seems fairly unlikely. But we also have chained-update jobs and update-and-rollback jobs, where there are multiple compute-pool rolls under the same update monitor, so 8 gives us some space for those. If we fail more complicated stuff on this, we can always raise the threshold higher in the future.

Using a delay (e.g. allowing Watchdog to be down for up to five minutes) is another alternative. But not all CI jobs have persistent volumes configured for Prometheus. For example, openshift/release#15677 mentions that several classes of jobs still lack those persistent volumes.

@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 22, 2021
To avoid locking on hard-anti-affinity vs. volume-affinity, Prometheus
returned to soft-anti-affinity and dropped its PDBs [1].  [2] has been
re-opened to track making Prometheus highly-available again, but until
that is addressed, we'll have occasional Prometheus outages like [3]:

  disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success    1h9m50s
    Jun  9 11:57:23.048: Unexpected alerts fired or pending during the upgrade:

    Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade

In that job, Watchdog was down for around 30s, and had two changes
(going down and coming back up).  It's possible that both Prometheus
pods come up on node A, and then are both drained off onto node B, and
then are both drained off back to node A, causing four changes during
a standard update.  If there is a node C that could also host the pods
(e.g. it was in the same availability zone so the volume could
attach), the first drain off A would lead to Proms on B and C via soft
anti-affinity, so having three consecutive paired drains seems fairly
unlikely.  But we also have chained-update jobs and
update-and-rollback jobs, where there are multiple compute-pool rolls
under the same update monitor, so 8 gives us some space for those.  If
we fail more complicated stuff on this, we can always raise the
threshold higher in the future.

Using a delay (e.g. allowing Watchdog to be down for up to five
minutes) is another alternative.  But not all CI jobs have persistent
volumes configured for Prometheus.  For example, [4] mentions that
several classes of jobs still lack those persistent volumes.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1967614
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1949262
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1402565624210657280
[4]: openshift/release#15677
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2021
@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 22, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 22, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: smarterclayton, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

16 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

27 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 66fd60d into openshift:master Jul 2, 2021
@wking wking deleted the allow-watchdog-outages branch July 3, 2021 14:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants