New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test/e2e/upgrade/alert: Allow some Watchdog changes #26262
test/e2e/upgrade/alert: Allow some Watchdog changes #26262
Conversation
8a88aed
to
bfda3c7
Compare
/lgtm |
To avoid locking on hard-anti-affinity vs. volume-affinity, Prometheus returned to soft-anti-affinity and dropped its PDBs [1]. [2] has been re-opened to track making Prometheus highly-available again, but until that is addressed, we'll have occasional Prometheus outages like [3]: disruption_tests: [sig-arch] Check if alerts are firing during or after upgrade success 1h9m50s Jun 9 11:57:23.048: Unexpected alerts fired or pending during the upgrade: Watchdog alert had missing intervals during the run, which may be a sign of a Prometheus outage in violation of the prometheus query SLO of 100% uptime during upgrade In that job, Watchdog was down for around 30s, and had two changes (going down and coming back up). It's possible that both Prometheus pods come up on node A, and then are both drained off onto node B, and then are both drained off back to node A, causing four changes during a standard update. If there is a node C that could also host the pods (e.g. it was in the same availability zone so the volume could attach), the first drain off A would lead to Proms on B and C via soft anti-affinity, so having three consecutive paired drains seems fairly unlikely. But we also have chained-update jobs and update-and-rollback jobs, where there are multiple compute-pool rolls under the same update monitor, so 8 gives us some space for those. If we fail more complicated stuff on this, we can always raise the threshold higher in the future. Using a delay (e.g. allowing Watchdog to be down for up to five minutes) is another alternative. But not all CI jobs have persistent volumes configured for Prometheus. For example, [4] mentions that several classes of jobs still lack those persistent volumes. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1967614 [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1949262 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-e2e-gcp-upgrade/1402565624210657280 [4]: openshift/release#15677
bfda3c7
to
0a3cbf5
Compare
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: smarterclayton, wking The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest Please review the full test history for this PR and help us cut down flakes. |
16 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
27 similar comments
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
/retest Please review the full test history for this PR and help us cut down flakes. |
To avoid locking on hard-anti-affinity vs. volume-affinity, Prometheus returned to soft-anti-affinity and dropped its PDBs rhbz#1967614. rbhz#1949262 has been re-opened to track making Prometheus highly-available again, but until that is addressed, we'll have occasional Prometheus outages like:
In that job, Watchdog was down for around 30s, and had two changes (going down and coming back up). It's possible that both Prometheus pods come up on node A, and then are both drained off onto node B, and then are both drained off back to node A, causing four changes during a standard update. If there is a node C that could also host the pods (e.g. it was in the same availability zone so the volume could attach), the first drain off A would lead to Proms on B and C via soft anti-affinity, so having three consecutive paired drains seems fairly unlikely. But we also have chained-update jobs and update-and-rollback jobs, where there are multiple compute-pool rolls under the same update monitor, so 8 gives us some space for those. If we fail more complicated stuff on this, we can always raise the threshold higher in the future.
Using a delay (e.g. allowing Watchdog to be down for up to five minutes) is another alternative. But not all CI jobs have persistent volumes configured for Prometheus. For example, openshift/release#15677 mentions that several classes of jobs still lack those persistent volumes.