Improvements and Additions to Alert Testing Stack #28332

dgoodwin · 2023-10-17T14:08:04Z

This PR continues work to resume some form of automated testing on alerts in a couple specific scenarios.

Removed Duplicated Alert Testing stack

Two tests are being removed in favor of a newer invariant based test we added some time ago which works off intervals we generated by querying prometheus, rather than repeated live prometheus queries to get the same data. This new style of testing allows us to run the alert tests against a given intervals file, which proved very useful on this PRs development.

The tests being removed:

[sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]

Ginko test run Late during e2e, live queries prometheus for the duration of
the run, skips any alerts that have their own test, and flakes if any
are found. It cannot fail.
Flake rate is about 80%.

[sig-arch] Check if alerts are firing during or after upgrade success

Monitor test run after upgrade.
Uses basically the same approach/code as above.
Flake rate is 97%.

Removing these two tests dramatically simplifies our stack so we can begin
working on resurrecting some form of alert testing in just one spot.

The backstop test which replaces them is: [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically

NOTE: A similarly named test that can fail and does catch real problems
is NOT being removed, this one is looking for alerts at the specific point in time it runs, i.e. nothing should be firing if we're done our e2e run:

[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

New Test To Ensure New Alerts Don't Fire

Added a test [sig-trt][invariant] No new alerts should be firing which checks for any occurrence of a firing non-info alert we don't know about (not in our data file), or is in our data file but is less than two weeks old. This was a feature of the planned design that was intended to give us a window to catch someone adding a new alert, that is firing in CI already. (we don't want things to get worse)

This test is set to flake, I'll move to a fail once I can see if run for a bit and confirm this is safe.

Ignore TelemeterClientFailures Alerts

This indicates a problem with an external service, it does not affect other tests, it's pending all the time and makes it to firing plenty often. Not an alert we're concerned with in testing, after speaking with monitoring team this is being removed. (Trevor is actually driving us towards disabling telemeter in CI clusters)

Extract More Alert Data into Structured Intervals

Expanded our structured alert intervals to now parse our the alert severity and level.

Fix The 100% Flake Rate on Existing Backstop Test

The test we're keeping called out above was a 100% flake, I believe because it did not consider the alerts we know fire on every CI cluster, AlertManagerNotConfigured and Watchdog. This alert now shares the list of alerts we do not care about with the remaining prometheus ginko test.

Next Steps

We have no plan to bring back historical alert data testing as we believe we cannot possibly keep up with the noise it generates. Right now the focus is on making sure nobody adds new alerts that are already firing in our fleet, and hopefully never seeing that happen for SD, as well as hunting down other pinned alerts we can work on for SD.

New initiative to allow pinning alerts that are simply not allowed to fire, regardless of historical data.

This is confusing but this commit removes two alert tests: [sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] - Ginko test run Late during e2e, live queries prometheus for the duration of the run, skips any alerts that have their own test, and flakes if any are found. It cannot fail. - Flake rate is about 80%. - This test is being removed in favor of the new one based on intervals observed. (see pkg/monitortests/testframework/legacytestframeworkmonitortests/monitortest.go. [sig-arch] Check if alerts are firing during or after upgrade success - Monitor test run after upgrade. - Uses basically the same approach/code as above. - Flake rate is 97%. - This test is being removed in favor of the new one based on intervals observed. (see pkg/monitortests/testframework/legacytestframeworkmonitortests/monitortest.go. Removing these two tests dramatically simplies our stack so we can begin working on resurrecting some form of alert testing in just one spot. The remaining backstop test is: [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically It is a 100% flake. NOTE: A similarly named test that can fail and does catch real problems is NOT being removed: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

These turn out to be a duplication of tests that already flake if an operator does Available=false or Degraded=true in a test run. (i.e. [bz-DNS] clusteroperator/dns should not change condition/Degraded)

Attempts to be forgiving, scan entire data file regardless of nurp looking to see if we've seen this alert name before anywhere. If not, or we did but first observed is less than two weeks ago, fail the test. This approach should help us survive the cutover to new releases.

openshift-ci · 2023-10-19T21:35:49Z

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-serial	`d334db8`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/unit	`d334db8`	link	true	`/test unit`
ci/prow/e2e-aws-ovn-single-node-upgrade	`d334db8`	link	false	`/test e2e-aws-ovn-single-node-upgrade`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

…s tests

dgoodwin · 2023-10-20T11:27:27Z

pkg/monitortestlibrary/allowedalerts/basic_alert.go

+	a.allowanceCalculator = alwaysFlake()
+	return a
+}
+


The use of this was removed, but I left the alwaysFlake functionality in so it's visible when we start working on more alerts to help SD out.

stbenjam · 2023-10-20T15:30:46Z

Do you consider the removed tests to be comparable to [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically? If so we may want to track it as a rename. If it's substantially different we can just leave it.

stbenjam · 2023-10-20T15:33:18Z

Do you think [sig-trt][invariant] No new alerts should be firing should be treated like panics? Out of the box aggregation I think aggregation allows ~4 failures until we build up history, but we have the ability to make this a never-failing test openshift/ci-tools#3653 (once it fails instead of flakes)

stbenjam · 2023-10-20T15:33:57Z

Otherwise code looks good to me, test seems to be passing in the jobs

/lgtm

openshift-ci · 2023-10-20T15:38:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/OWNERS~~ [dgoodwin,stbenjam]
~~test/OWNERS~~ [dgoodwin,stbenjam]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stbenjam · 2023-10-20T15:43:49Z

Also [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically fires A LOT, more than 99% of the time. It seems like we need to fix up that test too to allow exceptions, like AlertmanagerReceiversNotConfigured

openshift-ci-robot · 2023-10-20T16:10:33Z

/retest-required

Remaining retests: 0 against base HEAD abb76df and 2 for PR HEAD dc3c6fc in total

dgoodwin changed the title ~~alert pin tests~~ Test for specific alerts we want to fail a test on any fire Oct 17, 2023

openshift-ci bot requested review from csrwng and deads2k October 17, 2023 14:16

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2023

dgoodwin force-pushed the alert-pin-tests branch from 582363d to 0536339 Compare October 18, 2023 17:50

openshift-ci bot added the vendor-update Touching vendor dir or related files label Oct 18, 2023

dgoodwin force-pushed the alert-pin-tests branch from 0536339 to 5d28369 Compare October 19, 2023 17:53

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2023

dgoodwin added 6 commits October 19, 2023 14:55

Support pinning alerts that we want to fire on any occurrence

d90ca10

New initiative to allow pinning alerts that are simply not allowed to fire, regardless of historical data.

Calm down some logging

f990e70

Clarify that we're flaking on any occurrence of alert, not failing

5103297

Drop the ClusterOperatorDown/Degraded alert tests

70e7a31

These turn out to be a duplication of tests that already flake if an operator does Available=false or Degraded=true in a test run. (i.e. [bz-DNS] clusteroperator/dns should not change condition/Degraded)

dgoodwin force-pushed the alert-pin-tests branch from 5d28369 to d334db8 Compare October 19, 2023 17:55

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2023

dgoodwin added 2 commits October 20, 2023 08:07

Don't fire on alerts at sev info

747fd09

Remove the display bool for now, will re-add in another PR as it fail…

dc3c6fc

…s tests

dgoodwin commented Oct 20, 2023

View reviewed changes

dgoodwin changed the title ~~Test for specific alerts we want to fail a test on any fire~~ Improvements and Additions to Alert Testing Stack Oct 20, 2023

openshift-ci bot assigned stbenjam Oct 20, 2023

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2023

openshift-ci bot merged commit 617bc41 into openshift:master Oct 20, 2023
17 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements and Additions to Alert Testing Stack #28332

Improvements and Additions to Alert Testing Stack #28332

dgoodwin commented Oct 17, 2023 •

edited by openshift-ci bot

openshift-ci bot commented Oct 19, 2023

dgoodwin Oct 20, 2023

stbenjam commented Oct 20, 2023

stbenjam commented Oct 20, 2023 •

edited

stbenjam commented Oct 20, 2023

openshift-ci bot commented Oct 20, 2023

stbenjam commented Oct 20, 2023

openshift-ci-robot commented Oct 20, 2023

Improvements and Additions to Alert Testing Stack #28332

Improvements and Additions to Alert Testing Stack #28332

Conversation

dgoodwin commented Oct 17, 2023 • edited by openshift-ci bot

Removed Duplicated Alert Testing stack

New Test To Ensure New Alerts Don't Fire

Ignore TelemeterClientFailures Alerts

Extract More Alert Data into Structured Intervals

Read More Data from Alerts Query Results

Fix The 100% Flake Rate on Existing Backstop Test

Next Steps

openshift-ci bot commented Oct 19, 2023

dgoodwin Oct 20, 2023

Choose a reason for hiding this comment

stbenjam commented Oct 20, 2023

stbenjam commented Oct 20, 2023 • edited

stbenjam commented Oct 20, 2023

openshift-ci bot commented Oct 20, 2023

stbenjam commented Oct 20, 2023

openshift-ci-robot commented Oct 20, 2023

dgoodwin commented Oct 17, 2023 •

edited by openshift-ci bot

stbenjam commented Oct 20, 2023 •

edited