New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improvements and Additions to Alert Testing Stack #28332
Conversation
582363d
to
0536339
Compare
0536339
to
5d28369
Compare
New initiative to allow pinning alerts that are simply not allowed to fire, regardless of historical data.
This is confusing but this commit removes two alert tests: [sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel] - Ginko test run Late during e2e, live queries prometheus for the duration of the run, skips any alerts that have their own test, and flakes if any are found. It cannot fail. - Flake rate is about 80%. - This test is being removed in favor of the new one based on intervals observed. (see pkg/monitortests/testframework/legacytestframeworkmonitortests/monitortest.go. [sig-arch] Check if alerts are firing during or after upgrade success - Monitor test run after upgrade. - Uses basically the same approach/code as above. - Flake rate is 97%. - This test is being removed in favor of the new one based on intervals observed. (see pkg/monitortests/testframework/legacytestframeworkmonitortests/monitortest.go. Removing these two tests dramatically simplies our stack so we can begin working on resurrecting some form of alert testing in just one spot. The remaining backstop test is: [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically It is a 100% flake. NOTE: A similarly named test that can fail and does catch real problems is NOT being removed: [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
These turn out to be a duplication of tests that already flake if an operator does Available=false or Degraded=true in a test run. (i.e. [bz-DNS] clusteroperator/dns should not change condition/Degraded)
Attempts to be forgiving, scan entire data file regardless of nurp looking to see if we've seen this alert name before anywhere. If not, or we did but first observed is less than two weeks ago, fail the test. This approach should help us survive the cutover to new releases.
5d28369
to
d334db8
Compare
@dgoodwin: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
a.allowanceCalculator = alwaysFlake() | ||
return a | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of this was removed, but I left the alwaysFlake functionality in so it's visible when we start working on more alerts to help SD out.
Do you consider the removed tests to be comparable to |
Do you think |
Otherwise code looks good to me, test seems to be passing in the jobs /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Also |
TRT-1235 TRT-934
This PR continues work to resume some form of automated testing on alerts in a couple specific scenarios.
Removed Duplicated Alert Testing stack
Two tests are being removed in favor of a newer invariant based test we added some time ago which works off intervals we generated by querying prometheus, rather than repeated live prometheus queries to get the same data. This new style of testing allows us to run the alert tests against a given intervals file, which proved very useful on this PRs development.
The tests being removed:
[sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
the run, skips any alerts that have their own test, and flakes if any
are found. It cannot fail.
[sig-arch] Check if alerts are firing during or after upgrade success
Removing these two tests dramatically simplifies our stack so we can begin
working on resurrecting some form of alert testing in just one spot.
The backstop test which replaces them is:
[sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically
NOTE: A similarly named test that can fail and does catch real problems
is NOT being removed, this one is looking for alerts at the specific point in time it runs, i.e. nothing should be firing if we're done our e2e run:
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
New Test To Ensure New Alerts Don't Fire
Added a test
[sig-trt][invariant] No new alerts should be firing
which checks for any occurrence of a firing non-info alert we don't know about (not in our data file), or is in our data file but is less than two weeks old. This was a feature of the planned design that was intended to give us a window to catch someone adding a new alert, that is firing in CI already. (we don't want things to get worse)This test is set to flake, I'll move to a fail once I can see if run for a bit and confirm this is safe.
Ignore TelemeterClientFailures Alerts
This indicates a problem with an external service, it does not affect other tests, it's pending all the time and makes it to firing plenty often. Not an alert we're concerned with in testing, after speaking with monitoring team this is being removed. (Trevor is actually driving us towards disabling telemeter in CI clusters)
Extract More Alert Data into Structured Intervals
Expanded our structured alert intervals to now parse our the alert severity and level.
Read More Data from Alerts Query Results
The alerts query_results.json will now include more data such as first/last observed, start reading this info out and using minimally.
Fix The 100% Flake Rate on Existing Backstop Test
The test we're keeping called out above was a 100% flake, I believe because it did not consider the alerts we know fire on every CI cluster, AlertManagerNotConfigured and Watchdog. This alert now shares the list of alerts we do not care about with the remaining prometheus ginko test.
Next Steps
We have no plan to bring back historical alert data testing as we believe we cannot possibly keep up with the noise it generates. Right now the focus is on making sure nobody adds new alerts that are already firing in our fleet, and hopefully never seeing that happen for SD, as well as hunting down other pinned alerts we can work on for SD.