Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements and Additions to Alert Testing Stack #28332

Merged
merged 8 commits into from Oct 20, 2023

Conversation

dgoodwin
Copy link
Contributor

@dgoodwin dgoodwin commented Oct 17, 2023

TRT-1235 TRT-934

This PR continues work to resume some form of automated testing on alerts in a couple specific scenarios.

Removed Duplicated Alert Testing stack

Two tests are being removed in favor of a newer invariant based test we added some time ago which works off intervals we generated by querying prometheus, rather than repeated live prometheus queries to get the same data. This new style of testing allows us to run the alert tests against a given intervals file, which proved very useful on this PRs development.

The tests being removed:

[sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]

  • Ginko test run Late during e2e, live queries prometheus for the duration of
    the run, skips any alerts that have their own test, and flakes if any
    are found. It cannot fail.
  • Flake rate is about 80%.

[sig-arch] Check if alerts are firing during or after upgrade success

  • Monitor test run after upgrade.
  • Uses basically the same approach/code as above.
  • Flake rate is 97%.

Removing these two tests dramatically simplifies our stack so we can begin
working on resurrecting some form of alert testing in just one spot.

The backstop test which replaces them is: [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically

NOTE: A similarly named test that can fail and does catch real problems
is NOT being removed, this one is looking for alerts at the specific point in time it runs, i.e. nothing should be firing if we're done our e2e run:

[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]

New Test To Ensure New Alerts Don't Fire

Added a test [sig-trt][invariant] No new alerts should be firing which checks for any occurrence of a firing non-info alert we don't know about (not in our data file), or is in our data file but is less than two weeks old. This was a feature of the planned design that was intended to give us a window to catch someone adding a new alert, that is firing in CI already. (we don't want things to get worse)

This test is set to flake, I'll move to a fail once I can see if run for a bit and confirm this is safe.

Ignore TelemeterClientFailures Alerts

This indicates a problem with an external service, it does not affect other tests, it's pending all the time and makes it to firing plenty often. Not an alert we're concerned with in testing, after speaking with monitoring team this is being removed. (Trevor is actually driving us towards disabling telemeter in CI clusters)

Extract More Alert Data into Structured Intervals

Expanded our structured alert intervals to now parse our the alert severity and level.

Read More Data from Alerts Query Results

The alerts query_results.json will now include more data such as first/last observed, start reading this info out and using minimally.

Fix The 100% Flake Rate on Existing Backstop Test

The test we're keeping called out above was a 100% flake, I believe because it did not consider the alerts we know fire on every CI cluster, AlertManagerNotConfigured and Watchdog. This alert now shares the list of alerts we do not care about with the remaining prometheus ginko test.

Next Steps

We have no plan to bring back historical alert data testing as we believe we cannot possibly keep up with the noise it generates. Right now the focus is on making sure nobody adds new alerts that are already firing in our fleet, and hopefully never seeing that happen for SD, as well as hunting down other pinned alerts we can work on for SD.

@dgoodwin dgoodwin changed the title alert pin tests Test for specific alerts we want to fail a test on any fire Oct 17, 2023
@openshift-ci openshift-ci bot requested review from csrwng and deads2k October 17, 2023 14:16
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 17, 2023
@openshift-ci openshift-ci bot added the vendor-update Touching vendor dir or related files label Oct 18, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2023
New initiative to allow pinning alerts that are simply not allowed to
fire, regardless of historical data.
This is confusing but this commit removes two alert tests:

[sig-instrumentation][Late] Alerts shouldn't report any unexpected alerts in firing or pending state [apigroup:config.openshift.io] [Suite:openshift/conformance/parallel]
- Ginko test run Late during e2e, live queries prometheus for the duration of
  the run, skips any alerts that have their own test, and flakes if any
  are found. It cannot fail.
- Flake rate is about 80%.
- This test is being removed in favor of the new one based on intervals
  observed. (see
  pkg/monitortests/testframework/legacytestframeworkmonitortests/monitortest.go.

[sig-arch] Check if alerts are firing during or after upgrade success
- Monitor test run after upgrade.
- Uses basically the same approach/code as above.
- Flake rate is 97%.
- This test is being removed in favor of the new one based on intervals
  observed. (see
  pkg/monitortests/testframework/legacytestframeworkmonitortests/monitortest.go.

Removing these two tests dramatically simplies our stack so we can begin
working on resurrecting some form of alert testing in just one spot.

The remaining backstop test is: [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically
It is a 100% flake.

NOTE: A similarly named test that can fail and does catch real problems
is NOT being removed:

[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
These turn out to be a duplication of tests that already flake if an operator does Available=false or Degraded=true in a test run. (i.e. [bz-DNS] clusteroperator/dns should not change condition/Degraded)
Attempts to be forgiving, scan entire data file regardless of nurp looking to see if we've seen this alert name before anywhere. If not, or we did but first observed is less than two weeks ago, fail the test. This approach should help us survive the cutover to new releases.
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 19, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 19, 2023

@dgoodwin: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-serial d334db8 link false /test e2e-aws-ovn-single-node-serial
ci/prow/unit d334db8 link true /test unit
ci/prow/e2e-aws-ovn-single-node-upgrade d334db8 link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

a.allowanceCalculator = alwaysFlake()
return a
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of this was removed, but I left the alwaysFlake functionality in so it's visible when we start working on more alerts to help SD out.

@dgoodwin dgoodwin changed the title Test for specific alerts we want to fail a test on any fire Improvements and Additions to Alert Testing Stack Oct 20, 2023
@stbenjam
Copy link
Member

Do you consider the removed tests to be comparable to [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically? If so we may want to track it as a rename. If it's substantially different we can just leave it.

@stbenjam
Copy link
Member

stbenjam commented Oct 20, 2023

Do you think [sig-trt][invariant] No new alerts should be firing should be treated like panics? Out of the box aggregation I think aggregation allows ~4 failures until we build up history, but we have the ability to make this a never-failing test openshift/ci-tools#3653 (once it fails instead of flakes)

@stbenjam
Copy link
Member

Otherwise code looks good to me, test seems to be passing in the jobs

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 20, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 20, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stbenjam
Copy link
Member

Also [sig-trt][invariant] No alerts without an explicit test should be firing/pending more than historically fires A LOT, more than 99% of the time. It seems like we need to fix up that test too to allow exceptions, like AlertmanagerReceiversNotConfigured

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD abb76df and 2 for PR HEAD dc3c6fc in total

@openshift-ci openshift-ci bot merged commit 617bc41 into openshift:master Oct 20, 2023
17 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. vendor-update Touching vendor dir or related files
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants