Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing the severity of "missing runbook_url annotation for critical alerts" test case from flaky to failure #27987

Merged

Conversation

vimalk78
Copy link
Contributor

The following critical alerts do not have have runbook_url annotation added as exceptions:

  • OVNKubernetesNorthboundDatabaseClusterIDError
  • OVNKubernetesSouthboundDatabaseClusterIDError
  • OVNKubernetesNorthboundDatabaseLeaderError
  • OVNKubernetesSouthboundDatabaseLeaderError
  • OVNKubernetesNorthboundDatabaseMultipleLeadersError
  • OVNKubernetesSouthboundDatabaseMultipleLeadersError
  • OVNKubernetesNorthdInactive
  • KubeSchedulerDown
  • MultipleDefaultStorageClasses
  • MachineAPIOperatorMetricsCollectionFailing
  • MCDRebootError
  • ExtremelyHighIndividualControlPlaneMemory
  • HAProxyDown
  • ClusterOperatorDown
  • ClusterVersionOperatorDown

The check for runbook_url validity has been separated out into a new flaky test

@vimalk78
Copy link
Contributor Author

should the check for invalid runbook_url have its own list of exceptions?

@vimalk78
Copy link
Contributor Author

/retest

violations := sets.NewString()
runbook := string(alert.Annotations["runbook_url"])

if runbook != "" {
// If there's a 'runbook_url' annotation, make sure it's a
// valid URL and that we can fetch the contents.
if err := helper.ValidateURL(runbook, 10*time.Second); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would separate the test to perform 1) the semantic validation of the URL (e.g. it can be parsed by url.Parse() and the scheme is "http" or "https") and 2) the actual verification that the URL returns a 200 response code. The former can fail the test if it detects an invalid URL and the latter should only flake as the target might be temporarily unavailable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The helper.ValidateURL function uses url.Parse(rawURL) and returns error if it fails to parse.
can i add check for "https" scheme in the same validation function?

https://github.com/openshift/origin/blob/master/test/extended/util/prometheus/helpers.go#L355

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can i add check for "https" scheme in the same validation function?

yes and I would split the actual request to the URL into another function (e.g. QueryURL()).

@vimalk78
Copy link
Contributor Author

/retest

return err
}

if u.Scheme != "https" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"http" should be fine too.

}

if u.Scheme != "https" {
errstr := fmt.Sprintf("ValidateURL(%q) URL scheme is not https", rawURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errstr := fmt.Sprintf("ValidateURL(%q) URL scheme is not https", rawURL)
errstr := fmt.Sprintf("%q: URL scheme is invalid, it should be 'http' or 'https'", rawURL)

return fmt.Errorf(errstr)
}
if u.Host == "" {
errstr := fmt.Sprintf("ValidateURL(%q) URL does not have a host part", rawURL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
errstr := fmt.Sprintf("ValidateURL(%q) URL does not have a host part", rawURL)
errstr := fmt.Sprintf("%q: host should not be empty", rawURL)


if u.Scheme != "https" {
errstr := fmt.Sprintf("ValidateURL(%q) URL scheme is not https", rawURL)
framework.Logf(errstr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to log it here since it's already done by the caller?

}
if u.Host == "" {
errstr := fmt.Sprintf("ValidateURL(%q) URL does not have a host part", rawURL)
framework.Logf(errstr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here

@@ -174,8 +253,6 @@ var _ = g.Describe("[sig-instrumentation][Late] OpenShift alerting rules [apigro
})

if err != nil {
// We are still gathering data on how many alerts need to
// be fixed, so this is marked as a flake for now.
testresult.Flakef(err.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
testresult.Flakef(err.Error())
// We can't fail the test because the runbook URLs might be temporarily unavailable.
// At least we can manually check the CI logs to identify buggy URLs.
testresult.Flakef(err.Error())

}
})

g.It("should have valid runbook URLs", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
g.It("should have valid runbook URLs", func() {
g.It("should link to a valid URL if the runbook_url annotation is defined", func() {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) to be more specific and distinguish from the next test

Suggested change
g.It("should have valid runbook URLs", func() {
g.It("should link to an HTTP(S) location if the runbook_url annotation is defined", func() {

violations := sets.NewString()
runbook_url := string(alert.Annotations["runbook_url"])

if runbook_url != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) easier to read.

Suggested change
if runbook_url != "" {
if runbook_url == "" {
return nil
}

// If there's a 'runbook_url' annotation, make sure that it is a valid URL
if err := helper.ValidateURL(runbook_url); err != nil {
violations.Insert(
fmt.Sprintf("WARNING: Alert %q has an 'runbook_url' annotation: %q which is not valid. Error: %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForEachAlertingRule() always include the name of the alert.

Suggested change
fmt.Sprintf("WARNING: Alert %q has an 'runbook_url' annotation: %q which is not valid. Error: %v",
fmt.Sprintf("has an 'runbook_url' annotation which is not valid: %v",

violations.Insert(
fmt.Sprintf("WARNING: Alert %q has an invalid 'runbook_url' annotation: %v",
alert.Name, err),
fmt.Sprintf("WARNING: Alert %q has an 'runbook_url' annotation: %q which cannot be fetched. Error: %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment about the alert name already being included

Suggested change
fmt.Sprintf("WARNING: Alert %q has an 'runbook_url' annotation: %q which cannot be fetched. Error: %v",
fmt.Sprintf("has a runbook URL which cannot be fetched: %v",

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last suggestion

}
})

g.It("should have valid runbook URLs", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) to be more specific and distinguish from the next test

Suggested change
g.It("should have valid runbook URLs", func() {
g.It("should link to an HTTP(S) location if the runbook_url annotation is defined", func() {

@simonpasquier
Copy link
Contributor

@vimalk78 you need to regenerate some files to fix ci/prow/verify.

… alerts" test case from flaky to failure

The following critical alerts do not have have runbook_url annotation added as exceptions:
  - OVNKubernetesNorthboundDatabaseClusterIDError
  - OVNKubernetesSouthboundDatabaseClusterIDError
  - OVNKubernetesNorthboundDatabaseLeaderError
  - OVNKubernetesSouthboundDatabaseLeaderError
  - OVNKubernetesNorthboundDatabaseMultipleLeadersError
  - OVNKubernetesSouthboundDatabaseMultipleLeadersError
  - OVNKubernetesNorthdInactive
  - KubeSchedulerDown
  - MultipleDefaultStorageClasses
  - MachineAPIOperatorMetricsCollectionFailing
  - MCDRebootError
  - ExtremelyHighIndividualControlPlaneMemory
  - HAProxyDown
  - ClusterOperatorDown
  - ClusterVersionOperatorDown

The check for runbook_url validity has been separated out into two new tests
  - test which checks for URL validity. This fails the test if URL is invalid
  - test which fetches the URL to check its existance. This check makes test flaky,
    and does not fail the test if a URL cannot be fetched.
@simonpasquier
Copy link
Contributor

/skip

@danielmellado
Copy link

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 28, 2023

@vimalk78: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 681db23 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-agnostic-ovn-cmd 681db23 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-metal-ipi-sdn 681db23 link false /test e2e-metal-ipi-sdn
ci/prow/e2e-gcp-ovn-rt-upgrade 681db23 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-single-node-serial 681db23 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-upgrade 681db23 link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-csi 681db23 link false /test e2e-aws-csi
ci/prow/e2e-metal-ipi-ovn-ipv6 681db23 link false /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 28, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 28, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: simonpasquier, vimalk78

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 28, 2023
@openshift-merge-robot openshift-merge-robot merged commit 1f3163d into openshift:master Jun 28, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants