Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-3522: Include recent errors in canary checks fail #865

Conversation

rfredette
Copy link
Contributor

@rfredette rfredette commented Dec 12, 2022

The ingress canary sets the status condition CanaryChecksSucceeding to false when 5 or more successive checks fail. With this PR, the associated status message will include the error messages from the 5 3 most recent failures

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 12, 2022
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-3522, which is invalid:

  • expected the bug to target the "4.13.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

The ingress canary sets the status condition CanaryChecksSucceeding to false when 5 or more successive checks fail. With this PR, the associated status message will include the error messages from the 5 most recent failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Dec 12, 2022
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-3522, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.13.0) matches configured target version for branch (4.13.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @lihongan

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from lihongan December 12, 2022 18:12
@rfredette
Copy link
Contributor Author

seeing a lot of install failures on the failed CI runs.

/retest

@rfredette
Copy link
Contributor Author

/retest

@rfredette
Copy link
Contributor Author

Sample output:

    Last Transition Time:  2022-12-21T19:45:22Z
    Message:               Canary route checks for the default ingress controller are failing. Last 5 errors:
error sending canary HTTP request to "canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": Get "https://canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": remote error: tls: certificate required
error sending canary HTTP request to "canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": Get "https://canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": remote error: tls: certificate required
error sending canary HTTP request to "canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": Get "https://canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": remote error: tls: certificate required
error sending canary HTTP request to "canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": Get "https://canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": remote error: tls: certificate required
error sending canary HTTP request to "canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": Get "https://canary-openshift-ingress-canary.apps.ci-ln-33z1pqk-72292.origin-ci-int-gce.dev.rhcloud.com": remote error: tls: certificate required
    Reason:  CanaryChecksRepetitiveFailures
    Status:  False
    Type:    CanaryChecksSucceeding

Is this too verbose? I chose to include the last 5 errors since that's the minimum that will cause CanaryChecksSucceeding to report false, but I can tune it to report fewer if it's too much

@lihongan
Copy link
Contributor

@rfredette can we update the check interval and threshold to make failure detection more sensitive? we often see that authentication/console already reported error for a while (almost 2 minutes) but ingress is still good. see example

$ oc get co authentication console ingress
NAME             VERSION                                                   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication   4.12.0-0.ci.test-2022-12-22-014836-ci-ln-h6thnl2-latest   False       False         False      113s    OAuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth-openshift.apps.ci-ln-h6thnl2-76ef8.origin-ci-int-aws.dev.rhcloud.com/healthz": EOF
console          4.12.0-0.ci.test-2022-12-22-014836-ci-ln-h6thnl2-latest   False       False         False      106s    RouteHealthAvailable: failed to GET route (https://console-openshift-console.apps.ci-ln-h6thnl2-76ef8.origin-ci-int-aws.dev.rhcloud.com): Get "https://console-openshift-console.apps.ci-ln-h6thnl2-76ef8.origin-ci-int-aws.dev.rhcloud.com": EOF
ingress          4.12.0-0.ci.test-2022-12-22-014836-ci-ln-h6thnl2-latest   True        False         False      39m     

@@ -230,6 +230,8 @@ func (r *reconciler) startCanaryRoutePolling(stop <-chan struct{}) error {
// for status reporting.
successiveFail := 0

errors := []error{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better to add comments for variable errors ?

@lihongan
Copy link
Contributor

From my understanding, the last 5 canary check errors should be same in most of the cases, so maybe we don't need to list 5 errors.
In the https://issues.redhat.com/browse/OCPBUGS-3522, the Expected results is:

Canary route checks for the default ingress controller are failing: ${ERROR_MESSAGE}. ${POSSIBLY_ALSO_MORE_TROUBLESHOOTING_IDEAS?}

so maybe we should focus on how to improve the detailed error message (which can help troubleshooting).

@candita
Copy link
Contributor

candita commented Jan 4, 2023

/assign @gcs278

@gcs278
Copy link
Contributor

gcs278 commented Jan 9, 2023

@rfredette what's the status of this PR? Should I review or are you going to revisit this solution?

@rfredette
Copy link
Contributor Author

@gcs278 It shouldn't change much in terms of the logic, but I'm planning to reduce the number of error messages shown (I think the last 3 should be reasonable), and include some message about checking the canary logs for more info.

I'd say don't worry about the review yet; I'll ping you once I make that change, which should be in the next day or two

@candita
Copy link
Contributor

candita commented Jan 9, 2023

/test e2e-gcp-ovn-serial

@rfredette
Copy link
Contributor Author

can we update the check interval and threshold to make failure detection more sensitive? we often see that authentication/console already reported error for a while (almost 2 minutes) but ingress is still good.

@lihongan I think that's a reasonable ask, but it's out of scope for this particular issue. I think the right way to go about that is to create a story for it in the Maintenance & Debugability feature on Jira, but let me verify that.

@rfredette
Copy link
Contributor Author

The latest change makes the error message more concise in 2 ways. First, it deduplicates the error messages and includes the total number of times a given error was seen, and second, it only includes 3 unique error messages, instead of 5. Because of how I was able to test this, I only got one error message, but here's an example:

  - lastTransitionTime: "2023-01-11T01:20:40Z"
    message: |-
      Canary route checks for the default ingress controller are failing. Last 1 error messages:
      [Seen 7 times] error sending canary HTTP request to "canary-openshift-ingress-canary.apps.ci-ln-h5z9pc2-72292.origin-ci-int-gce.dev.rhcloud.com": Get "https://canary-openshift-ingress-canary.apps.ci-ln-h5z9pc2-72292.origin-ci-int-gce.dev.rhcloud.com": remote error: tls: handshake failure
    reason: CanaryChecksRepetitiveFailures
    status: "False"
    type: CanaryChecksSucceeding

Unfortunately I can't think of a way to generate some useful comment in addition to the error message, so I think this will have to do for increased debuggability.
@gcs278 I think this is ready for review, so when you get a chance, please take a look

@rfredette
Copy link
Contributor Author

several jobs got a 503 error during the build; retrying.
/retest

@rfredette
Copy link
Contributor Author

Test failures look unrelated.

/retest-required

@Miciah
Copy link
Contributor

Miciah commented Jan 17, 2023

/assign @alebedev87

Copy link
Contributor

@gcs278 gcs278 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No major issues, just some questions.

pkg/operator/controller/canary/controller.go Outdated Show resolved Hide resolved
pkg/operator/controller/canary/controller.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jun 24, 2023
@rfredette
Copy link
Contributor Author

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 10, 2023
@candita
Copy link
Contributor

candita commented Sep 25, 2023

/retest-required

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 25, 2023
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 24, 2024
@rfredette
Copy link
Contributor Author

/remove-lifecycle rotten

@openshift-ci openshift-ci bot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Feb 5, 2024
When the canary is failing, put up to 3 of the most recent errors into the status message to aid in debugging. If the errors have occurred more than once since the canary was last passing, also include the number of occurrences in the status.
@rfredette rfredette force-pushed the ocpbugs-3522-canary-error-message branch from a655264 to 6557c06 Compare February 8, 2024 00:20
@rfredette
Copy link
Contributor Author

/retest

@rfredette
Copy link
Contributor Author

In the latest commit, I added timestamp tracking to the errors and switched the [Seen 3 times] part of the message to the (x3 in 35s) style that is used elsewhere. I also reworked the deduplication code to reduce the number of times it loops over the list of errors.

@rfredette
Copy link
Contributor Author

/retest

@alebedev87
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 14, 2024
@gcs278
Copy link
Contributor

gcs278 commented Feb 29, 2024

Looks good to me. Since I'm the other reviewer, I'll approve it.
/approve
/lgtm

Copy link
Contributor

openshift-ci bot commented Feb 29, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gcs278

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 29, 2024
@rfredette
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Feb 29, 2024
@openshift-ci-robot
Copy link
Contributor

@rfredette: This pull request references Jira Issue OCPBUGS-3522, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @melvinjoseph86

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 5a89361 and 2 for PR HEAD 6557c06 in total

Copy link
Contributor

openshift-ci bot commented Mar 1, 2024

@rfredette: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 34839c7 into openshift:master Mar 1, 2024
14 checks passed
@openshift-ci-robot
Copy link
Contributor

@rfredette: Jira Issue OCPBUGS-3522: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-3522 has been moved to the MODIFIED state.

In response to this:

The ingress canary sets the status condition CanaryChecksSucceeding to false when 5 or more successive checks fail. With this PR, the associated status message will include the error messages from the 5 3 most recent failures

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-cluster-ingress-operator-container-v4.16.0-202403010908.p0.g34839c7.assembly.stream.el9 for distgit ose-cluster-ingress-operator.
All builds following this will include this PR.

@openshift-merge-robot
Copy link
Contributor

Fix included in accepted release 4.16.0-0.nightly-2024-03-05-105513

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants