New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-3522: Include recent errors in canary checks fail #865
OCPBUGS-3522: Include recent errors in canary checks fail #865
Conversation
@rfredette: This pull request references Jira Issue OCPBUGS-3522, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/jira refresh |
@rfredette: This pull request references Jira Issue OCPBUGS-3522, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
seeing a lot of install failures on the failed CI runs. /retest |
/retest |
Sample output:
Is this too verbose? I chose to include the last 5 errors since that's the minimum that will cause |
@rfredette can we update the check interval and threshold to make failure detection more sensitive? we often see that authentication/console already reported error for a while (almost 2 minutes) but ingress is still good. see example
|
@@ -230,6 +230,8 @@ func (r *reconciler) startCanaryRoutePolling(stop <-chan struct{}) error { | |||
// for status reporting. | |||
successiveFail := 0 | |||
|
|||
errors := []error{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's better to add comments for variable errors
?
From my understanding, the last 5 canary check errors should be same in most of the cases, so maybe we don't need to list 5 errors.
so maybe we should focus on how to improve the detailed error message (which can help troubleshooting). |
/assign @gcs278 |
@rfredette what's the status of this PR? Should I review or are you going to revisit this solution? |
@gcs278 It shouldn't change much in terms of the logic, but I'm planning to reduce the number of error messages shown (I think the last 3 should be reasonable), and include some message about checking the canary logs for more info. I'd say don't worry about the review yet; I'll ping you once I make that change, which should be in the next day or two |
/test e2e-gcp-ovn-serial |
@lihongan I think that's a reasonable ask, but it's out of scope for this particular issue. I think the right way to go about that is to create a story for it in the Maintenance & Debugability feature on Jira, but let me verify that. |
The latest change makes the error message more concise in 2 ways. First, it deduplicates the error messages and includes the total number of times a given error was seen, and second, it only includes 3 unique error messages, instead of 5. Because of how I was able to test this, I only got one error message, but here's an example:
Unfortunately I can't think of a way to generate some useful comment in addition to the error message, so I think this will have to do for increased debuggability. |
several jobs got a 503 error during the build; retrying. |
Test failures look unrelated. /retest-required |
/assign @alebedev87 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No major issues, just some questions.
/remove-lifecycle rotten |
/retest-required |
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
When the canary is failing, put up to 3 of the most recent errors into the status message to aid in debugging. If the errors have occurred more than once since the canary was last passing, also include the number of occurrences in the status.
a655264
to
6557c06
Compare
/retest |
In the latest commit, I added timestamp tracking to the errors and switched the |
/retest |
/lgtm |
Looks good to me. Since I'm the other reviewer, I'll approve it. |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gcs278 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/jira refresh |
@rfredette: This pull request references Jira Issue OCPBUGS-3522, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
@rfredette: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
@rfredette: Jira Issue OCPBUGS-3522: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-3522 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
[ART PR BUILD NOTIFIER] This PR has been included in build ose-cluster-ingress-operator-container-v4.16.0-202403010908.p0.g34839c7.assembly.stream.el9 for distgit ose-cluster-ingress-operator. |
Fix included in accepted release 4.16.0-0.nightly-2024-03-05-105513 |
The ingress canary sets the status condition
CanaryChecksSucceeding
to false when 5 or more successive checks fail. With this PR, the associated status message will include the error messages from the53 most recent failures