OCPBUGS-2086: Detect failure in wait-for on transition back to ready#6470
Conversation
pkg/agent/cluster.go
Outdated
There was a problem hiding this comment.
The error message should be something like "Failed to prepare cluster installation".
Unfortunately, returning an err doesn't actually do anything because the return value is not being checked. We need another mechanism that will result in (a) the error message being printed (instead of the warning above), and (b) the process exiting.
That could mean replacing all of the other errors in this function with nil, since they don't do anything, and then actually checking the return value. Or we could add another return value. Or something else.
There was a problem hiding this comment.
I changed to return an additional value to indicate whether the process should exit. This allows logs to be output which will not result in an exit, I've also removed some logs (like the validation) that happen every cycle.
When the error occurs the wait-for now exits with:
level=error msg=Bootstrap failed to complete: : bootstrap process returned error: Failed to prepare cluster installation
|
We also need to handle the case where |
Yes I was planning on handling that after we got agreement on the this first one. |
|
/title OCPBUGS-2086: Detect failure in wait-for on transition back to ready |
7e15dae to
a57d4b9
Compare
|
@bfournie: This pull request references Jira Issue OCPBUGS-2086, which is valid. 3 validation(s) were run on this bug
The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
pkg/agent/waitfor.go
Outdated
There was a problem hiding this comment.
I think we should only have this Info log if exit_on_err is false.
There was a problem hiding this comment.
Changed to log here if not exiting.
If a cluster moves from "preparing-for-installation" back to "ready" it indicates that a failure has occurred and installation will not continue. Note that in order to check the status the return from the validation check is moved to after the handling of the status errors.
a57d4b9 to
9787b10
Compare
|
@bfournie: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
666ce0e to
9d4fbc0
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lranjbar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/lgtm |
|
@bfournie: Jira Issue OCPBUGS-2086 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
In openshift#6470 we detected a failure when the cluster state moves back to Ready after an installation has been initiated. This adds an automatic retry when that condition occurs. It will help resolve issues like https://issues.redhat.com/browse/OCPBUGS-3280 and, in general, any problems that cause a cluster prepare failure.
In openshift#6470 we detected a failure when the cluster state moves back to Ready after an installation has been initiated. This adds an automatic retry when that condition occurs. It will help resolve issues like https://issues.redhat.com/browse/OCPBUGS-3280 and, in general, any problems that cause a cluster prepare failure.
If a cluster moves from "preparing-for-installation" back to "ready" it indicates that a failure has occurred and installation will not continue. Note that in order to check the status the return from the validation check is moved to after the handling of the status errors.