Skip to content

OCPBUGS-2086: Detect failure in wait-for on transition back to ready#6470

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
bfournie:detect-ready-transition-fail
Oct 25, 2022
Merged

OCPBUGS-2086: Detect failure in wait-for on transition back to ready#6470
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
bfournie:detect-ready-transition-fail

Conversation

@bfournie
Copy link
Contributor

@bfournie bfournie commented Oct 6, 2022

If a cluster moves from "preparing-for-installation" back to "ready" it indicates that a failure has occurred and installation will not continue. Note that in order to check the status the return from the validation check is moved to after the handling of the status errors.

@openshift-ci openshift-ci bot requested review from dhellmann and zaneb October 6, 2022 17:26
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should be something like "Failed to prepare cluster installation".

Unfortunately, returning an err doesn't actually do anything because the return value is not being checked. We need another mechanism that will result in (a) the error message being printed (instead of the warning above), and (b) the process exiting.

That could mean replacing all of the other errors in this function with nil, since they don't do anything, and then actually checking the return value. Or we could add another return value. Or something else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed to return an additional value to indicate whether the process should exit. This allows logs to be output which will not result in an exit, I've also removed some logs (like the validation) that happen every cycle.

When the error occurs the wait-for now exits with:
level=error msg=Bootstrap failed to complete: : bootstrap process returned error: Failed to prepare cluster installation

@zaneb
Copy link
Member

zaneb commented Oct 6, 2022

We also need to handle the case where wait-for bootstrap-complete is run only after assisted has already failed to prepare the installation. See the Jira ticket for more details. That could be in another PR, since it's quite a bit more complicated.

@bfournie
Copy link
Contributor Author

bfournie commented Oct 6, 2022

We also need to handle the case where wait-for bootstrap-complete is run only after assisted has already failed to prepare the installation. See the Jira ticket for more details. That could be in another PR, since it's quite a bit more complicated.

Yes I was planning on handling that after we got agreement on the this first one.

@celebdor
Copy link
Contributor

celebdor commented Oct 6, 2022

/title OCPBUGS-2086: Detect failure in wait-for on transition back to ready

@bfournie bfournie force-pushed the detect-ready-transition-fail branch from 7e15dae to a57d4b9 Compare October 6, 2022 23:06
@bfournie bfournie changed the title AGENT-353: Detect failure in wait-for on transition back to ready OCPBUGS-2086: Detect failure in wait-for on transition back to ready Oct 6, 2022
@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Oct 6, 2022
@openshift-ci-robot
Copy link
Contributor

@bfournie: This pull request references Jira Issue OCPBUGS-2086, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.12.0) matches configured target version for branch (4.12.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

If a cluster moves from "preparing-for-installation" back to "ready" it indicates that a failure has occurred and installation will not continue. Note that in order to check the status the return from the validation check is moved to after the handling of the status errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only have this Info log if exit_on_err is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to log here if not exiting.

If a cluster moves from "preparing-for-installation" back to "ready"
it indicates that a failure has occurred and installation will not
continue. Note that in order to check the status the return from
the validation check is moved to after the handling of the status
errors.
@bfournie bfournie force-pushed the detect-ready-transition-fail branch from a57d4b9 to 9787b10 Compare October 10, 2022 10:49
@bfournie
Copy link
Contributor Author

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 10, 2022

@bfournie: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-gcp-ovn-upgrade 7e15daee895c84f842344db0676daa45eb1b4a6f link false /test okd-scos-e2e-gcp-ovn-upgrade
ci/prow/okd-scos-e2e-vsphere 7e15daee895c84f842344db0676daa45eb1b4a6f link false /test okd-scos-e2e-vsphere
ci/prow/okd-scos-e2e-gcp 7e15daee895c84f842344db0676daa45eb1b4a6f link false /test okd-scos-e2e-gcp

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@bfournie bfournie force-pushed the detect-ready-transition-fail branch from 666ce0e to 9d4fbc0 Compare October 18, 2022 15:33
@lranjbar
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 24, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lranjbar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 24, 2022
@zaneb
Copy link
Member

zaneb commented Oct 25, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 25, 2022
@openshift-merge-robot openshift-merge-robot merged commit a2c4d48 into openshift:master Oct 25, 2022
@openshift-ci-robot
Copy link
Contributor

@bfournie: Jira Issue OCPBUGS-2086 is in an unrecognized state (ON_QA) and will not be moved to the MODIFIED state.

Details

In response to this:

If a cluster moves from "preparing-for-installation" back to "ready" it indicates that a failure has occurred and installation will not continue. Note that in order to check the status the return from the validation check is moved to after the handling of the status errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lranjbar added a commit to lranjbar/installer that referenced this pull request Oct 26, 2022
openshift-merge-robot added a commit that referenced this pull request Oct 27, 2022
@bfournie bfournie deleted the detect-ready-transition-fail branch October 28, 2022 17:49
bfournie added a commit to bfournie/installer that referenced this pull request Nov 9, 2022
In openshift#6470 we detected a failure when the cluster state moves
back to Ready after an installation has been initiated. This adds an automatic retry when that condition occurs.
It will help resolve issues like https://issues.redhat.com/browse/OCPBUGS-3280 and, in general, any problems
that cause a cluster prepare failure.
openshift-cherrypick-robot pushed a commit to openshift-cherrypick-robot/installer that referenced this pull request Nov 11, 2022
In openshift#6470 we detected a failure when the cluster state moves
back to Ready after an installation has been initiated. This adds an automatic retry when that condition occurs.
It will help resolve issues like https://issues.redhat.com/browse/OCPBUGS-3280 and, in general, any problems
that cause a cluster prepare failure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants