Bug 1932452: Backport ensure adoption is retried upon failure #127

honza · 2021-02-24T15:51:46Z

No description provided.

It is up to the consumer, not the host, to decide when a host is ready to have provisioning instructions added. The checks in the Available() method will be inlined in cluster-api-provider-metal3. Signed-off-by: Doug Hellmann <dhellmann@redhat.com> (cherry picked from commit 095bf63) Signed-off-by: Honza Pokorny <honza@redhat.com>

Use the API field directly instead of the convenience method. Signed-off-by: Doug Hellmann <dhellmann@redhat.com> (cherry picked from commit 29c784b) Signed-off-by: Honza Pokorny <honza@redhat.com>

Move public methods of the host type to private functions in the controller package. Signed-off-by: Doug Hellmann <dhellmann@redhat.com> (cherry picked from commit b0c8c58) Signed-off-by: Honza Pokorny <honza@redhat.com>

(cherry picked from commit 9b3dbf7) Signed-off-by: Honza Pokorny <honza@redhat.com>

(cherry picked from commit 177d9ef) Signed-off-by: Honza Pokorny <honza@redhat.com>

When a call to the provisioner succeeds rather than returning an error message, that's a good sign, and a reason to not have an error message set in the Host object. But it doesn't guarantee success: if the previous failure came at the end of some long-running operation (rather than outright rejection at the beginning), it could yet fail in exactly the same way. Clearing the ErrorType as soon as we start an operation allows us to use that field to determine whether to force the provisioner to retry in the presence of an error. (It will only be set if the last thing we did was record an error, therefore if we see it then we are at the beginning of a new retry.) One known issue with this is that if a request to change state in Ironic results in a 409 Conflict response, this just sets the Dirty flag and is indistinguishable from success when viewed outside the Ironic provisioner. If such a conflict occurs, we will effectively skip a retry attempt, increment the ErrorCount again, and sleep for even longer before the next attempt. To ensure that we actually do exponential backoff between retries, leave the ErrorCount unchanged until the whole operation actually succeeds (Dirty is no longer true). This is now decoupled from the ClearError() method that clears the ErrorType and ErrorMessage. As much as possible, do the clearing of ErrorCount in the host state machine. The exception is the steady states where we only do power management - the actionResult types currently lack enough detail to distinguish when count should be cleared when viewed from the state machine. In power management states (Ready/Available, Provisioned, Externally Provisioned), count the number of errors of any type since the power state was last successfully changed. Otherwise, the ErrorCount is cleared when an operation succesfully completes. Successful re-registration or adoption is never sufficient to clear the error count, except in the registration state. (cherry picked from commit 562b6ac) Signed-off-by: Honza Pokorny <honza@redhat.com>

Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739 (cherry picked from commit 84ca573) Signed-off-by: Honza Pokorny <honza@redhat.com>

If inspection fails at some point before it is actually started in ironic-inspector, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff. (cherry picked from commit 2d02462) Signed-off-by: Honza Pokorny <honza@redhat.com>

Error is the state that the Node goes into if deleting (i.e. deprovisioning) it fails. The only valid action from this state is to try deleting again, so do that rather than attempt to go straight to manageable. (cherry picked from commit 7791d21) Signed-off-by: Honza Pokorny <honza@redhat.com>

If deprovisioning fails, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff. If we are deprovisioning because of a deletion, give up after 3 retries (since this may indicate that the host simply doesn't exist any more) and allow the Host to be deleted. (cherry picked from commit d7d38ee) Signed-off-by: Honza Pokorny <honza@redhat.com>

(cherry picked from commit 7c3daef) Signed-off-by: Honza Pokorny <honza@redhat.com>

Denote instances where we are returning due to a 409 Conflict in Ironic by calling the retryAfterDelay() function to get a Result. Note that in some cases this introduces a delay where previously there was none. Delaying after a 409 response is probably always the right thing to do. (cherry picked from commit ac23a91) Signed-off-by: Honza Pokorny <honza@redhat.com>

(cherry picked from commit 5faf361) Signed-off-by: Honza Pokorny <honza@redhat.com>

(cherry picked from commit c1e13c5) Signed-off-by: Honza Pokorny <honza@redhat.com>

(cherry picked from commit 54c8028) Signed-off-by: Honza Pokorny <honza@redhat.com>

(cherry picked from commit 6857a7f) Signed-off-by: Honza Pokorny <honza@redhat.com>

(cherry picked from commit 1e6fd34) Signed-off-by: Honza Pokorny <honza@redhat.com>

Denote instances where we are returning due to data in the Host's Status structure being updated by calling the hostUpdated() function to get a Result. (cherry picked from commit 4b4c29d) Signed-off-by: Honza Pokorny <honza@redhat.com>

The provisioner interface currently requires provisioners to write to the host's Status in order to set the ID. This is undesirable. Return the provisioner's ID for the host from ValidateManagementAccess() and write it to the Status in the controller. (cherry picked from commit 50f97a7) Signed-off-by: Honza Pokorny <honza@redhat.com>

The provisioner interface currently requires provisioners to write to the host's Status in order to set the power status. This is undesirable. Return the power status (if known) for the host from UpdateHardwareState(), rather than a Result structure from which only the Dirty flag was used. Write the updated power state to the Status in the controller. (cherry picked from commit cd60f7d) Signed-off-by: Honza Pokorny <honza@redhat.com>

The fixtureProvisioner object gets recreated for each Reconcile() call, so it cannot be used to store state that is used to make different things happen on each reconciliation. Some of this state was instead being stored in the host Status, which could potentially confound some of the tests: only the controller should modify the status, and it is the thing we are trying to test. Instead, create a Fixture structure that contains persistent state for a particular Host and the provisioner Factory to create a provisioner that has access to that state. Stop writing to the Host's Status from the fixture provisioner. (cherry picked from commit 1279a47) Signed-off-by: Honza Pokorny <honza@redhat.com>

There is no longer any reason for provisioners to modify the BareMetalHost object itself; they should communicate any changes explicitly through return values from the Provisioner API. To prevent this starting to happen again in future, pass only a copy of the BareMetalHost in the provisioner factory, so that no changes made by the provisioner will ever make their way back to the controller nor be written to the k8s API. (cherry picked from commit 0656b7e) Signed-off-by: Honza Pokorny <honza@redhat.com>

This function is now called on basically every reconcile, so don't log information about things that haven't changed. (cherry picked from commit cbaa73d) Signed-off-by: Honza Pokorny <honza@redhat.com>

When the provisioner could write to the Host's Status subresource and expect it to be persisted, we had to always treat a Dirty flag returned from the provisioner as indicating that the Status was dirty and required a write. (Oddly this meant the Dirty flag was treated with almost the inverse of its intended meaning: when it was returned true the Status was rarely actually dirty, while when it was false the action was complete which always requires a write.) Now that provisioners are no longer allowed to write to the Status, we only need to request a write when the controller has updated the Status. Add a new actionUpdate to cover the couple of cases that are not already covered by actionComplete, and those where ClearError() returns dirty. The actionUpdate type comprises actionContinue so that type checks for the latter handle it correctly. Make actionContinue return false from the Dirty() method, meaning that it will not cause a write to the k8s API. The actionContinueNoWrite type is removed, since it would now be identical to actionContinue. (cherry picked from commit 3102c74) Signed-off-by: Honza Pokorny <honza@redhat.com>

All of the action<StateName>() methods of the reconciler are the handlers used by the host state machine to carry out the action in the current provisioning state, except for actionRegistering(). actionRegistering() is called in every state (by the state machine's ensureRegistered() method) prior to calling the action method for the current state. Rename actionRegistering() to registerHost() to eliminate the implication that it is an action specific to the Registering state. Simplify the return handling by returning nil when validating management access succeeded and there were no changes required to the Status, since this is what we want to return from ensureRegistered() (indicating no need to return early before executing the state's action method). This will occur both in general operation when there is no re-registration required, and also potentially at the conclusion of a re-registration where there was no credentials change and errors had already been cleared (by a previous invocation that returned actionUpdate). At the conclusion of a registration where changes need to be written to the Status, e.g. by clearing a previous error state or updating the known good credentials, actionComplete is returned. This is later translated to actionUpdate by ensureRegistered(). This enables that scenario to be distinguished from that of a change needing to be written while registration is still in progress, which results in actionUpdate. This means there is no longer a need in ensureRegistered() to guess when the registerHost() method might have resulted in a change to the Status. (cherry picked from commit 98d7131) Signed-off-by: Honza Pokorny <honza@redhat.com>

If provisioning fails and the node ends up in the deploy failed state, we try to recover by deprovisioning. That would work better if we actually handled deploy failed in Deprovision(). (cherry picked from commit a308ddf) Signed-off-by: Honza Pokorny <honza@redhat.com>

In the deprovisioning state, calling Adopt() on a freshly-reregistered host will fail if we didn't previously complete provisioning because there will be no image data stored in the status to pass to the node for adoption. Only call Adopt() if the previous provisioning completed and recorded the requisite image data. If the node has just been re-registered, it will be in the manageable state anyway, and thus will still get cleaned before it is next provisioned. (cherry picked from commit 24ba8a0) Signed-off-by: Honza Pokorny <honza@redhat.com>

Currently when adoption fails, we use RegistrationError but this gets cleared any time registration succeeds. We need a separate signal for adoption failure, which allows us to force retry the adoption again. This adds a new AdoptionError, and ensures that if registration succeeds we don't clear an AdoptionError. Co-authored-by: Andrea Fasano <afasano@redhat.com> (cherry picked from commit a10a2ad) Signed-off-by: Honza Pokorny <honza@redhat.com>

openshift-ci-robot · 2021-02-24T15:51:47Z

@honza: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

Backport ensure adoption is retried upon failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2021-02-24T15:53:56Z

@honza: This pull request references Bugzilla bug 1905577, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "4.8.0" instead
expected Bugzilla bug 1905577 to depend on a bug targeting a release in 4.8.0 and in one of the following states: MODIFIED, VERIFIED, but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1905577: Backport ensure adoption is retried upon failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stbenjam · 2021-02-24T15:55:53Z

/bugzilla refresh

openshift-ci-robot · 2021-02-24T15:55:56Z

@stbenjam: This pull request references Bugzilla bug 1905577, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "4.7.z" instead
expected Bugzilla bug 1905577 to depend on a bug targeting a release in 4.8.0 and in one of the following states: MODIFIED, VERIFIED, but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stbenjam · 2021-02-24T16:09:46Z

/retitle Bug 1932452: Backport ensure adoption is retried upon failure

stbenjam · 2021-02-24T16:10:25Z

/retitle Bug 1932452: Backport ensure adoption is retried upon failure

openshift-ci-robot · 2021-02-24T16:10:42Z

@honza: This pull request references Bugzilla bug 1932452, which is invalid:

expected the bug to target the "4.7.0" release, but it targets "4.7.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1932452: Backport ensure adoption is retried upon failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

stbenjam · 2021-02-24T16:11:52Z

/bugzilla refresh

openshift-ci-robot · 2021-02-24T16:12:10Z

@stbenjam: This pull request references Bugzilla bug 1932452, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 1905577 is in the state MODIFIED, which is one of the valid states (MODIFIED, VERIFIED)
dependent Bugzilla bug 1905577 targets the "4.8.0" release, which is one of the valid target releases: 4.8.0
bug has dependents

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

honza · 2021-02-25T11:17:20Z

/test e2e-metal-ipi-ovn-ipv6

stbenjam · 2021-02-25T18:30:07Z

/lgtm

openshift-ci-robot · 2021-02-25T18:31:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: honza, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [honza,stbenjam]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

stbenjam · 2021-03-05T16:11:42Z

/bugzilla refresh

openshift-ci-robot · 2021-03-05T16:11:47Z

@stbenjam: This pull request references Bugzilla bug 1932452, which is valid.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.z) matches configured target release for branch (4.7.z)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 1905577 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
dependent Bugzilla bug 1905577 targets the "4.8.0" release, which is one of the valid target releases: 4.8.0
bug has dependents

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

derekwaynecarr · 2021-03-05T18:42:12Z

(Patch-Manager) approve.

openshift-ci-robot · 2021-03-05T20:36:16Z

@honza: All pull requests linked via external trackers have merged:

openshift/baremetal-operator#127

Bugzilla bug 1932452 has been moved to the MODIFIED state.

In response to this:

Bug 1932452: Backport ensure adoption is retried upon failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann and others added 28 commits February 24, 2021 11:14

remove HasError method from host type

3e04aaf

Use the API field directly instead of the convenience method. Signed-off-by: Doug Hellmann <dhellmann@redhat.com> (cherry picked from commit 29c784b) Signed-off-by: Honza Pokorny <honza@redhat.com>

move error handling methods from host to private functions

5ef617a

Move public methods of the host type to private functions in the controller package. Signed-off-by: Doug Hellmann <dhellmann@redhat.com> (cherry picked from commit b0c8c58) Signed-off-by: Honza Pokorny <honza@redhat.com>

Filter out status updates from the reconcile loop

9b5f914

(cherry picked from commit 9b3dbf7) Signed-off-by: Honza Pokorny <honza@redhat.com>

ironic provisioner de-duplicate image options

16fc1d5

(cherry picked from commit 177d9ef) Signed-off-by: Honza Pokorny <honza@redhat.com>

provisioner: Fix copy-paste error in docs

d5b715f

(cherry picked from commit 7c3daef) Signed-off-by: Honza Pokorny <honza@redhat.com>

Ironic: Use retryAfterDelay() for other retries

693b4fd

(cherry picked from commit 5faf361) Signed-off-by: Honza Pokorny <honza@redhat.com>

Ironic: Use operationComplete() when complete

829c119

(cherry picked from commit c1e13c5) Signed-off-by: Honza Pokorny <honza@redhat.com>

Ironic: Return operationContinuing() for delays

74b5da4

(cherry picked from commit 54c8028) Signed-off-by: Honza Pokorny <honza@redhat.com>

Ironic: Use transientError() to return errors

1feeb9f

(cherry picked from commit 6857a7f) Signed-off-by: Honza Pokorny <honza@redhat.com>

Ironic: Use operationFailed() to handle failures

db6f278

(cherry picked from commit 1e6fd34) Signed-off-by: Honza Pokorny <honza@redhat.com>

Make actionRegistering() logs less chatty

24f0659

This function is now called on basically every reconcile, so don't log information about things that haven't changed. (cherry picked from commit cbaa73d) Signed-off-by: Honza Pokorny <honza@redhat.com>

openshift-ci-robot requested a review from kirankt February 24, 2021 15:52

openshift-ci-robot requested a review from sadasu February 24, 2021 15:52

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2021

honza changed the title ~~Backport ensure adoption is retried upon failure~~ Bug 1905577: Backport ensure adoption is retried upon failure Feb 24, 2021

openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 24, 2021

openshift-ci-robot changed the title ~~Bug 1905577: Backport ensure adoption is retried upon failure~~ Bug 1932452: Backport ensure adoption is retried upon failure Feb 24, 2021

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 24, 2021

openshift-ci-robot assigned stbenjam Feb 25, 2021

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2021

honza mentioned this pull request Mar 2, 2021

Bug 1927678: Backporting BMO extensions to support different reboot modes #128

Merged

openshift-ci-robot removed the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Mar 5, 2021

openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Mar 5, 2021

derekwaynecarr added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Mar 5, 2021

openshift-merge-robot merged commit 8ddca62 into openshift:release-4.7 Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1932452: Backport ensure adoption is retried upon failure #127

Bug 1932452: Backport ensure adoption is retried upon failure #127

honza commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

stbenjam commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

stbenjam commented Feb 24, 2021

stbenjam commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

stbenjam commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

honza commented Feb 25, 2021

stbenjam commented Feb 25, 2021

openshift-ci-robot commented Feb 25, 2021

stbenjam commented Mar 5, 2021

openshift-ci-robot commented Mar 5, 2021

derekwaynecarr commented Mar 5, 2021

openshift-ci-robot commented Mar 5, 2021

Bug 1932452: Backport ensure adoption is retried upon failure #127

Bug 1932452: Backport ensure adoption is retried upon failure #127

Conversation

honza commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

stbenjam commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

stbenjam commented Feb 24, 2021

stbenjam commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

stbenjam commented Feb 24, 2021

openshift-ci-robot commented Feb 24, 2021

honza commented Feb 25, 2021

stbenjam commented Feb 25, 2021

openshift-ci-robot commented Feb 25, 2021

stbenjam commented Mar 5, 2021

openshift-ci-robot commented Mar 5, 2021

derekwaynecarr commented Mar 5, 2021

openshift-ci-robot commented Mar 5, 2021