New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1932452: Backport ensure adoption is retried upon failure #127
Bug 1932452: Backport ensure adoption is retried upon failure #127
Conversation
It is up to the consumer, not the host, to decide when a host is ready to have provisioning instructions added. The checks in the Available() method will be inlined in cluster-api-provider-metal3. Signed-off-by: Doug Hellmann <dhellmann@redhat.com> (cherry picked from commit 095bf63) Signed-off-by: Honza Pokorny <honza@redhat.com>
Use the API field directly instead of the convenience method. Signed-off-by: Doug Hellmann <dhellmann@redhat.com> (cherry picked from commit 29c784b) Signed-off-by: Honza Pokorny <honza@redhat.com>
Move public methods of the host type to private functions in the controller package. Signed-off-by: Doug Hellmann <dhellmann@redhat.com> (cherry picked from commit b0c8c58) Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 9b3dbf7) Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 177d9ef) Signed-off-by: Honza Pokorny <honza@redhat.com>
When a call to the provisioner succeeds rather than returning an error message, that's a good sign, and a reason to not have an error message set in the Host object. But it doesn't guarantee success: if the previous failure came at the end of some long-running operation (rather than outright rejection at the beginning), it could yet fail in exactly the same way. Clearing the ErrorType as soon as we start an operation allows us to use that field to determine whether to force the provisioner to retry in the presence of an error. (It will only be set if the last thing we did was record an error, therefore if we see it then we are at the beginning of a new retry.) One known issue with this is that if a request to change state in Ironic results in a 409 Conflict response, this just sets the Dirty flag and is indistinguishable from success when viewed outside the Ironic provisioner. If such a conflict occurs, we will effectively skip a retry attempt, increment the ErrorCount again, and sleep for even longer before the next attempt. To ensure that we actually do exponential backoff between retries, leave the ErrorCount unchanged until the whole operation actually succeeds (Dirty is no longer true). This is now decoupled from the ClearError() method that clears the ErrorType and ErrorMessage. As much as possible, do the clearing of ErrorCount in the host state machine. The exception is the steady states where we only do power management - the actionResult types currently lack enough detail to distinguish when count should be cleared when viewed from the state machine. In power management states (Ready/Available, Provisioned, Externally Provisioned), count the number of errors of any type since the power state was last successfully changed. Otherwise, the ErrorCount is cleared when an operation succesfully completes. Successful re-registration or adoption is never sufficient to clear the error count, except in the registration state. (cherry picked from commit 562b6ac) Signed-off-by: Honza Pokorny <honza@redhat.com>
Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739 (cherry picked from commit 84ca573) Signed-off-by: Honza Pokorny <honza@redhat.com>
If inspection fails at some point before it is actually started in ironic-inspector, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff. (cherry picked from commit 2d02462) Signed-off-by: Honza Pokorny <honza@redhat.com>
Error is the state that the Node goes into if deleting (i.e. deprovisioning) it fails. The only valid action from this state is to try deleting again, so do that rather than attempt to go straight to manageable. (cherry picked from commit 7791d21) Signed-off-by: Honza Pokorny <honza@redhat.com>
If deprovisioning fails, we were just repeatedly retrying it instead of setting an error. Instead, set an error and retry only after a backoff. If we are deprovisioning because of a deletion, give up after 3 retries (since this may indicate that the host simply doesn't exist any more) and allow the Host to be deleted. (cherry picked from commit d7d38ee) Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 7c3daef) Signed-off-by: Honza Pokorny <honza@redhat.com>
Denote instances where we are returning due to a 409 Conflict in Ironic by calling the retryAfterDelay() function to get a Result. Note that in some cases this introduces a delay where previously there was none. Delaying after a 409 response is probably always the right thing to do. (cherry picked from commit ac23a91) Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 5faf361) Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit c1e13c5) Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 54c8028) Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 6857a7f) Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 1e6fd34) Signed-off-by: Honza Pokorny <honza@redhat.com>
Denote instances where we are returning due to data in the Host's Status structure being updated by calling the hostUpdated() function to get a Result. (cherry picked from commit 4b4c29d) Signed-off-by: Honza Pokorny <honza@redhat.com>
The provisioner interface currently requires provisioners to write to the host's Status in order to set the ID. This is undesirable. Return the provisioner's ID for the host from ValidateManagementAccess() and write it to the Status in the controller. (cherry picked from commit 50f97a7) Signed-off-by: Honza Pokorny <honza@redhat.com>
The provisioner interface currently requires provisioners to write to the host's Status in order to set the power status. This is undesirable. Return the power status (if known) for the host from UpdateHardwareState(), rather than a Result structure from which only the Dirty flag was used. Write the updated power state to the Status in the controller. (cherry picked from commit cd60f7d) Signed-off-by: Honza Pokorny <honza@redhat.com>
The fixtureProvisioner object gets recreated for each Reconcile() call, so it cannot be used to store state that is used to make different things happen on each reconciliation. Some of this state was instead being stored in the host Status, which could potentially confound some of the tests: only the controller should modify the status, and it is the thing we are trying to test. Instead, create a Fixture structure that contains persistent state for a particular Host and the provisioner Factory to create a provisioner that has access to that state. Stop writing to the Host's Status from the fixture provisioner. (cherry picked from commit 1279a47) Signed-off-by: Honza Pokorny <honza@redhat.com>
There is no longer any reason for provisioners to modify the BareMetalHost object itself; they should communicate any changes explicitly through return values from the Provisioner API. To prevent this starting to happen again in future, pass only a copy of the BareMetalHost in the provisioner factory, so that no changes made by the provisioner will ever make their way back to the controller nor be written to the k8s API. (cherry picked from commit 0656b7e) Signed-off-by: Honza Pokorny <honza@redhat.com>
This function is now called on basically every reconcile, so don't log information about things that haven't changed. (cherry picked from commit cbaa73d) Signed-off-by: Honza Pokorny <honza@redhat.com>
When the provisioner could write to the Host's Status subresource and expect it to be persisted, we had to always treat a Dirty flag returned from the provisioner as indicating that the Status was dirty and required a write. (Oddly this meant the Dirty flag was treated with almost the inverse of its intended meaning: when it was returned true the Status was rarely actually dirty, while when it was false the action was complete which always requires a write.) Now that provisioners are no longer allowed to write to the Status, we only need to request a write when the controller has updated the Status. Add a new actionUpdate to cover the couple of cases that are not already covered by actionComplete, and those where ClearError() returns dirty. The actionUpdate type comprises actionContinue so that type checks for the latter handle it correctly. Make actionContinue return false from the Dirty() method, meaning that it will not cause a write to the k8s API. The actionContinueNoWrite type is removed, since it would now be identical to actionContinue. (cherry picked from commit 3102c74) Signed-off-by: Honza Pokorny <honza@redhat.com>
All of the action<StateName>() methods of the reconciler are the handlers used by the host state machine to carry out the action in the current provisioning state, except for actionRegistering(). actionRegistering() is called in every state (by the state machine's ensureRegistered() method) prior to calling the action method for the current state. Rename actionRegistering() to registerHost() to eliminate the implication that it is an action specific to the Registering state. Simplify the return handling by returning nil when validating management access succeeded and there were no changes required to the Status, since this is what we want to return from ensureRegistered() (indicating no need to return early before executing the state's action method). This will occur both in general operation when there is no re-registration required, and also potentially at the conclusion of a re-registration where there was no credentials change and errors had already been cleared (by a previous invocation that returned actionUpdate). At the conclusion of a registration where changes need to be written to the Status, e.g. by clearing a previous error state or updating the known good credentials, actionComplete is returned. This is later translated to actionUpdate by ensureRegistered(). This enables that scenario to be distinguished from that of a change needing to be written while registration is still in progress, which results in actionUpdate. This means there is no longer a need in ensureRegistered() to guess when the registerHost() method might have resulted in a change to the Status. (cherry picked from commit 98d7131) Signed-off-by: Honza Pokorny <honza@redhat.com>
If provisioning fails and the node ends up in the deploy failed state, we try to recover by deprovisioning. That would work better if we actually handled deploy failed in Deprovision(). (cherry picked from commit a308ddf) Signed-off-by: Honza Pokorny <honza@redhat.com>
In the deprovisioning state, calling Adopt() on a freshly-reregistered host will fail if we didn't previously complete provisioning because there will be no image data stored in the status to pass to the node for adoption. Only call Adopt() if the previous provisioning completed and recorded the requisite image data. If the node has just been re-registered, it will be in the manageable state anyway, and thus will still get cleaned before it is next provisioned. (cherry picked from commit 24ba8a0) Signed-off-by: Honza Pokorny <honza@redhat.com>
Currently when adoption fails, we use RegistrationError but this gets cleared any time registration succeeds. We need a separate signal for adoption failure, which allows us to force retry the adoption again. This adds a new AdoptionError, and ensures that if registration succeeds we don't clear an AdoptionError. Co-authored-by: Andrea Fasano <afasano@redhat.com> (cherry picked from commit a10a2ad) Signed-off-by: Honza Pokorny <honza@redhat.com>
@honza: No Bugzilla bug is referenced in the title of this pull request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@honza: This pull request references Bugzilla bug 1905577, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@stbenjam: This pull request references Bugzilla bug 1905577, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/retitle Bug 1932452: Backport ensure adoption is retried upon failure |
1 similar comment
/retitle Bug 1932452: Backport ensure adoption is retried upon failure |
@honza: This pull request references Bugzilla bug 1932452, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@stbenjam: This pull request references Bugzilla bug 1932452, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/test e2e-metal-ipi-ovn-ipv6 |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: honza, stbenjam The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/bugzilla refresh |
@stbenjam: This pull request references Bugzilla bug 1932452, which is valid. 6 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(Patch-Manager) approve. |
@honza: All pull requests linked via external trackers have merged: Bugzilla bug 1932452 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
No description provided.