Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1932452: Backport ensure adoption is retried upon failure #127

Conversation

honza
Copy link
Member

@honza honza commented Feb 24, 2021

No description provided.

dhellmann and others added 28 commits February 24, 2021 11:14
It is up to the consumer, not the host, to decide when a host is ready
to have provisioning instructions added. The checks in the Available()
method will be inlined in cluster-api-provider-metal3.

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
(cherry picked from commit 095bf63)
Signed-off-by: Honza Pokorny <honza@redhat.com>
Use the API field directly instead of the convenience method.

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
(cherry picked from commit 29c784b)
Signed-off-by: Honza Pokorny <honza@redhat.com>
Move public methods of the host type to private functions in the
controller package.

Signed-off-by: Doug Hellmann <dhellmann@redhat.com>
(cherry picked from commit b0c8c58)
Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 9b3dbf7)
Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 177d9ef)
Signed-off-by: Honza Pokorny <honza@redhat.com>
When a call to the provisioner succeeds rather than returning an error
message, that's a good sign, and a reason to not have an error message
set in the Host object. But it doesn't guarantee success: if the
previous failure came at the end of some long-running operation (rather
than outright rejection at the beginning), it could yet fail in exactly
the same way.

Clearing the ErrorType as soon as we start an operation allows us to use
that field to determine whether to force the provisioner to retry in the
presence of an error. (It will only be set if the last thing we did was
record an error, therefore if we see it then we are at the beginning of
a new retry.)

One known issue with this is that if a request to change state in Ironic
results in a 409 Conflict response, this just sets the Dirty flag and is
indistinguishable from success when viewed outside the Ironic
provisioner. If such a conflict occurs, we will effectively skip a retry
attempt, increment the ErrorCount again, and sleep for even longer
before the next attempt.

To ensure that we actually do exponential backoff between retries, leave
the ErrorCount unchanged until the whole operation actually succeeds
(Dirty is no longer true). This is now decoupled from the ClearError()
method that clears the ErrorType and ErrorMessage.
As much as possible, do the clearing of ErrorCount in the host state
machine. The exception is the steady states where we only do power
management - the actionResult types currently lack enough detail to
distinguish when count should be cleared when viewed from the state
machine.

In power management states (Ready/Available, Provisioned, Externally
Provisioned), count the number of errors of any type since the power
state was last successfully changed.

Otherwise, the ErrorCount is cleared when an operation succesfully
completes. Successful re-registration or adoption is never sufficient to
clear the error count, except in the registration state.

(cherry picked from commit 562b6ac)
Signed-off-by: Honza Pokorny <honza@redhat.com>
Once we see an error in the Node, it returns to the 'enroll' [sic] state
and we don't have a way of determining whether we have seen and saved
that error or not. Previously we always assumed we had not, and didn't
retry the validation unless the credentials had changed.

Add a force flag to indicate that this is a new attempt and should start
again.

Fixes metal3-io#739

(cherry picked from commit 84ca573)
Signed-off-by: Honza Pokorny <honza@redhat.com>
If inspection fails at some point before it is actually started in
ironic-inspector, we were just repeatedly retrying it instead of setting
an error. Instead, set an error and retry only after a backoff.

(cherry picked from commit 2d02462)
Signed-off-by: Honza Pokorny <honza@redhat.com>
Error is the state that the Node goes into if deleting (i.e.
deprovisioning) it fails. The only valid action from this state is to
try deleting again, so do that rather than attempt to go straight to
manageable.

(cherry picked from commit 7791d21)
Signed-off-by: Honza Pokorny <honza@redhat.com>
If deprovisioning fails, we were just repeatedly retrying it instead of
setting an error. Instead, set an error and retry only after a backoff.

If we are deprovisioning because of a deletion, give up after 3 retries
(since this may indicate that the host simply doesn't exist any more)
and allow the Host to be deleted.

(cherry picked from commit d7d38ee)
Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 7c3daef)
Signed-off-by: Honza Pokorny <honza@redhat.com>
Denote instances where we are returning due to a 409 Conflict in Ironic
by calling the retryAfterDelay() function to get a Result.

Note that in some cases this introduces a delay where previously there
was none. Delaying after a 409 response is probably always the right
thing to do.

(cherry picked from commit ac23a91)
Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 5faf361)
Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit c1e13c5)
Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 54c8028)
Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 6857a7f)
Signed-off-by: Honza Pokorny <honza@redhat.com>
(cherry picked from commit 1e6fd34)
Signed-off-by: Honza Pokorny <honza@redhat.com>
Denote instances where we are returning due to data in the Host's Status
structure being updated by calling the hostUpdated() function to get a
Result.

(cherry picked from commit 4b4c29d)
Signed-off-by: Honza Pokorny <honza@redhat.com>
The provisioner interface currently requires provisioners to write to the
host's Status in order to set the ID. This is undesirable. Return the
provisioner's ID for the host from ValidateManagementAccess() and write it
to the Status in the controller.

(cherry picked from commit 50f97a7)
Signed-off-by: Honza Pokorny <honza@redhat.com>
The provisioner interface currently requires provisioners to write to the
host's Status in order to set the power status. This is undesirable. Return
the power status (if known) for the host from UpdateHardwareState(), rather
than a Result structure from which only the Dirty flag was used. Write the
updated power state to the Status in the controller.

(cherry picked from commit cd60f7d)
Signed-off-by: Honza Pokorny <honza@redhat.com>
The fixtureProvisioner object gets recreated for each Reconcile() call, so
it cannot be used to store state that is used to make different things
happen on each reconciliation. Some of this state was instead being stored
in the host Status, which could potentially confound some of the tests:
only the controller should modify the status, and it is the thing we are
trying to test.

Instead, create a Fixture structure that contains persistent state for a
particular Host and the provisioner Factory to create a provisioner that
has access to that state. Stop writing to the Host's Status from the
fixture provisioner.

(cherry picked from commit 1279a47)
Signed-off-by: Honza Pokorny <honza@redhat.com>
There is no longer any reason for provisioners to modify the BareMetalHost
object itself; they should communicate any changes explicitly through
return values from the Provisioner API.

To prevent this starting to happen again in future, pass only a copy of the
BareMetalHost in the provisioner factory, so that no changes made by the
provisioner will ever make their way back to the controller nor be written
to the k8s API.

(cherry picked from commit 0656b7e)
Signed-off-by: Honza Pokorny <honza@redhat.com>
This function is now called on basically every reconcile, so don't log
information about things that haven't changed.

(cherry picked from commit cbaa73d)
Signed-off-by: Honza Pokorny <honza@redhat.com>
When the provisioner could write to the Host's Status subresource and
expect it to be persisted, we had to always treat a Dirty flag returned
from the provisioner as indicating that the Status was dirty and required a
write. (Oddly this meant the Dirty flag was treated with almost the inverse
of its intended meaning: when it was returned true the Status was rarely
actually dirty, while when it was false the action was complete which
always requires a write.)

Now that provisioners are no longer allowed to write to the Status, we only
need to request a write when the controller has updated the Status. Add a
new actionUpdate to cover the couple of cases that are not already covered
by actionComplete, and those where ClearError() returns dirty. The
actionUpdate type comprises actionContinue so that type checks for the
latter handle it correctly.

Make actionContinue return false from the Dirty() method, meaning that it
will not cause a write to the k8s API. The actionContinueNoWrite type is
removed, since it would now be identical to actionContinue.

(cherry picked from commit 3102c74)
Signed-off-by: Honza Pokorny <honza@redhat.com>
All of the action<StateName>() methods of the reconciler are the handlers
used by the host state machine to carry out the action in the current
provisioning state, except for actionRegistering(). actionRegistering() is
called in every state (by the state machine's ensureRegistered() method)
prior to calling the action method for the current state.

Rename actionRegistering() to registerHost() to eliminate the implication
that it is an action specific to the Registering state.

Simplify the return handling by returning nil when validating management
access succeeded and there were no changes required to the Status, since
this is what we want to return from ensureRegistered() (indicating no need
to return early before executing the state's action method). This will
occur both in general operation when there is no re-registration required,
and also potentially at the conclusion of a re-registration where there was
no credentials change and errors had already been cleared (by a previous
invocation that returned actionUpdate).

At the conclusion of a registration where changes need to be written to the
Status, e.g. by clearing a previous error state or updating the known good
credentials, actionComplete is returned. This is later translated to
actionUpdate by ensureRegistered().

This enables that scenario to be distinguished from that of a change
needing to be written while registration is still in progress, which
results in actionUpdate.

This means there is no longer a need in ensureRegistered() to guess when
the registerHost() method might have resulted in a change to the Status.

(cherry picked from commit 98d7131)
Signed-off-by: Honza Pokorny <honza@redhat.com>
If provisioning fails and the node ends up in the deploy failed state,
we try to recover by deprovisioning. That would work better if we
actually handled deploy failed in Deprovision().

(cherry picked from commit a308ddf)
Signed-off-by: Honza Pokorny <honza@redhat.com>
In the deprovisioning state, calling Adopt() on a freshly-reregistered
host will fail if we didn't previously complete provisioning because
there will be no image data stored in the status to pass to the node for
adoption.

Only call Adopt() if the previous provisioning completed and recorded
the requisite image data.

If the node has just been re-registered, it will be in the manageable
state anyway, and thus will still get cleaned before it is next
provisioned.

(cherry picked from commit 24ba8a0)
Signed-off-by: Honza Pokorny <honza@redhat.com>
Currently when adoption fails, we use RegistrationError but this gets
cleared any time registration succeeds. We need a separate signal for
adoption failure, which allows us to force retry the adoption again.
This adds a new AdoptionError, and ensures that if registration succeeds
we don't clear an AdoptionError.

Co-authored-by: Andrea Fasano <afasano@redhat.com>
(cherry picked from commit a10a2ad)
Signed-off-by: Honza Pokorny <honza@redhat.com>
@openshift-ci-robot
Copy link

@honza: No Bugzilla bug is referenced in the title of this pull request.
To reference a bug, add 'Bug XXX:' to the title of this pull request and request another bug refresh with /bugzilla refresh.

In response to this:

Backport ensure adoption is retried upon failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 24, 2021
@honza honza changed the title Backport ensure adoption is retried upon failure Bug 1905577: Backport ensure adoption is retried upon failure Feb 24, 2021
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 24, 2021
@openshift-ci-robot
Copy link

@honza: This pull request references Bugzilla bug 1905577, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "4.8.0" instead
  • expected Bugzilla bug 1905577 to depend on a bug targeting a release in 4.8.0 and in one of the following states: MODIFIED, VERIFIED, but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1905577: Backport ensure adoption is retried upon failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stbenjam
Copy link
Member

/bugzilla refresh

@openshift-ci-robot
Copy link

@stbenjam: This pull request references Bugzilla bug 1905577, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "4.7.z" instead
  • expected Bugzilla bug 1905577 to depend on a bug targeting a release in 4.8.0 and in one of the following states: MODIFIED, VERIFIED, but no dependents were found

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stbenjam
Copy link
Member

/retitle Bug 1932452: Backport ensure adoption is retried upon failure

1 similar comment
@stbenjam
Copy link
Member

/retitle Bug 1932452: Backport ensure adoption is retried upon failure

@openshift-ci-robot openshift-ci-robot changed the title Bug 1905577: Backport ensure adoption is retried upon failure Bug 1932452: Backport ensure adoption is retried upon failure Feb 24, 2021
@openshift-ci-robot
Copy link

@honza: This pull request references Bugzilla bug 1932452, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "4.7.z" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1932452: Backport ensure adoption is retried upon failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stbenjam
Copy link
Member

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Feb 24, 2021
@openshift-ci-robot
Copy link

@stbenjam: This pull request references Bugzilla bug 1932452, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1905577 is in the state MODIFIED, which is one of the valid states (MODIFIED, VERIFIED)
  • dependent Bugzilla bug 1905577 targets the "4.8.0" release, which is one of the valid target releases: 4.8.0
  • bug has dependents

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@honza
Copy link
Member Author

honza commented Feb 25, 2021

/test e2e-metal-ipi-ovn-ipv6

@stbenjam
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 25, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: honza, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@stbenjam
Copy link
Member

stbenjam commented Mar 5, 2021

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot removed the bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. label Mar 5, 2021
@openshift-ci-robot
Copy link

@stbenjam: This pull request references Bugzilla bug 1932452, which is valid.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.z) matches configured target release for branch (4.7.z)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1905577 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
  • dependent Bugzilla bug 1905577 targets the "4.8.0" release, which is one of the valid target releases: 4.8.0
  • bug has dependents

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. label Mar 5, 2021
@derekwaynecarr
Copy link
Member

(Patch-Manager) approve.

@derekwaynecarr derekwaynecarr added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Mar 5, 2021
@openshift-merge-robot openshift-merge-robot merged commit 8ddca62 into openshift:release-4.7 Mar 5, 2021
@openshift-ci-robot
Copy link

@honza: All pull requests linked via external trackers have merged:

Bugzilla bug 1932452 has been moved to the MODIFIED state.

In response to this:

Bug 1932452: Backport ensure adoption is retried upon failure

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-urgent Referenced Bugzilla bug's severity is urgent for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants