Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1868104: Make use of errors and Failed phase to handle failed machines #113

Merged
merged 17 commits into from Dec 3, 2020

Conversation

zaneb
Copy link
Member

@zaneb zaneb commented Sep 15, 2020

Split the parts that set the ProviderID on the Machine and Node. These
happen at different times (at the end of the Provisioning phase and at
the end of the Provisioned phase, respectively).
This reduces unnecessary writes to the Node spec on every Update().
Use the ProviderID to get the Host
once it is available (after the Machine reaches the Provisioned phase).

To ensure we don't mistake a different deployment to the same Host (or a
replacement Host with the same name) with our original one, only return
true from Exists() if this Machine is still set as the Host's consumer.

In the Provisioning phase (and for Machines created with an older
version of CAPBM), fall back to reading from the annotation as before.

Fixes openshift#104
When creating a Machine, make all changes to the BareMetalHost object in
a single method. Avoid unnecessary writes by not doing an API update
when no changes are required.
Create a new method to remove both the consumer reference and our
finalizer from a Host.

This attempts to do the Right Thing if we find that this Machine no
longer owns the Host: do not clear the consumer reference if something
else is holding it, but still remove our finalizer unless there is a new
consumer that is also a Machine.
Allow the actuator itself to converge on its steady state before adding
the remediation annotation. This means the test code does not need to be
modified every time changes are made to how the actuator stores data.
The Create() method of the actuator is only called while the Machine is
in the Provisioning phase and Exists() returns false. Currently,
Exists() will start returning true as soon as the host annotation is
present on the Machine, so there is little point doing anything in
Create() after this.

There is no way for Create() to be called after the Machine is linked to
a Node, so no point calling handleNodeFinalizer().

Calling updateMachineStatus() actually moves the Machine from the
Provisioning to Provisioned phase; were Exists() to return false in this
phase the Machine would be marked as failed, so it is safer not to call
this lest the Exists() implementation get out of sync.
Call things in the order that we normally expect to have to deal with
them. Requeue after writes so that we are not setting ourselves up for
guaranteed conflicts with later writes.
The ErrorMessage field in the Status is largely managed by the Machine
controller itself, and is cleared when transitioning between phases:

->Provisioning->Provisioned->Running->Deleting
@openshift-ci-robot openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Sep 15, 2020
@openshift-ci-robot
Copy link

@zaneb: This pull request references Bugzilla bug 1868104, which is invalid:

  • expected the bug to target the "4.6.0" release, but it targets "4.7.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1868104: Make use of errors and Failed phase to handle failed machines

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Sep 15, 2020
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2020
@zaneb
Copy link
Member Author

zaneb commented Sep 15, 2020

/hold
This depends on openshift/machine-api-operator#688 to avoid creating Failed hosts that are undeletable by MachineHealthCheck.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 15, 2020
If there are no Hosts available to provision a Machine, we should flag
an InsufficientResourcesMachineError, which "generally refers to ...
running out of physical machines in an on-premise[s] environment".

We continue to retry looking for an available Host at intervals, but in
the meantime the Machine will be in an error state. Once the Machine is
able to acquire a Host for provisioning, clear the error if the Machine
remains in the Provisioning phase (upon reaching the Provisioned phase,
the machine controller will automatically clear it).

The presence of an error means that the MachineSet will target the
Machine for deletion on scale down, in preference to other Machines that
might be happily provisioned.

Fixes openshift#103
Both calls to this now need to requeue if a change is made, so just
return the RequeueAfterError from inside the method.
@zaneb
Copy link
Member Author

zaneb commented Sep 15, 2020

/retest

Copy link

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work breaking this down into reviewable commits.

We currently reference the Host only by namespace and name, but this is
insufficient to uniquely identify a provider: if the BareMetalHost were
deleted and another created with the same name then the namespace+name
would match even though it could now refer to a completely different
host (and certainly not one provisioned by this Machine object).

Including the UID in the ProviderID ensures that even if the Machine
were to remain around in the Failed phase, the nodelink controller could
still link a new Machine using the replacement Host (of the same name)
to its Node as it will have a unique ProviderID.

We do not retroactively update existing Machines where the ProviderID
does not included a UID because the ProviderID of a Node is immutable,
so we would be unable to update a linked Node to match.

In theory, the ProviderID should refer to an individual deployment to a
Host, not just an individual host. However, for now the BareMetalHost
API does not provide an ID for this (in a pinch we could use the
timestamp of when provisioning completed, but it's not an intended
purpose). The mechanism of claiming the Host using the Consumer
reference should ensure that two Machines can't be associated to the
same Host object.
If a Host provisioned by a Machine is deleted, its IP addresses are no
longer valid as they may be assigned to a new Host. In this case, remove
the IP addresses from the Machine. This means that if a new Machine is
created on a new Host that has the same IP address, the nodelink
controller will see only the new Machine associated with the IP and will
thus be able to link it to the Node.

Two Machines sharing a ProviderID would also prevent a Node being
linked, however the incorporation of the Host UID into the ProviderID
ensures that this will change even if the new Host is recreated with the
same name as the old one.

This must be done in the Exists() method, because the Update() method is
not called again once Exists() returns false.
If something external deprovisions the Host, ensure the Machine
transitions to the Failed phase.

We do this by returning false from Exists() if the Host returns to the
Ready state. Also return false while the Host is in the Provisioning
state, so that the Machine remains in the Provisioning phase (rather
than moving to Provisioned as soon as a Host is selected) - *provided*
that the ProviderID is not set yet.
Deleting a Host being used by a Machine is one instance of the problem
of a provider being externally deleted. Responding by deleting the
Machine object destroys the evidence, and is inconsistent with other
CAPI providers.

Instead, ensure the Machine transitions to the Failed phase if the Host
is deleted, just as we do if it is deprovisioned.

We do this by returning false from Exists() if the Host is not in the
Provisioned state. If something starts deprovisioning the Host
(including if the Host is deleted), the Machine will move to the Failed
phase.

This means we no longer need to put a finalizer on the Host to ensure we
have time to initiate the Machine delete before the Host goes away. We
still remove any existing finalizer, as it may have been added by a
previous version of CAPBM.

Fixes openshift#105
Fixes openshift#56
@dhellmann
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2020
@zaneb
Copy link
Member Author

zaneb commented Oct 13, 2020

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 13, 2020
@openshift-ci-robot
Copy link

@zaneb: This pull request references Bugzilla bug 1868104, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 13, 2020
@dhellmann
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2020
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020
@zaneb
Copy link
Member Author

zaneb commented Dec 3, 2020

openshift/machine-api-operator#688 has merged so we are free to merge this now.
/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020
@dhellmann
Copy link

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 9ddb843 into openshift:master Dec 3, 2020
@openshift-ci-robot
Copy link

@zaneb: All pull requests linked via external trackers have merged:

Bugzilla bug 1868104 has been moved to the MODIFIED state.

In response to this:

Bug 1868104: Make use of errors and Failed phase to handle failed machines

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

honza pushed a commit to honza/cluster-api-provider-baremetal that referenced this pull request Feb 7, 2022
🐛 Force deletion of BMH attributes when deleting m3m
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
4 participants