Bug 1868104: Make use of errors and Failed phase to handle failed machines #113

zaneb · 2020-09-15T02:30:59Z

Set InsufficientResourcesMachineError on Machines that are searching (unsuccessfully) for an available host. This ensures that such Machines are the first victims on scale down - Fixes No indication that a Machine can't acquire a Host #103.
Move Machines into the Failed phase if the Host is deprovisioned - Fixes Deleting a Host doesn't immediately affect Machine #56.
Don't delete failed Machines, leave this task to the MachineHealthCheck (see Don't use external remediation on failed Machines machine-api-operator#688) - Fixes Actuator should not delete Machine objects #105.
Keep Machines in the Provisioning phase while the Host is provisioning.
Minimise write conflicts and improve debugging information.

Split the parts that set the ProviderID on the Machine and Node. These happen at different times (at the end of the Provisioning phase and at the end of the Provisioned phase, respectively).

This reduces unnecessary writes to the Node spec on every Update().

Use the ProviderID to get the Host once it is available (after the Machine reaches the Provisioned phase). To ensure we don't mistake a different deployment to the same Host (or a replacement Host with the same name) with our original one, only return true from Exists() if this Machine is still set as the Host's consumer. In the Provisioning phase (and for Machines created with an older version of CAPBM), fall back to reading from the annotation as before. Fixes openshift#104

When creating a Machine, make all changes to the BareMetalHost object in a single method. Avoid unnecessary writes by not doing an API update when no changes are required.

Create a new method to remove both the consumer reference and our finalizer from a Host. This attempts to do the Right Thing if we find that this Machine no longer owns the Host: do not clear the consumer reference if something else is holding it, but still remove our finalizer unless there is a new consumer that is also a Machine.

Allow the actuator itself to converge on its steady state before adding the remediation annotation. This means the test code does not need to be modified every time changes are made to how the actuator stores data.

The Create() method of the actuator is only called while the Machine is in the Provisioning phase and Exists() returns false. Currently, Exists() will start returning true as soon as the host annotation is present on the Machine, so there is little point doing anything in Create() after this. There is no way for Create() to be called after the Machine is linked to a Node, so no point calling handleNodeFinalizer(). Calling updateMachineStatus() actually moves the Machine from the Provisioning to Provisioned phase; were Exists() to return false in this phase the Machine would be marked as failed, so it is safer not to call this lest the Exists() implementation get out of sync.

Call things in the order that we normally expect to have to deal with them. Requeue after writes so that we are not setting ourselves up for guaranteed conflicts with later writes.

The ErrorMessage field in the Status is largely managed by the Machine controller itself, and is cleared when transitioning between phases: ->Provisioning->Provisioned->Running->Deleting

openshift-ci-robot · 2020-09-15T02:31:03Z

@zaneb: This pull request references Bugzilla bug 1868104, which is invalid:

expected the bug to target the "4.6.0" release, but it targets "4.7.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1868104: Make use of errors and Failed phase to handle failed machines

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

zaneb · 2020-09-15T02:34:00Z

/hold
This depends on openshift/machine-api-operator#688 to avoid creating Failed hosts that are undeletable by MachineHealthCheck.

If there are no Hosts available to provision a Machine, we should flag an InsufficientResourcesMachineError, which "generally refers to ... running out of physical machines in an on-premise[s] environment". We continue to retry looking for an available Host at intervals, but in the meantime the Machine will be in an error state. Once the Machine is able to acquire a Host for provisioning, clear the error if the Machine remains in the Provisioning phase (upon reaching the Provisioned phase, the machine controller will automatically clear it). The presence of an error means that the MachineSet will target the Machine for deletion on scale down, in preference to other Machines that might be happily provisioned. Fixes openshift#103

Both calls to this now need to requeue if a change is made, so just return the RequeueAfterError from inside the method.

zaneb · 2020-09-15T14:26:17Z

/retest

dhellmann

Nice work breaking this down into reviewable commits.

pkg/cloud/baremetal/actuators/machine/actuator.go

We currently reference the Host only by namespace and name, but this is insufficient to uniquely identify a provider: if the BareMetalHost were deleted and another created with the same name then the namespace+name would match even though it could now refer to a completely different host (and certainly not one provisioned by this Machine object). Including the UID in the ProviderID ensures that even if the Machine were to remain around in the Failed phase, the nodelink controller could still link a new Machine using the replacement Host (of the same name) to its Node as it will have a unique ProviderID. We do not retroactively update existing Machines where the ProviderID does not included a UID because the ProviderID of a Node is immutable, so we would be unable to update a linked Node to match. In theory, the ProviderID should refer to an individual deployment to a Host, not just an individual host. However, for now the BareMetalHost API does not provide an ID for this (in a pinch we could use the timestamp of when provisioning completed, but it's not an intended purpose). The mechanism of claiming the Host using the Consumer reference should ensure that two Machines can't be associated to the same Host object.

If a Host provisioned by a Machine is deleted, its IP addresses are no longer valid as they may be assigned to a new Host. In this case, remove the IP addresses from the Machine. This means that if a new Machine is created on a new Host that has the same IP address, the nodelink controller will see only the new Machine associated with the IP and will thus be able to link it to the Node. Two Machines sharing a ProviderID would also prevent a Node being linked, however the incorporation of the Host UID into the ProviderID ensures that this will change even if the new Host is recreated with the same name as the old one. This must be done in the Exists() method, because the Update() method is not called again once Exists() returns false.

If something external deprovisions the Host, ensure the Machine transitions to the Failed phase. We do this by returning false from Exists() if the Host returns to the Ready state. Also return false while the Host is in the Provisioning state, so that the Machine remains in the Provisioning phase (rather than moving to Provisioned as soon as a Host is selected) - *provided* that the ProviderID is not set yet.

Deleting a Host being used by a Machine is one instance of the problem of a provider being externally deleted. Responding by deleting the Machine object destroys the evidence, and is inconsistent with other CAPI providers. Instead, ensure the Machine transitions to the Failed phase if the Host is deleted, just as we do if it is deprovisioned. We do this by returning false from Exists() if the Host is not in the Provisioned state. If something starts deprovisioning the Host (including if the Host is deleted), the Machine will move to the Failed phase. This means we no longer need to put a finalizer on the Host to ensure we have time to initiate the Machine delete before the Host goes away. We still remove any existing finalizer, as it may have been added by a previous version of CAPBM. Fixes openshift#105 Fixes openshift#56

dhellmann · 2020-09-25T17:56:47Z

/lgtm

zaneb · 2020-10-13T17:03:27Z

/bugzilla refresh

openshift-ci-robot · 2020-10-13T17:03:37Z

@zaneb: This pull request references Bugzilla bug 1868104, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.7.0) matches configured target release for branch (4.7.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dhellmann · 2020-11-16T21:02:07Z

/lgtm

zaneb · 2020-12-03T21:11:37Z

openshift/machine-api-operator#688 has merged so we are free to merge this now.
/hold cancel

dhellmann · 2020-12-03T21:14:25Z

/lgtm

openshift-ci-robot · 2020-12-03T21:14:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dhellmann,zaneb]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2020-12-03T23:09:46Z

@zaneb: All pull requests linked via external trackers have merged:

openshift/cluster-api-provider-baremetal#113

Bugzilla bug 1868104 has been moved to the MODIFIED state.

In response to this:

Bug 1868104: Make use of errors and Failed phase to handle failed machines

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

🐛 Force deletion of BMH attributes when deleting m3m

zaneb added 9 commits September 14, 2020 14:15

Refactor ensureProviderID

90689bc

Split the parts that set the ProviderID on the Machine and Node. These happen at different times (at the end of the Provisioning phase and at the end of the Provisioned phase, respectively).

Don't update Node unless we are changing ProviderID

fa86b5f

This reduces unnecessary writes to the Node spec on every Update().

Refactor Host provisioning code

bccbd00

When creating a Machine, make all changes to the BareMetalHost object in a single method. Avoid unnecessary writes by not doing an API update when no changes are required.

unit tests: Make Remediation scenario more realistic

3617999

Allow the actuator itself to converge on its steady state before adding the remediation annotation. This means the test code does not need to be modified every time changes are made to how the actuator stores data.

Rearrange Update() method and avoid conflicts

740e4aa

Call things in the order that we normally expect to have to deal with them. Requeue after writes so that we are not setting ourselves up for guaranteed conflicts with later writes.

Remove gratuitous clearError() calls

13a0d13

The ErrorMessage field in the Status is largely managed by the Machine controller itself, and is cleared when transitioning between phases: ->Provisioning->Provisioned->Running->Deleting

openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Sep 15, 2020

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Sep 15, 2020

openshift-ci-robot requested review from dhellmann and hardys September 15, 2020 02:31

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 15, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 15, 2020

zaneb added 2 commits September 14, 2020 22:43

Return RequeueAfter from ensureAnnotation()

57798f7

Both calls to this now need to requeue if a change is made, so just return the RequeueAfterError from inside the method.

zaneb force-pushed the failed-phase branch from 33cbf9b to 9b86d36 Compare September 15, 2020 02:44

dhellmann reviewed Sep 21, 2020

View reviewed changes

pkg/cloud/baremetal/actuators/machine/actuator.go Show resolved Hide resolved

pkg/cloud/baremetal/actuators/machine/actuator.go Show resolved Hide resolved

pkg/cloud/baremetal/actuators/machine/actuator.go Outdated Show resolved Hide resolved

zaneb added 4 commits September 25, 2020 13:11

zaneb force-pushed the failed-phase branch from 9b86d36 to e91261c Compare September 25, 2020 17:13

openshift-ci-robot assigned dhellmann Sep 25, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2020

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Oct 13, 2020

zaneb mentioned this pull request Oct 16, 2020

Don't add a finalizer to the Node #124

Merged

zaneb mentioned this pull request Oct 23, 2020

MachineHealthCheck fights with MachineSet on invalid configuration openshift/machine-api-operator#736

Closed

Merge branch 'openshift' into failed-phase

329b06e

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Nov 13, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Nov 16, 2020

Merge branch 'openshift' into failed-phase

3d8c404

openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 3, 2020

openshift-merge-robot merged commit 9ddb843 into openshift:master Dec 3, 2020

honza pushed a commit to honza/cluster-api-provider-baremetal that referenced this pull request Feb 7, 2022

Merge pull request openshift#113 from Nordix/v1a4-mael

ab41dfc

🐛 Force deletion of BMH attributes when deleting m3m

zaneb mentioned this pull request Apr 7, 2022

BareMetalHost deprovisioning phase hooks metal3-io/baremetal-operator#1105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1868104: Make use of errors and Failed phase to handle failed machines #113

Bug 1868104: Make use of errors and Failed phase to handle failed machines #113

zaneb commented Sep 15, 2020

openshift-ci-robot commented Sep 15, 2020

zaneb commented Sep 15, 2020

zaneb commented Sep 15, 2020

dhellmann left a comment

dhellmann commented Sep 25, 2020

zaneb commented Oct 13, 2020

openshift-ci-robot commented Oct 13, 2020

dhellmann commented Nov 16, 2020

zaneb commented Dec 3, 2020

dhellmann commented Dec 3, 2020

openshift-ci-robot commented Dec 3, 2020

openshift-ci-robot commented Dec 3, 2020

Bug 1868104: Make use of errors and Failed phase to handle failed machines #113

Bug 1868104: Make use of errors and Failed phase to handle failed machines #113

Conversation

zaneb commented Sep 15, 2020

openshift-ci-robot commented Sep 15, 2020

zaneb commented Sep 15, 2020

zaneb commented Sep 15, 2020

dhellmann left a comment

Choose a reason for hiding this comment

dhellmann commented Sep 25, 2020

zaneb commented Oct 13, 2020

openshift-ci-robot commented Oct 13, 2020

dhellmann commented Nov 16, 2020

zaneb commented Dec 3, 2020

dhellmann commented Dec 3, 2020

openshift-ci-robot commented Dec 3, 2020

openshift-ci-robot commented Dec 3, 2020