New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug 1868104: Make use of errors and Failed phase to handle failed machines #113
Conversation
Split the parts that set the ProviderID on the Machine and Node. These happen at different times (at the end of the Provisioning phase and at the end of the Provisioned phase, respectively).
This reduces unnecessary writes to the Node spec on every Update().
Use the ProviderID to get the Host once it is available (after the Machine reaches the Provisioned phase). To ensure we don't mistake a different deployment to the same Host (or a replacement Host with the same name) with our original one, only return true from Exists() if this Machine is still set as the Host's consumer. In the Provisioning phase (and for Machines created with an older version of CAPBM), fall back to reading from the annotation as before. Fixes openshift#104
When creating a Machine, make all changes to the BareMetalHost object in a single method. Avoid unnecessary writes by not doing an API update when no changes are required.
Create a new method to remove both the consumer reference and our finalizer from a Host. This attempts to do the Right Thing if we find that this Machine no longer owns the Host: do not clear the consumer reference if something else is holding it, but still remove our finalizer unless there is a new consumer that is also a Machine.
Allow the actuator itself to converge on its steady state before adding the remediation annotation. This means the test code does not need to be modified every time changes are made to how the actuator stores data.
The Create() method of the actuator is only called while the Machine is in the Provisioning phase and Exists() returns false. Currently, Exists() will start returning true as soon as the host annotation is present on the Machine, so there is little point doing anything in Create() after this. There is no way for Create() to be called after the Machine is linked to a Node, so no point calling handleNodeFinalizer(). Calling updateMachineStatus() actually moves the Machine from the Provisioning to Provisioned phase; were Exists() to return false in this phase the Machine would be marked as failed, so it is safer not to call this lest the Exists() implementation get out of sync.
Call things in the order that we normally expect to have to deal with them. Requeue after writes so that we are not setting ourselves up for guaranteed conflicts with later writes.
The ErrorMessage field in the Status is largely managed by the Machine controller itself, and is cleared when transitioning between phases: ->Provisioning->Provisioned->Running->Deleting
@zaneb: This pull request references Bugzilla bug 1868104, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold |
If there are no Hosts available to provision a Machine, we should flag an InsufficientResourcesMachineError, which "generally refers to ... running out of physical machines in an on-premise[s] environment". We continue to retry looking for an available Host at intervals, but in the meantime the Machine will be in an error state. Once the Machine is able to acquire a Host for provisioning, clear the error if the Machine remains in the Provisioning phase (upon reaching the Provisioned phase, the machine controller will automatically clear it). The presence of an error means that the MachineSet will target the Machine for deletion on scale down, in preference to other Machines that might be happily provisioned. Fixes openshift#103
Both calls to this now need to requeue if a change is made, so just return the RequeueAfterError from inside the method.
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work breaking this down into reviewable commits.
We currently reference the Host only by namespace and name, but this is insufficient to uniquely identify a provider: if the BareMetalHost were deleted and another created with the same name then the namespace+name would match even though it could now refer to a completely different host (and certainly not one provisioned by this Machine object). Including the UID in the ProviderID ensures that even if the Machine were to remain around in the Failed phase, the nodelink controller could still link a new Machine using the replacement Host (of the same name) to its Node as it will have a unique ProviderID. We do not retroactively update existing Machines where the ProviderID does not included a UID because the ProviderID of a Node is immutable, so we would be unable to update a linked Node to match. In theory, the ProviderID should refer to an individual deployment to a Host, not just an individual host. However, for now the BareMetalHost API does not provide an ID for this (in a pinch we could use the timestamp of when provisioning completed, but it's not an intended purpose). The mechanism of claiming the Host using the Consumer reference should ensure that two Machines can't be associated to the same Host object.
If a Host provisioned by a Machine is deleted, its IP addresses are no longer valid as they may be assigned to a new Host. In this case, remove the IP addresses from the Machine. This means that if a new Machine is created on a new Host that has the same IP address, the nodelink controller will see only the new Machine associated with the IP and will thus be able to link it to the Node. Two Machines sharing a ProviderID would also prevent a Node being linked, however the incorporation of the Host UID into the ProviderID ensures that this will change even if the new Host is recreated with the same name as the old one. This must be done in the Exists() method, because the Update() method is not called again once Exists() returns false.
If something external deprovisions the Host, ensure the Machine transitions to the Failed phase. We do this by returning false from Exists() if the Host returns to the Ready state. Also return false while the Host is in the Provisioning state, so that the Machine remains in the Provisioning phase (rather than moving to Provisioned as soon as a Host is selected) - *provided* that the ProviderID is not set yet.
Deleting a Host being used by a Machine is one instance of the problem of a provider being externally deleted. Responding by deleting the Machine object destroys the evidence, and is inconsistent with other CAPI providers. Instead, ensure the Machine transitions to the Failed phase if the Host is deleted, just as we do if it is deprovisioned. We do this by returning false from Exists() if the Host is not in the Provisioned state. If something starts deprovisioning the Host (including if the Host is deleted), the Machine will move to the Failed phase. This means we no longer need to put a finalizer on the Host to ensure we have time to initiate the Machine delete before the Host goes away. We still remove any existing finalizer, as it may have been added by a previous version of CAPBM. Fixes openshift#105 Fixes openshift#56
/lgtm |
/bugzilla refresh |
@zaneb: This pull request references Bugzilla bug 1868104, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/lgtm |
openshift/machine-api-operator#688 has merged so we are free to merge this now. |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhellmann, zaneb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@zaneb: All pull requests linked via external trackers have merged: Bugzilla bug 1868104 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
🐛 Force deletion of BMH attributes when deleting m3m
InsufficientResourcesMachineError
on Machines that are searching (unsuccessfully) for an available host. This ensures that such Machines are the first victims on scale down - Fixes No indication that a Machine can't acquire a Host #103.Failed
phase if the Host is deprovisioned - Fixes Deleting a Host doesn't immediately affect Machine #56.Provisioning
phase while the Host is provisioning.