Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No indication that a Machine can't acquire a Host #103

Closed
zaneb opened this issue Aug 27, 2020 · 0 comments · Fixed by #113
Closed

No indication that a Machine can't acquire a Host #103

zaneb opened this issue Aug 27, 2020 · 0 comments · Fixed by #113

Comments

@zaneb
Copy link
Member

zaneb commented Aug 27, 2020

If we create a Machine and there are no Hosts available to provision, the actuator will just leave the Machine in the "Provisioning" phase and it will keep looking for a Host. Currently there is no indication that there is no Host available and the Machine will never exit the Provisioning phase.

This means that if a user e.g. accidentally scales up their MachineSet to a size larger than the number of actual Hosts, scaling down again doesn't just delete the orphans but will by default delete provisioned Machines running real workloads at random. If you scale up by a large enough amount then this will end up deleting all of your workloads with high probability.

Instead, we should flag an error on the Machine, and clear it once a suitable Host is found. This will make the Machine among the first candidates for deletion when scaling down.

zaneb added a commit to zaneb/cluster-api-provider-baremetal that referenced this issue Sep 12, 2020
If there are no Hosts available to provision a Machine, we should flag
an InsufficientResourcesMachineError, which "generally refers to ...
running out of physical machines in an on-premise[s] environment".

We continue to retry looking for an available Host at intervals, but in
the meantime the Machine will be in an error state.

This means that the MachineSet will target the Machine for deletion on
scale down, in preference to other Machines that might be happily
provisioned.

Fixes openshift#103
zaneb added a commit to zaneb/cluster-api-provider-baremetal that referenced this issue Sep 12, 2020
If there are no Hosts available to provision a Machine, we should flag
an InsufficientResourcesMachineError, which "generally refers to ...
running out of physical machines in an on-premise[s] environment".

We continue to retry looking for an available Host at intervals, but in
the meantime the Machine will be in an error state.

This means that the MachineSet will target the Machine for deletion on
scale down, in preference to other Machines that might be happily
provisioned.

Fixes openshift#103
zaneb added a commit to zaneb/cluster-api-provider-baremetal that referenced this issue Sep 12, 2020
If there are no Hosts available to provision a Machine, we should flag
an InsufficientResourcesMachineError, which "generally refers to ...
running out of physical machines in an on-premise[s] environment".

We continue to retry looking for an available Host at intervals, but in
the meantime the Machine will be in an error state. As soon as the
Machine is able to acquire a Host for provisioning, clear the error.

This means that the MachineSet will target the Machine for deletion on
scale down, in preference to other Machines that might be happily
provisioned.

Fixes openshift#103
zaneb added a commit to zaneb/cluster-api-provider-baremetal that referenced this issue Sep 14, 2020
If there are no Hosts available to provision a Machine, we should flag
an InsufficientResourcesMachineError, which "generally refers to ...
running out of physical machines in an on-premise[s] environment".

We continue to retry looking for an available Host at intervals, but in
the meantime the Machine will be in an error state. Once the Machine is
able to acquire a Host for provisioning, clear the error if the Machine
remains in the Provisioning phase (upon reaching the Provisioned phase,
the machine controller will automatically clear it).

The presence of an error means that the MachineSet will target the
Machine for deletion on scale down, in preference to other Machines that
might be happily provisioned.

Fixes openshift#103
zaneb added a commit to zaneb/cluster-api-provider-baremetal that referenced this issue Sep 14, 2020
If there are no Hosts available to provision a Machine, we should flag
an InsufficientResourcesMachineError, which "generally refers to ...
running out of physical machines in an on-premise[s] environment".

We continue to retry looking for an available Host at intervals, but in
the meantime the Machine will be in an error state. Once the Machine is
able to acquire a Host for provisioning, clear the error if the Machine
remains in the Provisioning phase (upon reaching the Provisioned phase,
the machine controller will automatically clear it).

The presence of an error means that the MachineSet will target the
Machine for deletion on scale down, in preference to other Machines that
might be happily provisioned.

Fixes openshift#103
zaneb added a commit to zaneb/cluster-api-provider-baremetal that referenced this issue Sep 15, 2020
If there are no Hosts available to provision a Machine, we should flag
an InsufficientResourcesMachineError, which "generally refers to ...
running out of physical machines in an on-premise[s] environment".

We continue to retry looking for an available Host at intervals, but in
the meantime the Machine will be in an error state. Once the Machine is
able to acquire a Host for provisioning, clear the error if the Machine
remains in the Provisioning phase (upon reaching the Provisioned phase,
the machine controller will automatically clear it).

The presence of an error means that the MachineSet will target the
Machine for deletion on scale down, in preference to other Machines that
might be happily provisioned.

Fixes openshift#103
@zaneb zaneb changed the title Machines that can't find a Host appear in "Provisioning" phase No indication that a Machine can't find a Host Sep 15, 2020
@zaneb zaneb changed the title No indication that a Machine can't find a Host No indication that a Machine can't acquire a Host Sep 15, 2020
zaneb added a commit to zaneb/cluster-api-provider-baremetal that referenced this issue Sep 15, 2020
If there are no Hosts available to provision a Machine, we should flag
an InsufficientResourcesMachineError, which "generally refers to ...
running out of physical machines in an on-premise[s] environment".

We continue to retry looking for an available Host at intervals, but in
the meantime the Machine will be in an error state. Once the Machine is
able to acquire a Host for provisioning, clear the error if the Machine
remains in the Provisioning phase (upon reaching the Provisioned phase,
the machine controller will automatically clear it).

The presence of an error means that the MachineSet will target the
Machine for deletion on scale down, in preference to other Machines that
might be happily provisioned.

Fixes openshift#103
honza pushed a commit to honza/cluster-api-provider-baremetal that referenced this issue Feb 7, 2022
🏃 Update v1a3 CRDs to add image checksum type and disk format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant