Mitigate "machine was allocated without proper switch connections" #21

majst01 · 2020-03-13T08:39:32Z

There are two possibilities to get into a state where you have a machine that you cannot reach over the network after the allocation:

We register a new switch at the metal-api (metal-core starts for the first time and registers), but machines are already in waiting state (which can happen after a wrong update sequence or broken switch)
You start a machine which has a blade switch in between (like t1-small) where LLDP cannot discover the connections to the leaf switches

In both cases, we cannot find out to which switch a machine is connected to.

This can lead to the following failure state:

When you allocate a machine that is not in the switches machine connections
The machine starts to boot
The machine will not be enslaved into a VRF
The machine will not be reachable from external networks

Can we prevent this state? As this is actually confusing... the resulting machines are unusable for a user.

For scenario (1) you can get the switch connection after rebooting the machine and everything would be fine.

Both problems can be mitigated an assertion like this: the machine report should fail if there are not two switches visible from the machines.
This will cause the report to fail more often and the t1-small servers won't get to the waiting state any more.

To be honest, it is not so likely to get into this state. The last time this happened was because we updated the metal-core, the metal-api and wiped the rethinkdb. However, it's better for the robustness if we prevent these states anyway as they are possibly easy to prevent.

The problem is: The metal-api does not care if there are two switch connections to the machine or not. It will allow machine allocation without this condition fulfilled. The metal-hammer could actually report some wild stuff about switch neighbors to the metal-api, the api would say "fine" and when you allocate it, you would end up with an unusable machine. And this is what happened: The "machine connections" got lost because we had new switches registered at the api, but the machines behind the switches were already in the waiting state. The metal-api should at least validate if it is actually able to construct a proper switch configuration before allowing machine allocation.

--

Ideally, such a machine should not even be able to enter the wait table. This would cause a reboot of the machine re-reporting the connections + not having a user allocate such a machine.

majst01 · 2020-07-16T13:36:28Z

@Gerrit91 was #31 related to this ? cant remember why, maybe @mwindower has some helpful input as well

Gerrit91 · 2020-07-16T13:40:58Z

IMHO we should add a validation of the reported registration data and prevent the metal-hammer to enter the wait phase when for example the neighbor condition cannot be verified from the metal-api perspective.

Gerrit91 · 2020-07-16T13:41:07Z

It was not related to #31.

mwindower · 2020-07-16T16:09:00Z

It is related to #31 because connectMachineWithSwitches of the switch service is called during machine registration.
With #31 the machine registration with < 2 connections fails.

majst01 · 2022-03-17T13:32:54Z

also covered a bit with #256

Gerrit91 mentioned this issue Apr 3, 2020

Support switch replacement #31

Merged

majst01 mentioned this issue Aug 19, 2021

Only allow erasing disks under very certain circumstances. metal-stack/metal-hammer#57

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mitigate "machine was allocated without proper switch connections" #21

Mitigate "machine was allocated without proper switch connections" #21

majst01 commented Mar 13, 2020

majst01 commented Jul 16, 2020

Gerrit91 commented Jul 16, 2020

Gerrit91 commented Jul 16, 2020

mwindower commented Jul 16, 2020

majst01 commented Mar 17, 2022

Mitigate "machine was allocated without proper switch connections" #21

Mitigate "machine was allocated without proper switch connections" #21

Comments

majst01 commented Mar 13, 2020

majst01 commented Jul 16, 2020

Gerrit91 commented Jul 16, 2020

Gerrit91 commented Jul 16, 2020

mwindower commented Jul 16, 2020

majst01 commented Mar 17, 2022