Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mitigate "machine was allocated without proper switch connections" #21

Open
majst01 opened this issue Mar 13, 2020 · 5 comments
Open

Mitigate "machine was allocated without proper switch connections" #21

majst01 opened this issue Mar 13, 2020 · 5 comments

Comments

@majst01
Copy link
Contributor

majst01 commented Mar 13, 2020

There are two possibilities to get into a state where you have a machine that you cannot reach over the network after the allocation:

  1. We register a new switch at the metal-api (metal-core starts for the first time and registers), but machines are already in waiting state (which can happen after a wrong update sequence or broken switch)
  2. You start a machine which has a blade switch in between (like t1-small) where LLDP cannot discover the connections to the leaf switches

In both cases, we cannot find out to which switch a machine is connected to.

This can lead to the following failure state:

  • When you allocate a machine that is not in the switches machine connections
  • The machine starts to boot
  • The machine will not be enslaved into a VRF
  • The machine will not be reachable from external networks

Can we prevent this state? As this is actually confusing... the resulting machines are unusable for a user.

For scenario (1) you can get the switch connection after rebooting the machine and everything would be fine.


Both problems can be mitigated an assertion like this: the machine report should fail if there are not two switches visible from the machines.
This will cause the report to fail more often and the t1-small servers won't get to the waiting state any more.


To be honest, it is not so likely to get into this state. The last time this happened was because we updated the metal-core, the metal-api and wiped the rethinkdb. However, it's better for the robustness if we prevent these states anyway as they are possibly easy to prevent.

The problem is: The metal-api does not care if there are two switch connections to the machine or not. It will allow machine allocation without this condition fulfilled. The metal-hammer could actually report some wild stuff about switch neighbors to the metal-api, the api would say "fine" and when you allocate it, you would end up with an unusable machine. And this is what happened: The "machine connections" got lost because we had new switches registered at the api, but the machines behind the switches were already in the waiting state. The metal-api should at least validate if it is actually able to construct a proper switch configuration before allowing machine allocation.

--

Ideally, such a machine should not even be able to enter the wait table. This would cause a reboot of the machine re-reporting the connections + not having a user allocate such a machine.

@majst01
Copy link
Contributor Author

majst01 commented Jul 16, 2020

@Gerrit91 was #31 related to this ? cant remember why, maybe @mwindower has some helpful input as well

@Gerrit91
Copy link
Contributor

IMHO we should add a validation of the reported registration data and prevent the metal-hammer to enter the wait phase when for example the neighbor condition cannot be verified from the metal-api perspective.

@Gerrit91
Copy link
Contributor

It was not related to #31.

@mwindower
Copy link
Contributor

It is related to #31 because connectMachineWithSwitches of the switch service is called during machine registration.
With #31 the machine registration with < 2 connections fails.

@majst01
Copy link
Contributor Author

majst01 commented Mar 17, 2022

also covered a bit with #256

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants