New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BMH goes to non recoverable error state #708
Comments
|
It doesn't appear that you're testing with a version that includes #610 (there's no |
|
@zaneb thanks, it seems that was the old log not including 610, but here is the one after 610 included with error count, please have a look, it still gets stuck in registration error : |
|
OK, yep, this looks like the issue we discussed. This is happening in the actual registering state (i.e. the node has never been registered). The triedCredentials are now set, so we won't try again, even though it isn't the credentials that were wrong. This could also happen in the Ready or Provisioned states if The other scenario where this could happen is if you change the password to the BMC, and do so by updating the credentials in the Secret first and only then setting the new password on the actual BMC. /kind bug |
|
/cc @kashifest |
Previously we only verified that the Node was registered in Ironic when the credentials changed or in a steady state (Ready, Provisioned, ExternallyProvisioned). Simplify the code by always calling actionRegistering() prior to running the state machine. This ensures that the Node is always created in the ironic database after a move of the pod, regardless of what state the host is in at the time. This allows a BMC password to be changed in any order. Previously if the credentials were updated in the Secret prior to changing them on the BMC itself, it would never retry after the new credentials failed. It also ensures that after an error caused by a temporary inability to contact the BMC, retries still happen. Previously if there had been a registration error, we just gave up because the credentials hadn't changed on the incorrect assumption that bad credentials were the only way of failing to register. (Although the Host would still be reconciled every 1-2 minutes, it had no way of getting out of this state). Fixes metal3-io#708
Previously we only verified that the Node was registered in Ironic when the credentials changed or in a steady state (Ready, Provisioned, ExternallyProvisioned). Simplify the code by always calling actionRegistering() prior to running the state machine. This ensures that the Node is always created in the ironic database after a move of the pod, regardless of what state the host is in at the time. This allows a BMC password to be changed in any order. Previously if the credentials were updated in the Secret prior to changing them on the BMC itself, it would never retry after the new credentials failed. It also ensures that after an error caused by a temporary inability to contact the BMC, retries still happen. Previously if there had been a registration error, we just gave up because the credentials hadn't changed on the incorrect assumption that bad credentials were the only way of failing to register. (Although the Host would still be reconciled every 1-2 minutes, it had no way of getting out of this state). Fixes metal3-io#708
There is a case where Ironic can't recover BMHs when connectivity with BMCs fails for some time.
After that, Ironic will set a bare metal node as failed when it can't connect to its BMC for a short period of time. Once a bmh is in RegistrationError, it cannot be recovered, breaking the whole cluster. I have tested it locally, to imitate the situation of BMC connectivity loss, I have deleted vbmc container and re-created it after some time(10seconds) while BMHs' are in a registering state.
Below provided are logs:
As I brough this question earlier in one of the patches comments and got an answer that, in the case of BMC connectivity issue, re-register is not needed and retry would be sufficient to get BMH out of registration error, and that was implemented as part of #610 . However, that patch is not covering the case explained here IMO (based on my local tests). We have to fix it in order to properly get BMH out from this error state.
The text was updated successfully, but these errors were encountered: