New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BMH stucks in RegistrationError when vbmc is not reachable for short time #739
Comments
/cc @zaneb @dhellmann @maelk |
The problem is in the ironic provisioner. Looking at the Node data from ironic, there's no way to tell if there is a I think there's something similar going with inspection failures. In
What we need is some way of recording that we have seen the error. I now realise that prior to #388 the flip-flopping between the Registering and RegistrationFailed states could have achieved that for registration (oops), although it did not because at that time it only retried if the credentials changed. That wouldn't have helped adoption or inspection anyway. One option would be to store data in Ironic indicating that we've seen the error. For adoption and inspection we could move the state back to A better option is probably that upon seeing Dirty=true (which happens on both the initial forced retry and any intermediate states), we should clear the error type/message (so we won't do another forced update) but not the error count (so that wait times between failures will continue to increase). The error count would only get cleared upon completing the action. This seems like a not-too-bad UX. One issue with this is that we can't currently distinguish between a successful request to change the provision state and a 409 Conflict (both just return Dirty=true with a requeue delay). /cc @andfasano |
/cc @kashifest |
And I question myself, why we are only trying in the case of credentials being changed and not other cases i.e like the case this issue describing?
Yes, you are right, checked it in my testing environment. |
@furkatgofurov7 are the BMO logs complete? I see that registration failure traces for |
We're missing the earlier ones (on the last one the retry delay is huge, so there are no later ones), but it doesn't really matter. They'll all look the same. |
@andfasano no, that is a truncated version of the logs, as OpenStack paste was not able to embed the full log so I had to truncate it a lot. However, the logs you are seeing specifically node-0 related almost from the end of the logs, but it is pretty much jammed with, all looking identical lines as below:
|
Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739
Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739
@andfasano I tried to automate manual steps as much as I could, so to reproduce this scenario you can use the following simple steps:
While running the script, watch BMHs' and if the script will run as expected which is most probably(unlikely, but there can be a case when BMHs' state transition time from /cc @zaneb |
Thanks @furkatgofurov7 |
/kind bug |
Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739
Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739 (cherry picked from commit 84ca573) Signed-off-by: Honza Pokorny <honza@redhat.com>
We have the following scenario:
While BMHs' are in
registering
state, we are running a small script to simulate the scenario, where connection to vbmc will be lost, by simply stoping vbmc container, sleeping for 10 seconds, and starting vbmc again. When simulated, one of the BMHs will be stuck in registering which in turn results inRegistrationerror
. Interestingly, BMO tries to retry and handle the error several times(asErrorCount
keeps increasing), but BMH still is not able to get out of RegistrationError.Below provided are truncated version of logs for:
Any direction/help is much appreciated. Thanks!
Upd: node_0-serial0.log
Upd2: baremtal node list
The text was updated successfully, but these errors were encountered: