BMH stucks in RegistrationError when vbmc is not reachable for short time #739

furkatgofurov7 · 2020-12-07T15:53:12Z

We have the following scenario:
While BMHs' are in registering state, we are running a small script to simulate the scenario, where connection to vbmc will be lost, by simply stoping vbmc container, sleeping for 10 seconds, and starting vbmc again. When simulated, one of the BMHs will be stuck in registering which in turn results in Registrationerror. Interestingly, BMO tries to retry and handle the error several times(as ErrorCount keeps increasing), but BMH still is not able to get out of RegistrationError.
Below provided are truncated version of logs for:

Any direction/help is much appreciated. Thanks!

Upd: node_0-serial0.log
Upd2: baremtal node list

The text was updated successfully, but these errors were encountered:

furkatgofurov7 · 2020-12-07T15:53:58Z

/cc @zaneb @dhellmann @maelk

furkatgofurov7 · 2020-12-07T15:57:24Z

This issue is the continuation of #708, after testing BMO with changes included #716.

zaneb · 2020-12-10T03:40:50Z

The problem is in the ironic provisioner. Looking at the Node data from ironic, there's no way to tell if there is a last_error showing because our attempt to move to manageable just timed out, or if we have seen it before, have waited the requisite time, and want to try again. There's no timestamp in the ironic data (and trusting timestamps from other services is questionable anyway.) Currently we only try again if the credentials have changed, so in all other cases we'll just keep reporting the same error and incrementing the error count.

I think there's something similar going with inspection failures. In Adopt() I 'fixed' the equivalent problem (#697) by forcing a retry whenever we had an existing error of that type. This doesn't really solve the problem though:

If (as I did there) we continue to clear the error when we see an intermediate state (adopting/verifying/inspecting - i.e. when returning Dirty=true) - assuming we don't miss that state as it goes by - then we'll never actually get exponential backoffs, because the error count always gets reset to 0.
If we don't clear the error, then after the first failure we'll retry in a loop inside the ironic provisioner, without ever incrementing the error count again.

What we need is some way of recording that we have seen the error. I now realise that prior to #388 the flip-flopping between the Registering and RegistrationFailed states could have achieved that for registration (oops), although it did not because at that time it only retried if the credentials changed. That wouldn't have helped adoption or inspection anyway.

One option would be to store data in Ironic indicating that we've seen the error. For adoption and inspection we could move the state back to manageable after seeing the error, but that doesn't work for registration because it fails back to enroll (not my spelling mistake, for the record) where it started from. Can we clear the last_error field? (@dtantsur?) Or store the error count in the extra data field. Since atomic changes between ironic and bmo are not possible, after writing data to Ironic an error writing the updated failure count to the Host would result in an extra immediate retry, but that's not really a disaster as long as it's not systematic.

A better option is probably that upon seeing Dirty=true (which happens on both the initial forced retry and any intermediate states), we should clear the error type/message (so we won't do another forced update) but not the error count (so that wait times between failures will continue to increase). The error count would only get cleared upon completing the action. This seems like a not-too-bad UX. One issue with this is that we can't currently distinguish between a successful request to change the provision state and a 409 Conflict (both just return Dirty=true with a requeue delay).

/cc @andfasano

furkatgofurov7 · 2020-12-10T11:36:05Z

/cc @kashifest

furkatgofurov7 · 2020-12-10T13:41:53Z

questionable anyway.) Currently, we only try again if the credentials have changed, so in all other cases, we'll just keep reporting the same error and incrementing the error count.

And I question myself, why we are only trying in the case of credentials being changed and not other cases i.e like the case this issue describing?

but that doesn't work for registration because it fails back to enroll (not my spelling mistake, for the record) where it started from

Yes, you are right, checked it in my testing environment.

andfasano · 2020-12-10T15:52:27Z

@furkatgofurov7 are the BMO logs complete? I see that registration failure traces for node-0 at the top, but no other traces (I was expecting to find some others, given the higher value of the ErrorCount)

zaneb · 2020-12-10T16:59:13Z

are the BMO logs complete? I see that registration failure traces for node-0 at the top, but no other traces (I was expecting to find some others, given the higher value of the ErrorCount)

We're missing the earlier ones (on the last one the retry delay is huge, so there are no later ones), but it doesn't really matter. They'll all look the same.

furkatgofurov7 · 2020-12-10T19:24:27Z

@furkatgofurov7 are the BMO logs complete? I see that registration failure traces for node-0 at the top, but no other traces (I was expecting to find some others, given the higher value of the ErrorCount)

@andfasano no, that is a truncated version of the logs, as OpenStack paste was not able to embed the full log so I had to truncate it a lot. However, the logs you are seeing specifically node-0 related almost from the end of the logs, but it is pretty much jammed with, all looking identical lines as below:

`{"level":"info","ts":1607333100.075303,"logger":"controllers.BareMetalHost","msg":"updating credentials success status fields","baremetalhost":"metal3/node-1","provisioningState":"ready"}
{"level":"info","ts":1607333100.0753107,"logger":"controllers.BareMetalHost","msg":"clearing previous error message","baremetalhost":"metal3/node-1","provisioningState":"ready"}
{"level":"info","ts":1607333100.0753253,"logger":"provisioner.ironic","msg":"updating hardware state","host":"node-1"}
{"level":"info","ts":1607333100.0945454,"logger":"provisioner.ironic","msg":"found existing node by ID","host":"node-1"}
{"level":"info","ts":1607333100.0945992,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"metal3/node-1","provisioningState":"ready","requeue":true,"after":60}
{"level":"info","ts":1607333160.095001,"logger":"provisioner.ironic","msg":"verifying ironic provisioner dependencies","host":"node-1"}
{"level":"info","ts":1607333160.1319568,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"metal3/node-1","provisioningState":"ready","credentials":{"credentials":{"name":"node-1-bmc-secret","namespace":"metal3"},"credentialsVersion":"12122"}}
{"level":"info","ts":1607333160.1320024,"logger":"provisioner.ironic","msg":"validating management access","host":"node-1"}
{"level":"info","ts":1607333160.1659138,"logger":"provisioner.ironic","msg":"found existing node by ID","host":"node-1"}
{"level":"info","ts":1607333160.1659489,"logger":"provisioner.ironic","msg":"current provision state","host":"node-1","lastError":"","current":"manageable","target":""}
{"level":"info","ts":1607333160.165959,"logger":"provisioner.ironic","msg":"have manageable host","host":"node-1"}
{"level":"info","ts":1607333160.1659648,"logger":"controllers.BareMetalHost","msg":"response from validate","baremetalhost":"metal3/node-1","provisioningState":"ready","provResult":{"Dirty":false,"RequeueAfter":0,"ErrorMessage":""}}`

Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739

furkatgofurov7 · 2020-12-11T23:19:51Z

@andfasano I tried to automate manual steps as much as I could, so to reproduce this scenario you can use the following simple steps:

Clone metal3-dev-env, export some env variables (export EPHEMERAL_CLUSTER=kind, export CONTAINER_RUNTIME=docker, would be best to choose for this test and it is based on Docker containers actually) and run make (we should have 2 BMHs' in Ready state once this step is done).
Run this custom script inside the metal3-dev-env folder, which should hopefully take care of most of the steps in the test.

While running the script, watch BMHs' and if the script will run as expected which is most probably(unlikely, but there can be a case when BMHs' state transition time from Registering to Inspecting being pretty fast and we may not be able to stop==>sleep==>start vbmc container exactly while they are in Registering), it should be able to catch an important point, when, while BMHs are in Registering state, it has to stop==>sleep==>start the vbmc container. If successful, one of the BMHs' should end up in the Registering state and never be able to recover from it.

/cc @zaneb

andfasano · 2020-12-15T11:58:02Z

@andfasano I tried to automate manual steps as much as I could, so to reproduce this scenario you can use the following simple steps:
* Clone [metal3-dev-env](https://github.com/metal3-io/metal3-dev-env.git), export some [env variables](https://github.com/metal3-io/metal3-dev-env/blob/3293407d842d8e89ed422460a8ca24dd159f62bc/README.md#environment-variables) (export EPHEMERAL_CLUSTER=kind, export CONTAINER_RUNTIME=docker, would be best to choose for this test and it is based on Docker containers actually) and run `make` (we should have 2 BMHs' in `Ready` state once this step is done).

* Run this [custom script](https://pastebin.com/8bUv5Ay1) inside the metal3-dev-env folder, which should hopefully take care of most of the steps in the test.
While running the script, watch BMHs' and if the script will run as expected which is most probably(unlikely, but there can be a case when BMHs' state transition time from Registering to Inspecting being pretty fast and we may not be able to stop==>sleep==>start vbmc container exactly while they are in Registering), it should be able to catch an important point, when, while BMHs are in Registering state, it has to stop==>sleep==>start the vbmc container. If successful, one of the BMHs' should end up in the Registering state and never be able to recover from it.

/cc @zaneb

Thanks @furkatgofurov7

fmuyassarov · 2020-12-16T14:51:23Z

/kind bug

Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739

Once we see an error in the Node, it returns to the 'enroll' [sic] state and we don't have a way of determining whether we have seen and saved that error or not. Previously we always assumed we had not, and didn't retry the validation unless the credentials had changed. Add a force flag to indicate that this is a new attempt and should start again. Fixes metal3-io#739 (cherry picked from commit 84ca573) Signed-off-by: Honza Pokorny <honza@redhat.com>

zaneb mentioned this issue Dec 11, 2020

Do retries with backoff in ValidateManagementAccess, Inspect, and Deprovision #749

Merged

metal3-io-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 16, 2020

metal3-io-bot closed this as completed in 84ca573 Jan 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BMH stucks in RegistrationError when vbmc is not reachable for short time #739

BMH stucks in RegistrationError when vbmc is not reachable for short time #739

furkatgofurov7 commented Dec 7, 2020 •

edited

furkatgofurov7 commented Dec 7, 2020

furkatgofurov7 commented Dec 7, 2020

zaneb commented Dec 10, 2020 •

edited

furkatgofurov7 commented Dec 10, 2020

furkatgofurov7 commented Dec 10, 2020

andfasano commented Dec 10, 2020

zaneb commented Dec 10, 2020

furkatgofurov7 commented Dec 10, 2020 •

edited

furkatgofurov7 commented Dec 11, 2020 •

edited

andfasano commented Dec 15, 2020

fmuyassarov commented Dec 16, 2020

BMH stucks in RegistrationError when vbmc is not reachable for short time #739

BMH stucks in RegistrationError when vbmc is not reachable for short time #739

Comments

furkatgofurov7 commented Dec 7, 2020 • edited

furkatgofurov7 commented Dec 7, 2020

furkatgofurov7 commented Dec 7, 2020

zaneb commented Dec 10, 2020 • edited

furkatgofurov7 commented Dec 10, 2020

furkatgofurov7 commented Dec 10, 2020

andfasano commented Dec 10, 2020

zaneb commented Dec 10, 2020

furkatgofurov7 commented Dec 10, 2020 • edited

furkatgofurov7 commented Dec 11, 2020 • edited

andfasano commented Dec 15, 2020

fmuyassarov commented Dec 16, 2020

furkatgofurov7 commented Dec 7, 2020 •

edited

zaneb commented Dec 10, 2020 •

edited

furkatgofurov7 commented Dec 10, 2020 •

edited

furkatgofurov7 commented Dec 11, 2020 •

edited