Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BMH stucks in RegistrationError when vbmc is not reachable for short time #739

Closed
furkatgofurov7 opened this issue Dec 7, 2020 · 11 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Dec 7, 2020

We have the following scenario:
While BMHs' are in registering state, we are running a small script to simulate the scenario, where connection to vbmc will be lost, by simply stoping vbmc container, sleeping for 10 seconds, and starting vbmc again. When simulated, one of the BMHs will be stuck in registering which in turn results in Registrationerror. Interestingly, BMO tries to retry and handle the error several times(as ErrorCount keeps increasing), but BMH still is not able to get out of RegistrationError.
Below provided are truncated version of logs for:

Any direction/help is much appreciated. Thanks!

Upd: node_0-serial0.log
Upd2: baremtal node list

@furkatgofurov7
Copy link
Member Author

/cc @zaneb @dhellmann @maelk

@furkatgofurov7
Copy link
Member Author

This issue is the continuation of #708, after testing BMO with changes included #716.

@zaneb
Copy link
Member

zaneb commented Dec 10, 2020

The problem is in the ironic provisioner. Looking at the Node data from ironic, there's no way to tell if there is a last_error showing because our attempt to move to manageable just timed out, or if we have seen it before, have waited the requisite time, and want to try again. There's no timestamp in the ironic data (and trusting timestamps from other services is questionable anyway.) Currently we only try again if the credentials have changed, so in all other cases we'll just keep reporting the same error and incrementing the error count.

I think there's something similar going with inspection failures. In Adopt() I 'fixed' the equivalent problem (#697) by forcing a retry whenever we had an existing error of that type. This doesn't really solve the problem though:

  • If (as I did there) we continue to clear the error when we see an intermediate state (adopting/verifying/inspecting - i.e. when returning Dirty=true) - assuming we don't miss that state as it goes by - then we'll never actually get exponential backoffs, because the error count always gets reset to 0.
  • If we don't clear the error, then after the first failure we'll retry in a loop inside the ironic provisioner, without ever incrementing the error count again.

What we need is some way of recording that we have seen the error. I now realise that prior to #388 the flip-flopping between the Registering and RegistrationFailed states could have achieved that for registration (oops), although it did not because at that time it only retried if the credentials changed. That wouldn't have helped adoption or inspection anyway.

One option would be to store data in Ironic indicating that we've seen the error. For adoption and inspection we could move the state back to manageable after seeing the error, but that doesn't work for registration because it fails back to enroll (not my spelling mistake, for the record) where it started from. Can we clear the last_error field? (@dtantsur?) Or store the error count in the extra data field. Since atomic changes between ironic and bmo are not possible, after writing data to Ironic an error writing the updated failure count to the Host would result in an extra immediate retry, but that's not really a disaster as long as it's not systematic.

A better option is probably that upon seeing Dirty=true (which happens on both the initial forced retry and any intermediate states), we should clear the error type/message (so we won't do another forced update) but not the error count (so that wait times between failures will continue to increase). The error count would only get cleared upon completing the action. This seems like a not-too-bad UX. One issue with this is that we can't currently distinguish between a successful request to change the provision state and a 409 Conflict (both just return Dirty=true with a requeue delay).

/cc @andfasano

@furkatgofurov7
Copy link
Member Author

/cc @kashifest

@furkatgofurov7
Copy link
Member Author

questionable anyway.) Currently, we only try again if the credentials have changed, so in all other cases, we'll just keep reporting the same error and incrementing the error count.

And I question myself, why we are only trying in the case of credentials being changed and not other cases i.e like the case this issue describing?

but that doesn't work for registration because it fails back to enroll (not my spelling mistake, for the record) where it started from

Yes, you are right, checked it in my testing environment.

@andfasano
Copy link
Member

@furkatgofurov7 are the BMO logs complete? I see that registration failure traces for node-0 at the top, but no other traces (I was expecting to find some others, given the higher value of the ErrorCount)

@zaneb
Copy link
Member

zaneb commented Dec 10, 2020

are the BMO logs complete? I see that registration failure traces for node-0 at the top, but no other traces (I was expecting to find some others, given the higher value of the ErrorCount)

We're missing the earlier ones (on the last one the retry delay is huge, so there are no later ones), but it doesn't really matter. They'll all look the same.

@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Dec 10, 2020

@furkatgofurov7 are the BMO logs complete? I see that registration failure traces for node-0 at the top, but no other traces (I was expecting to find some others, given the higher value of the ErrorCount)

@andfasano no, that is a truncated version of the logs, as OpenStack paste was not able to embed the full log so I had to truncate it a lot. However, the logs you are seeing specifically node-0 related almost from the end of the logs, but it is pretty much jammed with, all looking identical lines as below:

`{"level":"info","ts":1607333100.075303,"logger":"controllers.BareMetalHost","msg":"updating credentials success status fields","baremetalhost":"metal3/node-1","provisioningState":"ready"}
{"level":"info","ts":1607333100.0753107,"logger":"controllers.BareMetalHost","msg":"clearing previous error message","baremetalhost":"metal3/node-1","provisioningState":"ready"}
{"level":"info","ts":1607333100.0753253,"logger":"provisioner.ironic","msg":"updating hardware state","host":"node-1"}
{"level":"info","ts":1607333100.0945454,"logger":"provisioner.ironic","msg":"found existing node by ID","host":"node-1"}
{"level":"info","ts":1607333100.0945992,"logger":"controllers.BareMetalHost","msg":"done","baremetalhost":"metal3/node-1","provisioningState":"ready","requeue":true,"after":60}
{"level":"info","ts":1607333160.095001,"logger":"provisioner.ironic","msg":"verifying ironic provisioner dependencies","host":"node-1"}
{"level":"info","ts":1607333160.1319568,"logger":"controllers.BareMetalHost","msg":"registering and validating access to management controller","baremetalhost":"metal3/node-1","provisioningState":"ready","credentials":{"credentials":{"name":"node-1-bmc-secret","namespace":"metal3"},"credentialsVersion":"12122"}}
{"level":"info","ts":1607333160.1320024,"logger":"provisioner.ironic","msg":"validating management access","host":"node-1"}
{"level":"info","ts":1607333160.1659138,"logger":"provisioner.ironic","msg":"found existing node by ID","host":"node-1"}
{"level":"info","ts":1607333160.1659489,"logger":"provisioner.ironic","msg":"current provision state","host":"node-1","lastError":"","current":"manageable","target":""}
{"level":"info","ts":1607333160.165959,"logger":"provisioner.ironic","msg":"have manageable host","host":"node-1"}
{"level":"info","ts":1607333160.1659648,"logger":"controllers.BareMetalHost","msg":"response from validate","baremetalhost":"metal3/node-1","provisioningState":"ready","provResult":{"Dirty":false,"RequeueAfter":0,"ErrorMessage":""}}`

zaneb added a commit to zaneb/baremetal-operator that referenced this issue Dec 11, 2020
Once we see an error in the Node, it returns to the 'enroll' [sic] state
and we don't have a way of determining whether we have seen and saved
that error or not. Previously we always assumed we had not, and didn't
retry the validation unless the credentials had changed.

Add a force flag to indicate that this is a new attempt and should start
again.

Fixes metal3-io#739
zaneb added a commit to zaneb/baremetal-operator that referenced this issue Dec 11, 2020
Once we see an error in the Node, it returns to the 'enroll' [sic] state
and we don't have a way of determining whether we have seen and saved
that error or not. Previously we always assumed we had not, and didn't
retry the validation unless the credentials had changed.

Add a force flag to indicate that this is a new attempt and should start
again.

Fixes metal3-io#739
@furkatgofurov7
Copy link
Member Author

furkatgofurov7 commented Dec 11, 2020

@andfasano I tried to automate manual steps as much as I could, so to reproduce this scenario you can use the following simple steps:

  • Clone metal3-dev-env, export some env variables (export EPHEMERAL_CLUSTER=kind, export CONTAINER_RUNTIME=docker, would be best to choose for this test and it is based on Docker containers actually) and run make (we should have 2 BMHs' in Ready state once this step is done).
  • Run this custom script inside the metal3-dev-env folder, which should hopefully take care of most of the steps in the test.

While running the script, watch BMHs' and if the script will run as expected which is most probably(unlikely, but there can be a case when BMHs' state transition time from Registering to Inspecting being pretty fast and we may not be able to stop==>sleep==>start vbmc container exactly while they are in Registering), it should be able to catch an important point, when, while BMHs are in Registering state, it has to stop==>sleep==>start the vbmc container. If successful, one of the BMHs' should end up in the Registering state and never be able to recover from it.

/cc @zaneb

@andfasano
Copy link
Member

@andfasano I tried to automate manual steps as much as I could, so to reproduce this scenario you can use the following simple steps:

* Clone [metal3-dev-env](https://github.com/metal3-io/metal3-dev-env.git), export some [env variables](https://github.com/metal3-io/metal3-dev-env/blob/3293407d842d8e89ed422460a8ca24dd159f62bc/README.md#environment-variables) (export EPHEMERAL_CLUSTER=kind, export CONTAINER_RUNTIME=docker, would be best to choose for this test and it is based on Docker containers actually) and run `make` (we should have 2 BMHs' in `Ready` state once this step is done).

* Run this [custom script](https://pastebin.com/8bUv5Ay1) inside the metal3-dev-env folder, which should hopefully take care of most of the steps in the test.

While running the script, watch BMHs' and if the script will run as expected which is most probably(unlikely, but there can be a case when BMHs' state transition time from Registering to Inspecting being pretty fast and we may not be able to stop==>sleep==>start vbmc container exactly while they are in Registering), it should be able to catch an important point, when, while BMHs are in Registering state, it has to stop==>sleep==>start the vbmc container. If successful, one of the BMHs' should end up in the Registering state and never be able to recover from it.

/cc @zaneb

Thanks @furkatgofurov7

@fmuyassarov
Copy link
Member

/kind bug

@metal3-io-bot metal3-io-bot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 16, 2020
zaneb added a commit to zaneb/baremetal-operator that referenced this issue Jan 11, 2021
Once we see an error in the Node, it returns to the 'enroll' [sic] state
and we don't have a way of determining whether we have seen and saved
that error or not. Previously we always assumed we had not, and didn't
retry the validation unless the credentials had changed.

Add a force flag to indicate that this is a new attempt and should start
again.

Fixes metal3-io#739
honza pushed a commit to honza/baremetal-operator that referenced this issue Feb 24, 2021
Once we see an error in the Node, it returns to the 'enroll' [sic] state
and we don't have a way of determining whether we have seen and saved
that error or not. Previously we always assumed we had not, and didn't
retry the validation unless the credentials had changed.

Add a force flag to indicate that this is a new attempt and should start
again.

Fixes metal3-io#739

(cherry picked from commit 84ca573)
Signed-off-by: Honza Pokorny <honza@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

5 participants