Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BMH goes to non recoverable error state #708

Closed
furkatgofurov7 opened this issue Nov 3, 2020 · 5 comments
Closed

BMH goes to non recoverable error state #708

furkatgofurov7 opened this issue Nov 3, 2020 · 5 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@furkatgofurov7
Copy link
Member

furkatgofurov7 commented Nov 3, 2020

There is a case where Ironic can't recover BMHs when connectivity with BMCs fails for some time.
After that, Ironic will set a bare metal node as failed when it can't connect to its BMC for a short period of time. Once a bmh is in RegistrationError, it cannot be recovered, breaking the whole cluster. I have tested it locally, to imitate the situation of BMC connectivity loss, I have deleted vbmc container and re-created it after some time(10seconds) while BMHs' are in a registering state.
Below provided are logs:

``Status:
  Error Message:  Failed to get power state for node 56055790-ddda-427c-a16a-f10261e52382. Error: Redfish connection failed for node 56055790-ddda-427c-a16a-f10261e52382: Unable to connect to https://10.33.131.4/redfish/v1/. Error: HTTPSConnectionPool(host='10.33.131.4', port=443): Max retries exceeded with url: /redfish/v1/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6982f7a080>: Failed to establish a new connection: [Errno 110] ETIMEDOUT',))
  Error Type:     registration error
  Good Credentials:
  Hardware Profile:  
  Last Updated:      2020-08-18T23:34:15Z
  Operation History:
    Deprovision:
      End:    <nil>
      Start:  <nil>
    Inspect:
      End:    <nil>
      Start:  <nil>
    Provision:
      End:    <nil>
      Start:  <nil>
    Register:
      End:             2020-08-18T23:34:15Z
      Start:           2020-08-18T23:20:47Z
  Operational Status:  error
  Powered On:          false
  Provisioning:
    ID:  56055790-ddda-427c-a16a-f10261e52382
    Image:
      Checksum:  
      URL:       
    State:       registration error
  Tried Credentials:
    Credentials:
      Name:               ephemeral
      Namespace:          dc325-vpod1-capi
    Credentials Version:  9857
Events:                   <none>` `

As I brough this question earlier in one of the patches comments and got an answer that, in the case of BMC connectivity issue, re-register is not needed and retry would be sufficient to get BMH out of registration error, and that was implemented as part of #610 . However, that patch is not covering the case explained here IMO (based on my local tests). We have to fix it in order to properly get BMH out from this error state.

@furkatgofurov7
Copy link
Member Author

@zaneb @dhellmann

@zaneb
Copy link
Member

zaneb commented Nov 3, 2020

It doesn't appear that you're testing with a version that includes #610 (there's no Error Count field in the output). Have you tried testing with that PR?

@furkatgofurov7
Copy link
Member Author

@zaneb thanks, it seems that was the old log not including 610, but here is the one after 610 included with error count, please have a look, it still gets stuck in registration error :

` status:
    errorCount: 1
    errorMessage: 'Failed to get power state for node c3d1a60a-5425-4dc4-9fba-2dc861ace782.
      Error: Command ''[''ipmitool'', ''-I'', ''lanplus'', ''-H'', ''192.168.111.1'',
      ''-L'', ''ADMINISTRATOR'', ''-p'', ''6230'', ''-U'', ''admin'', ''-R'', ''12'',
      ''-N'', ''5'', ''-f'', ''/tmp/tmp9_fmvykb'', ''power'', ''status'']'' timed
      out after 60 seconds'
    errorType: registration error
    goodCredentials: {}
    hardwareProfile: ""
    lastUpdated: "2020-11-03T14:01:06Z"
    operationHistory:
      deprovision:
        end: null
        start: null
      inspect:
        end: null
        start: null
      provision:
        end: null
        start: null
      register:
        end: null
        start: "2020-11-03T14:00:05Z"
    operationalStatus: error
    poweredOn: false
    provisioning:
      ID: c3d1a60a-5425-4dc4-9fba-2dc861ace782
      image:
        checksum: ""
        url: ""
      state: registering
    triedCredentials:
      credentials:
        name: node-0-bmc-secret
        namespace: metal3
      credentialsVersion: "10468"`

@zaneb
Copy link
Member

zaneb commented Nov 3, 2020

OK, yep, this looks like the issue we discussed.

This is happening in the actual registering state (i.e. the node has never been registered). The triedCredentials are now set, so we won't try again, even though it isn't the credentials that were wrong.

This could also happen in the Ready or Provisioned states if Adopt() or ValidateManagementAccess() (respectively) fails. Given that these are the first things called in each of those states, there's a high likelihood that if the BMC becomes unreachable for some reason that this is the first place it will fail.

The other scenario where this could happen is if you change the password to the BMC, and do so by updating the credentials in the Secret first and only then setting the new password on the actual BMC.

/kind bug

@metal3-io-bot metal3-io-bot added the kind/bug Categorizes issue or PR as related to a bug. label Nov 3, 2020
@furkatgofurov7
Copy link
Member Author

/cc @kashifest

zaneb added a commit to zaneb/baremetal-operator that referenced this issue Nov 10, 2020
Previously we only verified that the Node was registered in Ironic when
the credentials changed or in a steady state (Ready, Provisioned,
ExternallyProvisioned). Simplify the code by always calling
actionRegistering() prior to running the state machine.

This ensures that the Node is always created in the ironic database
after a move of the pod, regardless of what state the host is in at the
time.

This allows a BMC password to be changed in any order. Previously if the
credentials were updated in the Secret prior to changing them on the BMC
itself, it would never retry after the new credentials failed.

It also ensures that after an error caused by a temporary inability to
contact the BMC, retries still happen. Previously if there had been a
registration error, we just gave up because the credentials hadn't
changed on the incorrect assumption that bad credentials were the only
way of failing to register. (Although the Host would still be reconciled
every 1-2 minutes, it had no way of getting out of this state).

Fixes metal3-io#708
zaneb added a commit to zaneb/baremetal-operator that referenced this issue Nov 10, 2020
Previously we only verified that the Node was registered in Ironic when
the credentials changed or in a steady state (Ready, Provisioned,
ExternallyProvisioned). Simplify the code by always calling
actionRegistering() prior to running the state machine.

This ensures that the Node is always created in the ironic database
after a move of the pod, regardless of what state the host is in at the
time.

This allows a BMC password to be changed in any order. Previously if the
credentials were updated in the Secret prior to changing them on the BMC
itself, it would never retry after the new credentials failed.

It also ensures that after an error caused by a temporary inability to
contact the BMC, retries still happen. Previously if there had been a
registration error, we just gave up because the credentials hadn't
changed on the incorrect assumption that bad credentials were the only
way of failing to register. (Although the Host would still be reconciled
every 1-2 minutes, it had no way of getting out of this state).

Fixes metal3-io#708
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants