Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle registration independently of provisioning state #388

Merged
merged 3 commits into from Oct 27, 2020

Conversation

zaneb
Copy link
Member

@zaneb zaneb commented Jan 14, 2020

We may need to re-register the Host at any point in time (whenever the credentials secret changes, for a start). Previously we did this by switching the provisioning state back to registering and then back to the current state when finished.
Since registering an provisioning are in effect orthogonal, separate the registration phase out from the provisioning state. Always ensure the Host registration is current before running the rest of the state machine. To enable this the RegistrationError state is removed, along with the other error provisioning states. The OperationalState and ErrorType fields provide a more reliable reporting of errors than what those provisioning states could.

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 14, 2020
@zaneb zaneb requested a review from dhellmann January 14, 2020 20:44
@metal3-io-bot metal3-io-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 14, 2020
@zaneb
Copy link
Member Author

zaneb commented Jan 14, 2020

/test-centos-integration

@maelk
Copy link
Member

maelk commented Jan 15, 2020

This PR does not include the fasttrack fix. Please rebase to fix the current problem in the CI

@zaneb
Copy link
Member Author

zaneb commented Jan 15, 2020

/test-centos-integration

1 similar comment
@maelk
Copy link
Member

maelk commented Jan 16, 2020

/test-centos-integration

@dhellmann
Copy link
Member

/test-centos-integration

@zaneb
Copy link
Member Author

zaneb commented Feb 13, 2020

/test-integration

@maelk
Copy link
Member

maelk commented Feb 13, 2020

Please rebase the PR to fix the CI issue

@zaneb
Copy link
Member Author

zaneb commented May 11, 2020

/test-integration

@zaneb
Copy link
Member Author

zaneb commented Jul 23, 2020

/test-integration

@zaneb
Copy link
Member Author

zaneb commented Jul 23, 2020

/test-integration

@zaneb zaneb requested a review from dhellmann July 23, 2020 16:55
@zaneb
Copy link
Member Author

zaneb commented Jul 30, 2020

It would be really nice if we could either merge this or decide that we're not going to merge it, so that I can stop carrying two different state machine implementations around in my head :)

return
}

if hsm.Host.Status.GoodCredentials.Match(*info.bmcCredsSecret) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean that a host that was successfully registered once is not re-registered if the database is wiped after a pod restart? Maybe that's handled the next time the provisioner is used to try to do something with the host?

Elsewhere we say that after a host passes through the registration state it never goes back. I wonder if we need this check at all any more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this mean that a host that was successfully registered once is not re-registered if the database is wiped after a pod restart?

It means that we won't call actionRegistering below.
In the previous code, it meant we wouldn't flip back to the Registering state (which calls actionRegistering).
So the effect is the same, which is not to say that there isn't a bug. I suspect there are a lot of states where we don't re-register the host if the DB is dropped. (The ones that call Adopt() are the only ones where we do AFAIK.)

Elsewhere we say that after a host passes through the registration state it never goes back.

This is more of an accounting thing. We do re-register below if the creds change, by doing the exact stuff that the Registering state does, but we don't change the provisioning state while we do it any more.


recordStateBegin(hsm.Host, metal3v1alpha1.StateRegistering, metav1.Now())
if hsm.Host.Status.ErrorType == metal3v1alpha1.RegistrationError {
if hsm.Host.Status.TriedCredentials.Match(*info.bmcCredsSecret) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we don't need this test any more? Or do we need to wait for #610?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I suspect the existence of this code is a mistake - registration could have failed for a reason other than the creds being wrong, so we probably shouldn't force the user to update them before we look again. #610 provides an exponential backoff on failure, which is what we really want.
This is bug-for-bug compatible with how it works now, so any changes should go in a separate PR.

Copy link
Member

@kashifest kashifest Sep 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen incase we had a BMC connectivity issue for a brief moment, and the BMH went to RegistrationError, and then after a while the connectivity is reinstated. The workflow here won't try to register again since the change in scenario was not related to BMC credentials but related to BMC connectivity. Am I understanding it right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but with a caveat: it likely wouldn't have gone into RegistrationError before either - it could easily have been ProvisioningError or PowerManagementError, because we don't know the cause of the error, only what we were trying to do when it happened.

In those scenarios it's fine anyway, because we don't need to re-register (i.e. send new credentials to Ironic) to reconnect; a retry is enough. The theory behind this code is probably that the only circumstances we need to re-register is if the password is changed on the BMC, and in that case it won't work until the creds are updated in k8s anyway so there's no point retrying until they are.

The problem with that logic is that if we do change the credentials and connectivity is interrupted while we are updating the credentials then we erroneously assume it was because the credentials are bad and don't try again. (The same would presumably apply if you change them in the 'wrong' order - in the k8s Secret first and later on the BMC.) If that proves to be the case, it's a (pre-existing) bug.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In those scenarios it's fine anyway, because we don't need to re-register (i.e. send new credentials to Ironic) to reconnect; a retry is enough. The theory behind this code is probably that the only circumstances we need to re-register is if the password is changed on the BMC, and in that case it won't work until the creds are updated in k8s anyway so there's no point retrying until they are.

Does this mean that in case of BMC connectivity issue, no extra logic needs to be added to host state machine and retry is already implemented? I am interested in this specific case where only bmc connectivity is lost and nothing wrong happened with the credentials. I would like to know your opinion on that specific case, and what needs to be done to get BMH out of RegistrationError @zaneb ? Thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that in case of BMC connectivity issue, no extra logic needs to be added to host state machine and retry is already implemented?

Yes, as of #610 retry is implemented.
(Prior to that we in theory stopped reconciling after a registration error, and waited for some change to the Host or Secret. In practice I suspect that in many cases writes to the status would have triggered an immediate re-reconcile and resultant retry anyway.)

@dhellmann
Copy link
Member

I would like to freeze go code changes for a few days to try to land #650 without having to rebase it, because rebasing will mean redoing the work from scratch.

/hold

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 24, 2020
@dhellmann
Copy link
Member

#655 has merged

/hold cancel

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2020
@dhellmann
Copy link
Member

This is going to need to be rebased, but the logic in this version looks good.

Changing the provisioning status is not how we actually indicate errors:
we have the operational status for that. The existence of these states
also has the effect of obscuring which state the Host was in (and hence
what we were trying to do) when the error occurred. For example, a
registration error can occur during provisioning.

Remove the RegistrationError, ProvisioningError, and
PowerManagementError states, in line with what the state machine diagram
envisions, but also remove the Error states from the diagram as they are
now orthogonal to the provisioning state.
The registration state of the Host is effectively orthogonal to the
provisioning state so, instead of bouncing back to the Registering state
whenever the credentials change, ensure the Host is registered with the
current credentials prior to handling the current state.

The Registering state remains as a transition state to the rest of the
state machine - the Host will not exit the Registering state until it
has been registered at least once.
Eliminate the unreachable return at the end of the function and use
early exits only for cases where there is still work to continue.
@zaneb
Copy link
Member Author

zaneb commented Oct 23, 2020

/test-integration

@zaneb zaneb requested a review from dhellmann October 23, 2020 17:57
Copy link
Member

@dhellmann dhellmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. I don't want to approve it with 1 reviewer late on a Friday, so I'll leave it for others to review next week.

/approve
/cc @andfasano

info.host.Status.HardwareDetails = details
return actionComplete{}
}
info.host.ClearError()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like sometimes we clear the error in the controller and sometimes in the state machine. That may be an area to clean up in a future PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ClearError is invoked alongside to actionComplete{} or actionContinue (but not the viceversa). Could make sense to have a utility method to group those commands together, like the recordActionFailure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC I looked into doing the ClearError() in a single place in Reconcile() when we get the actionResult. However, it's not as consistent as you would hope so we have to keep doing it on a case-by-case basis.

The changes in this last commit are just refactoring to make the control flow of this action look like the others, so I wouldn't get too hung up on this particular part. If I'd known this PR was going to take a year I'd have just submitted this commit separately.

// registered using the current BMC credentials, so we can move to the
// next state. We will not return to the Registering state, even
// if the credentials change and the Host must be re-registered.
hsm.Host.ClearError()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See earlier comment about clearing errors.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit of a special case because we don't call an action*() function from this state any more. I'm not sure it's even necessary, but I wrote this patch a year ago so I might have known the reason then :D

@@ -221,6 +217,20 @@ func TestErrorCountIncreasedWhenProvisionerFails(t *testing.T) {
}
}

func TestErrorCountIncreasedWhenRegistrationFails(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised at how few changes there are to the tests. Maybe that says more about our test coverage than this change, though.

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dhellmann, zaneb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@furkatgofurov7
Copy link
Member

@zaneb @dhellmann I have a question regarding a specific case, even I asked it earlier in this PR, I will do it once again.
We have a situation where Ironic can't recover BMHs when connectivity with BMCs fails for a some time. After, Ironic will set a baremetal node as failed when it can't connect to it's BMC for a short period of time. Once a bmh is in error state, it cannot be recovered, breaking the whole cluster. I have tested it locally, to imitate the situation of BMC connectivity loss, I have deleted VBMC container and checked the logs, it is throwing registration_error (see logs below):

`Status:
  Error Message:  Failed to get power state for node 56055790-ddda-427c-a16a-f10261e52382. Error: Redfish connection failed for node 56055790-ddda-427c-a16a-f10261e52382: Unable to connect to https://10.33.131.4/redfish/v1/. Error: HTTPSConnectionPool(host='10.33.131.4', port=443): Max retries exceeded with url: /redfish/v1/ (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f6982f7a080>: Failed to establish a new connection: [Errno 110] ETIMEDOUT',))
  Error Type:     registration error
  Good Credentials:
  Hardware Profile:  
  Last Updated:      2020-08-18T23:34:15Z
  Operation History:
    Deprovision:
      End:    <nil>
      Start:  <nil>
    Inspect:
      End:    <nil>
      Start:  <nil>
    Provision:
      End:    <nil>
      Start:  <nil>
    Register:
      End:             2020-08-18T23:34:15Z
      Start:           2020-08-18T23:20:47Z
  Operational Status:  error
  Powered On:          false
  Provisioning:
    ID:  56055790-ddda-427c-a16a-f10261e52382
    Image:
      Checksum:  
      URL:       
    State:       registration error
  Tried Credentials:
    Credentials:
      Name:               ephemeral
      Namespace:          dc325-vpod1-capi
    Credentials Version:  9857
Events:                   <none>` 

I was just wondering, as I checked the code last time, we do not consider BMC losing connectivity and going to registration error state in host_state_machine, maybe we could add that case too? Although, @zaneb answered this why we do not need to re-register in that case and simple retry is enough, would you mind explain more what happens if even retry fails or simply put this, will nodes get out of error state with current implemented behaviour in BMO? I just would like to know your opinions on that. Thanks in advance!
Sorry, I am still new to the repo and trying to get more familiar with different parts of it.

@zaneb
Copy link
Member Author

zaneb commented Oct 24, 2020

@furkatgofurov7 let's open a separate issue to discuss it. The goal of this PR is to not change any behaviour other than the reported provisioning state, so it's really nothing to do with this and any discussion here is going to get lost.

@dhellmann
Copy link
Member

/lgtm

Let's get this in and iterate if we need to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants