New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle registration independently of provisioning state #388
Conversation
|
/test-centos-integration |
|
This PR does not include the fasttrack fix. Please rebase to fix the current problem in the CI |
|
/test-centos-integration |
1 similar comment
|
/test-centos-integration |
|
/test-centos-integration |
|
/test-integration |
|
Please rebase the PR to fix the CI issue |
|
/test-integration |
|
/test-integration |
|
/test-integration |
|
It would be really nice if we could either merge this or decide that we're not going to merge it, so that I can stop carrying two different state machine implementations around in my head :) |
| return | ||
| } | ||
|
|
||
| if hsm.Host.Status.GoodCredentials.Match(*info.bmcCredsSecret) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this mean that a host that was successfully registered once is not re-registered if the database is wiped after a pod restart? Maybe that's handled the next time the provisioner is used to try to do something with the host?
Elsewhere we say that after a host passes through the registration state it never goes back. I wonder if we need this check at all any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this mean that a host that was successfully registered once is not re-registered if the database is wiped after a pod restart?
It means that we won't call actionRegistering below.
In the previous code, it meant we wouldn't flip back to the Registering state (which calls actionRegistering).
So the effect is the same, which is not to say that there isn't a bug. I suspect there are a lot of states where we don't re-register the host if the DB is dropped. (The ones that call Adopt() are the only ones where we do AFAIK.)
Elsewhere we say that after a host passes through the registration state it never goes back.
This is more of an accounting thing. We do re-register below if the creds change, by doing the exact stuff that the Registering state does, but we don't change the provisioning state while we do it any more.
|
|
||
| recordStateBegin(hsm.Host, metal3v1alpha1.StateRegistering, metav1.Now()) | ||
| if hsm.Host.Status.ErrorType == metal3v1alpha1.RegistrationError { | ||
| if hsm.Host.Status.TriedCredentials.Match(*info.bmcCredsSecret) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need this test any more? Or do we need to wait for #610?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I suspect the existence of this code is a mistake - registration could have failed for a reason other than the creds being wrong, so we probably shouldn't force the user to update them before we look again. #610 provides an exponential backoff on failure, which is what we really want.
This is bug-for-bug compatible with how it works now, so any changes should go in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen incase we had a BMC connectivity issue for a brief moment, and the BMH went to RegistrationError, and then after a while the connectivity is reinstated. The workflow here won't try to register again since the change in scenario was not related to BMC credentials but related to BMC connectivity. Am I understanding it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but with a caveat: it likely wouldn't have gone into RegistrationError before either - it could easily have been ProvisioningError or PowerManagementError, because we don't know the cause of the error, only what we were trying to do when it happened.
In those scenarios it's fine anyway, because we don't need to re-register (i.e. send new credentials to Ironic) to reconnect; a retry is enough. The theory behind this code is probably that the only circumstances we need to re-register is if the password is changed on the BMC, and in that case it won't work until the creds are updated in k8s anyway so there's no point retrying until they are.
The problem with that logic is that if we do change the credentials and connectivity is interrupted while we are updating the credentials then we erroneously assume it was because the credentials are bad and don't try again. (The same would presumably apply if you change them in the 'wrong' order - in the k8s Secret first and later on the BMC.) If that proves to be the case, it's a (pre-existing) bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In those scenarios it's fine anyway, because we don't need to re-register (i.e. send new credentials to Ironic) to reconnect; a retry is enough. The theory behind this code is probably that the only circumstances we need to re-register is if the password is changed on the BMC, and in that case it won't work until the creds are updated in k8s anyway so there's no point retrying until they are.
Does this mean that in case of BMC connectivity issue, no extra logic needs to be added to host state machine and retry is already implemented? I am interested in this specific case where only bmc connectivity is lost and nothing wrong happened with the credentials. I would like to know your opinion on that specific case, and what needs to be done to get BMH out of RegistrationError @zaneb ? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that in case of BMC connectivity issue, no extra logic needs to be added to host state machine and retry is already implemented?
Yes, as of #610 retry is implemented.
(Prior to that we in theory stopped reconciling after a registration error, and waited for some change to the Host or Secret. In practice I suspect that in many cases writes to the status would have triggered an immediate re-reconcile and resultant retry anyway.)
|
I would like to freeze go code changes for a few days to try to land #650 without having to rebase it, because rebasing will mean redoing the work from scratch. /hold |
|
#655 has merged /hold cancel |
|
This is going to need to be rebased, but the logic in this version looks good. |
Changing the provisioning status is not how we actually indicate errors: we have the operational status for that. The existence of these states also has the effect of obscuring which state the Host was in (and hence what we were trying to do) when the error occurred. For example, a registration error can occur during provisioning. Remove the RegistrationError, ProvisioningError, and PowerManagementError states, in line with what the state machine diagram envisions, but also remove the Error states from the diagram as they are now orthogonal to the provisioning state.
The registration state of the Host is effectively orthogonal to the provisioning state so, instead of bouncing back to the Registering state whenever the credentials change, ensure the Host is registered with the current credentials prior to handling the current state. The Registering state remains as a transition state to the rest of the state machine - the Host will not exit the Registering state until it has been registered at least once.
Eliminate the unreachable return at the end of the function and use early exits only for cases where there is still work to continue.
|
/test-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. I don't want to approve it with 1 reviewer late on a Friday, so I'll leave it for others to review next week.
/approve
/cc @andfasano
| info.host.Status.HardwareDetails = details | ||
| return actionComplete{} | ||
| } | ||
| info.host.ClearError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like sometimes we clear the error in the controller and sometimes in the state machine. That may be an area to clean up in a future PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ClearError is invoked alongside to actionComplete{} or actionContinue (but not the viceversa). Could make sense to have a utility method to group those commands together, like the recordActionFailure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC I looked into doing the ClearError() in a single place in Reconcile() when we get the actionResult. However, it's not as consistent as you would hope so we have to keep doing it on a case-by-case basis.
The changes in this last commit are just refactoring to make the control flow of this action look like the others, so I wouldn't get too hung up on this particular part. If I'd known this PR was going to take a year I'd have just submitted this commit separately.
| // registered using the current BMC credentials, so we can move to the | ||
| // next state. We will not return to the Registering state, even | ||
| // if the credentials change and the Host must be re-registered. | ||
| hsm.Host.ClearError() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See earlier comment about clearing errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit of a special case because we don't call an action*() function from this state any more. I'm not sure it's even necessary, but I wrote this patch a year ago so I might have known the reason then :D
| @@ -221,6 +217,20 @@ func TestErrorCountIncreasedWhenProvisionerFails(t *testing.T) { | |||
| } | |||
| } | |||
|
|
|||
| func TestErrorCountIncreasedWhenRegistrationFails(t *testing.T) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little surprised at how few changes there are to the tests. Maybe that says more about our test coverage than this change, though.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhellmann, zaneb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@zaneb @dhellmann I have a question regarding a specific case, even I asked it earlier in this PR, I will do it once again. I was just wondering, as I checked the code last time, we do not consider BMC losing connectivity and going to registration error state in host_state_machine, maybe we could add that case too? Although, @zaneb answered this why we do not need to re-register in that case and simple retry is enough, would you mind explain more what happens if even retry fails or simply put this, will nodes get out of error state with current implemented behaviour in BMO? I just would like to know your opinions on that. Thanks in advance! |
|
@furkatgofurov7 let's open a separate issue to discuss it. The goal of this PR is to not change any behaviour other than the reported provisioning state, so it's really nothing to do with this and any discussion here is going to get lost. |
|
/lgtm Let's get this in and iterate if we need to. |
We may need to re-register the Host at any point in time (whenever the credentials secret changes, for a start). Previously we did this by switching the provisioning state back to registering and then back to the current state when finished.
Since registering an provisioning are in effect orthogonal, separate the registration phase out from the provisioning state. Always ensure the Host registration is current before running the rest of the state machine. To enable this the RegistrationError state is removed, along with the other error provisioning states. The OperationalState and ErrorType fields provide a more reliable reporting of errors than what those provisioning states could.