New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move from Available to Preparing if HostFirmwareSettings changed #1075
Move from Available to Preparing if HostFirmwareSettings changed #1075
Conversation
|
/test-v1b1-integration |
|
/lgtm |
| // Check if hostFirmwareSettings have been changed | ||
| dirty, _, _ := hsm.Reconciler.getHostFirmwareSettings(info) | ||
| if dirty { | ||
| hsm.NextState = metal3v1alpha1.StatePreparing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bfournie it would be nice to have some coverage on this part (considering that a state change is a relevant action). As a good starting point for that you could have a look at the baremetalhost controller test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added a test for getHostFirmwareSettings()
| @@ -417,6 +417,13 @@ func (hsm *hostStateMachine) handleAvailable(info *reconcileInfo) actionResult { | |||
| } else if dirty { | |||
| hsm.NextState = metal3v1alpha1.StatePreparing | |||
| return actionComplete{} | |||
| } else { | |||
| // Check if hostFirmwareSettings have been changed | |||
| dirty, _, _ := hsm.Reconciler.getHostFirmwareSettings(info) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should ignore err here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| @@ -417,6 +417,13 @@ func (hsm *hostStateMachine) handleAvailable(info *reconcileInfo) actionResult { | |||
| } else if dirty { | |||
| hsm.NextState = metal3v1alpha1.StatePreparing | |||
| return actionComplete{} | |||
| } else { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: no need for an else here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
| prepareData.ActualFirmwareSettings = hfs.Status.Settings.DeepCopy() | ||
| prepareData.TargetFirmwareSettings = hfs.Spec.Settings.DeepCopy() | ||
| } | ||
|
|
||
| provResult, started, err := prov.Prepare(prepareData, dirty) | ||
| provResult, started, err := prov.Prepare(prepareData, bmhDirty || hfsDirty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behaviour... IIRC the flag causes us to start a new cleaning.
Previously we keep passing true until it returns started=true, at which point we save the provisioning settings and bmhDirty becomes false.
It looks to me like hfsDirty will continue to return true even after cleaning is started? I fear this will mess with the error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change in behaviour... IIRC the flag causes us to start a new cleaning.
Previously we keep passing true until it returns started=true, at which point we save the provisioning settings and bmhDirty becomes false. It looks to me like hfsDirty will continue to return true even after cleaning is started? I fear this will mess with the error handling.
Good point. I added another condition to the HFS to track whether cleaning has started and that will allow the error to be returned if Ironic fails the cleaning step. As we talked about using the condition prevents needed to add another field to the CRD.
281b7f5
to
61a7fde
Compare
|
/test-v1b1-centos-integration |
|
/hold |
61a7fde
to
4e15861
Compare
|
/hold cancel |
pkg/provisioner/ironic/ironic.go
Outdated
| @@ -1310,7 +1310,8 @@ func (p *ironicProvisioner) Prepare(data provisioner.PrepareData, unprepared boo | |||
| // When clean failed, we need to clean host provisioning settings. | |||
| // If unprepared is false, means the settings aren't cleared. | |||
| // So we can't set the node's state to manageable, until the settings are cleared. | |||
| if !unprepared { | |||
| // Also check for late binding resources in which the settings are cleared after cleaning | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this means.
pkg/provisioner/ironic/ironic.go
Outdated
| @@ -1310,7 +1310,8 @@ func (p *ironicProvisioner) Prepare(data provisioner.PrepareData, unprepared boo | |||
| // When clean failed, we need to clean host provisioning settings. | |||
| // If unprepared is false, means the settings aren't cleared. | |||
| // So we can't set the node's state to manageable, until the settings are cleared. | |||
| if !unprepared { | |||
| // Also check for late binding resources in which the settings are cleared after cleaning | |||
| if !unprepared || lateBinding { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this has the opposite problem: if a previous attempt to change only the HFS (i.e. the BMH settings are not dirty) failed and we recorded that error, then when the controller retries we'll just report this error straight away and never move back to manageable. (The controller will clear the firmware settings from the BMH, meaning unprepared will be now true, but lateBinding will also remain true so we'll always take this path.)
I'm thinking if we name this new flag force (for consistency with other Provisioner APIs), check if !force here, and pass bmhDirty for force:
prov.Prepare(prepareData, bmhDirty || hfsDirty, bmhDirty)
(this patch is the equivalent of passing bmhDirty && !hfsDirty for force)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess my suggestion relies on us being able to distinguish between the existing firmware/RAID settings having been cleared due to an error, and them having been set to nothing due to nothing being specified in the Spec. I don't think this is the case, because otherwise it would trigger manual cleaning on every host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the fix works to use bmhDirty for force. I was able to simulate a cleaning failure when changing the HFS settings and get the error back. I was also able to recover after fixing the situation that caused the cleaning failure:
bmh/ostest-worker-1 -n openshift-machine-api
NAME STATE CONSUMER ONLINE ERROR AGE
ostest-worker-1 preparing ostest-qdqsj-worker-0-cww5k false 78m
'apply_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: Redfish BIOS apply configuration failed for node 5c2bb81b-6736-43f2-a412-1c021aa76d5a. Error: HTTP PATCH http://192.168.111.1:8000/redfish/v1/Systems/193c3aa9-1836-4575-b52c-b6359c1cd38e/BIOS/Settings returned code 500. Base.1.0.GeneralError: cannot serialize 11 (type int) Extended information: [{'@odata.type': '/redfish/v1/$metadata#Message.1.0.0.Message', 'MessageId': 'Base.1.0.GeneralError'}]
Thanks
4e15861
to
2bd2291
Compare
|
/test-v1b1-centos-integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I finally figured out what to do with the force flag (see inline).
My remaining worry is that the started flag is going to continue to be false until the HFS reports that the settings are updated. If there is any chance that this controller sees the ironic Node return to manageable state while the HFS is still saying the settings differ (note that we may be reading from an outdated cache), then we will embark on an unnecessary second round of manual cleaning. Is there any mechanism to prevent this?
| prepareData.ActualFirmwareSettings = hfs.Status.Settings.DeepCopy() | ||
| prepareData.TargetFirmwareSettings = hfs.Spec.Settings.DeepCopy() | ||
| } | ||
|
|
||
| provResult, started, err := prov.Prepare(prepareData, dirty) | ||
| provResult, started, err := prov.Prepare(prepareData, bmhDirty || hfsDirty, bmhDirty) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realised I've been massively overthinking this.
The force flag now works exactly the same as all the other Provisioner APIs that have one. So we can pass the equivalent thing, which would be info.host.Status.ErrorType == metal3v1alpha1.PreparationError.
I think passing bmhDirty has a bug in the case where we are retrying but there is no difference between the Firmware/RAID settings in the BMH Spec and the empty ones in the Status.
2bd2291
to
b854535
Compare
|
/test-v1b1-centos-integration |
| info.log.Info("saving host provisioning settings") | ||
| _, err := saveHostProvisioningSettings(info.host) | ||
| if err != nil { | ||
| return actionError{errors.Wrap(err, "could not save the host provisioning settings")} | ||
| } | ||
| } | ||
|
|
||
| updateErrMsg := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not entirely correct, because if there wasn't a previous error but we did save the host provisioning settings, then we also need to write. This should probably be something like:
needsUpdate := bmhDirty && started
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you mean, fixed to restore initial way
| @@ -43,7 +43,7 @@ import ( | |||
| const ( | |||
| provisionerRetryDelay = time.Second * 30 | |||
| resourceNotAvailableRetryDelay = time.Second * 30 | |||
| reconcilerRequeueDelay = time.Minute * 5 | |||
| reconcilerRequeueDelay = time.Minute * 1 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should consider reducing this only when FirmwareSettingsChangeDetected is true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a really good idea. Changed.
b854535
to
a913e48
Compare
|
/test-v1b1-centos-integration |
|
/lgtm |
|
/test-v1b1-integration |
a913e48
to
0c671f9
Compare
Currently if the host is in the Available state and the HFS settings are changed it doesn't transition to the Preparing state as it would when the BareMetalHost settings are changed. This change fixes that.
0c671f9
to
cd3340b
Compare
|
/test-v1b1-centos-integration |
|
/test-v1b1-integration |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andfasano, zhouhao3 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Not sure I like the /lgtm |
|
/test-v1b1-centos-integration |
Currently if the host is in the Available state and the HFS settings are changed it doesn't transition to the Preparing state as it would when the BareMetalHost settings are changed. This change fixes that.