Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move from Available to Preparing if HostFirmwareSettings changed #1075

Merged
merged 1 commit into from Feb 4, 2022

Conversation

bfournie
Copy link
Member

Currently if the host is in the Available state and the HFS settings are changed it doesn't transition to the Preparing state as it would when the BareMetalHost settings are changed. This change fixes that.

@metal3-io-bot metal3-io-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 20, 2022
@bfournie
Copy link
Member Author

/test-v1b1-integration
/test-v1b1-centos-integration

@dtantsur
Copy link
Member

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 21, 2022
// Check if hostFirmwareSettings have been changed
dirty, _, _ := hsm.Reconciler.getHostFirmwareSettings(info)
if dirty {
hsm.NextState = metal3v1alpha1.StatePreparing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bfournie it would be nice to have some coverage on this part (considering that a state change is a relevant action). As a good starting point for that you could have a look at the baremetalhost controller test

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added a test for getHostFirmwareSettings()

@@ -417,6 +417,13 @@ func (hsm *hostStateMachine) handleAvailable(info *reconcileInfo) actionResult {
} else if dirty {
hsm.NextState = metal3v1alpha1.StatePreparing
return actionComplete{}
} else {
// Check if hostFirmwareSettings have been changed
dirty, _, _ := hsm.Reconciler.getHostFirmwareSettings(info)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should ignore err here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -417,6 +417,13 @@ func (hsm *hostStateMachine) handleAvailable(info *reconcileInfo) actionResult {
} else if dirty {
hsm.NextState = metal3v1alpha1.StatePreparing
return actionComplete{}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: no need for an else here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

prepareData.ActualFirmwareSettings = hfs.Status.Settings.DeepCopy()
prepareData.TargetFirmwareSettings = hfs.Spec.Settings.DeepCopy()
}

provResult, started, err := prov.Prepare(prepareData, dirty)
provResult, started, err := prov.Prepare(prepareData, bmhDirty || hfsDirty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behaviour... IIRC the flag causes us to start a new cleaning.

Previously we keep passing true until it returns started=true, at which point we save the provisioning settings and bmhDirty becomes false.
It looks to me like hfsDirty will continue to return true even after cleaning is started? I fear this will mess with the error handling.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change in behaviour... IIRC the flag causes us to start a new cleaning.

Previously we keep passing true until it returns started=true, at which point we save the provisioning settings and bmhDirty becomes false. It looks to me like hfsDirty will continue to return true even after cleaning is started? I fear this will mess with the error handling.

Good point. I added another condition to the HFS to track whether cleaning has started and that will allow the error to be returned if Ironic fails the cleaning step. As we talked about using the condition prevents needed to add another field to the CRD.

@metal3-io-bot metal3-io-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 26, 2022
@bfournie
Copy link
Member Author

/test-v1b1-centos-integration
/test-v1b1-integration

@bfournie
Copy link
Member Author

/hold

@metal3-io-bot metal3-io-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2022
@bfournie
Copy link
Member Author

/hold cancel
/test-v1b1-centos-integration
/test-v1b1-integration

@metal3-io-bot metal3-io-bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 26, 2022
@@ -1310,7 +1310,8 @@ func (p *ironicProvisioner) Prepare(data provisioner.PrepareData, unprepared boo
// When clean failed, we need to clean host provisioning settings.
// If unprepared is false, means the settings aren't cleared.
// So we can't set the node's state to manageable, until the settings are cleared.
if !unprepared {
// Also check for late binding resources in which the settings are cleared after cleaning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this means.

@@ -1310,7 +1310,8 @@ func (p *ironicProvisioner) Prepare(data provisioner.PrepareData, unprepared boo
// When clean failed, we need to clean host provisioning settings.
// If unprepared is false, means the settings aren't cleared.
// So we can't set the node's state to manageable, until the settings are cleared.
if !unprepared {
// Also check for late binding resources in which the settings are cleared after cleaning
if !unprepared || lateBinding {
Copy link
Member

@zaneb zaneb Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this has the opposite problem: if a previous attempt to change only the HFS (i.e. the BMH settings are not dirty) failed and we recorded that error, then when the controller retries we'll just report this error straight away and never move back to manageable. (The controller will clear the firmware settings from the BMH, meaning unprepared will be now true, but lateBinding will also remain true so we'll always take this path.)

I'm thinking if we name this new flag force (for consistency with other Provisioner APIs), check if !force here, and pass bmhDirty for force:

prov.Prepare(prepareData, bmhDirty || hfsDirty, bmhDirty)

(this patch is the equivalent of passing bmhDirty && !hfsDirty for force)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess my suggestion relies on us being able to distinguish between the existing firmware/RAID settings having been cleared due to an error, and them having been set to nothing due to nothing being specified in the Spec. I don't think this is the case, because otherwise it would trigger manual cleaning on every host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the fix works to use bmhDirty for force. I was able to simulate a cleaning failure when changing the HFS settings and get the error back. I was also able to recover after fixing the situation that caused the cleaning failure:

bmh/ostest-worker-1 -n openshift-machine-api
NAME              STATE       CONSUMER                      ONLINE   ERROR   AGE
ostest-worker-1   preparing   ostest-qdqsj-worker-0-cww5k   false            78m
 'apply_configuration', 'abortable': False, 'priority': 0}: Redfish exception occurred. Error: Redfish BIOS apply configuration failed for node 5c2bb81b-6736-43f2-a412-1c021aa76d5a. Error: HTTP PATCH http://192.168.111.1:8000/redfish/v1/Systems/193c3aa9-1836-4575-b52c-b6359c1cd38e/BIOS/Settings returned code 500. Base.1.0.GeneralError: cannot serialize 11 (type int) Extended information: [{'@odata.type': '/redfish/v1/$metadata#Message.1.0.0.Message', 'MessageId': 'Base.1.0.GeneralError'}]

Thanks

@bfournie
Copy link
Member Author

/test-v1b1-centos-integration
/test-v1b1-integration

Copy link
Member

@zaneb zaneb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I finally figured out what to do with the force flag (see inline).

My remaining worry is that the started flag is going to continue to be false until the HFS reports that the settings are updated. If there is any chance that this controller sees the ironic Node return to manageable state while the HFS is still saying the settings differ (note that we may be reading from an outdated cache), then we will embark on an unnecessary second round of manual cleaning. Is there any mechanism to prevent this?

prepareData.ActualFirmwareSettings = hfs.Status.Settings.DeepCopy()
prepareData.TargetFirmwareSettings = hfs.Spec.Settings.DeepCopy()
}

provResult, started, err := prov.Prepare(prepareData, dirty)
provResult, started, err := prov.Prepare(prepareData, bmhDirty || hfsDirty, bmhDirty)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised I've been massively overthinking this.

The force flag now works exactly the same as all the other Provisioner APIs that have one. So we can pass the equivalent thing, which would be info.host.Status.ErrorType == metal3v1alpha1.PreparationError.

I think passing bmhDirty has a bug in the case where we are retrying but there is no difference between the Firmware/RAID settings in the BMH Spec and the empty ones in the Status.

@bfournie
Copy link
Member Author

/test-v1b1-centos-integration
/test-v1b1-integration

info.log.Info("saving host provisioning settings")
_, err := saveHostProvisioningSettings(info.host)
if err != nil {
return actionError{errors.Wrap(err, "could not save the host provisioning settings")}
}
}

updateErrMsg := false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely correct, because if there wasn't a previous error but we did save the host provisioning settings, then we also need to write. This should probably be something like:

needsUpdate := bmhDirty && started

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you mean, fixed to restore initial way

@@ -43,7 +43,7 @@ import (
const (
provisionerRetryDelay = time.Second * 30
resourceNotAvailableRetryDelay = time.Second * 30
reconcilerRequeueDelay = time.Minute * 5
reconcilerRequeueDelay = time.Minute * 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should consider reducing this only when FirmwareSettingsChangeDetected is true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good idea. Changed.

@bfournie
Copy link
Member Author

/test-v1b1-centos-integration
/test-v1b1-centos-integration

@zaneb
Copy link
Member

zaneb commented Jan 28, 2022

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 28, 2022
@bfournie
Copy link
Member Author

/test-v1b1-integration

@metal3-io-bot metal3-io-bot removed the lgtm Indicates that a PR is ready to be merged. label Feb 3, 2022
Currently if the host is in the Available state and the HFS settings are
changed it doesn't transition to the Preparing state as it would when the
BareMetalHost settings are changed. This change fixes that.
@andfasano
Copy link
Member

/test-v1b1-centos-integration
/test-v1b1-centos-integration

@bfournie
Copy link
Member Author

bfournie commented Feb 3, 2022

/test-v1b1-integration

@andfasano
Copy link
Member

/approve

@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano, zhouhao3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@metal3-io-bot metal3-io-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 4, 2022
@dtantsur
Copy link
Member

dtantsur commented Feb 4, 2022

Not sure I like the force vs unprepared.. but let it be for now.

/lgtm

@metal3-io-bot metal3-io-bot added the lgtm Indicates that a PR is ready to be merged. label Feb 4, 2022
@furkatgofurov7
Copy link
Member

/test-v1b1-centos-integration
/test-v1b1-integration

@metal3-io-bot metal3-io-bot merged commit 04faa1c into metal3-io:main Feb 4, 2022
@bfournie bfournie deleted the hfs-available-to-prepare branch February 4, 2022 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants