-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Fix HFC to execute updates #1793
Conversation
We probably want to to cherry-pick this to release-0.6 when done, if HFC doesn't work? |
@tuminoid correct |
Let's schedule the bot then to do the pick: |
@tuminoid: once the present PR merges, I will cherry-pick it on top of release-0.6 in a new PR and assign it to you. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently any changes in HostFirmwareComponents won't trigger the firmware update on it. This commit aims to fix the issues we have observed while testing with real hardware: - BMH stuck in preparing and executing the same firmware update multiple times - not being able to define the HostFirmwareComponent manually and have the update applied when the BMH is in preparing. Signed-off-by: Iury Gregory Melo Ferreira <imelofer@redhat.com>
Done, I've tested again in my setup before pushing the changes here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments, but clearly this is a huge improvement so it may be worth merging and following up on those afterwards.
@@ -1164,6 +1164,20 @@ func (r *BareMetalHostReconciler) actionPreparing(prov provisioner.Provisioner, | |||
return recordActionFailure(info, metal3api.PreparationError, provResult.ErrorMessage) | |||
} | |||
|
|||
if hfcDirty && started { | |||
hfcStillDirty, err := r.saveHostFirmwareComponents(prov, info, hfc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This saves the new status when we begin the manual cleaning, but I think I'm right in saying clearHostProvisioningSettings()
does not clear them? So if there's a failure in actually applying this change, we won't retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there is a failure we shouldn't retry I would say, it can be a bad firmware (Dell for example has firmware separate for each model, if I use the firmware of an R750 in R640 it complains and fails)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or it could be a dropped connection or a failed write.
We shouldn't report that we've done something if we haven't done it.
If the user provides the wrong firmware we should keep trying until they realise and stop doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, unless we invent a way for Ironic to say "this will never work, don't try again", the current trend is to retry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if provResult.ErrorMessage != "" {
if bmhDirty {
info.log.Info("handling cleaning error in controller")
clearHostProvisioningSettings(info.host)
}
if hfcDitry {
clearHostFirmwareComponentsUpdates(hfc)
}
return recordActionFailure(info, metal3api.PreparationError, provResult.ErrorMessage)
}
Something like this? a new function clearHostFirmwareComponentsUpdates
to be used, since I don't think we should change clearHostProvisioningSettings
} | ||
|
||
// Retrieve new information about the firmware components stored in ironic | ||
components, err := prov.GetFirmwareComponents() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is only going to get called once, at the beginning of manual cleaning.
Aren't the Components expected to change in ironic only after manual cleaning is complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, this is the problem I'm planning to fix in a separate PR, still trying to figure out how
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Zane is right. But I'm also wondering why the HFC controller does not update the components afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good point. I thought at one point we deleted that from the HFC controller, but indeed it's still there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, I've waited some hours to see if the HFC controller would attemp to update, but it didn't..,my best guess is that I need to change the conditions when calling updateHostFirmware
in the Reconcile
for HostFirmwareComponentsReconciler
@@ -1851,15 +1904,6 @@ func (r *BareMetalHostReconciler) getHostFirmwareComponents(info *reconcileInfo) | |||
|
|||
// Check if there are Updates in the Spec that are different than the Status | |||
if meta.IsStatusConditionTrue(hfc.Status.Conditions, string(metal3api.HostFirmwareComponentsChangeDetected)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory we should check that the condition matches the current Generation (example) so we know the data is not out of date.
Not a huge deal though. (Bigger worry is actually that we miss that the Spec has been copied to the Status - which won't bump the generation - which would result in us doing the update again.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will do some tests in the follow-up I'm working on.
/approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with a follow-up. Thank you Iury.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dtantsur, zaneb The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@tuminoid: #1793 failed to apply on top of branch "release-0.6":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently any changes in HostFirmwareComponents won't trigger the firmware update on it.
This commit aims to fix the issues we have observed while testing with real hardware.
What this PR does / why we need it: metal3-io/metal3-docs#364