Add proposal for retry interface #171

andfasano · 2021-03-17T15:34:38Z

Draft proposal to address the problem discussed originally in metal3-io/baremetal-operator#798

andfasano · 2021-03-24T08:25:58Z

s3rj1k · 2021-03-24T10:32:38Z

@andfasano IMO naming annotation resume.metal3.io could bring some with pause annotation.
Assuming someone does not have deep knowledge of docs, he would assume that adding resume annotation would remove pause annotation as well.

If both of annotations does not have intended dependencies, does not affect each other, I would recommend renaming resume.metal3.io to something else.

Other then naming issue looks good and should solve related issue without removing error context as you mentioned before.

design/baremetal-operator/resume-interface.md

zaneb · 2021-03-24T15:27:35Z

design/baremetal-operator/resume-interface.md

+The annotation must be consumed as soon as possible within the BMO reconcile
+loop, and it should be removed as soon as the `Status.ErrorCount` is set to 1.
+If the BMH resource is not in an error state (`Status.ErrorCount` already
+set to zero) then the annotation must be removed and the event properly logged.


Is there a way to avoid a thundering herd bearing down on ironic if somebody does a bulk apply of this to all BMHs in a system at once?

@zaneb doesn't PROVISIONING_LIMIT solves this ?

That check is limited only to a subset of states, while the annotation trigger in the current proposal is not bound to anyone. So I don't think currently we have a complete mechanism to prevent that, and moreover this could be a more general problem for any annotation that may request an Ironic action (see for example the inspect annotation, even though in such case it could be triggered only from the Ready state).
If it's ok, I could add this point in the risks paragraph.

metal3-io-bot · 2021-03-25T11:44:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andfasano
To complete the pull request process, please assign fmuyassarov
You can assign the PR to them by writing /assign @fmuyassarov in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

dhellmann · 2021-03-31T14:43:06Z

design/baremetal-operator/retry-interface.md

+This proposal aims to introduce a new declarative API
+to forcibly retry the reconciliation of a waiting BareMetalHost resource
+(being in error state) when a fix is immediately available, so that it
+could be possible to reduce the waiting time required


I'm still not clear why we need a special API for this. Any edit to the host, including adding a temporary label, would trigger reconciliation.

I guess the difference this makes is that if it fails again for a different (perhaps more easily recoverable) reason then the time to retry is shorter.
That said, I must confess to being somewhat sceptical of the value also.

Yes, the key point of the proposal it's about resetting the ErrorCount to 1 - otherwise any edit will be fine fo retriggering the reconciliation.

Why do we want to artificially reset the error count? Just to shorten retry intervals if there's another failure?

Yes (or another set of failures)

dhellmann · 2021-03-31T14:43:30Z

design/baremetal-operator/retry-interface.md

+## Alternatives
+
+Reducing the backoff gaps (or upper limit) could mitigate somehow as the
+waiting time will be reduced


Why not reduce the upper bound for the backoff?

Another alternative: build a CLI util that resets the error count without making it part of the BMH API.

Why not reduce the upper bound for the backoff?

I think it's listed as a an alternative. maxBackOffCount could be reduced to cap the max waiting time after the host reached that number of (failed) iterations. Current gap values can be checked here

Yes, it is listed as an alternative, but there is no reason given about why it was rejected.

Maybe I misunderstood the goal of the Alternatives section, but my intention was to list another approach so that it could have been discussed. Personally I think it's still a potentially viable alternative - as not requiring explicit user intervention

andfasano · 2021-05-13T14:31:56Z

Hi all,
I'd like to summarise the current situation in order to continue if possible the discussion and reach a conclusion. It looks like the consensus is not to change the API with an additional annotation, but eventually to improve the retry mechanism where needed or modify the upper limit to reduce the waiting time. The latter could slightly help in those situations where a resource having an high ErrorCount value receives an update request (triggered by a normal timeout or due an artificial change request on the resource) and fails again - this because ErrorCount is usually reset on a successful state change. The pro of such approach is that it wouldn't require any change in the public API, and the user that would request an immediate re-evaluation could force it with a normal update.

s3rj1k · 2021-05-13T14:38:34Z

@andfasano I we go this way, maybe we could have some kind of CLI utility to patch-up spec just in case then fix is needed ASAP?

#171 (comment)

cc: @zaneb

andfasano · 2021-05-14T15:35:24Z

@andfasano I we go this way, maybe we could have some kind of CLI utility to patch-up spec just in case then fix is needed ASAP?

#171 (comment)

cc: @zaneb

As already mentioned here I think that the ErrorCount should be a field managed by BMO only without any external intervention. If an intervention is required immediately then there is still the possibility to manually force the update.

s3rj1k · 2021-05-14T15:58:58Z

@andfasano Well, this is not community friendly, this effectively will force patching operator in field when fixing error is not possible.

This is what triggered all of this discussion, I everybody is fine with that, let's close this and hope that no body gets hurt by broken production cluster.

andfasano · 2021-05-14T16:47:52Z

@andfasano Well, this is not community friendly, this effectively will force patching operator in field when fixing error is not possible.

This is what triggered all of this discussion, I everybody is fine with that, let's close this and hope that no body gets hurt by broken production cluster.

TBH I'm not sure to understand exactly the point, given that I was originally fine about the proposal of lowering the ErrorCount (to 1) via annotation, and actively trying to reach a solution. From a design point of view I simply don't agree moving such action outside from the operator scope (the manual update has been mentioned just as a current workaround to avoid waiting the next reconcile loop). I'm also fine in reducing the ErrorCount cap limit as an alternative approach, as it could allow the operator to re-evaluate the resource currently in error slightly more frequently.

andfasano · 2021-05-14T18:05:40Z

@andfasano Well, this is not community friendly, this effectively will force patching operator in field when fixing error is not possible.
This is what triggered all of this discussion, I everybody is fine with that, let's close this and hope that no body gets hurt by broken production cluster.

TBH I'm not sure to understand exactly the point, given that I was originally fine about the proposal of lowering the ErrorCount (to 1) via annotation, and actively trying to reach a solution. From a design point of view I simply don't agree moving such action outside from the operator scope (the manual update has been mentioned just as a current workaround to avoid waiting the next reconcile loop). I'm also fine in reducing the ErrorCount cap limit as an alternative approach, as it could allow the operator to re-evaluate the resource currently in error slightly more frequently.

Just to be clear, after reviewing Zane's comment, if the consensus is about writing a very tiny CLI tool that only resets the ErrorCount to 1 (but the error status is maintained untouched as mentioned in the current proposal), just as a sort of "safety valve" for unplanned situations, then it would be ok for me.

zaneb · 2021-05-17T19:37:31Z

I'd have thought the field recovery procedure would go like this:

fix whatever external issue is causing the problem
make literally any change to the BMH to retrigger reconciliation
wait to see if it works
if not, go to 1.

Resetting the error count can help if you want to skip step 3, but that hardly seems like an emergency worthy of recompiling a new bmo (which will in most cases take longer than waiting to see if it worked). Nevertheless, while I don't quite see the point, having a CLI tool that just does if errorCount > 1 { errorCount = 1} seems mostly harmless. If somebody wrote that and wanted to submit it upstream I wouldn't object.

s3rj1k · 2021-05-17T19:42:13Z

@zaneb 3. I would need to wait for ~5h for reconcile re-trigger, as by the time 1. and 2. are finished, error back-off is triggered

dhellmann · 2021-05-17T19:57:40Z

@zaneb 3. I would need to wait for ~5h for reconcile re-trigger, as by the time 1. and 2. are finished, error back-off is triggered

The error back-off is only applied when the operator tells controller-runtime to try again after a delay. It is not applied when reconciliation is triggered because the CR is edited. So, step 2 should trigger reconciliation immediately, without waiting for the error back-off. Is that not the behavior you see?

s3rj1k · 2021-05-17T20:06:53Z

@dhellmann No. Last time I checked changing something in BMH CR re-triggers single reconcile loop that is again stuck at back-off timer at the and.

Only thing that forced a reset on error count was removing provisioning image URL, and that triggered deprovisioning.

Deprovisioning a perfectly healthy (working kubeworker but broken BMH object) host is a bit harsh solution.

So the idea emerged, after discussion in several different threads, to somehow fix long back-off delay, either by resetting errorCount to 1 or by changing back-off time.

dhellmann · 2021-05-17T20:16:02Z

@dhellmann No. Last time I checked changing something in BMH CR re-triggers single reconcile loop that is again stuck at back-off timer at the and.

This feels like a bug. If we successfully reconcile the host, we should reset that back-off. @andfasano, what do you think?

So the idea emerged, after discussion in several different threads, to somehow fix long back-off delay, either by resetting errorCount to 1 or by changing back-off time.

We need to make the BMO manage that value correctly.

s3rj1k · 2021-05-17T20:23:25Z

@dhellmann

This feels like a bug. If we successfully reconcile the host, we should reset that back-off.

I do not thing that reconcile was successful, it was just stuck at back-off with no way of re-triggering successful reconcile. (only via deprovisioning).

I do believe that a root cause of that bug was fixed by metal3-io/baremetal-operator@c45ecfb.

But back-off part is still the same, so this could re-emerge in some other unexpected place.

zaneb · 2021-05-17T20:34:41Z

After you trigger the reconcile, if another error happens it will indeed not retry a third time until the delay has expired (which will be a long time if the error count is high). If the problem is not fixed then it doesn't matter if you retry fast or slow because it will always fail - in fact slow is better because it wastes less resources. If this is caused by a bug in BMO, there is no substitute for fixing the bug or at least hacking around in ironic. If it is caused by some external condition (e.g. network unavailable) then that needs to be resolved before trying again.

However, if there is no error on the retry then it will proceed and eventually clear the error and the error count. If this doesn't happen it's a bug.

andfasano · 2021-05-18T13:19:11Z

After you trigger the reconcile, if another error happens it will indeed not retry a third time until the delay has expired (which will be a long time if the error count is high). If the problem is not fixed then it doesn't matter if you retry fast or slow because it will always fail - in fact slow is better because it wastes less resources. If this is caused by a bug in BMO, there is no substitute for fixing the bug or at least hacking around in ironic. If it is caused by some external condition (e.g. network unavailable) then that needs to be resolved before trying again.

However, if there is no error on the retry then it will proceed and eventually clear the error and the error count. If this doesn't happen it's a bug.

I think that @zane summarized very precisly the situation and the expected behavior - when BMO reacts to a resource related event. For that, if there is a specific execution flow in BMO that doesn't reset properly the ErrorCount (upon success) it's definitely a bug and it should be addressed case by case.

Instead, a potentially interesting scenario addressed by the current proposal (or the CLI) was mainly related when the fix for a resource (currently in error state) does not generate an event (as for example the firmware update on the host). In such case it is necessary to wait the natural timeout of the previously enqueued request for that host or, if you don't want to wait for such period, generate artificially a new one to force BMO reaction (and fallback in the previously described situation).

dhellmann · 2021-05-18T13:23:46Z

@dhellmann

This feels like a bug. If we successfully reconcile the host, we should reset that back-off.

I do not thing that reconcile was successful, it was just stuck at back-off with no way of re-triggering successful reconcile. (only via deprovisioning).

I do believe that a root cause of that bug was fixed by metal3-io/baremetal-operator@c45ecfb.

But back-off part is still the same, so this could re-emerge in some other unexpected place.

So the BMO correctly responded to an update event and reconciled. That reconciliation resulted in an error. The error count on the host was not decreased, and the BMO went back to waiting for the error back-off timeout.

That's the behavior I would expect when an error occurs. Did you expect something different?

s3rj1k · 2021-05-18T13:55:07Z

@dhellmann

That's the behavior I would expect when an error occurs. Did you expect something different?

I expected that baremetal-operator would allow me to try to fix the issue without waiting ~5h for next reconcile :)

It is not always obvious why error is occurring, so to try different fixes one would have to try different changes.

In current situation the only way to re-trigger single reconcile (without deprovisioning) is to use pause annotation :)
Seems like a hack.

zaneb · 2021-05-18T15:51:37Z

@dhellmann

That's the behavior I would expect when an error occurs. Did you expect something different?

I expected that baremetal-operator would allow me to try to fix the issue without waiting ~5h for next reconcile :)

But you hadn't fixed the issue yet. That's why it failed again.

Once you had, you can trigger another reconcile at any time by writing any change at all to the BMH.

You are always free to try to fix the issue at any time. The backoff time is only the amount of time that the baremetal-operator will wait before trying to fix the issue itself - in the dumbest way possible (blindly trying again). The time increases with each failure because you don't want it doing that in a tight loop.

It is not always obvious why error is occurring, so to try different fixes one would have to try different changes.

That's fair, but you can always retrigger a reconcile after each try. It goes without saying that you are interactively debugging it at this point, so this is always an option and there's no real need to have it automatically retrying on a quicker schedule (and in fact it might be better if it doesn't, because that could interfere with your debugging attempts).

In current situation the only way to re-trigger single reconcile (without deprovisioning) is to use pause annotation :)
Seems like a hack.

You can use any annotation you want to trigger a reconcile. It doesn't have to be one recognised by the baremetal-operator.

Furthermore, by bringing up deprovisioning you are conflating the original issue, which prevented the BareMetalHost running inspection and therefore left it totally unusable, with the provisioned state in which only power management errors can occur. The ErrorCount is cleared on any successful state transition, or any successful power management action in a provisioned/ready state, not only on deprovisioning.

We're going around in circles at this point. The exact nature of the problem in metal3-io/baremetal-operator#798 is lost in the mists of time. The root cause was addressed by metal3-io/baremetal-operator#817. Meanwhile, we have also merged metal3-io/baremetal-operator#840 which fixed a real bug with error handling during inspection that could easily have resulted in very confusing behaviour.

As far as we can establish, the error backoff is now working the way one would expect, and we have no reason to think that retriggering a reconcile would not allow a complete recovery from an error condition that has been corrected. If you see any further issues that are caused by unintended behaviour, please open another bug.

/close

metal3-io-bot · 2021-05-18T15:51:40Z

@zaneb: Closed this PR.

In response to this:

@dhellmann

That's the behavior I would expect when an error occurs. Did you expect something different?

I expected that baremetal-operator would allow me to try to fix the issue without waiting ~5h for next reconcile :)

But you hadn't fixed the issue yet. That's why it failed again.

Once you had, you can trigger another reconcile at any time by writing any change at all to the BMH.

You are always free to try to fix the issue at any time. The backoff time is only the amount of time that the baremetal-operator will wait before trying to fix the issue itself - in the dumbest way possible (blindly trying again). The time increases with each failure because you don't want it doing that in a tight loop.

It is not always obvious why error is occurring, so to try different fixes one would have to try different changes.

That's fair, but you can always retrigger a reconcile after each try. It goes without saying that you are interactively debugging it at this point, so this is always an option and there's no real need to have it automatically retrying on a quicker schedule (and in fact it might be better if it doesn't, because that could interfere with your debugging attempts).

In current situation the only way to re-trigger single reconcile (without deprovisioning) is to use pause annotation :)
Seems like a hack.

You can use any annotation you want to trigger a reconcile. It doesn't have to be one recognised by the baremetal-operator.

Furthermore, by bringing up deprovisioning you are conflating the original issue, which prevented the BareMetalHost running inspection and therefore left it totally unusable, with the provisioned state in which only power management errors can occur. The ErrorCount is cleared on any successful state transition, or any successful power management action in a provisioned/ready state, not only on deprovisioning.

We're going around in circles at this point. The exact nature of the problem in metal3-io/baremetal-operator#798 is lost in the mists of time. The root cause was addressed by metal3-io/baremetal-operator#817. Meanwhile, we have also merged metal3-io/baremetal-operator#840 which fixed a real bug with error handling during inspection that could easily have resulted in very confusing behaviour.

As far as we can establish, the error backoff is now working the way one would expect, and we have no reason to think that retriggering a reconcile would not allow a complete recovery from an error condition that has been corrected. If you see any further issues that are caused by unintended behaviour, please open another bug.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

s3rj1k · 2021-05-18T16:35:47Z

@zaneb Fair, probably it is impossible to reproduce that state at this point.

Can we tweak back-off time to make it not so long?

cc: @andfasano

metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 17, 2021

andfasano force-pushed the wakeup-reconcile branch from b34b8bf to 026e107 Compare March 17, 2021 16:39

Add proposal for resume interface

147bad5

andfasano force-pushed the wakeup-reconcile branch from 026e107 to 147bad5 Compare March 17, 2021 17:26

andfasano changed the title ~~WIP: Add proposal for resume interface~~ Add proposal for resume interface Mar 23, 2021

metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 23, 2021

zaneb reviewed Mar 24, 2021

View reviewed changes

s3rj1k mentioned this pull request Mar 24, 2021

Resetting status.errorCount manualy? metal3-io/baremetal-operator#798

Closed

andfasano changed the title ~~Add proposal for resume interface~~ Add proposal for retry interface Mar 25, 2021

Change to retry term

66e3af0

andfasano force-pushed the wakeup-reconcile branch from 87a83d1 to 66e3af0 Compare March 25, 2021 12:49

dhellmann reviewed Mar 31, 2021

View reviewed changes

metal3-io-bot closed this May 18, 2021

Add proposal for retry interface #171

Add proposal for retry interface #171

Conversation

andfasano commented Mar 17, 2021

andfasano commented Mar 24, 2021

s3rj1k commented Mar 24, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

metal3-io-bot commented Mar 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andfasano commented May 13, 2021

s3rj1k commented May 13, 2021

andfasano commented May 14, 2021

s3rj1k commented May 14, 2021

andfasano commented May 14, 2021

andfasano commented May 14, 2021

zaneb commented May 17, 2021

s3rj1k commented May 17, 2021

dhellmann commented May 17, 2021

s3rj1k commented May 17, 2021 • edited

dhellmann commented May 17, 2021

s3rj1k commented May 17, 2021

zaneb commented May 17, 2021

andfasano commented May 18, 2021

dhellmann commented May 18, 2021

s3rj1k commented May 18, 2021 • edited

zaneb commented May 18, 2021

metal3-io-bot commented May 18, 2021

s3rj1k commented May 18, 2021

s3rj1k commented May 17, 2021 •

edited

s3rj1k commented May 18, 2021 •

edited