Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add proposal for retry interface #171

Closed
wants to merge 2 commits into from

Conversation

andfasano
Copy link
Member

Draft proposal to address the problem discussed originally in metal3-io/baremetal-operator#798

@metal3-io-bot metal3-io-bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Mar 17, 2021
@andfasano andfasano changed the title WIP: Add proposal for resume interface Add proposal for resume interface Mar 23, 2021
@metal3-io-bot metal3-io-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 23, 2021
@andfasano
Copy link
Member Author

cc @s3rj1k @dhellmann

@s3rj1k
Copy link
Member

s3rj1k commented Mar 24, 2021

@andfasano IMO naming annotation resume.metal3.io could bring some with pause annotation.
Assuming someone does not have deep knowledge of docs, he would assume that adding resume annotation would remove pause annotation as well.

If both of annotations does not have intended dependencies, does not affect each other, I would recommend renaming resume.metal3.io to something else.

Other then naming issue looks good and should solve related issue without removing error context as you mentioned before.

design/baremetal-operator/resume-interface.md Outdated Show resolved Hide resolved
design/baremetal-operator/resume-interface.md Outdated Show resolved Hide resolved
The annotation must be consumed as soon as possible within the BMO reconcile
loop, and it should be removed as soon as the `Status.ErrorCount` is set to 1.
If the BMH resource is not in an error state (`Status.ErrorCount` already
set to zero) then the annotation must be removed and the event properly logged.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to avoid a thundering herd bearing down on ironic if somebody does a bulk apply of this to all BMHs in a system at once?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zaneb doesn't PROVISIONING_LIMIT solves this ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check is limited only to a subset of states, while the annotation trigger in the current proposal is not bound to anyone. So I don't think currently we have a complete mechanism to prevent that, and moreover this could be a more general problem for any annotation that may request an Ironic action (see for example the inspect annotation, even though in such case it could be triggered only from the Ready state).
If it's ok, I could add this point in the risks paragraph.

@andfasano andfasano changed the title Add proposal for resume interface Add proposal for retry interface Mar 25, 2021
@metal3-io-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: andfasano
To complete the pull request process, please assign fmuyassarov
You can assign the PR to them by writing /assign @fmuyassarov in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

This proposal aims to introduce a new declarative API
to forcibly retry the reconciliation of a waiting BareMetalHost resource
(being in error state) when a fix is immediately available, so that it
could be possible to reduce the waiting time required
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not clear why we need a special API for this. Any edit to the host, including adding a temporary label, would trigger reconciliation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the difference this makes is that if it fails again for a different (perhaps more easily recoverable) reason then the time to retry is shorter.
That said, I must confess to being somewhat sceptical of the value also.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the key point of the proposal it's about resetting the ErrorCount to 1 - otherwise any edit will be fine fo retriggering the reconciliation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to artificially reset the error count? Just to shorten retry intervals if there's another failure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (or another set of failures)

## Alternatives

Reducing the backoff gaps (or upper limit) could mitigate somehow as the
waiting time will be reduced
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reduce the upper bound for the backoff?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another alternative: build a CLI util that resets the error count without making it part of the BMH API.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not reduce the upper bound for the backoff?

I think it's listed as a an alternative. maxBackOffCount could be reduced to cap the max waiting time after the host reached that number of (failed) iterations. Current gap values can be checked here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is listed as an alternative, but there is no reason given about why it was rejected.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I misunderstood the goal of the Alternatives section, but my intention was to list another approach so that it could have been discussed. Personally I think it's still a potentially viable alternative - as not requiring explicit user intervention

@andfasano
Copy link
Member Author

Hi all,
I'd like to summarise the current situation in order to continue if possible the discussion and reach a conclusion. It looks like the consensus is not to change the API with an additional annotation, but eventually to improve the retry mechanism where needed or modify the upper limit to reduce the waiting time. The latter could slightly help in those situations where a resource having an high ErrorCount value receives an update request (triggered by a normal timeout or due an artificial change request on the resource) and fails again - this because ErrorCount is usually reset on a successful state change. The pro of such approach is that it wouldn't require any change in the public API, and the user that would request an immediate re-evaluation could force it with a normal update.

@s3rj1k
Copy link
Member

s3rj1k commented May 13, 2021

@andfasano I we go this way, maybe we could have some kind of CLI utility to patch-up spec just in case then fix is needed ASAP?

#171 (comment)

cc: @zaneb

@andfasano
Copy link
Member Author

@andfasano I we go this way, maybe we could have some kind of CLI utility to patch-up spec just in case then fix is needed ASAP?

#171 (comment)

cc: @zaneb

As already mentioned here I think that the ErrorCount should be a field managed by BMO only without any external intervention. If an intervention is required immediately then there is still the possibility to manually force the update.

@s3rj1k
Copy link
Member

s3rj1k commented May 14, 2021

@andfasano Well, this is not community friendly, this effectively will force patching operator in field when fixing error is not possible.

This is what triggered all of this discussion, I everybody is fine with that, let's close this and hope that no body gets hurt by broken production cluster.

@andfasano
Copy link
Member Author

@andfasano Well, this is not community friendly, this effectively will force patching operator in field when fixing error is not possible.

This is what triggered all of this discussion, I everybody is fine with that, let's close this and hope that no body gets hurt by broken production cluster.

TBH I'm not sure to understand exactly the point, given that I was originally fine about the proposal of lowering the ErrorCount (to 1) via annotation, and actively trying to reach a solution. From a design point of view I simply don't agree moving such action outside from the operator scope (the manual update has been mentioned just as a current workaround to avoid waiting the next reconcile loop). I'm also fine in reducing the ErrorCount cap limit as an alternative approach, as it could allow the operator to re-evaluate the resource currently in error slightly more frequently.

@andfasano
Copy link
Member Author

@andfasano Well, this is not community friendly, this effectively will force patching operator in field when fixing error is not possible.
This is what triggered all of this discussion, I everybody is fine with that, let's close this and hope that no body gets hurt by broken production cluster.

TBH I'm not sure to understand exactly the point, given that I was originally fine about the proposal of lowering the ErrorCount (to 1) via annotation, and actively trying to reach a solution. From a design point of view I simply don't agree moving such action outside from the operator scope (the manual update has been mentioned just as a current workaround to avoid waiting the next reconcile loop). I'm also fine in reducing the ErrorCount cap limit as an alternative approach, as it could allow the operator to re-evaluate the resource currently in error slightly more frequently.

Just to be clear, after reviewing Zane's comment, if the consensus is about writing a very tiny CLI tool that only resets the ErrorCount to 1 (but the error status is maintained untouched as mentioned in the current proposal), just as a sort of "safety valve" for unplanned situations, then it would be ok for me.

@zaneb
Copy link
Member

zaneb commented May 17, 2021

I'd have thought the field recovery procedure would go like this:

  1. fix whatever external issue is causing the problem
  2. make literally any change to the BMH to retrigger reconciliation
  3. wait to see if it works
  4. if not, go to 1.

Resetting the error count can help if you want to skip step 3, but that hardly seems like an emergency worthy of recompiling a new bmo (which will in most cases take longer than waiting to see if it worked). Nevertheless, while I don't quite see the point, having a CLI tool that just does if errorCount > 1 { errorCount = 1} seems mostly harmless. If somebody wrote that and wanted to submit it upstream I wouldn't object.

@s3rj1k
Copy link
Member

s3rj1k commented May 17, 2021

@zaneb 3. I would need to wait for ~5h for reconcile re-trigger, as by the time 1. and 2. are finished, error back-off is triggered

@dhellmann
Copy link
Member

@zaneb 3. I would need to wait for ~5h for reconcile re-trigger, as by the time 1. and 2. are finished, error back-off is triggered

The error back-off is only applied when the operator tells controller-runtime to try again after a delay. It is not applied when reconciliation is triggered because the CR is edited. So, step 2 should trigger reconciliation immediately, without waiting for the error back-off. Is that not the behavior you see?

@s3rj1k
Copy link
Member

s3rj1k commented May 17, 2021

@dhellmann No. Last time I checked changing something in BMH CR re-triggers single reconcile loop that is again stuck at back-off timer at the and.

Only thing that forced a reset on error count was removing provisioning image URL, and that triggered deprovisioning.

Deprovisioning a perfectly healthy (working kubeworker but broken BMH object) host is a bit harsh solution.

So the idea emerged, after discussion in several different threads, to somehow fix long back-off delay, either by resetting errorCount to 1 or by changing back-off time.

@dhellmann
Copy link
Member

@dhellmann No. Last time I checked changing something in BMH CR re-triggers single reconcile loop that is again stuck at back-off timer at the and.

This feels like a bug. If we successfully reconcile the host, we should reset that back-off. @andfasano, what do you think?

So the idea emerged, after discussion in several different threads, to somehow fix long back-off delay, either by resetting errorCount to 1 or by changing back-off time.

We need to make the BMO manage that value correctly.

@s3rj1k
Copy link
Member

s3rj1k commented May 17, 2021

@dhellmann

This feels like a bug. If we successfully reconcile the host, we should reset that back-off.

I do not thing that reconcile was successful, it was just stuck at back-off with no way of re-triggering successful reconcile. (only via deprovisioning).

I do believe that a root cause of that bug was fixed by metal3-io/baremetal-operator@c45ecfb.

But back-off part is still the same, so this could re-emerge in some other unexpected place.

@zaneb
Copy link
Member

zaneb commented May 17, 2021

After you trigger the reconcile, if another error happens it will indeed not retry a third time until the delay has expired (which will be a long time if the error count is high). If the problem is not fixed then it doesn't matter if you retry fast or slow because it will always fail - in fact slow is better because it wastes less resources. If this is caused by a bug in BMO, there is no substitute for fixing the bug or at least hacking around in ironic. If it is caused by some external condition (e.g. network unavailable) then that needs to be resolved before trying again.

However, if there is no error on the retry then it will proceed and eventually clear the error and the error count. If this doesn't happen it's a bug.

@andfasano
Copy link
Member Author

After you trigger the reconcile, if another error happens it will indeed not retry a third time until the delay has expired (which will be a long time if the error count is high). If the problem is not fixed then it doesn't matter if you retry fast or slow because it will always fail - in fact slow is better because it wastes less resources. If this is caused by a bug in BMO, there is no substitute for fixing the bug or at least hacking around in ironic. If it is caused by some external condition (e.g. network unavailable) then that needs to be resolved before trying again.

However, if there is no error on the retry then it will proceed and eventually clear the error and the error count. If this doesn't happen it's a bug.

I think that @zane summarized very precisly the situation and the expected behavior - when BMO reacts to a resource related event. For that, if there is a specific execution flow in BMO that doesn't reset properly the ErrorCount (upon success) it's definitely a bug and it should be addressed case by case.

Instead, a potentially interesting scenario addressed by the current proposal (or the CLI) was mainly related when the fix for a resource (currently in error state) does not generate an event (as for example the firmware update on the host). In such case it is necessary to wait the natural timeout of the previously enqueued request for that host or, if you don't want to wait for such period, generate artificially a new one to force BMO reaction (and fallback in the previously described situation).

@dhellmann
Copy link
Member

@dhellmann

This feels like a bug. If we successfully reconcile the host, we should reset that back-off.

I do not thing that reconcile was successful, it was just stuck at back-off with no way of re-triggering successful reconcile. (only via deprovisioning).

I do believe that a root cause of that bug was fixed by metal3-io/baremetal-operator@c45ecfb.

But back-off part is still the same, so this could re-emerge in some other unexpected place.

So the BMO correctly responded to an update event and reconciled. That reconciliation resulted in an error. The error count on the host was not decreased, and the BMO went back to waiting for the error back-off timeout.

That's the behavior I would expect when an error occurs. Did you expect something different?

@s3rj1k
Copy link
Member

s3rj1k commented May 18, 2021

@dhellmann

That's the behavior I would expect when an error occurs. Did you expect something different?

I expected that baremetal-operator would allow me to try to fix the issue without waiting ~5h for next reconcile :)

It is not always obvious why error is occurring, so to try different fixes one would have to try different changes.

In current situation the only way to re-trigger single reconcile (without deprovisioning) is to use pause annotation :)
Seems like a hack.

@zaneb
Copy link
Member

zaneb commented May 18, 2021

@dhellmann

That's the behavior I would expect when an error occurs. Did you expect something different?

I expected that baremetal-operator would allow me to try to fix the issue without waiting ~5h for next reconcile :)

But you hadn't fixed the issue yet. That's why it failed again.

Once you had, you can trigger another reconcile at any time by writing any change at all to the BMH.

You are always free to try to fix the issue at any time. The backoff time is only the amount of time that the baremetal-operator will wait before trying to fix the issue itself - in the dumbest way possible (blindly trying again). The time increases with each failure because you don't want it doing that in a tight loop.

It is not always obvious why error is occurring, so to try different fixes one would have to try different changes.

That's fair, but you can always retrigger a reconcile after each try. It goes without saying that you are interactively debugging it at this point, so this is always an option and there's no real need to have it automatically retrying on a quicker schedule (and in fact it might be better if it doesn't, because that could interfere with your debugging attempts).

In current situation the only way to re-trigger single reconcile (without deprovisioning) is to use pause annotation :)
Seems like a hack.

You can use any annotation you want to trigger a reconcile. It doesn't have to be one recognised by the baremetal-operator.

Furthermore, by bringing up deprovisioning you are conflating the original issue, which prevented the BareMetalHost running inspection and therefore left it totally unusable, with the provisioned state in which only power management errors can occur. The ErrorCount is cleared on any successful state transition, or any successful power management action in a provisioned/ready state, not only on deprovisioning.

We're going around in circles at this point. The exact nature of the problem in metal3-io/baremetal-operator#798 is lost in the mists of time. The root cause was addressed by metal3-io/baremetal-operator#817. Meanwhile, we have also merged metal3-io/baremetal-operator#840 which fixed a real bug with error handling during inspection that could easily have resulted in very confusing behaviour.

As far as we can establish, the error backoff is now working the way one would expect, and we have no reason to think that retriggering a reconcile would not allow a complete recovery from an error condition that has been corrected. If you see any further issues that are caused by unintended behaviour, please open another bug.

/close

@metal3-io-bot
Copy link
Contributor

@zaneb: Closed this PR.

In response to this:

@dhellmann

That's the behavior I would expect when an error occurs. Did you expect something different?

I expected that baremetal-operator would allow me to try to fix the issue without waiting ~5h for next reconcile :)

But you hadn't fixed the issue yet. That's why it failed again.

Once you had, you can trigger another reconcile at any time by writing any change at all to the BMH.

You are always free to try to fix the issue at any time. The backoff time is only the amount of time that the baremetal-operator will wait before trying to fix the issue itself - in the dumbest way possible (blindly trying again). The time increases with each failure because you don't want it doing that in a tight loop.

It is not always obvious why error is occurring, so to try different fixes one would have to try different changes.

That's fair, but you can always retrigger a reconcile after each try. It goes without saying that you are interactively debugging it at this point, so this is always an option and there's no real need to have it automatically retrying on a quicker schedule (and in fact it might be better if it doesn't, because that could interfere with your debugging attempts).

In current situation the only way to re-trigger single reconcile (without deprovisioning) is to use pause annotation :)
Seems like a hack.

You can use any annotation you want to trigger a reconcile. It doesn't have to be one recognised by the baremetal-operator.

Furthermore, by bringing up deprovisioning you are conflating the original issue, which prevented the BareMetalHost running inspection and therefore left it totally unusable, with the provisioned state in which only power management errors can occur. The ErrorCount is cleared on any successful state transition, or any successful power management action in a provisioned/ready state, not only on deprovisioning.

We're going around in circles at this point. The exact nature of the problem in metal3-io/baremetal-operator#798 is lost in the mists of time. The root cause was addressed by metal3-io/baremetal-operator#817. Meanwhile, we have also merged metal3-io/baremetal-operator#840 which fixed a real bug with error handling during inspection that could easily have resulted in very confusing behaviour.

As far as we can establish, the error backoff is now working the way one would expect, and we have no reason to think that retriggering a reconcile would not allow a complete recovery from an error condition that has been corrected. If you see any further issues that are caused by unintended behaviour, please open another bug.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@s3rj1k
Copy link
Member

s3rj1k commented May 18, 2021

@zaneb Fair, probably it is impossible to reproduce that state at this point.

Can we tweak back-off time to make it not so long?

cc: @andfasano

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants