-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add proposal for retry interface #171
Conversation
b34b8bf
to
026e107
Compare
026e107
to
147bad5
Compare
@andfasano IMO naming annotation If both of annotations does not have intended dependencies, does not affect each other, I would recommend renaming Other then naming issue looks good and should solve related issue without removing error context as you mentioned before. |
The annotation must be consumed as soon as possible within the BMO reconcile | ||
loop, and it should be removed as soon as the `Status.ErrorCount` is set to 1. | ||
If the BMH resource is not in an error state (`Status.ErrorCount` already | ||
set to zero) then the annotation must be removed and the event properly logged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to avoid a thundering herd bearing down on ironic if somebody does a bulk apply of this to all BMHs in a system at once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zaneb doesn't PROVISIONING_LIMIT solves this ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That check is limited only to a subset of states, while the annotation trigger in the current proposal is not bound to anyone. So I don't think currently we have a complete mechanism to prevent that, and moreover this could be a more general problem for any annotation that may request an Ironic action (see for example the inspect
annotation, even though in such case it could be triggered only from the Ready
state).
If it's ok, I could add this point in the risks paragraph.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: andfasano The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
87a83d1
to
66e3af0
Compare
This proposal aims to introduce a new declarative API | ||
to forcibly retry the reconciliation of a waiting BareMetalHost resource | ||
(being in error state) when a fix is immediately available, so that it | ||
could be possible to reduce the waiting time required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not clear why we need a special API for this. Any edit to the host, including adding a temporary label, would trigger reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the difference this makes is that if it fails again for a different (perhaps more easily recoverable) reason then the time to retry is shorter.
That said, I must confess to being somewhat sceptical of the value also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the key point of the proposal it's about resetting the ErrorCount to 1 - otherwise any edit will be fine fo retriggering the reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to artificially reset the error count? Just to shorten retry intervals if there's another failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (or another set of failures)
## Alternatives | ||
|
||
Reducing the backoff gaps (or upper limit) could mitigate somehow as the | ||
waiting time will be reduced |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reduce the upper bound for the backoff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative: build a CLI util that resets the error count without making it part of the BMH API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not reduce the upper bound for the backoff?
I think it's listed as a an alternative. maxBackOffCount
could be reduced to cap the max waiting time after the host reached that number of (failed) iterations. Current gap values can be checked here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is listed as an alternative, but there is no reason given about why it was rejected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I misunderstood the goal of the Alternatives
section, but my intention was to list another approach so that it could have been discussed. Personally I think it's still a potentially viable alternative - as not requiring explicit user intervention
Hi all, |
@andfasano I we go this way, maybe we could have some kind of CLI utility to patch-up spec just in case then fix is needed ASAP? cc: @zaneb |
As already mentioned here I think that the ErrorCount should be a field managed by BMO only without any external intervention. If an intervention is required immediately then there is still the possibility to manually force the update. |
@andfasano Well, this is not community friendly, this effectively will force patching operator This is what triggered all of this discussion, I everybody is fine with that, let's close this and hope that no body gets hurt by broken production cluster. |
TBH I'm not sure to understand exactly the point, given that I was originally fine about the proposal of lowering the ErrorCount (to 1) via annotation, and actively trying to reach a solution. From a design point of view I simply don't agree moving such action outside from the operator scope (the manual update has been mentioned just as a current workaround to avoid waiting the next reconcile loop). I'm also fine in reducing the ErrorCount cap limit as an alternative approach, as it could allow the operator to re-evaluate the resource currently in error slightly more frequently. |
Just to be clear, after reviewing Zane's comment, if the consensus is about writing a very tiny CLI tool that only resets the ErrorCount to 1 (but the error status is maintained untouched as mentioned in the current proposal), just as a sort of "safety valve" for unplanned situations, then it would be ok for me. |
I'd have thought the field recovery procedure would go like this:
Resetting the error count can help if you want to skip step 3, but that hardly seems like an emergency worthy of recompiling a new bmo (which will in most cases take longer than waiting to see if it worked). Nevertheless, while I don't quite see the point, having a CLI tool that just does |
@zaneb |
The error back-off is only applied when the operator tells controller-runtime to try again after a delay. It is not applied when reconciliation is triggered because the CR is edited. So, step 2 should trigger reconciliation immediately, without waiting for the error back-off. Is that not the behavior you see? |
@dhellmann No. Last time I checked changing something in BMH CR re-triggers single reconcile loop that is again stuck at back-off timer at the and. Only thing that forced a reset on error count was removing provisioning image URL, and that triggered deprovisioning. Deprovisioning a perfectly healthy (working kubeworker but broken BMH object) host is a bit harsh solution. So the idea emerged, after discussion in several different threads, to somehow fix long back-off delay, either by resetting errorCount to 1 or by changing back-off time. |
This feels like a bug. If we successfully reconcile the host, we should reset that back-off. @andfasano, what do you think?
We need to make the BMO manage that value correctly. |
I do not thing that reconcile was successful, it was just stuck at back-off with no way of re-triggering successful reconcile. (only via deprovisioning). I do believe that a root cause of that bug was fixed by metal3-io/baremetal-operator@c45ecfb. But back-off part is still the same, so this could re-emerge in some other unexpected place. |
After you trigger the reconcile, if another error happens it will indeed not retry a third time until the delay has expired (which will be a long time if the error count is high). If the problem is not fixed then it doesn't matter if you retry fast or slow because it will always fail - in fact slow is better because it wastes less resources. If this is caused by a bug in BMO, there is no substitute for fixing the bug or at least hacking around in ironic. If it is caused by some external condition (e.g. network unavailable) then that needs to be resolved before trying again. However, if there is no error on the retry then it will proceed and eventually clear the error and the error count. If this doesn't happen it's a bug. |
I think that @zane summarized very precisly the situation and the expected behavior - when BMO reacts to a resource related event. For that, if there is a specific execution flow in BMO that doesn't reset properly the ErrorCount (upon success) it's definitely a bug and it should be addressed case by case. Instead, a potentially interesting scenario addressed by the current proposal (or the CLI) was mainly related when the fix for a resource (currently in error state) does not generate an event (as for example the firmware update on the host). In such case it is necessary to wait the natural timeout of the previously enqueued request for that host or, if you don't want to wait for such period, generate artificially a new one to force BMO reaction (and fallback in the previously described situation). |
So the BMO correctly responded to an update event and reconciled. That reconciliation resulted in an error. The error count on the host was not decreased, and the BMO went back to waiting for the error back-off timeout. That's the behavior I would expect when an error occurs. Did you expect something different? |
I expected that baremetal-operator would allow me to try to fix the issue without waiting ~5h for next reconcile :) It is not always obvious why error is occurring, so to try different fixes one would have to try different changes. In current situation the only way to re-trigger single reconcile (without deprovisioning) is to use pause annotation :) |
But you hadn't fixed the issue yet. That's why it failed again. Once you had, you can trigger another reconcile at any time by writing any change at all to the BMH. You are always free to try to fix the issue at any time. The backoff time is only the amount of time that the baremetal-operator will wait before trying to fix the issue itself - in the dumbest way possible (blindly trying again). The time increases with each failure because you don't want it doing that in a tight loop.
That's fair, but you can always retrigger a reconcile after each try. It goes without saying that you are interactively debugging it at this point, so this is always an option and there's no real need to have it automatically retrying on a quicker schedule (and in fact it might be better if it doesn't, because that could interfere with your debugging attempts).
You can use any annotation you want to trigger a reconcile. It doesn't have to be one recognised by the baremetal-operator. Furthermore, by bringing up deprovisioning you are conflating the original issue, which prevented the BareMetalHost running inspection and therefore left it totally unusable, with the provisioned state in which only power management errors can occur. The ErrorCount is cleared on any successful state transition, or any successful power management action in a provisioned/ready state, not only on deprovisioning. We're going around in circles at this point. The exact nature of the problem in metal3-io/baremetal-operator#798 is lost in the mists of time. The root cause was addressed by metal3-io/baremetal-operator#817. Meanwhile, we have also merged metal3-io/baremetal-operator#840 which fixed a real bug with error handling during inspection that could easily have resulted in very confusing behaviour. As far as we can establish, the error backoff is now working the way one would expect, and we have no reason to think that retriggering a reconcile would not allow a complete recovery from an error condition that has been corrected. If you see any further issues that are caused by unintended behaviour, please open another bug. /close |
@zaneb: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@zaneb Fair, probably it is impossible to reproduce that state at this point. Can we tweak back-off time to make it not so long? cc: @andfasano |
Draft proposal to address the problem discussed originally in metal3-io/baremetal-operator#798