Crash the whole operator on unrecoverable errors in watchers/workers #509
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What do these changes do?
When a fatal error happens in the operator's watching, queueing, multiplexing, or processing, including API PATCH'ing, then stop the whole operator instead of ignoring and continuing.
Description
This issue was detected in an incident when PATCH request failed due to HTTP 422 "Unprocessable Entity" (#346). Instead of stopping or slowing down any attempts, the operator continued handling repeatedly with 1-2 attempts per second.
On a wider scope, if anything goes wrong in the top-level processing, i.e. before it reaches the handlers (which have their own error handling and backoff intervals), then crash the whole operator, and let Kubernetes deal with a broken pod. Also properly report the exit status (non-zero in case of errors).
This does not prevent incidents with repeated handling completely but will slow them down at least (restarts are not fast).
All in all, this should protect the users from the framework/operators misbehaviour in some rare cases. In all other cases, nothing changes for the users.
Note: A separate fix will be made (#351) with throttling of unrecoverable errors on a per-resource basis from approximately when the processing begins, and until the handlers are reached (this span also covers resource PATCH'ing). The operator will stop anyway for errors from watching to the point of processing where throttling beings — but this is a much more narrow scope (than processing) and is covered by this PR.
A sample log for an operator with a failed worker. Please note the triple-logging: once from every failed worker (there can be more than 1), once for the root task of watching a resource (there can be more than 1 watcher), and once for the whole operator before exiting (by Python) — every time with more and more details on the stacktrace:
Side-changes:
asyncio.CancelledError
raised from inside.except:
block left. This is unlikely to happen, but just in case.functools.partial
objects (processors) with all their arguments. This could eventually lead to some data leaks to the logs.Issues/PRs
Type of changes
Checklist
CONTRIBUTORS.txt