Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash the whole operator on unrecoverable errors in watchers/workers #509

Merged
merged 3 commits into from
Sep 4, 2020

Conversation

nolar
Copy link
Owner

@nolar nolar commented Aug 20, 2020

What do these changes do?

When a fatal error happens in the operator's watching, queueing, multiplexing, or processing, including API PATCH'ing, then stop the whole operator instead of ignoring and continuing.

Description

This issue was detected in an incident when PATCH request failed due to HTTP 422 "Unprocessable Entity" (#346). Instead of stopping or slowing down any attempts, the operator continued handling repeatedly with 1-2 attempts per second.

On a wider scope, if anything goes wrong in the top-level processing, i.e. before it reaches the handlers (which have their own error handling and backoff intervals), then crash the whole operator, and let Kubernetes deal with a broken pod. Also properly report the exit status (non-zero in case of errors).

This does not prevent incidents with repeated handling completely but will slow them down at least (restarts are not fast).

All in all, this should protect the users from the framework/operators misbehaviour in some rare cases. In all other cases, nothing changes for the users.


Note: A separate fix will be made (#351) with throttling of unrecoverable errors on a per-resource basis from approximately when the processing begins, and until the handlers are reached (this span also covers resource PATCH'ing). The operator will stop anyway for errors from watching to the point of processing where throttling beings — but this is a much more narrow scope (than processing) and is covered by this PR.


A sample log for an operator with a failed worker. Please note the triple-logging: once from every failed worker (there can be more than 1), once for the root task of watching a resource (there can be more than 1 watcher), and once for the whole operator before exiting (by Python) — every time with more and more details on the stacktrace:

[2020-09-04 01:23:49,564] kopf.reactor.queuein [ERROR   ] Event processing has failed with an unrecoverable error for (Resource(group='zalando.org', version='v1', plural='kopfexamples'), 'a93fdae1-ac70-4598-8e2b-48c6e72e5e32').
Traceback (most recent call last):
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 213, in worker
    raise Exception("boo!")
Exception: boo!

[2020-09-04 01:23:49,568] kopf.reactor.running [ERROR   ] Root task 'watcher of kopfexamples.zalando.org' is failed: Event processing has failed with an unrecoverable error. This seems to be a framework bug. The operator will stop to prevent damage.
Traceback (most recent call last):
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/aiojobs/_scheduler.py", line 138, in _wait_failed
    await task  # should raise exception
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 213, in worker
    raise Exception("boo!")
Exception: boo!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 442, in _root_task_checker
    await coro
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 147, in watcher
    raise RuntimeError("Event processing has failed with an unrecoverable error. "
RuntimeError: Event processing has failed with an unrecoverable error. This seems to be a framework bug. The operator will stop to prevent damage.

[2020-09-04 01:23:49,576] kopf.reactor.running [DEBUG   ] Root task 'daemon killer' is cancelled.
[2020-09-04 01:23:49,576] kopf.reactor.running [DEBUG   ] Root task 'credentials retriever' is cancelled.
[2020-09-04 01:23:49,576] kopf.reactor.running [DEBUG   ] Root task 'poster of events' is cancelled.
[2020-09-04 01:23:49,578] kopf.reactor.running [DEBUG   ] Root tasks are stopped: finishing normally; tasks left: set()
[2020-09-04 01:23:49,578] kopf.reactor.running [DEBUG   ] Hung tasks stopping is skipped: no tasks given.
Traceback (most recent call last):
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/aiojobs/_scheduler.py", line 138, in _wait_failed
    await task  # should raise exception
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 213, in worker
    raise Exception("boo!")
Exception: boo!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/nolar/src/kopf/kopf/cli.py", line 32, in wrapper
    return fn(*args, **kwargs)
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/decorators.py", line 73, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/nolar/src/kopf/kopf/cli.py", line 73, in run
    return running.run(
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 94, in run
    loop.run_until_complete(operator(
  File "/Users/nolar/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 150, in operator
    await run_tasks(operator_tasks, ignored=existing_tasks)
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 359, in run_tasks
    await _reraise(root_done | root_cancelled | hung_done | hung_cancelled)
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 426, in _reraise
    task.result()  # can raise the regular (non-cancellation) exceptions.
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 442, in _root_task_checker
    await coro
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 147, in watcher
    raise RuntimeError("Event processing has failed with an unrecoverable error. "
RuntimeError: Event processing has failed with an unrecoverable error. This seems to be a framework bug. The operator will stop to prevent damage.

Process finished with exit code 1

Side-changes:

  • Log daemon-killer's exit reason as "cancelled" (as all other tasks), not as "exited unexpectedly" — due to no asyncio.CancelledError raised from inside.
  • Cover the queue pulling and event batching by this unexpected errors safety net too — by shifting the except: block left. This is unlikely to happen, but just in case.
  • Stop logging the functools.partial objects (processors) with all their arguments. This could eventually lead to some data leaks to the logs.

Issues/PRs

Issues: #346

Related: #331

Replaces: #350 zalando-incubator/kopf#350

Type of changes

  • Bug fix (non-breaking change which fixes an issue)

Checklist

  • The code addresses only the mentioned problem, and this problem only
  • I think the code is well written
  • Unit tests for the changes exist
  • Documentation reflects the changes
  • If you provide code modification, please add yourself to CONTRIBUTORS.txt

These arguments' reprs can be quite lengthy, generating unnecessary
traffic to the logging systems/storages.

Not to mention that they can also contain sensitive values
which should not be logged.
@nolar nolar force-pushed the crash-on-fatals branch 2 times, most recently from e5c232e to db7145c Compare September 3, 2020 23:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant