Crash the whole operator on unrecoverable errors in watchers/workers #509

nolar · 2020-08-20T20:05:58Z

What do these changes do?

When a fatal error happens in the operator's watching, queueing, multiplexing, or processing, including API PATCH'ing, then stop the whole operator instead of ignoring and continuing.

Description

This issue was detected in an incident when PATCH request failed due to HTTP 422 "Unprocessable Entity" (#346). Instead of stopping or slowing down any attempts, the operator continued handling repeatedly with 1-2 attempts per second.

On a wider scope, if anything goes wrong in the top-level processing, i.e. before it reaches the handlers (which have their own error handling and backoff intervals), then crash the whole operator, and let Kubernetes deal with a broken pod. Also properly report the exit status (non-zero in case of errors).

This does not prevent incidents with repeated handling completely but will slow them down at least (restarts are not fast).

All in all, this should protect the users from the framework/operators misbehaviour in some rare cases. In all other cases, nothing changes for the users.

Note: A separate fix will be made (#351) with throttling of unrecoverable errors on a per-resource basis from approximately when the processing begins, and until the handlers are reached (this span also covers resource PATCH'ing). The operator will stop anyway for errors from watching to the point of processing where throttling beings — but this is a much more narrow scope (than processing) and is covered by this PR.

A sample log for an operator with a failed worker. Please note the triple-logging: once from every failed worker (there can be more than 1), once for the root task of watching a resource (there can be more than 1 watcher), and once for the whole operator before exiting (by Python) — every time with more and more details on the stacktrace:

[2020-09-04 01:23:49,564] kopf.reactor.queuein [ERROR   ] Event processing has failed with an unrecoverable error for (Resource(group='zalando.org', version='v1', plural='kopfexamples'), 'a93fdae1-ac70-4598-8e2b-48c6e72e5e32').
Traceback (most recent call last):
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 213, in worker
    raise Exception("boo!")
Exception: boo!

[2020-09-04 01:23:49,568] kopf.reactor.running [ERROR   ] Root task 'watcher of kopfexamples.zalando.org' is failed: Event processing has failed with an unrecoverable error. This seems to be a framework bug. The operator will stop to prevent damage.
Traceback (most recent call last):
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/aiojobs/_scheduler.py", line 138, in _wait_failed
    await task  # should raise exception
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 213, in worker
    raise Exception("boo!")
Exception: boo!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 442, in _root_task_checker
    await coro
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 147, in watcher
    raise RuntimeError("Event processing has failed with an unrecoverable error. "
RuntimeError: Event processing has failed with an unrecoverable error. This seems to be a framework bug. The operator will stop to prevent damage.

[2020-09-04 01:23:49,576] kopf.reactor.running [DEBUG   ] Root task 'daemon killer' is cancelled.
[2020-09-04 01:23:49,576] kopf.reactor.running [DEBUG   ] Root task 'credentials retriever' is cancelled.
[2020-09-04 01:23:49,576] kopf.reactor.running [DEBUG   ] Root task 'poster of events' is cancelled.
[2020-09-04 01:23:49,578] kopf.reactor.running [DEBUG   ] Root tasks are stopped: finishing normally; tasks left: set()
[2020-09-04 01:23:49,578] kopf.reactor.running [DEBUG   ] Hung tasks stopping is skipped: no tasks given.
Traceback (most recent call last):
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/aiojobs/_scheduler.py", line 138, in _wait_failed
    await task  # should raise exception
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 213, in worker
    raise Exception("boo!")
Exception: boo!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/nolar/src/kopf/kopf/cli.py", line 32, in wrapper
    return fn(*args, **kwargs)
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/decorators.py", line 73, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/Users/nolar/.pyenv/versions/kopf38/lib/python3.8/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/Users/nolar/src/kopf/kopf/cli.py", line 73, in run
    return running.run(
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 94, in run
    loop.run_until_complete(operator(
  File "/Users/nolar/.pyenv/versions/3.8.5/lib/python3.8/asyncio/base_events.py", line 616, in run_until_complete
    return future.result()
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 150, in operator
    await run_tasks(operator_tasks, ignored=existing_tasks)
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 359, in run_tasks
    await _reraise(root_done | root_cancelled | hung_done | hung_cancelled)
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 426, in _reraise
    task.result()  # can raise the regular (non-cancellation) exceptions.
  File "/Users/nolar/src/kopf/kopf/reactor/running.py", line 442, in _root_task_checker
    await coro
  File "/Users/nolar/src/kopf/kopf/reactor/queueing.py", line 147, in watcher
    raise RuntimeError("Event processing has failed with an unrecoverable error. "
RuntimeError: Event processing has failed with an unrecoverable error. This seems to be a framework bug. The operator will stop to prevent damage.

Process finished with exit code 1

Side-changes:

Log daemon-killer's exit reason as "cancelled" (as all other tasks), not as "exited unexpectedly" — due to no asyncio.CancelledError raised from inside.
Cover the queue pulling and event batching by this unexpected errors safety net too — by shifting the except: block left. This is unlikely to happen, but just in case.
Stop logging the functools.partial objects (processors) with all their arguments. This could eventually lead to some data leaks to the logs.

Issues/PRs

Issues: #346

Related: #331

Replaces: #350 zalando-incubator/kopf#350

Type of changes

Bug fix (non-breaking change which fixes an issue)

Checklist

The code addresses only the mentioned problem, and this problem only
I think the code is well written
Unit tests for the changes exist
Documentation reflects the changes
If you provide code modification, please add yourself to CONTRIBUTORS.txt

These arguments' reprs can be quite lengthy, generating unnecessary traffic to the logging systems/storages. Not to mention that they can also contain sensitive values which should not be logged.

This was referenced Aug 20, 2020

Throttle individual resource processing on unexpected errors #510

Merged

Crash the whole operator on unrecoverable errors in watchers/workers zalando-incubator/kopf#350

Closed

nolar added bug Something isn't working enhancement New feature or request labels Aug 20, 2020

kopf-archiver bot mentioned this pull request Aug 23, 2020

[PR] Crash the whole operator on unrecoverable errors in watchers/workers #350

Closed

5 tasks

nolar force-pushed the crash-on-fatals branch from e6ae4ae to 4c2c77f Compare September 3, 2020 19:51

Stop logging the functools.partials with all their arguments

9c789b2

These arguments' reprs can be quite lengthy, generating unnecessary traffic to the logging systems/storages. Not to mention that they can also contain sensitive values which should not be logged.

nolar force-pushed the crash-on-fatals branch 2 times, most recently from e5c232e to db7145c Compare September 3, 2020 23:20

nolar added 2 commits September 4, 2020 02:02

Stop operator on unrecoverable fatal errors when watching/processing

b759b48

Log the daemon killer exit properly: cancelled, not exited unexpectedly

572bdf6

nolar force-pushed the crash-on-fatals branch from db7145c to 572bdf6 Compare September 4, 2020 00:02

nolar merged commit 0dae3ef into master Sep 4, 2020

nolar deleted the crash-on-fatals branch September 4, 2020 00:34

This was referenced Sep 10, 2020

Terminate ultimately via SIGKILL after a timeout (10m by default) #543

Merged

Force-kill the operator from inside if stuck at exiting zalando-incubator/kopf#370

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash the whole operator on unrecoverable errors in watchers/workers #509

Crash the whole operator on unrecoverable errors in watchers/workers #509

nolar commented Aug 20, 2020 •

edited

Loading

Crash the whole operator on unrecoverable errors in watchers/workers #509

Crash the whole operator on unrecoverable errors in watchers/workers #509

Conversation

nolar commented Aug 20, 2020 • edited Loading

What do these changes do?

Description

Issues/PRs

Type of changes

Checklist

nolar commented Aug 20, 2020 •

edited

Loading