`Context` semantics for cross-actor-task cancellation and overruns #357

goodboy · 2023-04-14T21:21:17Z

Finally!

This is a rigorous rework of our Context and ContextCancelled
semantics to be more like trio.CancelScope
for cross-actor linked tasks!

High level mechanics:

if any actor-task requests a remote actor to be cancelled, either by
calling Actor.cancel() or the internal Actor._cancel_stask() (which
is called by Context.cancel()), that requesting actor's
uid is set on the target far-end Context._cancel_called_remote and
any eventually raised ContextCancelled.canceller: tuple.
this allows any actor receiving a ContextCancelled error-msg to
check if they were the cancellation "requester", in which case the
error is not raised locally and instead the cancellation is
"caught" (and silently ignored inside the Portal.open_context()
block, and possibly @tractor.context callee func) just like normal
trio cancel scopes: if you call cs.cancel() inside some scope or
nursery block.
if any received ContextCancelled.canceller != current_actor().uid,
then the error is raised thus making it explicit that the
cancellation was not anticipated by the cancel-receiving (actor) task
and thus can be subsequently handled as an (unanticipated) remote
cancellation.

Also included is new support and public API for allowing contexts (and
their streams) to "allow overruns" (since backpressure isn't really
possible without a msg/transport protocol extension): enabling the case
where some sender is pushing msgs faster then the other side is
receiving them and no error is received on the sender side.

Normally (and by default) the rx side will receive (via
RemoteActorError msg) and subsequently raise a StreamOverrun that's
been relayed from the sender. The receiver then can decide how to the
handle the condition; previously any sent msg delivered that caused an
overrun would be discarded.

Instead we now offer a allow_overruns: bool to the
Context.open_stream() API which adds a new mode with the following
functionality:

if an overrun condition is hit, we spawn a single "overflow drainer"
task in the Context._scope_nursery: trio.Nursery which runs
Context._drain_overflows() (not convinced of naming yet btw 😂):
this task has one simple job: it does the work of calling await self._send_chan(msg) for every queued up overrun-condition msg
(stored in the new Context._overflow_q: deque) in a blocking
fashion but without blocking the RPC msg
loop.
when the ._in_overrun: bool is already set we presume the task is
already running and instead new overflow msgs are queued and
a warning log msg emitted.
on Context exit any existing drainer task is cancelled and lingering
msg discarded but reported in a warning log msg.

Details of implementation:

move all Context related code to a new ._context mod.
extend the test suite to include full coverage of all cancel request
and overruns (ignoring) cases.
add ContextCancelled.canceller: tuple[str, str].
expose new Context.cancel_called, .cancelled_caught and
.cancel_called_remote properties.
add Context._deliver_msg() as the factored content from what was
Actor._push_result() which now calls the former.
add a new mk_context() -> Context factory.

Possibly still todo:

is maybe no_overruns or ignore_overruns or something else
a better var name?
better StreamOverrun msg content?
make Context.result() return None if the default is never set,
more like a function that has no return?
drop lingering ._debug.breakpoint() comments from
._context.py?
ensure we don't need the cs.shield = True that was inside
Context.cancel()?
get more confident about dropping the previous
Context._send_chan.aclose() calls (currently commented out)?
not use @dataclass for Context and instead go
msgspec.Struct?
KINDA IMPORTANT maybe we should allow relaying the source
error traceback to cancelled context tasks which were not the cause of
their own cancellation?
- i.e. the ContextCancelled.canceller != current_actor().uid
  case where it might be handy to know why, for eg., some parent actor
  cancelled some other child that you were attached to, instead of just
  seeing that it was cancelled by someone else?
apparently on self-cancels (which i still don't think is fully tested..?) we can get a relayed error that has no .canceller set?
- see this example from the new remote annotation testing in piker
  - which when you inspect the input err to Context._maybe_raise_remote_err()

…ed` handling

These will verify new changes to the runtime/messaging core which allows us to adopt an "ignore cancel if requested by us" style handling of `ContextCancelled` more like how `trio` does with `trio.Nursery.cancel_scope.cancel()`. We now expect a `ContextCancelled.canceller: tuple` which is set to the actor uid of the actor which requested the cancellation which eventually resulted in the remote error-msg. Also adds some experimental tweaks to the "backpressure" test which it turns out is very problematic in coordination with context cancellation since blocking on the feed mem chan to some task will block the ipc msg loop and thus handling of cancellation.. More to come to both the test and core to address this hopefully since right now this test is failing.

To handle both remote cancellation this adds `ContextCanceled.canceller: tuple` the uid of the cancel requesting actor and is expected to be set by the runtime when servicing any remote cancel request. This makes it possible for `ContextCancelled` receivers to know whether "their actor runtime" is the source of the cancellation. Also add an explicit `RemoteActor.src_actor_uid` which better formalizes the notion of "which remote actor" the error originated from. Both of these new attrs are expected to be packed in the `.msgdata` when the errors are loaded locally.

Turns out stuff was totally broken in these cases because we're either closing the underlying mem chan too early or not handling the "allow_overruns" mode's cancellation correctly..

This adds remote cancellation semantics to our `tractor.Context` machinery to more closely match that of `trio.CancelScope` but with operational differences to handle the nature of parallel tasks interoperating across multiple memory boundaries: - if an actor task cancels some context it has opened via `Context.cancel()`, the remote (scope linked) task will be cancelled using the normal `CancelScope` semantics of `trio` meaning the remote cancel scope surrounding the far side task is cancelled and `trio.Cancelled`s are expected to be raised in that scope as per normal `trio` operation, and in the case where no error is raised in that remote scope, a `ContextCancelled` error is raised inside the runtime machinery and relayed back to the opener/caller side of the context. - if any actor task cancels a full remote actor runtime using `Portal.cancel_actor()` the same semantics as above apply except every other remote actor task which also has an open context with the actor which was cancelled will also be sent a `ContextCancelled` **but** with the `.canceller` field set to the uid of the original cancel requesting actor. This changeset also includes a more "proper" solution to the issue of "allowing overruns" during streaming without attempting to implement any form of IPC streaming backpressure. Implementing task-granularity backpressure cross-process turns out to be more or less impossible without augmenting out streaming protocol (likely at the cost of performance). Further allowing overruns requires special care since any blocking of the runtime RPC msg loop task effectively can block control msgs such as cancels and stream terminations. The implementation details per abstraction layer are as follows. ._streaming.Context: - add a new contructor factor func `mk_context()` which provides a strictly private init-er whilst allowing us to not have to define an `.__init__()` on the type def. - add public `.cancel_called` and `.cancel_called_remote` properties. - general rename of what was the internal `._backpressure` var to `._allow_overruns: bool`. - move the old contents of `Actor._push_result()` into a new `._deliver_msg()` allowing for better encapsulation of per-ctx msg handling. - always check for received 'error' msgs and process them with the new `_maybe_cancel_and_set_remote_error()` **before** any msg delivery to the local task, thus guaranteeing error and cancellation handling despite any overflow handling. - add a new `._drain_overflows()` task-method for use with new `._allow_overruns: bool = True` mode. - add back a `._scope_nursery: trio.Nursery` (allocated in `Portal.open_context()`) who's sole purpose is to spawn a single task which runs the above method; anything else is an error. - augment `._deliver_msg()` to start a task and run the above method when operating in no overrun mode; the task queues overflow msgs and attempts to send them to the underlying mem chan using a blocking `.send()` call. - on context exit, any existing "drainer task" will be cancelled and remaining overflow queued msgs are discarded with a warning. - rename `._error` -> `_remote_error` and set it in a new method `_maybe_cancel_and_set_remote_error()` which is called before processing - adjust `.result()` to always call `._maybe_raise_remote_err()` at its start such that whenever a `ContextCancelled` arrives we do logic for whether or not to immediately raise that error or ignore it due to the current actor being the one who requested the cancel, by checking the error's `.canceller` field. - set the default value of `._result` to be `id(Context()` thus avoiding conflict with any `.result()` actually being `False`.. ._runtime.Actor: - augment `.cancel()` and `._cancel_task()` and `.cancel_rpc_tasks()` to take a `requesting_uid: tuple` indicating the source actor of every cancellation request. - pass through the new `Context._allow_overruns` through `.get_context()` - call the new `Context._deliver_msg()` from `._push_result()` (since the factoring out that method's contents). ._runtime._invoke: - `TastStatus.started()` back a `Context` (unless an error is raised) instead of the cancel scope to make it easy to set/get state on that context for the purposes of cancellation and remote error relay. - always raise any remote error via `Context._maybe_raise_remote_err()` before doing any `ContextCancelled` logic. - assign any `Context._cancel_called_remote` set by the `requesting_uid` cancel methods (mentioned above) to the `ContextCancelled.canceller`. ._runtime.process_messages: - always pass a `requesting_uid: tuple` to `Actor.cancel()` and `._cancel_task` to that any corresponding `ContextCancelled.canceller` can be set inside `._invoke()`.

This actually caught further runtime bugs so it's gud i tried.. Add overrun-ignore enabled / disabled cases and error catching for all of them. More or less this should cover every possible outcome when it comes to setting `allow_overruns: bool` i hope XD

Because obviously we probably want to support `allow_overruns` on the remote callee side as well XD Only found the bugs fixed in this patch this thanks to writing a much more exhaustive test set for overrun cases B)

Turns out you can get a case where you might be opening multiple ctx-streams concurrently and during the context opening phase you block for all contexts to open, but then when you eventually start opening streams some slow to start context has caused the others become in an overrun state.. so we need to let the caller control whether that's an error ;) This also needs a test!

Requires goodboy/tractor#357. Avoid overruns when doing concurrent live feed init over multiple brokers.

- `Context._cancel_called_remote` -> `._cancelled_remote` since "called" implies the cancellation was "requested" when it could be due to another error and the actor uid is the value - only set once the far end task scope is terminated due to either error or cancel, which has nothing to do with *what* caused the cancellation. - `Actor._cancel_called_remote` -> `._cancel_called_by_remote` which emphasizes that this variable is **only set** IFF some remote actor **requested that** this actor's runtime be cancelled via `Actor.cancel()`.

Tests that appropriate `Context` exit state, the relay of a `ContextCancelled` error and its `.canceller: tuple[str, str]` value are set when an inter-peer cancellation happens via an "out of band" request method (in this case using `Portal.cancel_actor()` and that cancellation is propagated "horizontally" to other peers. Verify that any such cancellation scenario which also experiences an "error during `ContextCancelled` handling" DOES NOT result in that further error being suppressed and that the user's exception bubbles out of the `Context.open_context()` block(s) appropriately! Likely more tests to come as well as some factoring of the teardown state checks where possible. Pertains to serious testing the major work landing in #357

goodboy requested a review from guilledk April 14, 2023 21:22

goodboy added enhancement New feature or request messaging messaging patterns and protocols streaming cancellation SC teardown semantics and anti-zombie semantics labels Apr 14, 2023

goodboy force-pushed the ctx_cancel_semantics_and_overruns branch 2 times, most recently from a8eb962 to a4ecc2a Compare May 15, 2023 13:19

goodboy added 23 commits May 15, 2023 10:00

Enable Context backpressure by default; avoid startup race-crashes?

cfb2bc0

More single doc-strs in discovery mod

d753431

Tweak context doc str

903537c

Add some log-level method doc-strings

b3f9251

Change a bunch of log levels to cancel, including any `ContextCancell…

e80e0a5

…ed` handling

Assign RemoteActorError boxed error type for context cancelleds

8317903

Log waiter task cancelling msg as cancel-level

220b244

Augment test cases for callee-returns-result early

03644f5

Turns out stuff was totally broken in these cases because we're either closing the underlying mem chan too early or not handling the "allow_overruns" mode's cancellation correctly..

Move NoRuntime import inside current_actor() to avoid cycle

f54c415

Only tuplize .canceller if non-None

90e4101

Drop brackpressure usage from fan out tests

6994d20

Flip allocate log msgs to debug

6db656f

Fix cluster test to use allow_overruns

f1e9c0b

Adjust aio test for silent cancellation by parent

63adf73

Set Context._scope_nursery on callee side too

968f13f

Because obviously we probably want to support `allow_overruns` on the remote callee side as well XD Only found the bugs fixed in this patch this thanks to writing a much more exhaustive test set for overrun cases B)

Ignore drainer-task nursery RTE during context exit

04e4397

Drop caller cancels overrun test; covered in new tests

041d7da

Move move context code into new ._context mod

20d75ff

Tweak doc string

7293b82

goodboy added 2 commits May 15, 2023 10:00

Oof, fix remaining Actor.cancel() in Actor._from_parent()

60791ed

goodboy force-pushed the ctx_cancel_semantics_and_overruns branch from a4ecc2a to ead9e41 Compare May 15, 2023 14:00

goodboy added a commit to pikers/piker that referenced this pull request May 17, 2023

Always allow overruns in sampler context

df96155

Requires goodboy/tractor#357. Avoid overruns when doing concurrent live feed init over multiple brokers.

goodboy mentioned this pull request May 25, 2023

Rekt pps? problem? => piker.accounting pikers/piker#489

Merged

51 tasks

goodboy added 3 commits May 25, 2023 16:00

Drop Optional style from runtime mod

6495688

Tidy up typing imports in broadcaster mod

17ae449

Just import mp top level in ._spawn

7507e26

goodboy mentioned this pull request Aug 29, 2023

Switching to pdbpp is making it impossible to install with pypy and git #364

Closed

This was referenced Oct 18, 2023

How to do (next gen, SC oriented) remote error serialization for cross host/process propagation? #5

Open

Injecting Context._scope: trio.Nursery errors, probably a ContextCancelled to start #368

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Context` semantics for cross-actor-task cancellation and overruns #357

`Context` semantics for cross-actor-task cancellation and overruns #357

goodboy commented Apr 14, 2023 •

edited

Context semantics for cross-actor-task cancellation and overruns #357

Are you sure you want to change the base?

Context semantics for cross-actor-task cancellation and overruns #357

Conversation

goodboy commented Apr 14, 2023 • edited

High level mechanics:

Details of implementation:

Possibly still todo:

`Context` semantics for cross-actor-task cancellation and overruns #357

`Context` semantics for cross-actor-task cancellation and overruns #357

goodboy commented Apr 14, 2023 •

edited