Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context semantics for cross-actor-task cancellation and overruns #357

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

goodboy
Copy link
Owner

@goodboy goodboy commented Apr 14, 2023

Finally!

This is a rigorous rework of our Context and ContextCancelled
semantics to be more like trio.CancelScope
for cross-actor linked tasks!

High level mechanics:

  • if any actor-task requests a remote actor to be cancelled, either by
    calling Actor.cancel() or the internal Actor._cancel_stask() (which
    is called by Context.cancel()), that requesting actor's
    uid is set on the target far-end Context._cancel_called_remote and
    any eventually raised ContextCancelled.canceller: tuple.
  • this allows any actor receiving a ContextCancelled error-msg to
    check if they were the cancellation "requester", in which case the
    error is not raised locally and instead the cancellation is
    "caught" (and silently ignored inside the Portal.open_context()
    block, and possibly @tractor.context callee func) just like normal
    trio cancel scopes: if you call cs.cancel() inside some scope or
    nursery block.
  • if any received ContextCancelled.canceller != current_actor().uid,
    then the error is raised thus making it explicit that the
    cancellation was not anticipated by the cancel-receiving (actor) task
    and thus can be subsequently handled as an (unanticipated) remote
    cancellation.

Also included is new support and public API for allowing contexts (and
their streams) to "allow overruns" (since backpressure isn't really
possible without a msg/transport protocol extension): enabling the case
where some sender is pushing msgs faster then the other side is
receiving them and no error is received on the sender side.

Normally (and by default) the rx side will receive (via
RemoteActorError msg) and subsequently raise a StreamOverrun that's
been relayed from the sender. The receiver then can decide how to the
handle the condition; previously any sent msg delivered that caused an
overrun would be discarded.

Instead we now offer a allow_overruns: bool to the
Context.open_stream() API which adds a new mode with the following
functionality:

  • if an overrun condition is hit, we spawn a single "overflow drainer"
    task in the Context._scope_nursery: trio.Nursery which runs
    Context._drain_overflows() (not convinced of naming yet btw 😂):
  • this task has one simple job: it does the work of calling await self._send_chan(msg) for every queued up overrun-condition msg
    (stored in the new Context._overflow_q: deque) in a blocking
    fashion but without blocking the RPC msg
    loop
    .
  • when the ._in_overrun: bool is already set we presume the task is
    already running and instead new overflow msgs are queued and
    a warning log msg emitted.
  • on Context exit any existing drainer task is cancelled and lingering
    msg discarded but reported in a warning log msg.

Details of implementation:

  • move all Context related code to a new ._context mod.
  • extend the test suite to include full coverage of all cancel request
    and overruns (ignoring) cases.
  • add ContextCancelled.canceller: tuple[str, str].
  • expose new Context.cancel_called, .cancelled_caught and
    .cancel_called_remote properties.
  • add Context._deliver_msg() as the factored content from what was
    Actor._push_result() which now calls the former.
  • add a new mk_context() -> Context factory.

Possibly still todo:

  • is maybe no_overruns or ignore_overruns or something else
    a better var name?
  • better StreamOverrun msg content?
  • make Context.result() return None if the default is never set,
    more like a function that has no return?
  • drop lingering ._debug.breakpoint() comments from
    ._context.py?
  • ensure we don't need the cs.shield = True that was inside
    Context.cancel()?
  • get more confident about dropping the previous
    Context._send_chan.aclose() calls (currently commented out)?
  • not use @dataclass for Context and instead go
    msgspec.Struct?
  • KINDA IMPORTANT maybe we should allow relaying the source
    error traceback to cancelled context tasks which were not the cause of
    their own cancellation?
    • i.e. the ContextCancelled.canceller != current_actor().uid
      case where it might be handy to know why, for eg., some parent actor
      cancelled some other child that you were attached to, instead of just
      seeing that it was cancelled by someone else?
  • apparently on self-cancels (which i still don't think is fully tested..?) we can get a relayed error that has no .canceller set?
    • see this example from the new remote annotation testing in piker
      screenshot-2023-12-21_15-31-04
      • which when you inspect the input err to Context._maybe_raise_remote_err()
        screenshot-2023-12-21_15-30-53

@goodboy goodboy requested a review from guilledk April 14, 2023 21:22
@goodboy goodboy added enhancement New feature or request messaging messaging patterns and protocols streaming cancellation SC teardown semantics and anti-zombie semantics labels Apr 14, 2023
@goodboy goodboy force-pushed the ctx_cancel_semantics_and_overruns branch 2 times, most recently from a8eb962 to a4ecc2a Compare May 15, 2023 13:19
goodboy added 23 commits May 15, 2023 10:00
These will verify new changes to the runtime/messaging core which allows
us to adopt an "ignore cancel if requested by us" style handling of
`ContextCancelled` more like how `trio` does with
`trio.Nursery.cancel_scope.cancel()`. We now expect
a `ContextCancelled.canceller: tuple` which is set to the actor uid of
the actor which requested the cancellation which eventually resulted in
the remote error-msg.

Also adds some experimental tweaks to the "backpressure" test which it
turns out is very problematic in coordination with context cancellation
since blocking on the feed mem chan to some task will block the ipc msg
loop and thus handling of cancellation.. More to come to both the test
and core to address this hopefully since right now this test is failing.
To handle both remote cancellation this adds `ContextCanceled.canceller:
tuple` the uid of the cancel requesting actor and is expected to be set
by the runtime when servicing any remote cancel request. This makes it
possible for `ContextCancelled` receivers to know whether "their actor
runtime" is the source of the cancellation.

Also add an explicit `RemoteActor.src_actor_uid` which better formalizes
the notion of "which remote actor" the error originated from.

Both of these new attrs are expected to be packed in the `.msgdata` when
the errors are loaded locally.
Turns out stuff was totally broken in these cases because we're either
closing the underlying mem chan too early or not handling the
"allow_overruns" mode's cancellation correctly..
This adds remote cancellation semantics to our `tractor.Context`
machinery to more closely match that of `trio.CancelScope` but
with operational differences to handle the nature of parallel tasks interoperating
across multiple memory boundaries:

- if an actor task cancels some context it has opened via
  `Context.cancel()`, the remote (scope linked) task will be cancelled
  using the normal `CancelScope` semantics of `trio` meaning the remote
  cancel scope surrounding the far side task is cancelled and
  `trio.Cancelled`s are expected to be raised in that scope as per
  normal `trio` operation, and in the case where no error is raised
  in that remote scope, a `ContextCancelled` error is raised inside the
  runtime machinery and relayed back to the opener/caller side of the
  context.
- if any actor task cancels a full remote actor runtime using
  `Portal.cancel_actor()` the same semantics as above apply except every
  other remote actor task which also has an open context with the actor
  which was cancelled will also be sent a `ContextCancelled` **but**
  with the `.canceller` field set to the uid of the original cancel
  requesting actor.

This changeset also includes a more "proper" solution to the issue of
"allowing overruns" during streaming without attempting to implement any
form of IPC streaming backpressure. Implementing task-granularity
backpressure cross-process turns out to be more or less impossible
without augmenting out streaming protocol (likely at the cost of
performance). Further allowing overruns requires special care since
any blocking of the runtime RPC msg loop task effectively can block
control msgs such as cancels and stream terminations.

The implementation details per abstraction layer are as follows.

._streaming.Context:
- add a new contructor factor func `mk_context()` which provides
  a strictly private init-er whilst allowing us to not have to define
  an `.__init__()` on the type def.
- add public `.cancel_called` and `.cancel_called_remote` properties.
- general rename of what was the internal `._backpressure` var to
  `._allow_overruns: bool`.
- move the old contents of `Actor._push_result()` into a new
  `._deliver_msg()` allowing for better encapsulation of per-ctx
  msg handling.
 - always check for received 'error' msgs and process them with the new
   `_maybe_cancel_and_set_remote_error()` **before** any msg delivery to
   the local task, thus guaranteeing error and cancellation handling
   despite any overflow handling.
- add a new `._drain_overflows()` task-method for use with new
  `._allow_overruns: bool = True` mode.
 - add back a `._scope_nursery: trio.Nursery` (allocated in
   `Portal.open_context()`) who's sole purpose is to spawn a single task
   which runs the above method; anything else is an error.
 - augment `._deliver_msg()` to start a task and run the above method
   when operating in no overrun mode; the task queues overflow msgs and
   attempts to send them to the underlying mem chan using a blocking
   `.send()` call.
 - on context exit, any existing "drainer task" will be cancelled and
   remaining overflow queued msgs are discarded with a warning.
- rename `._error` -> `_remote_error` and set it in a new method
  `_maybe_cancel_and_set_remote_error()` which is called before
  processing
- adjust `.result()` to always call `._maybe_raise_remote_err()` at its
  start such that whenever a `ContextCancelled` arrives we do logic for
  whether or not to immediately raise that error or ignore it due to the
  current actor being the one who requested the cancel, by checking the
  error's `.canceller` field.
 - set the default value of `._result` to be `id(Context()` thus avoiding
   conflict with any `.result()` actually being `False`..

._runtime.Actor:
- augment `.cancel()` and `._cancel_task()` and `.cancel_rpc_tasks()` to
  take a `requesting_uid: tuple` indicating the source actor of every
  cancellation request.
- pass through the new `Context._allow_overruns` through `.get_context()`
- call the new `Context._deliver_msg()` from `._push_result()` (since
  the factoring out that method's contents).

._runtime._invoke:
- `TastStatus.started()` back a `Context` (unless an error is raised)
  instead of the cancel scope to make it easy to set/get state on that
  context for the purposes of cancellation and remote error relay.
- always raise any remote error via `Context._maybe_raise_remote_err()`
  before doing any `ContextCancelled` logic.
- assign any `Context._cancel_called_remote` set by the `requesting_uid`
  cancel methods (mentioned above) to the `ContextCancelled.canceller`.

._runtime.process_messages:
- always pass a `requesting_uid: tuple` to `Actor.cancel()` and
  `._cancel_task` to that any corresponding `ContextCancelled.canceller`
  can be set inside `._invoke()`.
This actually caught further runtime bugs so it's gud i tried..
Add overrun-ignore enabled / disabled cases and error catching for all
of them. More or less this should cover every possible outcome when
it comes to setting `allow_overruns: bool` i hope XD
Because obviously we probably want to support `allow_overruns` on the
remote callee side as well XD

Only found the bugs fixed in this patch this thanks to writing a much
more exhaustive test set for overrun cases B)
Turns out you can get a case where you might be opening multiple
ctx-streams concurrently and during the context opening phase you block
for all contexts to open, but then when you eventually start opening
streams some slow to start context has caused the others become in an
overrun state.. so we need to let the caller control whether that's an
error ;)

This also needs a test!
@goodboy goodboy force-pushed the ctx_cancel_semantics_and_overruns branch from a4ecc2a to ead9e41 Compare May 15, 2023 14:00
goodboy added a commit to pikers/piker that referenced this pull request May 17, 2023
Requires goodboy/tractor#357.
Avoid overruns when doing concurrent live feed init over multiple
brokers.
- `Context._cancel_called_remote` -> `._cancelled_remote` since "called"
  implies the cancellation was "requested" when it could be due to
  another error and the actor uid is the value - only set once the far
  end task scope is terminated due to either error or cancel, which has
  nothing to do with *what* caused the cancellation.
- `Actor._cancel_called_remote` -> `._cancel_called_by_remote` which
  emphasizes that this variable is **only set** IFF some remote actor
  **requested that** this actor's runtime be cancelled via
  `Actor.cancel()`.
goodboy added a commit that referenced this pull request Oct 18, 2023
Tests that appropriate `Context` exit state, the relay of
a `ContextCancelled` error and its `.canceller: tuple[str, str]` value
are set when an inter-peer cancellation happens via an "out of band"
request method (in this case using `Portal.cancel_actor()` and that
cancellation is propagated "horizontally" to other peers. Verify that
any such cancellation scenario which also experiences an "error during
`ContextCancelled` handling" DOES NOT result in that further error being
suppressed and that the user's exception bubbles out of the
`Context.open_context()` block(s) appropriately!

Likely more tests to come as well as some factoring of the teardown
state checks where possible.

Pertains to serious testing the major work landing in #357
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cancellation SC teardown semantics and anti-zombie semantics enhancement New feature or request messaging messaging patterns and protocols streaming
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant