Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context semantics for cross-actor-task cancellation and overruns #357

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Commits on May 15, 2023

  1. Configuration menu
    Copy the full SHA
    cfb2bc0 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    d753431 View commit details
    Browse the repository at this point in the history
  3. Tweak context doc str

    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    903537c View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    b3f9251 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    e80e0a5 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    8317903 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    220b244 View commit details
    Browse the repository at this point in the history
  8. Add new set of context cancellation tests

    These will verify new changes to the runtime/messaging core which allows
    us to adopt an "ignore cancel if requested by us" style handling of
    `ContextCancelled` more like how `trio` does with
    `trio.Nursery.cancel_scope.cancel()`. We now expect
    a `ContextCancelled.canceller: tuple` which is set to the actor uid of
    the actor which requested the cancellation which eventually resulted in
    the remote error-msg.
    
    Also adds some experimental tweaks to the "backpressure" test which it
    turns out is very problematic in coordination with context cancellation
    since blocking on the feed mem chan to some task will block the ipc msg
    loop and thus handling of cancellation.. More to come to both the test
    and core to address this hopefully since right now this test is failing.
    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    71cd445 View commit details
    Browse the repository at this point in the history
  9. Add new remote error introspection attrs

    To handle both remote cancellation this adds `ContextCanceled.canceller:
    tuple` the uid of the cancel requesting actor and is expected to be set
    by the runtime when servicing any remote cancel request. This makes it
    possible for `ContextCancelled` receivers to know whether "their actor
    runtime" is the source of the cancellation.
    
    Also add an explicit `RemoteActor.src_actor_uid` which better formalizes
    the notion of "which remote actor" the error originated from.
    
    Both of these new attrs are expected to be packed in the `.msgdata` when
    the errors are loaded locally.
    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    67f82c6 View commit details
    Browse the repository at this point in the history
  10. Augment test cases for callee-returns-result early

    Turns out stuff was totally broken in these cases because we're either
    closing the underlying mem chan too early or not handling the
    "allow_overruns" mode's cancellation correctly..
    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    03644f5 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    f54c415 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    90e4101 View commit details
    Browse the repository at this point in the history
  13. Remote Context cancellation semantics rework B)

    This adds remote cancellation semantics to our `tractor.Context`
    machinery to more closely match that of `trio.CancelScope` but
    with operational differences to handle the nature of parallel tasks interoperating
    across multiple memory boundaries:
    
    - if an actor task cancels some context it has opened via
      `Context.cancel()`, the remote (scope linked) task will be cancelled
      using the normal `CancelScope` semantics of `trio` meaning the remote
      cancel scope surrounding the far side task is cancelled and
      `trio.Cancelled`s are expected to be raised in that scope as per
      normal `trio` operation, and in the case where no error is raised
      in that remote scope, a `ContextCancelled` error is raised inside the
      runtime machinery and relayed back to the opener/caller side of the
      context.
    - if any actor task cancels a full remote actor runtime using
      `Portal.cancel_actor()` the same semantics as above apply except every
      other remote actor task which also has an open context with the actor
      which was cancelled will also be sent a `ContextCancelled` **but**
      with the `.canceller` field set to the uid of the original cancel
      requesting actor.
    
    This changeset also includes a more "proper" solution to the issue of
    "allowing overruns" during streaming without attempting to implement any
    form of IPC streaming backpressure. Implementing task-granularity
    backpressure cross-process turns out to be more or less impossible
    without augmenting out streaming protocol (likely at the cost of
    performance). Further allowing overruns requires special care since
    any blocking of the runtime RPC msg loop task effectively can block
    control msgs such as cancels and stream terminations.
    
    The implementation details per abstraction layer are as follows.
    
    ._streaming.Context:
    - add a new contructor factor func `mk_context()` which provides
      a strictly private init-er whilst allowing us to not have to define
      an `.__init__()` on the type def.
    - add public `.cancel_called` and `.cancel_called_remote` properties.
    - general rename of what was the internal `._backpressure` var to
      `._allow_overruns: bool`.
    - move the old contents of `Actor._push_result()` into a new
      `._deliver_msg()` allowing for better encapsulation of per-ctx
      msg handling.
     - always check for received 'error' msgs and process them with the new
       `_maybe_cancel_and_set_remote_error()` **before** any msg delivery to
       the local task, thus guaranteeing error and cancellation handling
       despite any overflow handling.
    - add a new `._drain_overflows()` task-method for use with new
      `._allow_overruns: bool = True` mode.
     - add back a `._scope_nursery: trio.Nursery` (allocated in
       `Portal.open_context()`) who's sole purpose is to spawn a single task
       which runs the above method; anything else is an error.
     - augment `._deliver_msg()` to start a task and run the above method
       when operating in no overrun mode; the task queues overflow msgs and
       attempts to send them to the underlying mem chan using a blocking
       `.send()` call.
     - on context exit, any existing "drainer task" will be cancelled and
       remaining overflow queued msgs are discarded with a warning.
    - rename `._error` -> `_remote_error` and set it in a new method
      `_maybe_cancel_and_set_remote_error()` which is called before
      processing
    - adjust `.result()` to always call `._maybe_raise_remote_err()` at its
      start such that whenever a `ContextCancelled` arrives we do logic for
      whether or not to immediately raise that error or ignore it due to the
      current actor being the one who requested the cancel, by checking the
      error's `.canceller` field.
     - set the default value of `._result` to be `id(Context()` thus avoiding
       conflict with any `.result()` actually being `False`..
    
    ._runtime.Actor:
    - augment `.cancel()` and `._cancel_task()` and `.cancel_rpc_tasks()` to
      take a `requesting_uid: tuple` indicating the source actor of every
      cancellation request.
    - pass through the new `Context._allow_overruns` through `.get_context()`
    - call the new `Context._deliver_msg()` from `._push_result()` (since
      the factoring out that method's contents).
    
    ._runtime._invoke:
    - `TastStatus.started()` back a `Context` (unless an error is raised)
      instead of the cancel scope to make it easy to set/get state on that
      context for the purposes of cancellation and remote error relay.
    - always raise any remote error via `Context._maybe_raise_remote_err()`
      before doing any `ContextCancelled` logic.
    - assign any `Context._cancel_called_remote` set by the `requesting_uid`
      cancel methods (mentioned above) to the `ContextCancelled.canceller`.
    
    ._runtime.process_messages:
    - always pass a `requesting_uid: tuple` to `Actor.cancel()` and
      `._cancel_task` to that any corresponding `ContextCancelled.canceller`
      can be set inside `._invoke()`.
    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    c720260 View commit details
    Browse the repository at this point in the history
  14. Configuration menu
    Copy the full SHA
    6994d20 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    6db656f View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    f1e9c0b View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    63adf73 View commit details
    Browse the repository at this point in the history
  18. Seriously cover all overrun cases

    This actually caught further runtime bugs so it's gud i tried..
    Add overrun-ignore enabled / disabled cases and error catching for all
    of them. More or less this should cover every possible outcome when
    it comes to setting `allow_overruns: bool` i hope XD
    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    f9911c2 View commit details
    Browse the repository at this point in the history
  19. Set Context._scope_nursery on callee side too

    Because obviously we probably want to support `allow_overruns` on the
    remote callee side as well XD
    
    Only found the bugs fixed in this patch this thanks to writing a much
    more exhaustive test set for overrun cases B)
    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    968f13f View commit details
    Browse the repository at this point in the history
  20. Configuration menu
    Copy the full SHA
    04e4397 View commit details
    Browse the repository at this point in the history
  21. Configuration menu
    Copy the full SHA
    041d7da View commit details
    Browse the repository at this point in the history
  22. Configuration menu
    Copy the full SHA
    20d75ff View commit details
    Browse the repository at this point in the history
  23. Tweak doc string

    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    7293b82 View commit details
    Browse the repository at this point in the history
  24. Configuration menu
    Copy the full SHA
    60791ed View commit details
    Browse the repository at this point in the history
  25. Expose allow_overruns to Portal.open_context()

    Turns out you can get a case where you might be opening multiple
    ctx-streams concurrently and during the context opening phase you block
    for all contexts to open, but then when you eventually start opening
    streams some slow to start context has caused the others become in an
    overrun state.. so we need to let the caller control whether that's an
    error ;)
    
    This also needs a test!
    goodboy committed May 15, 2023
    Configuration menu
    Copy the full SHA
    ead9e41 View commit details
    Browse the repository at this point in the history

Commits on May 19, 2023

  1. Remote cancellation runtime-internal vars renames

    - `Context._cancel_called_remote` -> `._cancelled_remote` since "called"
      implies the cancellation was "requested" when it could be due to
      another error and the actor uid is the value - only set once the far
      end task scope is terminated due to either error or cancel, which has
      nothing to do with *what* caused the cancellation.
    - `Actor._cancel_called_remote` -> `._cancel_called_by_remote` which
      emphasizes that this variable is **only set** IFF some remote actor
      **requested that** this actor's runtime be cancelled via
      `Actor.cancel()`.
    goodboy committed May 19, 2023
    Configuration menu
    Copy the full SHA
    a0276f4 View commit details
    Browse the repository at this point in the history

Commits on May 25, 2023

  1. Configuration menu
    Copy the full SHA
    6495688 View commit details
    Browse the repository at this point in the history

Commits on Jun 14, 2023

  1. Configuration menu
    Copy the full SHA
    17ae449 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    7507e26 View commit details
    Browse the repository at this point in the history