Tpt-tolerance: more lowlevel trio CRE/BRE -> TransportClosed translations#411
Tpt-tolerance: more lowlevel trio CRE/BRE -> TransportClosed translations#411
trio CRE/BRE -> TransportClosed translations#411Conversation
There was a problem hiding this comment.
Pull request overview
This PR tightens IPC transport fault-tolerance by translating additional low-level trio transport exceptions into TransportClosed, and adjusts RPC error-relay/debugging behavior around transport disconnects.
Changes:
- Re-enable translation of a specific
trio.ClosedResourceError(“this socket was already closed”) intoTransportClosedin the transport send path. - Add an explicit
TransportClosedhandler in_errors_relayed_via_ipc()and refine debugger-entry gating/logging. - Change
_invoke()’shide_tbdefault and add notes/comments around final result sending.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
tractor/ipc/_transport.py |
Adds another trio-exception → TransportClosed translation during send() failures. |
tractor/_rpc.py |
Updates RPC error relay to treat transport disconnects specially; tweaks debugger-entry behavior and changes _invoke() defaults/comments. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| await chan.send(return_msg) | ||
| # ?TODO, remove the below since .send() already | ||
| # doesn't raise on tpt-closed? | ||
| # try: | ||
| # await chan.send(return_msg) | ||
| # except TransportClosed: | ||
| # log.exception( | ||
| # f"Failed send final result to 'parent'-side of IPC-ctx!\n" | ||
| # f'\n' | ||
| # f'{chan}\n' | ||
| # f'Channel already disconnected ??\n' | ||
| # f'\n' | ||
| # f'{pretty_struct.pformat(return_msg)}' | ||
| # ) | ||
| # # ?TODO? will this ever be true though? | ||
| # if chan.connected(): | ||
| # raise | ||
|
|
There was a problem hiding this comment.
The comment suggests Channel.send() “already doesn't raise on tpt-closed”, but Channel.send() re-raises TransportClosed after logging (see tractor/ipc/_chan.py). This note is misleading and may cause future changes to incorrectly assume final result sends are safe from disconnects; consider correcting/removing the comment or clarifying which layer is expected to suppress TransportClosed (if any).
| await chan.send(return_msg) | |
| # ?TODO, remove the below since .send() already | |
| # doesn't raise on tpt-closed? | |
| # try: | |
| # await chan.send(return_msg) | |
| # except TransportClosed: | |
| # log.exception( | |
| # f"Failed send final result to 'parent'-side of IPC-ctx!\n" | |
| # f'\n' | |
| # f'{chan}\n' | |
| # f'Channel already disconnected ??\n' | |
| # f'\n' | |
| # f'{pretty_struct.pformat(return_msg)}' | |
| # ) | |
| # # ?TODO? will this ever be true though? | |
| # if chan.connected(): | |
| # raise | |
| # NOTE: `Channel.send()` may still raise `TransportClosed` on | |
| # disconnect. We currently let that propagate rather than | |
| # treating final-result sends as best-effort. | |
| await chan.send(return_msg) |
tractor/_rpc.py
Outdated
| except TransportClosed: | ||
| log.exception('Tpt disconnect during remote-exc relay?') | ||
| raise |
There was a problem hiding this comment.
Catching TransportClosed here only handles disconnects raised directly from the wrapped invoke body. A TransportClosed raised later while attempting to ship/relay a boxed RPC error (e.g., via Channel.send() inside error-relay code paths) will not be caught by this clause and can still escape the context manager. If the goal is to tolerate disconnects during error relay, consider also handling TransportClosed around the error-shipment path (or updating the helper that ships errors to treat TransportClosed like other transport failures).
| except TransportClosed: | |
| log.exception('Tpt disconnect during remote-exc relay?') | |
| raise | |
| except TransportClosed as err: | |
| log.exception('Tpt disconnect during remote-exc relay?') | |
| rpc_err = err |
tractor/_rpc.py
Outdated
| except TransportClosed: | ||
| log.exception('Tpt disconnect during remote-exc relay?') | ||
| raise |
There was a problem hiding this comment.
log.exception() will always emit a traceback here, which can be very noisy if TransportClosed is a relatively common/expected condition (and it looks like TransportClosed already carries its own loglevel + reporting via report_n_maybe_raise()). Consider logging at the transport-appropriate level (or deferring reporting to the outer message-loop handler) to avoid duplicate stack traces for normal disconnect scenarios.
tractor/_rpc.py
Outdated
| and | ||
| not isinstance(err, TransportClosed) |
There was a problem hiding this comment.
This not isinstance(err, TransportClosed) guard looks redundant now that TransportClosed is already handled by the preceding except TransportClosed: block. Consider removing it to simplify the condition (or, if you intend to handle TransportClosed here, restructure the exception handling so it can actually reach this branch).
| and | |
| not isinstance(err, TransportClosed) |
tractor/_rpc.py
Outdated
| err, | ||
| api_frame=inspect.currentframe(), | ||
|
|
||
| # don't REPL any psuedo-expected tpt-disconnect |
There was a problem hiding this comment.
Typo in comment: "psuedo" -> "pseudo".
| # don't REPL any psuedo-expected tpt-disconnect | |
| # don't REPL any pseudo-expected tpt-disconnect |
tractor/_rpc.py
Outdated
|
|
||
| is_rpc: bool = True, | ||
| hide_tb: bool = True, | ||
| hide_tb: bool = False, |
There was a problem hiding this comment.
Changing _invoke(..., hide_tb=...) default from True to False is a user-visible behavior change (it will start showing internal runtime frames by default in tracebacks for RPC tasks). If this wasn’t intentional for this PR, consider keeping the previous default and only forcing hide_tb=False at specific call sites where you want extra visibility (e.g., in debug mode / transport-error scenarios).
| hide_tb: bool = False, | |
| hide_tb: bool = True, |
7f602db to
f03730e
Compare
For IPC-disconnects-during-teardown edge cases, augment some `._rpc`
machinery,
- in `._invoke()` around the `await chan.send(return_msg)` where we
suppress if the underlying `Channel` already disconnected.
- add a disjoint handler in `_errors_relayed_via_ipc()` which just
reports-n-reraises the exc (same as prior behaviour).
* originally i thought it needed to be handled specially (to avoid
being crash handled) but turns out that isn't necessary?
* hence the also-added-bu-masked-out `debug_filter` / guard expression
around the `await debug._maybe_enter_pm()` line.
- show the `._invoke()` frame for the moment.
A partial revert of commit c05d08e since it seem we already suppress tpt-closed errors lower down in `.ipc.Channel.send()`; given that i'm pretty sure this new handler code should basically never run? Left in a todo to remove the masked content once i'm done more thoroughly testing under `piker`.
Unmask the CRE case block for peer-closed socket errors which already had a TODO about reproducing the condition. It appears this case can happen during inter-actor comms teardowns in `piker`, but i haven't been able to figure out exactly what reproduces it yet.. So activate the block again for that 'socket already closed'-msg case, and add a TODO questioning how to reproduce it. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
f03730e to
f8e2568
Compare
Convert `spawn` fixture to a generator and add post-test graceful subproc cleanup via `SIGINT`/`SIGKILL` to avoid leaving stale `pexpect` child procs around between test runs as well as any UDS-tpt socket files under the system runtime-dir. Deats, - convert `return _spawn` -> `yield _spawn` to enable post-yield teardown logic. - add a new `nonlocal spawned` ref so teardown logic can access the last spawned child from outside the delivered spawner fn-closure. - add `SIGINT`-loop after yield with 5s timeout, then `SIGKILL` if proc still alive. - add masked `breakpoint()` and TODO about UDS path cleanup (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Internal helper which falls back to sync `pdb` when the child actor can't reach root to acquire the TTY lock. Useful when debugging tpt layer failures (intentional or otherwise) where a sub-actor can no longer IPC-contact the root to coordinate REPL access; root uses `.pause()` as normal while non-root falls back to `mk_pdb().set_trace()`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Refine tpt-error reporting to include closure attribution (`'locally'` vs `'by peer'`), tighten match conditions and reduce needless newlines in exc reprs. Deats, - factor out `trans_err_msg: str` and `by_whom: str` into a `dict` lookup before the `match:` block to pair specific err msgs to closure attribution strings. - use `by_whom` directly as `CRE` case guard condition (truthy when msg matches known underlying CRE msg content). - conveniently include `by_whom!r` in `TransportClosed` message. - fix `'locally ?'` -> `'locally?'` in send-side `CRE` handler (drop errant space). - add masked `maybe_pause_bp()` calls at both `CRE` sites (from when i was tracing a test harness issue where the UDS socket path wasn't being cleaned up on teardown). - drop trailing `\n` from `body=` args to `TransportClosed`. - reuse `trans_err_msg` for the `BRE`/broken-pipe guard. Also adjust testing, namely `test_ctxep_pauses_n_maybe_ipc_breaks`'s expected patts-set for new msg formats to be raised out of `.ipc._transport`. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Add `TransportClosed` to except clauses where `trio`'s own resource-closed errors are already caught, ensuring our higher-level tpt exc is also tolerated in those same spots. Likely i will follow up with a removal of the `trio` variants since most *should be* caught and re-raised as tpt-closed out of the `.ipc` stack now? Add `TransportClosed` to various handler blocks, - `._streaming.MsgStream.aclose()/.send()` except blocks. - the broken-channel except in `._context.open_context_from_portal()`. - obvi import it where necessary in those ^ mods. Adjust `test_advanced_faults` suite + exs-script to match, - update `ipc_failure_during_stream.py` example to catch `TransportClosed` alongside `trio.ClosedResourceError` in both the break and send-check paths. - shield the `trio.sleep(0.01)` after tpt close in example to avoid taskc-raise/masking on that checkpoint since we want to simulate waiting for a user to send a KBI. - loosen `ExceptionGroup` assertion to `len(excs) <= 2` and ensure all excs are `TransportClosed`. - improve multi-line formatting, minor style/formatting fixes in condition expressions. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Remove the `trio.ClosedResourceError` and `trio.BrokenResourceError` handling that should now be subsumed by `TransportClosed` re-raising out of the `.ipc` stack. Deats, - drop CRE and BRE from `._streaming.MsgStream.aclose()/.send()` blocks. - similarly rm from `._context.open_context_from_portal()`. - also from `._portal.Portal.cancel_actor()` and drop the (now-completed-todo) comment about this exact thing. Also add comment in `._rpc.try_ship_error_to_remote()` noting the remaining `trio` catches there are bc the `.ipc` layers *should* be wrapping them; thus `log.critical()` use is warranted. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Flesh out missing method doc-strings, improve log msg formatting and assert -> `RuntimeError` for un-inited tpt layer. Deats, - add doc-string to `.send()` noting `TransportClosed` raise on comms failures. - add doc-string to `.recv()`. - expand `._aiter_msgs()` doc-string, line-len reflow. - add doc-string to `.connected()`. - convert `assert self._transport` -> `RuntimeError` raise in `._aiter_msgs()` for more explicit crashing. - expand `_connect_chan()` doc-string, note it's lowlevel and suggest `.open_portal()` to user instead. - factor out `src_exc_str` in `TransportClosed` log handler to avoid double-call - use multiline style for `.connected()` return expr. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Brief descriptions for both fns in `._discovery` clarifying what each delivers and under what conditions. (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
trio transport excs -> TransportClosed translationstrio CRE/BRE -> TransportClosed translations
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Comments suppressed due to low confidence (1)
tractor/ipc/_transport.py:495
- In the
except (BrokenResourceError, ClosedResourceError)block, the defaultcase _:re-raises the original Trio exception. Higher layers in this PR are being updated to only tolerateTransportClosed, so letting raw Trio transport exceptions escape here can lead to unexpected crashes. Consider always translating these low-level transport errors intoTransportClosed(possibly with a generic message/loglevel) instead of re-raisingtrans_err.
# unless the disconnect condition falls under "a
# normal operation breakage" we usualy console warn
# about it.
case _:
log.exception(
f'{tpt_name} layer failed pre-send ??\n'
)
raise trans_err
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tests/devx/conftest.py
Outdated
| (_time_took := (time.time() - start) < timeout) | ||
| ): | ||
| ptyproc.kill(signal.SIGINT) | ||
| time.sleep(0.01) | ||
| else: | ||
| ptyproc.kill(signal.SIGKILL) | ||
|
|
There was a problem hiding this comment.
The teardown loop uses while ...: ... else: ptyproc.kill(SIGKILL), but since the loop has no break, the else branch will always run whenever the loop exits normally (including when the process has already exited). This risks sending SIGKILL unnecessarily (and can raise if the process is already gone). Consider replacing the while/else with an explicit timeout loop and then if spawned.isalive(): kill(SIGKILL) after the loop; also _time_took := (time.time() - start) < timeout assigns a boolean, so the name is misleading and the elapsed time isn't preserved.
| (_time_took := (time.time() - start) < timeout) | |
| ): | |
| ptyproc.kill(signal.SIGINT) | |
| time.sleep(0.01) | |
| else: | |
| ptyproc.kill(signal.SIGKILL) | |
| (time.time() - start) < timeout | |
| ): | |
| ptyproc.kill(signal.SIGINT) | |
| time.sleep(0.01) | |
| if spawned and spawned.isalive(): | |
| try: | |
| ptyproc.kill(signal.SIGKILL) | |
| except OSError: | |
| # Process may have exited between isalive() and kill() | |
| pass |
|
|
||
| This fn raises `TransportClosed` on comms failures and is | ||
| normally handled by higher level runtime machinery for the | ||
| expected-graceful cases, normally ephemercal |
There was a problem hiding this comment.
Spelling in docstring: ephemercal should be ephemeral.
| expected-graceful cases, normally ephemercal | |
| expected-graceful cases, normally ephemeral |
| Yield `MsgType` IPC msgs decoded and deliverd from an | ||
| underlying `MsgTransport` protocol. | ||
|
|
||
| This is a streaming routine alo implemented as an async-gen | ||
| func (same a `MsgTransport._iter_pkts()`) gets allocated by | ||
| a `.__call__()` inside `.__init__()` where it is assigned to | ||
| the `._aiter_msgs` attr. | ||
| This is a streaming routine alo implemented as an | ||
| async-generator func (same a `MsgTransport._iter_pkts()`) | ||
| gets allocated by a `.__call__()` inside `.__init__()` where | ||
| it is assigned to the `._aiter_msgs` attr. |
There was a problem hiding this comment.
Spelling/grammar in docstring: deliverd -> delivered, alo -> also, and same a -> same as.
| # XXX NOTE XXX in SC terms this is one of the worst things | ||
| # that can happen and provides for a 2-general's dilemma.. | ||
| # | ||
| # FURHTER, we should never really have to handle these |
There was a problem hiding this comment.
Spelling in comment: FURHTER should be FURTHER.
| # FURHTER, we should never really have to handle these | |
| # FURTHER, we should never really have to handle these |
| tpt_name: str = f'{type(self).__name__!r}' | ||
|
|
||
| trans_err_msg: str = trans_err.args[0] | ||
| by_whom: str = { |
There was a problem hiding this comment.
by_whom can be None when the error message isn't in the mapping, but it's annotated as str. This is a type mismatch and can mask the fact that the subsequent match branch depends on optionality; annotate as str | None (or provide a non-None default) so type checkers and readers get the correct contract.
| by_whom: str = { | |
| by_whom: str | None = { |
tests/test_advanced_faults.py
Outdated
| (isinstance(exc, TransportClosed) | ||
| for exc in excs) |
There was a problem hiding this comment.
This assertion doesn't do what it intends: (isinstance(exc, TransportClosed) for exc in excs) creates a generator which is always truthy, so the test will pass even when excs contains non-TransportClosed exceptions. Use all(isinstance(exc, TransportClosed) for exc in excs) (and consider asserting the expected exception types/count explicitly).
| (isinstance(exc, TransportClosed) | |
| for exc in excs) | |
| all( | |
| isinstance(exc, TransportClosed) | |
| for exc in excs | |
| ) |
Adjust `basic_echo_server()` default sequence len to avoid the race where the 'tell_little_bro()` finished streaming **before** the echo-server sub is cancelled by its peer subactor (which is the whole thing we're testing!). Deats, - bump `rng_seed` default from 50 -> 100 to ensure peer cancel req arrives before echo dialog completes on fast hw. - add `trio.sleep(0.001)` between send/receive in msg loop on the "client" streamer side to give cancel request transit more time to arrive. Also, - add more native `tractor`-type hints. - reflow `basic_echo_server()` doc-string for 67 char limit - add masked `pause()` call with comment about unreachable code path - alphabetize imports: mv `current_actor` and `open_nursery` below typed imports (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Improve the `spawn` fixture teardown logic in `tests/devx/conftest.py` fixing the while-else bug, and fix `test_advanced_faults` genexp for `TransportClosed` exc type checking. Deats, - replace broken `while-else` pattern with direct `if ptyproc.isalive()` check after the SIGINT loop. - fix undefined `spawned` ref -> `ptyproc.isalive()` in while condition. - improve walrus expr formatting in timeout check (multiline style). Also fix `test_ipc_channel_break_during_stream()` assertion, - wrap genexp in `all()` call so it actually checks all excs are `TransportClosed` instead of just creating an unused generator. (this patch was suggested by copilot in, #411) (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
Expand and clarify the comment for the default `case _` block in the `.send()` error matcher, noting that we console-error and raise-thru for unexpected disconnect conditions. (this patch was suggested by copilot in, #411) (this commit msg was generated in some part by [`claude-code`][claude-code-gh]) [claude-code-gh]: https://github.com/anthropics/claude-code
More or less finishing-up/refining the new
tractor.ipcexcmachinery introduced in #375 such that all "lowlevel
trio" sourceexception types are re-raise-propagated out of the
.ipc.Channel/.MsgTransportlayers asTransportClosed, which isgenerally handled with more tolerance (at least for expected graceful
comms disconnect cases) throughout the higher-level RPC machinery.
This patch includes tolerating now both,
trio.BrokenResourceError(BRE)trio.ClosedResourceError(CRE)raised out of the underlying transport protocol interfaces similarly
depending on its
exc.message[0]content, special cases are remappedas
TransportClosed(tpt-closed) with appropriate explanatory msgs.Summary of tweaks
tpt-closed where we were handling CRE/BRE previous.
call-into/reside-on-top-of the
.ipcprimitives.tweak as part of the above.
namely multi-line styling for fn-sigs and multi-expr-logics
for
ifbranching.