Skip to content

Tpt-tolerance: more lowlevel trio CRE/BRE -> TransportClosed translations#411

Merged
goodboy merged 18 commits intomainfrom
tpt_tolerance
Feb 19, 2026
Merged

Tpt-tolerance: more lowlevel trio CRE/BRE -> TransportClosed translations#411
goodboy merged 18 commits intomainfrom
tpt_tolerance

Conversation

@goodboy
Copy link
Owner

@goodboy goodboy commented Feb 12, 2026

More or less finishing-up/refining the new tractor.ipc exc
machinery introduced in #375 such that all "lowlevel trio" source
exception types are re-raise-propagated out of the
.ipc.Channel/.MsgTransport layers as TransportClosed, which is
generally handled with more tolerance (at least for expected graceful
comms disconnect cases) throughout the higher-level RPC machinery.

This patch includes tolerating now both,

  • trio.BrokenResourceError (BRE)
  • trio.ClosedResourceError (CRE)

raised out of the underlying transport protocol interfaces similarly
depending on its exc.message[0] content, special cases are remapped
as TransportClosed (tpt-closed) with appropriate explanatory msgs.


Summary of tweaks

  • adjustments to various test suites to match, namely handling
    tpt-closed where we were handling CRE/BRE previous.
  • removal of CRE/BRE from various internal layers which generally
    call-into/reside-on-top-of the .ipc primitives.
  • various doc-str updates in surrounding mods i needed to read or
    tweak as part of the above.
  • ofc styling adjustements inline with my recent prefs for py code,
    namely multi-line styling for fn-sigs and multi-expr-logics
    for if branching.

Copilot AI review requested due to automatic review settings February 12, 2026 02:46
@goodboy goodboy changed the base branch from main to log_sys_testing February 12, 2026 02:47
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens IPC transport fault-tolerance by translating additional low-level trio transport exceptions into TransportClosed, and adjusts RPC error-relay/debugging behavior around transport disconnects.

Changes:

  • Re-enable translation of a specific trio.ClosedResourceError (“this socket was already closed”) into TransportClosed in the transport send path.
  • Add an explicit TransportClosed handler in _errors_relayed_via_ipc() and refine debugger-entry gating/logging.
  • Change _invoke()’s hide_tb default and add notes/comments around final result sending.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.

File Description
tractor/ipc/_transport.py Adds another trio-exception → TransportClosed translation during send() failures.
tractor/_rpc.py Updates RPC error relay to treat transport disconnects specially; tweaks debugger-entry behavior and changes _invoke() defaults/comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 696 to 713
await chan.send(return_msg)
# ?TODO, remove the below since .send() already
# doesn't raise on tpt-closed?
# try:
# await chan.send(return_msg)
# except TransportClosed:
# log.exception(
# f"Failed send final result to 'parent'-side of IPC-ctx!\n"
# f'\n'
# f'{chan}\n'
# f'Channel already disconnected ??\n'
# f'\n'
# f'{pretty_struct.pformat(return_msg)}'
# )
# # ?TODO? will this ever be true though?
# if chan.connected():
# raise

Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment suggests Channel.send() “already doesn't raise on tpt-closed”, but Channel.send() re-raises TransportClosed after logging (see tractor/ipc/_chan.py). This note is misleading and may cause future changes to incorrectly assume final result sends are safe from disconnects; consider correcting/removing the comment or clarifying which layer is expected to suppress TransportClosed (if any).

Suggested change
await chan.send(return_msg)
# ?TODO, remove the below since .send() already
# doesn't raise on tpt-closed?
# try:
# await chan.send(return_msg)
# except TransportClosed:
# log.exception(
# f"Failed send final result to 'parent'-side of IPC-ctx!\n"
# f'\n'
# f'{chan}\n'
# f'Channel already disconnected ??\n'
# f'\n'
# f'{pretty_struct.pformat(return_msg)}'
# )
# # ?TODO? will this ever be true though?
# if chan.connected():
# raise
# NOTE: `Channel.send()` may still raise `TransportClosed` on
# disconnect. We currently let that propagate rather than
# treating final-result sends as best-effort.
await chan.send(return_msg)

Copilot uses AI. Check for mistakes.
tractor/_rpc.py Outdated
Comment on lines 287 to 289
except TransportClosed:
log.exception('Tpt disconnect during remote-exc relay?')
raise
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Catching TransportClosed here only handles disconnects raised directly from the wrapped invoke body. A TransportClosed raised later while attempting to ship/relay a boxed RPC error (e.g., via Channel.send() inside error-relay code paths) will not be caught by this clause and can still escape the context manager. If the goal is to tolerate disconnects during error relay, consider also handling TransportClosed around the error-shipment path (or updating the helper that ships errors to treat TransportClosed like other transport failures).

Suggested change
except TransportClosed:
log.exception('Tpt disconnect during remote-exc relay?')
raise
except TransportClosed as err:
log.exception('Tpt disconnect during remote-exc relay?')
rpc_err = err

Copilot uses AI. Check for mistakes.
tractor/_rpc.py Outdated
Comment on lines 287 to 289
except TransportClosed:
log.exception('Tpt disconnect during remote-exc relay?')
raise
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log.exception() will always emit a traceback here, which can be very noisy if TransportClosed is a relatively common/expected condition (and it looks like TransportClosed already carries its own loglevel + reporting via report_n_maybe_raise()). Consider logging at the transport-appropriate level (or deferring reporting to the outer message-loop handler) to avoid duplicate stack traces for normal disconnect scenarios.

Copilot uses AI. Check for mistakes.
tractor/_rpc.py Outdated
Comment on lines 327 to 328
and
not isinstance(err, TransportClosed)
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This not isinstance(err, TransportClosed) guard looks redundant now that TransportClosed is already handled by the preceding except TransportClosed: block. Consider removing it to simplify the condition (or, if you intend to handle TransportClosed here, restructure the exception handling so it can actually reach this branch).

Suggested change
and
not isinstance(err, TransportClosed)

Copilot uses AI. Check for mistakes.
tractor/_rpc.py Outdated
err,
api_frame=inspect.currentframe(),

# don't REPL any psuedo-expected tpt-disconnect
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in comment: "psuedo" -> "pseudo".

Suggested change
# don't REPL any psuedo-expected tpt-disconnect
# don't REPL any pseudo-expected tpt-disconnect

Copilot uses AI. Check for mistakes.
tractor/_rpc.py Outdated

is_rpc: bool = True,
hide_tb: bool = True,
hide_tb: bool = False,
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing _invoke(..., hide_tb=...) default from True to False is a user-visible behavior change (it will start showing internal runtime frames by default in tracebacks for RPC tasks). If this wasn’t intentional for this PR, consider keeping the previous default and only forcing hide_tb=False at specific call sites where you want extra visibility (e.g., in debug mode / transport-error scenarios).

Suggested change
hide_tb: bool = False,
hide_tb: bool = True,

Copilot uses AI. Check for mistakes.
Base automatically changed from log_sys_testing to main February 12, 2026 03:13
goodboy and others added 3 commits February 12, 2026 00:51
For IPC-disconnects-during-teardown edge cases, augment some `._rpc`
machinery,
- in `._invoke()` around the `await chan.send(return_msg)` where we
  suppress if the underlying `Channel` already disconnected.
- add a disjoint handler in `_errors_relayed_via_ipc()` which just
  reports-n-reraises the exc (same as prior behaviour).
  * originally i thought it needed to be handled specially (to avoid
    being crash handled) but turns out that isn't necessary?
  * hence the also-added-bu-masked-out `debug_filter` / guard expression
    around the `await debug._maybe_enter_pm()` line.
- show the `._invoke()` frame for the moment.
A partial revert of commit c05d08e since it seem we already
suppress tpt-closed errors lower down in `.ipc.Channel.send()`; given
that i'm pretty sure this new handler code should basically never run?

Left in a todo to remove the masked content once i'm done more
thoroughly testing under `piker`.
Unmask the CRE case block for peer-closed socket errors which already
had a TODO about reproducing the condition. It appears this case can
happen during inter-actor comms teardowns in `piker`, but i haven't been
able to figure out exactly what reproduces it yet..

So activate the block again for that 'socket already closed'-msg case,
and add a TODO questioning how to reproduce it.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Convert `spawn` fixture to a generator and add post-test graceful
subproc cleanup via `SIGINT`/`SIGKILL` to avoid leaving stale `pexpect`
child procs around between test runs as well as any UDS-tpt socket files
under the system runtime-dir.

Deats,
- convert `return _spawn` -> `yield _spawn` to enable
  post-yield teardown logic.
- add a new `nonlocal spawned` ref so teardown logic can access the last
  spawned child from outside the delivered spawner fn-closure.
- add `SIGINT`-loop after yield with 5s timeout, then
  `SIGKILL` if proc still alive.
- add masked `breakpoint()` and TODO about UDS path cleanup

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Internal helper which falls back to sync `pdb` when the
child actor can't reach root to acquire the TTY lock.

Useful when debugging tpt layer failures (intentional or
otherwise) where a sub-actor can no longer IPC-contact the
root to coordinate REPL access; root uses `.pause()` as
normal while non-root falls back to `mk_pdb().set_trace()`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Refine tpt-error reporting to include closure attribution (`'locally'`
vs `'by peer'`), tighten match conditions and reduce needless newlines
in exc reprs.

Deats,
- factor out `trans_err_msg: str` and `by_whom: str` into a `dict`
  lookup before the `match:` block to pair specific err msgs to closure
  attribution strings.
- use `by_whom` directly as `CRE` case guard condition
  (truthy when msg matches known underlying CRE msg content).
- conveniently include `by_whom!r` in `TransportClosed` message.
- fix `'locally ?'` -> `'locally?'` in send-side `CRE`
  handler (drop errant space).
- add masked `maybe_pause_bp()` calls at both `CRE` sites (from when
  i was tracing a test harness issue where the UDS socket path wasn't
  being cleaned up on teardown).
- drop trailing `\n` from `body=` args to `TransportClosed`.
- reuse `trans_err_msg` for the `BRE`/broken-pipe guard.

Also adjust testing, namely `test_ctxep_pauses_n_maybe_ipc_breaks`'s
expected patts-set for new msg formats to be raised out of
`.ipc._transport`.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Add `TransportClosed` to except clauses where `trio`'s own
resource-closed errors are already caught, ensuring our
higher-level tpt exc is also tolerated in those same spots.
Likely i will follow up with a removal of the `trio` variants since most
*should be* caught and re-raised as tpt-closed out of the `.ipc` stack
now?

Add `TransportClosed` to various handler blocks,
- `._streaming.MsgStream.aclose()/.send()` except blocks.
- the broken-channel except in `._context.open_context_from_portal()`.
- obvi import it where necessary in those ^ mods.

Adjust `test_advanced_faults` suite + exs-script to match,
- update `ipc_failure_during_stream.py` example to catch
  `TransportClosed` alongside `trio.ClosedResourceError`
  in both the break and send-check paths.
- shield the `trio.sleep(0.01)` after tpt close in example to avoid
  taskc-raise/masking on that checkpoint since we want to simulate
  waiting for a user to send a KBI.
- loosen `ExceptionGroup` assertion to `len(excs) <= 2` and ensure all
  excs are `TransportClosed`.
- improve multi-line formatting, minor style/formatting fixes in
  condition expressions.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Remove the `trio.ClosedResourceError` and `trio.BrokenResourceError`
handling that should now be subsumed by `TransportClosed` re-raising out
of the `.ipc` stack.

Deats,
- drop CRE and BRE from `._streaming.MsgStream.aclose()/.send()` blocks.
- similarly rm from `._context.open_context_from_portal()`.
- also from `._portal.Portal.cancel_actor()` and drop the
  (now-completed-todo) comment about this exact thing.

Also add comment in `._rpc.try_ship_error_to_remote()` noting the
remaining `trio` catches there are bc the `.ipc` layers *should* be
wrapping them; thus `log.critical()` use is warranted.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Flesh out missing method doc-strings, improve log msg formatting and
assert -> `RuntimeError` for un-inited tpt layer.

Deats,
- add doc-string to `.send()` noting `TransportClosed` raise
  on comms failures.
- add doc-string to `.recv()`.
- expand `._aiter_msgs()` doc-string, line-len reflow.
- add doc-string to `.connected()`.
- convert `assert self._transport` -> `RuntimeError` raise
  in `._aiter_msgs()` for more explicit crashing.
- expand `_connect_chan()` doc-string, note it's lowlevel
  and suggest `.open_portal()` to user instead.
- factor out `src_exc_str` in `TransportClosed` log handler
  to avoid double-call
- use multiline style for `.connected()` return expr.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Brief descriptions for both fns in `._discovery` clarifying
what each delivers and under what conditions.

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy changed the title Tpt tolerance: more lowlevel trio transport excs -> TransportClosed translations Tpt-tolerance: more lowlevel trio CRE/BRE -> TransportClosed translations Feb 19, 2026
@goodboy goodboy requested a review from Copilot February 19, 2026 19:21
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Comments suppressed due to low confidence (1)

tractor/ipc/_transport.py:495

  • In the except (BrokenResourceError, ClosedResourceError) block, the default case _: re-raises the original Trio exception. Higher layers in this PR are being updated to only tolerate TransportClosed, so letting raw Trio transport exceptions escape here can lead to unexpected crashes. Consider always translating these low-level transport errors into TransportClosed (possibly with a generic message/loglevel) instead of re-raising trans_err.
                    # unless the disconnect condition falls under "a
                    # normal operation breakage" we usualy console warn
                    # about it.
                    case _:
                        log.exception(
                            f'{tpt_name} layer failed pre-send ??\n'
                        )
                        raise trans_err

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 108 to 114
(_time_took := (time.time() - start) < timeout)
):
ptyproc.kill(signal.SIGINT)
time.sleep(0.01)
else:
ptyproc.kill(signal.SIGKILL)

Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The teardown loop uses while ...: ... else: ptyproc.kill(SIGKILL), but since the loop has no break, the else branch will always run whenever the loop exits normally (including when the process has already exited). This risks sending SIGKILL unnecessarily (and can raise if the process is already gone). Consider replacing the while/else with an explicit timeout loop and then if spawned.isalive(): kill(SIGKILL) after the loop; also _time_took := (time.time() - start) < timeout assigns a boolean, so the name is misleading and the elapsed time isn't preserved.

Suggested change
(_time_took := (time.time() - start) < timeout)
):
ptyproc.kill(signal.SIGINT)
time.sleep(0.01)
else:
ptyproc.kill(signal.SIGKILL)
(time.time() - start) < timeout
):
ptyproc.kill(signal.SIGINT)
time.sleep(0.01)
if spawned and spawned.isalive():
try:
ptyproc.kill(signal.SIGKILL)
except OSError:
# Process may have exited between isalive() and kill()
pass

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another good catch!


This fn raises `TransportClosed` on comms failures and is
normally handled by higher level runtime machinery for the
expected-graceful cases, normally ephemercal
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling in docstring: ephemercal should be ephemeral.

Suggested change
expected-graceful cases, normally ephemercal
expected-graceful cases, normally ephemeral

Copilot uses AI. Check for mistakes.
Comment on lines +432 to +438
Yield `MsgType` IPC msgs decoded and deliverd from an
underlying `MsgTransport` protocol.

This is a streaming routine alo implemented as an async-gen
func (same a `MsgTransport._iter_pkts()`) gets allocated by
a `.__call__()` inside `.__init__()` where it is assigned to
the `._aiter_msgs` attr.
This is a streaming routine alo implemented as an
async-generator func (same a `MsgTransport._iter_pkts()`)
gets allocated by a `.__call__()` inside `.__init__()` where
it is assigned to the `._aiter_msgs` attr.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling/grammar in docstring: deliverd -> delivered, alo -> also, and same a -> same as.

Copilot uses AI. Check for mistakes.
# XXX NOTE XXX in SC terms this is one of the worst things
# that can happen and provides for a 2-general's dilemma..
#
# FURHTER, we should never really have to handle these
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling in comment: FURHTER should be FURTHER.

Suggested change
# FURHTER, we should never really have to handle these
# FURTHER, we should never really have to handle these

Copilot uses AI. Check for mistakes.
tpt_name: str = f'{type(self).__name__!r}'

trans_err_msg: str = trans_err.args[0]
by_whom: str = {
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by_whom can be None when the error message isn't in the mapping, but it's annotated as str. This is a type mismatch and can mask the fact that the subsequent match branch depends on optionality; annotate as str | None (or provide a non-None default) so type checkers and readers get the correct contract.

Suggested change
by_whom: str = {
by_whom: str | None = {

Copilot uses AI. Check for mistakes.
Comment on lines 258 to 259
(isinstance(exc, TransportClosed)
for exc in excs)
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assertion doesn't do what it intends: (isinstance(exc, TransportClosed) for exc in excs) creates a generator which is always truthy, so the test will pass even when excs contains non-TransportClosed exceptions. Use all(isinstance(exc, TransportClosed) for exc in excs) (and consider asserting the expected exception types/count explicitly).

Suggested change
(isinstance(exc, TransportClosed)
for exc in excs)
all(
isinstance(exc, TransportClosed)
for exc in excs
)

Copilot uses AI. Check for mistakes.
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woo good catch ma fren!

goodboy and others added 5 commits February 19, 2026 15:24
Adjust `basic_echo_server()` default sequence len to avoid the race
where the 'tell_little_bro()` finished streaming **before** the
echo-server sub is cancelled by its peer subactor (which is the whole
thing we're testing!).

Deats,
- bump `rng_seed` default from 50 -> 100 to ensure peer
  cancel req arrives before echo dialog completes on fast hw.
- add `trio.sleep(0.001)` between send/receive in msg loop on the
  "client" streamer side to give cancel request transit more time to
  arrive.

Also,
- add more native `tractor`-type hints.
- reflow `basic_echo_server()` doc-string for 67 char limit
- add masked `pause()` call with comment about unreachable
  code path
- alphabetize imports: mv `current_actor` and `open_nursery`
  below typed imports

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Improve the `spawn` fixture teardown logic in
`tests/devx/conftest.py` fixing the while-else bug, and fix
`test_advanced_faults` genexp for `TransportClosed` exc type
checking.

Deats,
- replace broken `while-else` pattern with direct
  `if ptyproc.isalive()` check after the SIGINT loop.
- fix undefined `spawned` ref -> `ptyproc.isalive()` in
  while condition.
- improve walrus expr formatting in timeout check (multiline
  style).

Also fix `test_ipc_channel_break_during_stream()` assertion,
- wrap genexp in `all()` call so it actually checks all excs
  are `TransportClosed` instead of just creating an unused
  generator.

(this patch was suggested by copilot in,
 #411)

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
Expand and clarify the comment for the default `case _`
block in the `.send()` error matcher, noting that we
console-error and raise-thru for unexpected disconnect
conditions.

(this patch was suggested by copilot in,
 #411)

(this commit msg was generated in some part by [`claude-code`][claude-code-gh])
[claude-code-gh]: https://github.com/anthropics/claude-code
@goodboy goodboy merged commit 70bb772 into main Feb 19, 2026
2 checks passed
@goodboy goodboy deleted the tpt_tolerance branch February 19, 2026 21:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments