Sigintsaviour ci worked #315

goodboy · 2022-07-11T20:05:33Z

Purely to run CI and see what the diff is with #165 history...

goodboy · 2022-07-11T21:04:13Z

Oof 😂

This gets very close to avoiding any possible hangs to do with tty locking and SIGINT handling minus a special case that will be detailed below. Summary of implementation changes: - convert `_mk_pdb()` -> `with _open_pdb() as pdb:` which implicitly handles the `bdb.BdbQuit` case such that debugger teardown hooks are always called. - rename the handler to `shield_sigint()` and handle a variety of new cases: * the root is in debug but hasn't been cancelled -> call `Actor.cancel_soon()` * the root is in debug but *has* been called (`Actor.cancel_soon()` already called) -> raise KBI * a child is in debug *and* has a task locking the debugger -> ignore SIGINT in child *and* the root actor. - if the debugger instance is provided to the handler at acquire time, on SIGINT handling completion re-print the last pdb++ REPL output so that the user realizes they are still actively in debug. - ignore the unlock case where a race condition of "no task" holding the lock causes the `RuntimeError` normally associated with the "wrong task" doing so (not sure if this is a `trio` bug?). - change debug logs to runtime level. Unhandled case(s): - a child is maybe in debug mode but does not itself have any task using the debugger. * ToDo: we need a way to decide what to do with "intermediate" child actors who themselves either are not in `debug_mode=True` but have children who *are* such that a SIGINT won't cause cancellation of that child-as-parent-of-another-child **iff** any of their children are in in debug mode.

Using either of `@pdb.hideframe` or `__tracebackhide__` on stdlib methods doesn't seem to work either.. This all seems to have something to do with async generator usage I think ?

None of it worked (you still will see `.__exit__()` frames on debugger entry - you'd think this would have been solved by now but, shrug) so instead wrap the debugger entry-point in a `try:` and put the SIGINT handler restoration inside `MultiActorPdb` teardown hooks. This seems to restore the UX as it was prior but with also giving the desired SIGINT override handler behaviour.

Finally! I think this may be the root issue we've been seeing in production in a client project. No idea yet why this is happening but the fault-causing sequence seems to be: - `.open_context()` in a child actor - enter the debugger via `tractor.breakpoint()` - continue from that entry via `c` command in REPL - raise an error just after inside the context task's body Looking at logging it appears as though the child thinks it has the tty but no input is accepted on the REPL and a further `ctrl-c` results in some teardown but also a further hang where both parent and child become unresponsive..

There's a bug that's triggered in the stdlib without latest `pdb++` installed; add a note for that. Further inside `wait_for_parent_stdin_hijack()` don't `.started()` until the interactor stream has been opened to avoid races when debugging this `._debug.py` module (at the least) since we usually don't want the spawning (parent) task to resume until we know for sure the tty lock has been acquired. Also, drop the random checkpoint we had inside `_breakpoint()`, not sure it was actually adding anything useful since we're (mostly) carefully shielded throughout this func.

There's no point in sending a cancel message to the remote linked task and especially no reason to block waiting on a result from that task if the transport layer is detected to be disconnected. We expect that the transport shouldn't go down at the layer of the message loop (reconnection logic should be handled in the transport layer itself) so if we detect the channel is not connected we don't bother requesting cancels nor waiting on a final result message. Why? - if the connection goes down in error the caller side won't have a way to know "how long" it should block to wait for a cancel ack or result and causes a potential hang that may require an additional ctrl-c from the user especially if using the debugger or if the traceback is not seen on console. - obviously there's no point in waiting for messages when there's no transport to deliver them XD Further, add some more detailed cancel logging detailing the task and actor ids.

The method now returns a `bool` which flags whether the transport died to the caller and allows for reporting a disconnect in the channel-transport handler task. This is something a user will normally want to know about on the caller side especially after seeing a traceback from the peer (if in tree) on console.

A hopefully significant fix here is to always avoid suppressing a SIGINT when the root actor can not detect an active IPC connections (via a connected channel) to the supposed debug lock holding actor. In that case it is most likely that the actor has either terminated or has lost its connection for debugger control and there is no way the root can verify the lock is in use; thus we choose to allow KBI cancellation. Drop the (by comment) `try`-`finally` block in `_hijoack_stdin_for_child()` around the `_acquire_debug_lock()` call since all that logic should now be handled internal to that locking manager. Try to catch a weird error around the `.do_longlist()` method call that seems to sometimes break on py3.10 and latest `pdbpp`.

Ensure that even when `pdb` resumption methods are called during a crash where `trio`'s runtime has already terminated (eg. `Event.set()` will raise) we always revert our sigint handler to the original. Further inside the handler if we hit a case where a child is in debug and (thinks it) has the global pdb lock, if it has no IPC connection to a parent, simply presume tty sync-coordination is now lost and cancel the child immediately.

goodboy · 2022-09-16T00:18:13Z

Legacy testing branch.

goodboy added 29 commits July 27, 2022 11:37

Add WIP while-debugger-active SIGINT ignore handler

0503142

(facepalm) Reraise BdbQuit and discard ownerless lock releases

1e789ec

Make mypy happy

aee00e6

Add a pre-started breakpoint example

aad9d7e

Handle a context cancel? Might be a noop

a8a2110

Fix example name typo

4ea2bc5

Try overriding _GeneratorContextManager.__exit__(); didn't work..

a617631

Using either of `@pdb.hideframe` or `__tracebackhide__` on stdlib methods doesn't seem to work either.. This all seems to have something to do with async generator usage I think ?

Add and use a pdb instance factory

4e6d009

Typing fixes, simplify _set_trace()

5dd8adc

Drop high log level in ctx example

7481982

Drop uneeded backframe traceback hide annotation

4e06b10

Type annot updates

e2169f2

Avoid attr error XD

1163ec5

Pre-declare disconnected flag

df16a0c

Add back in async gen loop

67607a4

Add example that triggers bug #302

ebefd6e

Make example a subpkg for python -m <mod> testing

7ecc48b

Only warn on trio.BrokenResourceErrors from _invoke()

1fd4588

Just warn on IPC breaks

2800100

Log cancels with appropriate level

d1f347c

Always call pdb hook even if tty locking fails

11c1582

Tolerate double .remove()s of stream on portal teardowns

7b40491

goodboy added 6 commits July 27, 2022 11:38

Readme formatting tweaks

dade6a4

Add runtime level msg around channel draining

70e4458

Add spaces before values in log msg

8a70a52

Move pydantic-click hang example to new dir, skip in test suite

ee8ead4

Show full KBI trace for help with CI hangs

7dd72e0

goodboy force-pushed the sigintsaviour_ci_worked branch from cc18c84 to 7dd72e0 Compare July 27, 2022 19:15

goodboy closed this Sep 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sigintsaviour ci worked #315

Sigintsaviour ci worked #315

goodboy commented Jul 11, 2022

goodboy commented Jul 11, 2022

goodboy commented Sep 16, 2022

Sigintsaviour ci worked #315

Sigintsaviour ci worked #315

Conversation

goodboy commented Jul 11, 2022

goodboy commented Jul 11, 2022

goodboy commented Sep 16, 2022