Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need to detect KBI for debugging? #148

Open
goodboy opened this issue Aug 19, 2020 · 4 comments
Open

Do we need to detect KBI for debugging? #148

goodboy opened this issue Aug 19, 2020 · 4 comments
Assignees
Labels
debugger enhancement New feature or request experiment Exploratory design and testing question Further information is requested

Comments

@goodboy
Copy link
Owner

goodboy commented Aug 19, 2020

I've been running into this problem using the new debugger support from #129:

  • ctrl-c while already in debugger
  • SIGINT during a request for the tty lock in the root-parent from a child process

In some cases this causes awkward crashing with errors other then the expected KBI and in the worst case the program actually hangs since the child causes deadlock trying to acquire the parent's tty lock (and the parent is waiting for the child to terminate - speaking of which this might be a good test as part of #145).

I got some tips in the trio gitter channel.

@oremanj said:

with the current Ctrl+C handling, I don't think it's possible -- you don't know the whole tree is getting cancelled, because the KI starts at whatever-the-main-task-is-doing and propagates upward cancelling one layer at a time. there's a somewhat stalled effort to improve this. you're probably best off implementing your own SIGINT handling

if you want a non-good way to directly detect a propagating KI exception, uh, I guess you can do something like:

task = trio.lowlevel.current_task()
nurseries_outward = task.child_nurseries[::-1]
while task.parent_nursery is not None:
    nurseries_outward.append(task.parent_nursery)
    task = task.parent_nursery.parent_task
for nursery in nurseries_outward:
    if nursery.cancel_scope.cancel_called and any(isinstance(exc, KeyboardInterrupt) for exc in nursery._pending_excs:
        return True
return False

@njsmith followed up with:

one quirk of SIGINT, if you're using subprocesses, is that usually the OS sends separate copies of the signal to all of your subprocesses

To which I suggested separate SIGINT handlers per proc, to which he responsed:

or else disable SIGINT in the child processes and rely on some other mechanism for propagating cancellation across process boundaries

@oremanj also mentioned:

the parent SIGINT will cancel trio.run_process which will deliver the cancellation to the children using its deliver_cancel argument, you can modify that to send them a different signal, send some state via a non-signal mechanism, etc
the default sends SIGTERM, waits, sends SIGKILL. yeah it sounds like you probably just want to block SIGINT in your subprocesses

So I think the solution might be best accomplished via the suggestion of ignoring SIGINT in child procs and using our cancel method call to accomplish that. Also, the deliver_cancel system @oremanj mentioned is on my list to check out.

@goodboy goodboy added enhancement New feature or request question Further information is requested experiment Exploratory design and testing labels Aug 19, 2020
@goodboy
Copy link
Owner Author

goodboy commented Sep 9, 2020

Ok, so figured out what I think is the main bug to do with the #129 stuff.

  • run a program with a await tractor.breakpoint() in it
  • execute the continue command to resume the program
  • use ctrl-c to kill the process tree
  • the debugger (might) get(s) entered again
  • execute continue again to hit the deadlock

This more often then not seems to result in some kind of deadlock where the parent is waiting on the child to terminate but the child is waiting on the lock from the parent somehow, I think.

@goodboy
Copy link
Owner Author

goodboy commented Sep 27, 2020

Alright dug slightly into using deliver_cancel:

@goodboy
Copy link
Owner Author

goodboy commented Sep 27, 2020

Disabled SIGINT in children and all other machinery seems to still work in the test suite; so I guess that's good?

It's still not solving the main issue with debugging where if you ctl-c from the pdbpp prompt and then continue the parent process is somehow getting in a super weird deadlock state with the children where the children aren't being cancelled. Killing the children manually using a system tool seems to then unblock the root actor and it tears down with KeyboardInterrupt. Might try disabling sigint handling while inside the debugger now.

@goodboy goodboy self-assigned this Sep 27, 2020
@goodboy
Copy link
Owner Author

goodboy commented Oct 2, 2020

I think the issue has been resolved by allocating a separate cancel scope for the service nursery scheduled in wait_for_parent_stdin_hijack. By simply cancelling this scope in Actor.cancel() deadlocks are avoided since the child task causing the lock is properly killed now (though i'm slightly confused because the task is scheduled on the service nursery, and that nursery is getting cancelled 🤷).

Ahhh! It's probably that the channel server is blocking for the task to finish and since this tty lock request task is not put in the rpc task set the "all channels are ded" event never comes in because we have a task locked in the parent from the service nursery that's never cancelled. This looks to be the source of the deadlock and should be resolved now 🏄

I'm gonna close this once some tests come in.

The good news is that working through the SIGINT stuff did allow finding some issues with Pdbs handler state mangling which we're now disabling in order to avoid nasty behavior if a user triggers ctl-c while inside the debugger prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debugger enhancement New feature or request experiment Exploratory design and testing question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant