Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock with --dynamic-ui on a large enough context #9926

Closed
stuhood opened this issue May 31, 2020 · 4 comments · Fixed by #9959 or #9969
Closed

Deadlock with --dynamic-ui on a large enough context #9926

stuhood opened this issue May 31, 2020 · 4 comments · Fixed by #9959 or #9969
Assignees
Labels
Milestone

Comments

@stuhood
Copy link
Sponsor Member

stuhood commented May 31, 2020

Running:

./v2 --tag=-integration test tests/python/pants_test/::

...reproducibly hangs the UI (and seemingly the whole process). Disabling the UI with --no-dynamic-ui allows progress to be made.

@stuhood stuhood added the bug label May 31, 2020
@stuhood stuhood added this to the 1.29.x milestone May 31, 2020
@gshuflin gshuflin self-assigned this Jun 2, 2020
@stuhood
Copy link
Sponsor Member Author

stuhood commented Jun 2, 2020

Mentioned offline: it's possible that this relates to the cffi/cpython change, which causes us to interact more directly with the GIL in a bunch of places. Can look in stack traces for ensure_gil and parking_lot (the crate we use for most of our locks) for potential conflicts there.

@stuhood stuhood assigned stuhood and unassigned gshuflin Jun 4, 2020
@stuhood
Copy link
Sponsor Member Author

stuhood commented Jun 4, 2020

So, it looks like this had nothing to do with cffi/cpython: just another fun combination of locks!

The deadlock I reproduced was:

  • One thread in WorkunitStore::log_workunit_state, which via our logging mechanism is trying to use the Session to write to stderr.
  • Meanwhile, another thread in Session::maybe_display_render reaches out to request WorkunitStore::heavy_hitters.

Rather than trying to fiddle with lock ordering here, I think the approach least likely to bite us in the long run is to have logging be mostly lock free by enqueueing onto a non-blocking queue consumed by the Session, rather than actually acquiring a lock on the session.

stuhood added a commit that referenced this issue Jun 4, 2020
### Problem

As described in #9926, we observed a deadlock with:
* One thread in `WorkunitStore::log_workunit_state`, which via our logging mechanism was trying to use the `Session` to write to stderr.
* Another thread in `Session::maybe_display_render` requesting `WorkunitStore::heavy_hitters`.

### Solution

Do not acquire the `Session` lock in a logging callback (which might occur anywhere at all, and could cause other unidentified lock interleaving): instead, enqueue for `Scheduler::execute` on the main thread to write to the `Session`.

### Result

Fixes #9926. The particular deadlock described there is only possible now that we log workunit completions, but if this cherry-picks cleanly to `1.29.x`, we should apply it there as well to prevent any unanticipated interleaving.

[ci skip-jvm-tests]
@tdyas
Copy link
Contributor

tdyas commented Jun 4, 2020

I'm still getting deadlock even with this fix. I've attached a backtrace taken by attaching gdb:
backtrace.txt

@gshuflin gshuflin reopened this Jun 4, 2020
@tdyas
Copy link
Contributor

tdyas commented Jun 4, 2020

Full backtrace from a subsequent run that also deadlocked:
gdb.txt

stuhood added a commit that referenced this issue Jun 5, 2020
As described in #9926, we observed a deadlock with:
* One thread in `WorkunitStore::log_workunit_state`, which via our logging mechanism was trying to use the `Session` to write to stderr.
* Another thread in `Session::maybe_display_render` requesting `WorkunitStore::heavy_hitters`.

Do not acquire the `Session` lock in a logging callback (which might occur anywhere at all, and could cause other unidentified lock interleaving): instead, enqueue for `Scheduler::execute` on the main thread to write to the `Session`.

Fixes #9926. The particular deadlock described there is only possible now that we log workunit completions, but if this cherry-picks cleanly to `1.29.x`, we should apply it there as well to prevent any unanticipated interleaving.

[ci skip-jvm-tests]
stuhood added a commit that referenced this issue Jun 5, 2020
### Problem

As reported on #9926 and shown in https://github.com/pantsbuild/pants/files/4732596/gdb.txt, threads 5 and 25 were deadlocked accessing the `WorkunitStore` (in `complete_workunit` and `poll_session_workunits`, respectively) and the GIL (to get `Node` information, and via the `poll_session_workunits` method in `interface.rs`, respectively).

### Solution

Release the GIL before interacting with the `WorkunitStore` in `poll_session_workunits`. I'd like to refactor `interface.rs` to make this kind of issue more challenging to trigger, but this fix is self-contained. 

### Result

Fixes #9926 some more.

[ci skip-jvm-tests]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants