Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve compatibility with tokio #1367

Open
devrandom opened this issue Mar 17, 2022 · 6 comments
Open

Improve compatibility with tokio #1367

devrandom opened this issue Mar 17, 2022 · 6 comments

Comments

@devrandom
Copy link
Member

devrandom commented Mar 17, 2022

Background

The VLS project is using gRPC to connect to a remote signer. This uses tokio/tonic.

Symptom

If the gRPC client is created in the default tokio runtime, a deadlock may result.

Discussion

The lightning-net-tokio calls into PeerManager sync code. This code takes a lock on the PeerManager.peers Mutex, which is held across any calls to the signer.

The ldksample calls PeerManager.get_peer_node_ids in do_connect_peer, which also takes a lock on said Mutex.

The above sync code can stall the I/O reactor, resulting in the signer gRPC call never completing and the peers Mutex never being released.

A possible solution is to always call task::block_in_place or perhaps Handle::spawn_blocking when calling any Mutex-using sync code from inside lightning-net-tokio and ldksample.

@devrandom
Copy link
Member Author

On further thought, a more general description of the issue is "holding an std Mutex across an await that performs tokio IO".

I believe that BitcoindClient in the LDK sample suffers from this.

The deadlock scenario is:

  • thread 1: locks mutex, then starts IO
  • thread 2: has the IO reactor, tries to lock mutex

Thread 2 is blocked on the mutex, and therefore the IO reactor is stalled. Thread 1 is stalled waiting for IO to complete. No progress can be made - deadlock.

@TheBlueMatt
Copy link
Collaborator

Right, I believe the issue here is in thread 1 there - we are in a sync context, take a syn mutex, and then move to an async context. In general we should never be going from a sync context to (blocking on) an async context.

@devrandom
Copy link
Member Author

Another variation of this is likely the cause of L2-Technology/sensei#73:

  • thread 1: takes a lock on sync mutex, then spawns an LDK user task (e.g. persistence) onto thread 2, blocks waiting for task
  • thread 2: dispatches an LDK task, which tries to lock same mutex

Thread 1 can't make progress, because the task it is waiting is starved on thread 2. Thread 2 can't make progress because of the sync lock wait.

My previous conclusion was that it's never safe to spawn user tasks on the runtime LDK uses. That's why we use the following code to spawn lnrod user tasks in a separate runtime:

https://gitlab.com/lightning-signer/lnrod/-/blob/3e155d2ccecf45212a1bda74e7581a10f0e7ffd5/src/signer/vls.rs#L227-232

@TheBlueMatt
Copy link
Collaborator

I don't quite understand how that is a deadlock - futures aren't tied to a thread, so a thread 3 should have no problem working on the future from thread 1, at least if I understand tokio correctly.

@devrandom
Copy link
Member Author

I think tokio can't move tasks between threads in general - it can steal work in some cases as an optimization, but not in every case.

@TheBlueMatt
Copy link
Collaborator

Ah! tokio discord agrees - there's nothing preventing work from being stolen, but tokio doesn't do it aggressively, if other threads are sleeping they aren't going to wake up to steal work.

mariocynicys added a commit to mariocynicys/rust-teos that referenced this issue Aug 20, 2022
This fixes a deadlock issue that appears with too many concurrent requtests. This solution was taken from devrandom's issue: lightningdevkit/rust-lightning#1367
Another solution that would probably work is that lightning net tokio wraps our sync calls with `tokio::task::spawn_blocking(|| sync_call()).await.unwarp()`, but not sure how would this affect performace of other users with no async code requirements.

Note that peer_manager_tests now don't need to be anotated with `flavor = "multi_thread"`, that's because the runtime we block_on (inside lightning net tokio) is the artificial runtime created in `api::lightning::server` which is already "multi_thread"
mariocynicys added a commit to mariocynicys/rust-teos that referenced this issue Feb 5, 2023
This fixes a deadlock issue that appears with too many concurrent requtests. This solution was taken from devrandom's issue: lightningdevkit/rust-lightning#1367
Another solution that would probably work is that lightning net tokio wraps our sync calls with `tokio::task::spawn_blocking(|| sync_call()).await.unwarp()`, but not sure how would this affect performace of other users with no async code requirements.

Note that peer_manager_tests now don't need to be anotated with `flavor = "multi_thread"`, that's because the runtime we block_on (inside lightning net tokio) is the artificial runtime created in `api::lightning::server` which is already "multi_thread"
mariocynicys added a commit to mariocynicys/rust-teos that referenced this issue Feb 5, 2023
This fixes a deadlock issue that appears with too many concurrent requtests. This solution was taken from devrandom's issue: lightningdevkit/rust-lightning#1367
Another solution that would probably work is that lightning net tokio wraps our sync calls with `tokio::task::spawn_blocking(|| sync_call()).await.unwarp()`, but not sure how would this affect performace of other users with no async code requirements.

Note that peer_manager_tests now don't need to be anotated with `flavor = "multi_thread"`, that's because the runtime we block_on (inside lightning net tokio) is the artificial runtime created in `api::lightning::server` which is already "multi_thread"
mariocynicys added a commit to mariocynicys/rust-teos that referenced this issue Feb 5, 2023
This fixes a deadlock issue that appears with too many concurrent requtests. This solution was taken from devrandom's issue: lightningdevkit/rust-lightning#1367
Another solution that would probably work is that lightning net tokio wraps our sync calls with `tokio::task::spawn_blocking(|| sync_call()).await.unwarp()`, but not sure how would this affect performace of other users with no async code requirements.

Note that peer_manager_tests now don't need to be anotated with `flavor = "multi_thread"`, that's because the runtime we block_on (inside lightning net tokio) is the artificial runtime created in `api::lightning::server` which is already "multi_thread"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants