Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Window's notify thread is stuck hanging #287

Closed
Awfa opened this issue Mar 17, 2021 · 6 comments
Closed

Window's notify thread is stuck hanging #287

Awfa opened this issue Mar 17, 2021 · 6 comments
Labels
A-bug B-debouncer debouncer related

Comments

@Awfa
Copy link

Awfa commented Mar 17, 2021

System details

  • OS/Platform name and version: Windows 10
  • Rust version (if building from source): rustc --version: rustc 1.50.0
  • Notify version (or commit hash if building from git): 4.0.15

What you did (as detailed as you can)

We created a bunch of ReadDirectoryChangesWatcher thru notify::watcher. After we dropped the ReadDirectoryChangesWatchers, we expected the watcher to drop the given mspc::Sender.

What you expected

Sometimes, the given mspc::Sender wouldn't be dropped by the ReadDirectoryChangesWatcher after the ReadDirectoryChangesWatcher was dropped.

What happened

I tracked it down to a race condition in the notify crate (4.0.15) under src/debounce/timer.rs.
A particular interleaving of operations could leave the ScheduleWorker thread alive while the WatchTimer thread has dropped.

Below is a simplified model of WatchTimer and ScheduleTimer showing the ScheduleTimer hanging after WatchTimer drops:

use std::sync::{atomic::AtomicBool, Arc, Condvar, Mutex};

fn main() {
    // We will use a channel to enforce an interleaving that shows the thread hang
    let (rendezvous_sender, rendezvous_receiver) = std::sync::mpsc::sync_channel::<()>(0);

    // Below is a modeling of WatchTimer::new spawning ScheduleWorker
    let condvar = Arc::new(Condvar::new());
    let stopped = Arc::new(AtomicBool::new(false));
    let thread = spawn_schedule_worker(condvar.clone(), stopped.clone(), rendezvous_receiver);

    // Interleaving sync point [1]
    // Makes it so we pass the `stopped` AtomicBool check in the thread
    rendezvous_sender.send(()).unwrap();

    // Below is a modeling of WatchTimer::drop
    stopped.store(true, std::sync::atomic::Ordering::SeqCst);
    condvar.notify_one();

    // Interleaving sync point [2]
    // Makes it so we sent a notify on the condvar before the thread waits on it
    rendezvous_sender.send(()).unwrap();

    // Ensure the thread is terminated
    thread.join().unwrap();
}

fn spawn_schedule_worker(
    condvar: Arc<Condvar>,
    stopped: Arc<AtomicBool>,
    rendezvous_receiver: std::sync::mpsc::Receiver<()>,
) -> std::thread::JoinHandle<()> {
    std::thread::spawn(move || {
        let m = Mutex::new(());

        let mut g = m.lock().unwrap();

        loop {
            if stopped.load(std::sync::atomic::Ordering::SeqCst) {
                println!("Leaving loop via stopped bool");
                break;
            }

            // Interleaving sync point [1]
            // Makes it so we pass the `stopped` AtomicBool check in the thread
            rendezvous_receiver.recv().unwrap();

            // Interleaving sync point [2]
            // Makes it so we sent a notify on the condvar before the thread waits on it
            rendezvous_receiver.recv().unwrap();

            println!("Waiting for condvar");
            g = condvar.wait(g).unwrap();
            println!("Waking from condvar");
        }
    })
}

The creation of condvar, stopped, and thread are analogous to WatchTimer::new and storing true in stopped and using condvar.notify_one() is analogous to WatchTimer::drop.

A thread interleaving is enforced via a rendezvous channel. Running this above rust program should show a program hang after "Waiting for condvar" is printed.

@0xpr03 0xpr03 added A-bug B-debouncer debouncer related labels Mar 17, 2021
@0xpr03
Copy link
Member

0xpr03 commented Mar 17, 2021

Thanks for this detailed and in-depth report. I'll have a look later on, probably not today. Unfortunately this debouncer of v4 is also completely removed in v5.

Awfa added a commit to Awfa/notify that referenced this issue Mar 17, 2021
@Awfa
Copy link
Author

Awfa commented Mar 17, 2021

Interestingly, after I fixed this issue - the ScheduleWorker wasn't dropping still in some test runs. I found out that the WatchTimer object was being leaked somehow and tracked it down to the APC queue somehow not being fully completed. This second issue could theoretically be present in the current v5 as well (I didn't check).

When the APC queue is not fully cleared, the ReadDirectoryRequest object is leaked since this isn't called.

I have my fix for the first issue mentioned in this post, as well as a stop-gap fix for the APC queue here. I don't know the root-cause of the APC queue issue though.

@Awfa
Copy link
Author

Awfa commented Mar 18, 2021

Found the reason for the second issue - in stop_watch, the call to synchapi::WaitForSingleObjectEx(ws.complete_sem, INFINITE, TRUE); could leave the apc queue with work left to do - the fix I came up with is to actually wait for the WatchState's completion semaphore to be in a signal state.

Both fixes are here

I believe v5 is also affected by the second problem here

@0xpr03
Copy link
Member

0xpr03 commented Mar 18, 2021

Thanks for all the work! I think we should integrate this fix as a direct PR against try_v4 (for v4 fixes) and main (v5) if you like. Otherwise let me know and I'll make a PR based on your fork.

Edit: This could also explain some spurious CI problems before I started maintain this and ultimately led to a removal of most tests and the old debouncer code of v4 in v5.

Awfa added a commit to Awfa/notify that referenced this issue Mar 18, 2021
@Awfa Awfa mentioned this issue Mar 18, 2021
@Awfa
Copy link
Author

Awfa commented Mar 18, 2021

How about I do the PR for v4 and you can do it for v5?

I was motivated to fix this problem when our tests using notify had flakiness in our CI as well :)

@0xpr03
Copy link
Member

0xpr03 commented Mar 18, 2021

Sounds good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-bug B-debouncer debouncer related
Projects
None yet
Development

No branches or pull requests

2 participants