Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test sync_wait_tests.cpp /"multiple threads" hangs on mac osx #98

Closed
yfinkelstein opened this issue Dec 23, 2018 · 12 comments
Closed

test sync_wait_tests.cpp /"multiple threads" hangs on mac osx #98

yfinkelstein opened this issue Dec 23, 2018 · 12 comments

Comments

@yfinkelstein
Copy link

yfinkelstein commented Dec 23, 2018

output on the console after ctrl-c under lldb:

...
sync_wait(task<T>)
sync_wait(shared_task<T>)
multiple threads

Process 12959 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x00007fff7296a7de libsystem_kernel.dylib`__psynch_cvwait + 10
libsystem_kernel.dylib`__psynch_cvwait:
->  0x7fff7296a7de <+10>: jae    0x7fff7296a7e8            ; <+20>
    0x7fff7296a7e0 <+12>: movq   %rax, %rdi
    0x7fff7296a7e3 <+15>: jmp    0x7fff729673b7            ; cerror_nocancel
    0x7fff7296a7e8 <+20>: retq
Target 0: (cppcoro_tests) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7296a7de libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff72a24593 libsystem_pthread.dylib`_pthread_cond_wait + 724
    frame #2: 0x00007fff6ff03bda libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
    frame #3: 0x000000010029c331 cppcoro_tests`void std::__1::condition_variable::wait<cppcoro::detail::lightweight_manual_reset_event::wait()::$_0>(this=0x00007ffeefbfd900, __lk=0x00007ffeefbfd800, __pred=(anonymous class) @ 0x00007ffeefbfd7a8)::$_0) at __mutex_base:375
    frame #4: 0x000000010029c24f cppcoro_tests`cppcoro::detail::lightweight_manual_reset_event::wait(this=0x00007ffeefbfd8c0) at lightweight_manual_reset_event.cpp:234
    frame #5: 0x00000001001ad241 cppcoro_tests`cppcoro::awaitable_traits<cppcoro::task<int>&&, void>::await_result_t cppcoro::sync_wait<cppcoro::task<int> >(awaitable=0x00007ffeefbfd970) at sync_wait.hpp:32
    frame #6: 0x00000001001a9d49 cppcoro_tests`_DOCTEST_ANON_FUNC_8() at sync_wait_tests.cpp:72
    frame #7: 0x000000010012a72d cppcoro_tests`doctest::Context::run(this=0x00007ffeefbff108) at doctest.h:5651
    frame #8: 0x000000010012b4b6 cppcoro_tests`main(argc=1, argv=0x00007ffeefbff148) at doctest.h:5759
    frame #9: 0x00007fff7282ded9 libdyld.dylib`start + 1
    frame #10: 0x00007fff7282ded9 libdyld.dylib`start + 1
(lldb)```
@yfinkelstein
Copy link
Author

The test just hangs for ever.

$ clang -v
clang version 7.0.0 (tags/RELEASE_700/final)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /usr/local/opt/llvm/bin```

@yfinkelstein
Copy link
Author

yfinkelstein commented Dec 23, 2018

I spent a few min to debug this. I added

		printf("before schedule: %d\n", value);
		co_await tp.schedule();
		printf("after schedule: %d\n", value);

output, and changed the number of times the loop runs:
with 10 - success
with 100 - success
with 1000 - success
with 10'000 - hangs or fails with libc++abi.dylib: terminating with uncaught exception of type std::__1::system_error: condition_variable wait failed: Invalid argument

@yfinkelstein
Copy link
Author

yfinkelstein commented Dec 23, 2018

Actually, after running the same test multiple times and observing the printed value it became clear that the number of iterations is not directly relevant for the outcome. The test simply has some sort of a race condition. Some time it passes, some time is hangs after 7 iterations, some time after 89. Ex:

before schedule: 204
after schedule: 204
before schedule: 205
after schedule: 205
before schedule: 206
after schedule: 206
before schedule: 207
after schedule: 207
before schedule: 208
after schedule: 208
before schedule: 209
after schedule: 209
before schedule: 210
after schedule: 210
before schedule: 211
after schedule: 211
before schedule: 212
after schedule: 212
before schedule: 213
after schedule: 213
before schedule: 214
after schedule: 214
before schedule: 215
after schedule: 215
before schedule: 216
after schedule: 216
before schedule: 217
after schedule: 217
before schedule: 218
after schedule: 218
before schedule: 219
after schedule: 219
before schedule: 220
after schedule: 220
before schedule: 221
after schedule: 221
before schedule: 222
after schedule: 222
before schedule: 223
after schedule: 223
before schedule: 224
after schedule: 224


<hang>

@yfinkelstein
Copy link
Author

So the thread in the static thread pool and the main thread are clashing when accessing the task state somewhere.

@yfinkelstein
Copy link
Author

Thread #2:

  * frame #0: 0x00007fff7296a7de libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff72a24593 libsystem_pthread.dylib`_pthread_cond_wait + 724
    frame #2: 0x00007fff6ff03bda libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
    frame #3: 0x0000000100290a2e cppcoro_tests`cppcoro::auto_reset_event::wait(this=0x00000001017000f0) at auto_reset_event.cpp:91
    frame #4: 0x00000001002a00b9 cppcoro_tests`cppcoro::static_thread_pool::thread_state::sleep_until_woken(this=0x00000001017000c8) at static_thread_pool.cpp:76
    frame #5: 0x000000010029ebfd cppcoro_tests`cppcoro::static_thread_pool::run_worker_thread(this=0x00007ffeefbfda50, threadIndex=0) at static_thread_pool.cpp:485
    frame #6: 0x00000001002a548e cppcoro_tests`cppcoro::static_thread_pool::static_thread_pool(this=0x0000000101700058)::$_0::operator()() const at static_thread_pool.cpp:346
    frame #7: 0x00000001002a5231 cppcoro_tests`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0> >(void*) [inlined] decltype(__f=0x0000000101700058)::$_0>(fp)(std::__1::forward<>(fp0))) std::__1::__invoke<cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0>(cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0&&) at type_traits:4345
    frame #8: 0x00000001002a5220 cppcoro_tests`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0> >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0>(__t=size=2)::$_0>&, std::__1::__tuple_indices<>) at thread:342
    frame #9: 0x00000001002a51fc cppcoro_tests`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, cppcoro::static_thread_pool::static_thread_pool(unsigned int)::$_0> >(__vp=0x0000000101700050) at thread:352
    frame #10: 0x00007fff72a21305 libsystem_pthread.dylib`_pthread_body + 126
    frame #11: 0x00007fff72a2426f libsystem_pthread.dylib`_pthread_start + 70
    frame #12: 0x00007fff72a20415 libsystem_pthread.dylib`thread_start + 13```

@yfinkelstein
Copy link
Author

yfinkelstein commented Dec 23, 2018

Thread #1 (main):

* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x00007fff7296a7de libsystem_kernel.dylib`__psynch_cvwait + 10
    frame #1: 0x00007fff72a24593 libsystem_pthread.dylib`_pthread_cond_wait + 724
    frame #2: 0x00007fff6ff03bda libc++.1.dylib`std::__1::condition_variable::wait(std::__1::unique_lock<std::__1::mutex>&) + 18
    frame #3: 0x000000010029d211 cppcoro_tests`void std::__1::condition_variable::wait<cppcoro::detail::lightweight_manual_reset_event::wait()::$_0>(this=0x00007ffeefbfd900, __lk=0x00007ffeefbfd800, __pred=(anonymous class) @ 0x00007ffeefbfd7a8)::$_0) at __mutex_base:375
    frame #4: 0x000000010029d12f cppcoro_tests`cppcoro::detail::lightweight_manual_reset_event::wait(this=0x00007ffeefbfd8c0) at lightweight_manual_reset_event.cpp:234
    frame #5: 0x00000001001add61 cppcoro_tests`cppcoro::awaitable_traits<cppcoro::task<int>&&, void>::await_result_t cppcoro::sync_wait<cppcoro::task<int> >(awaitable=0x00007ffeefbfd970) at sync_wait.hpp:32
    frame #6: 0x00000001001aa869 cppcoro_tests`_DOCTEST_ANON_FUNC_8() at sync_wait_tests.cpp:74
    frame #7: 0x000000010012b24d cppcoro_tests`doctest::Context::run(this=0x00007ffeefbff108) at doctest.h:5651
    frame #8: 0x000000010012bfd6 cppcoro_tests`main(argc=1, argv=0x00007ffeefbff148) at doctest.h:5759
    frame #9: 0x00007fff7282ded9 libdyld.dylib`start + 1
    frame #10: 0x00007fff7282ded9 libdyld.dylib`start + 1

@lewissbaker
Copy link
Owner

I suspect there is a bug in the lightweight_manual_reset_event::set() method:

void cppcoro::detail::lightweight_manual_reset_event::set() noexcept
{
	{
		std::lock_guard<std::mutex> lock(m_mutex);
		m_isSet = true;
	}
	m_cv.notify_all();
}

I think by having the call to notify_all() outside of the critical section it may potentially miss waking up a a thread that is blocked inside .wait().

Can you try moving this call inside the scope of the lock_guard and see if that fixes the hang?

@lewissbaker
Copy link
Owner

I've made this change in 0a768bf.

@yfinkelstein Let me know if you still see this issue with latest master.

@wilevers
Copy link

wilevers commented Mar 9, 2019

I suspect there is a bug in the lightweight_manual_reset_event::set() method:

void cppcoro::detail::lightweight_manual_reset_event::set() noexcept
{
	{
		std::lock_guard<std::mutex> lock(m_mutex);
		m_isSet = true;
	}
	m_cv.notify_all();
}

I think by having the call to notify_all() outside of the critical section it may potentially miss waking up a a thread that is blocked inside .wait().

Can you try moving this call inside the scope of the lock_guard and see if that fixes the hang?

This should not matter. See this old discussion thread: https://groups.google.com/forum/?hl=ky#!msg/comp.programming.threads/wEUgPq541v8/ZByyyS8acqMJ

@wilevers
Copy link

I wrote:

This should not matter. See this old discussion thread: https://groups.google.com/forum/?hl=ky#!msg/comp.programming.threads/wEUgPq541v8/ZByyyS8acqMJ

Looking at your comment on commit 0a768bf again, I was wrong. Obviously, if setting m_isSet to true implies permission for another thread to destroy the condition variable, then it must be guaranteed that the new value of m_isSet is not observable yet when notify_all() is called. Sorry for the noise!

@lewissbaker
Copy link
Owner

@wilevers Your original comment was valid based on a response to my comment on this thread. Indeed, whether or not the call to notify_all() was done inside or outside of the critical section shouldn't cause a missed wake-up.

But, yes, there was another problem which was that it's possible the cv.notify_all() call could be running concurrently with cv's destructor or possibly even concurrently with the constructor of a different condition_variable that was being constructed in the same place.

This seems to be the most likely cause for the condition_variable wait failed: Invalid argument error that @yfinkelstein was seeing. And could plausibly also have been responsible for the hangs.

I'm marking this issue as closed but feel free to reopen if you are still seeing it.

@wilevers
Copy link

But, yes, there was another problem which was that it's possible the cv.notify_all() call could be running concurrently with cv's destructor or possibly even concurrently with the constructor of a different condition_variable that was being constructed in the same place.

This seems to be the most likely cause for the condition_variable wait failed: Invalid argument error that @yfinkelstein was seeing. And could plausibly also have been responsible for the hangs.

I'm marking this issue as closed but feel free to reopen if you are still seeing it.

Excellent. Thank you, and again, sorry for the noise...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants