Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ThreadSanitizer: deadlock in signal handler #1540

Closed
tavplubix opened this issue Jun 13, 2022 · 13 comments
Closed

ThreadSanitizer: deadlock in signal handler #1540

tavplubix opened this issue Jun 13, 2022 · 13 comments

Comments

@tavplubix
Copy link

Hi, seems like our stress tests have found a deadlock in TSan while testing ClickHouse. The stacktrace is

Thread 238 (Thread 0x7fa07287e700 (LWP 925)):
#0  0x000000000b2949aa in __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) ()
#1  0x000000000b295a4a in __sanitizer::Semaphore::Wait() ()
#2  0x000000000b31a23f in __tsan::SlotLock(__tsan::ThreadState*) ()
#3  0x000000000b32a2c3 in __tsan::Acquire(__tsan::ThreadState*, unsigned long, unsigned long) ()
#4  0x000000000b2bd7b8 in __tsan::CallUserSignalHandler(__tsan::ThreadState*, bool, bool, int, __sanitizer::__sanitizer_siginfo*, void*) ()
#5  0x000000000b2bdcd4 in sighandler(int, __sanitizer::__sanitizer_siginfo*, void*) ()
#6  <signal handler called>
#7  0x000000000b33205e in __tsan::MetaMap::FreeRange(__tsan::Processor*, unsigned long, unsigned long, bool) ()
#8  0x000000000b331d0e in __tsan::MetaMap::FreeBlock(__tsan::Processor*, unsigned long, bool) ()
#9  0x000000000b31334f in __tsan::OnUserFree(__tsan::ThreadState*, unsigned long, unsigned long, bool) ()
#10 0x000000000b313172 in __tsan::user_free(__tsan::ThreadState*, unsigned long, void*, bool) ()
#11 0x000000000b2b5df5 in free ()
#12 0x00007fa19dd67a75 in _dl_deallocate_tls () from /lib64/ld-linux-x86-64.so.2
#13 0x00007fa19dd03242 in free_stacks () from /lib/x86_64-linux-gnu/libpthread.so.0
#14 0x00007fa19dd04522 in __free_tcb () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fa19dd05b29 in __pthread_clockjoin_ex () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x000000000b2b72de in pthread_join ()
#17 0x000000001edc0d66 in Poco::ThreadImpl::joinImpl (this=this@entry=0x7b5400457938, milliseconds=milliseconds@entry=10000) at ../contrib/poco/Foundation/src/Thread_POSIX.cpp:247
#18 0x000000001edc1dc1 in Poco::Thread::tryJoin (this=0x7b5400457938, milliseconds=10000) at ../contrib/poco/Foundation/src/Thread.cpp:153
#19 0x000000001edc3b93 in Poco::PooledThread::release (this=0x7b5400457900) at ../contrib/poco/Foundation/src/ThreadPool.cpp:179
#20 0x000000001edc53d1 in Poco::ThreadPool::housekeep (this=this@entry=0x7ffc8fc36030) at ../contrib/poco/Foundation/src/ThreadPool.cpp:435
#21 0x000000001edc5aeb in Poco::ThreadPool::getThread (this=this@entry=0x7ffc8fc36030) at ../contrib/poco/Foundation/src/ThreadPool.cpp:446
#22 0x000000001edc5fc8 in Poco::ThreadPool::startWithPriority (this=0x7ffc8fc36030, priority=Poco::Thread::PRIO_NORMAL, target=..., name=...) at ../contrib/poco/Foundation/src/ThreadPool.cpp:365
#23 0x000000001eb7898f in Poco::Net::TCPServerDispatcher::enqueue (this=0x7b4000001600, socket=...) at ../contrib/poco/Net/src/TCPServerDispatcher.cpp:152
#24 0x000000001eb774a8 in Poco::Net::TCPServer::run (this=<optimized out>) at ../contrib/poco/Net/src/TCPServer.cpp:148
#25 0x000000001edc20f0 in Poco::(anonymous namespace)::RunnableHolder::run (this=<optimized out>) at ../contrib/poco/Foundation/src/Thread.cpp:55
#26 0x000000001edc082c in Poco::ThreadImpl::runnableEntry (pThread=0x7b4400019168) at ../contrib/poco/Foundation/src/Thread_POSIX.cpp:345
#27 0x000000000b2b6f19 in __tsan_thread_start_func ()
#28 0x00007fa19dd04609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#29 0x00007fa19dc29133 in clone () from /lib/x86_64-linux-gnu/libc.so.6

I looked through llvm code, and looks like this thread is trying to lock the same mutex second time:
in __tsan::CallUserSignalHandler: https://github.com/llvm/llvm-project/blob/12e137ab24dae51553433e00ff96e28a14d5b1f5/compiler-rt/lib/tsan/rtl/tsan_mman.cpp#L272
and then in __tsan::Acquire: https://github.com/llvm/llvm-project/blob/f831d6fc800ccf22c1c09888fce3e3c8ebc2c992/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp#L448
If I understand correctly, SlotLocker locks thr->slot->mtx which is __sanitizer::Mutex and it's not recursive.

The binary was built with Clang 14.0.5 (build log).

@dvyukov
Copy link
Contributor

dvyukov commented Jun 13, 2022

This part is problematic:

#5  0x000000000b2bdcd4 in sighandler(int, __sanitizer::__sanitizer_siginfo*, void*) ()
#6  <signal handler called>
#7  0x000000000b33205e in __tsan::MetaMap::FreeRange(__tsan::Processor*, unsigned long, unsigned long, bool) ()

What signal is it?
If it's SIGSEGV/SIGBUS happening inside of tsan runtime, then it's the root cause. Even if it wouldn't deadlock, it will still crash.

@tavplubix
Copy link
Author

What signal is it?

I'm not sure, but probably it's SIGPROF from our ThreadFuzzer (it randomly send SIGPROF to some threads and sleeps or calls sched_yield in the signal handler). I doubt it's SIGSEGV or SIGBUS, because we attach gdb to the server process before running stress tests and I don't see such signals in the gdb output. Also there are a lot of threads with DB::ThreadFuzzer::signalHandler in stacktrace: log with stacktraces of all threads

@dvyukov
Copy link
Contributor

dvyukov commented Jun 13, 2022

For SIGPROF sighandler wouldn't call CallUserSignalHandler:

#4  0x000000000b2bd7b8 in __tsan::CallUserSignalHandler(__tsan::ThreadState*, bool, bool, int, __sanitizer::__sanitizer_siginfo*, void*) ()
#5  0x000000000b2bdcd4 in sighandler(int, __sanitizer::__sanitizer_siginfo*, void*) ()

see:
https://github.com/llvm/llvm-project/blob/5724231af279d2acf7a6f7b44fe9089456c37777/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp#L2072-L2077

@tavplubix
Copy link
Author

I'm not familiar with llvm/tsan codebase, but looks like it may call CallUserSignalHandler even for SIGPROF while executing a blocking function (i.e. if sctx->in_blocking_func is true):
https://github.com/llvm/llvm-project/blob/8246b2e156568c31e71e16cbaf4c14d316e7c06e/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp#L2088-L2096

And according to the stacktrace it was in a blocking function:

#16 0x000000000b2b72de in pthread_join ()

And looks like it have set in_blocking_func to 1:
pthread_join - https://github.com/llvm/llvm-project/blob/8246b2e156568c31e71e16cbaf4c14d316e7c06e/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp#L1082
BLOCK_REAL and BlockingCall - https://github.com/llvm/llvm-project/blob/8246b2e156568c31e71e16cbaf4c14d316e7c06e/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp#L320-L332

(gdb) disas 0x000000000b2b72de
Dump of assembler code for function pthread_join:
...
   0x000000000b2b729d <+173>:	jmp    0xb2b72c0 <pthread_join+208>      <== BlockingCall ctor
   0x000000000b2b729f <+175>:	nop
   0x000000000b2b72a0 <+176>:	movq   $0x0,0x8(%r13)
   0x000000000b2b72a8 <+184>:	cmpl   $0x0,0x30(%rbx)
   0x000000000b2b72ac <+188>:	je     0xb2b72c0 <pthread_join+208>
   0x000000000b2b72ae <+190>:	mov    %rbx,%rdi
   0x000000000b2b72b1 <+193>:	call   0xb2bd5a0 <_ZN6__tsan25ProcessPendingSignalsImplEPNS_11ThreadStateE>
   0x000000000b2b72b6 <+198>:	cs nopw 0x0(%rax,%rax,1)
   0x000000000b2b72c0 <+208>:	movq   $0x1,0x8(%r13)          <== atomic_store(&ctx->in_blocking_func, 1, memory_order_relaxed)
   0x000000000b2b72c8 <+216>:	cmpl   $0x0,0x30(%rbx)
   0x000000000b2b72cc <+220>:	jne    0xb2b72a0 <pthread_join+176>
   0x000000000b2b72ce <+222>:	addl   $0x1,0x8(%rbx)
   0x000000000b2b72d2 <+226>:	mov    %r15,%rdi
   0x000000000b2b72d5 <+229>:	mov    %r14,%rsi
   0x000000000b2b72d8 <+232>:	call   *0x17d25ba2(%rip)        # 0x22fdce80 <_ZN14__interception17real_pthread_joinE>
   0x000000000b2b72de <+238>:	mov    %eax,%ebp               <== address from the stacktrace (so thread is in real_pthread_join)
   0x000000000b2b72e0 <+240>:	addl   $0xffffffff,0x8(%rbx)   <== ~BlockingCall()
   0x000000000b2b72e4 <+244>:	movq   $0x0,0x8(%r13)          <== atomic_store(&ctx->in_blocking_func, 0, memory_order_relaxed);
   0x000000000b2b72ec <+252>:	mov    %rbx,%rdi
   0x000000000b2b72ef <+255>:	call   0xb31cc70 <_ZN6__tsan15ThreadIgnoreEndEPNS_11ThreadStateE>

Binary: link

Am I missing something?

@tavplubix
Copy link
Author

tavplubix commented Jun 14, 2022

Also we noticed this deadlock after upgrading from clang-13 to clang-14. As I see from git blame, SlotLocker was added to OnUserFree and Acquire in llvm/llvm-project@b332134 and this change is present since llvmorg-14.0.0-rc1.

@dvyukov
Copy link
Contributor

dvyukov commented Jun 14, 2022

You are right. This is a bug around in_blocking_func.
I think this was fixed in llvm/llvm-project@9ca2afd and then regressed in llvm/llvm-project@ce37210

@dvyukov
Copy link
Contributor

dvyukov commented Jun 14, 2022

Here is a test and additional CHECKs that detect this situation more reliably:
dvyukov/llvm-project@d33f845
But I don't have a solution yet.

dvyukov added a commit to dvyukov/llvm-project that referenced this issue Jun 14, 2022
@dvyukov
Copy link
Contributor

dvyukov commented Jun 15, 2022

Sent https://reviews.llvm.org/D127845 with a fix.

@tavplubix
Copy link
Author

Hi, has the fix been merged? We faced the same issue again (this time with clang-15). I did not see the changes from https://reviews.llvm.org/D127845 in main branch in https://github.com/llvm/llvm-project, so seems like the patch was forgotten.

@dvyukov
Copy link
Contributor

dvyukov commented Sep 30, 2022

@melver
Copy link
Contributor

melver commented Sep 30, 2022

Wait, what happened here? Wasn't this merged as: dvyukov/llvm-project@fdb73f7

@dvyukov
Copy link
Contributor

dvyukov commented Sep 30, 2022

That's my repo fork.

@melver
Copy link
Contributor

melver commented Sep 30, 2022

Ah, me not looking the repo - and Github also closed this issue based on that commit...

@melver melver reopened this Sep 30, 2022
azat added a commit to azat/ClickHouse that referenced this issue May 8, 2023
Since initial issue got resolved [1].

  [1]: google/sanitizers#1540

Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>
veselypeta pushed a commit to veselypeta/cherillvm that referenced this issue May 22, 2024
We set in_blocking_func around some blocking C functions so that we don't
delay signal infinitely (if in_blocking_func is set we deliver signals
synchronously).

However, pthread_join is blocking but also call munmap/free to free thread
resources. If we are inside the munmap/free interceptors called from
pthread_join and deliver a signal synchronously, it can lead to deadlocks
and crashes since we re-enter runtime and try to lock the same mutexes
or use the same per-thread data structures.

If we re-enter runtime via an interceptor when in_blocking_func is set,
temporary reset in_blocking_func around the interceptor and restore it back
when we return from the recursive interceptor.

Also move in_blocking_func from ThreadSignalContext to ThreadContext
so that we can CHECK that it's not set in SlotLocker ctor.

Fixes google/sanitizers#1540

Reviewed By: melver

Differential Revision: https://reviews.llvm.org/D127845
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants