ThreadSanitizer: deadlock in signal handler #1540

tavplubix · 2022-06-13T11:53:53Z

Hi, seems like our stress tests have found a deadlock in TSan while testing ClickHouse. The stacktrace is

Thread 238 (Thread 0x7fa07287e700 (LWP 925)):
#0  0x000000000b2949aa in __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) ()
#1  0x000000000b295a4a in __sanitizer::Semaphore::Wait() ()
#2  0x000000000b31a23f in __tsan::SlotLock(__tsan::ThreadState*) ()
#3  0x000000000b32a2c3 in __tsan::Acquire(__tsan::ThreadState*, unsigned long, unsigned long) ()
#4  0x000000000b2bd7b8 in __tsan::CallUserSignalHandler(__tsan::ThreadState*, bool, bool, int, __sanitizer::__sanitizer_siginfo*, void*) ()
#5  0x000000000b2bdcd4 in sighandler(int, __sanitizer::__sanitizer_siginfo*, void*) ()
#6  <signal handler called>
#7  0x000000000b33205e in __tsan::MetaMap::FreeRange(__tsan::Processor*, unsigned long, unsigned long, bool) ()
#8  0x000000000b331d0e in __tsan::MetaMap::FreeBlock(__tsan::Processor*, unsigned long, bool) ()
#9  0x000000000b31334f in __tsan::OnUserFree(__tsan::ThreadState*, unsigned long, unsigned long, bool) ()
#10 0x000000000b313172 in __tsan::user_free(__tsan::ThreadState*, unsigned long, void*, bool) ()
#11 0x000000000b2b5df5 in free ()
#12 0x00007fa19dd67a75 in _dl_deallocate_tls () from /lib64/ld-linux-x86-64.so.2
#13 0x00007fa19dd03242 in free_stacks () from /lib/x86_64-linux-gnu/libpthread.so.0
#14 0x00007fa19dd04522 in __free_tcb () from /lib/x86_64-linux-gnu/libpthread.so.0
#15 0x00007fa19dd05b29 in __pthread_clockjoin_ex () from /lib/x86_64-linux-gnu/libpthread.so.0
#16 0x000000000b2b72de in pthread_join ()
#17 0x000000001edc0d66 in Poco::ThreadImpl::joinImpl (this=this@entry=0x7b5400457938, milliseconds=milliseconds@entry=10000) at ../contrib/poco/Foundation/src/Thread_POSIX.cpp:247
#18 0x000000001edc1dc1 in Poco::Thread::tryJoin (this=0x7b5400457938, milliseconds=10000) at ../contrib/poco/Foundation/src/Thread.cpp:153
#19 0x000000001edc3b93 in Poco::PooledThread::release (this=0x7b5400457900) at ../contrib/poco/Foundation/src/ThreadPool.cpp:179
#20 0x000000001edc53d1 in Poco::ThreadPool::housekeep (this=this@entry=0x7ffc8fc36030) at ../contrib/poco/Foundation/src/ThreadPool.cpp:435
#21 0x000000001edc5aeb in Poco::ThreadPool::getThread (this=this@entry=0x7ffc8fc36030) at ../contrib/poco/Foundation/src/ThreadPool.cpp:446
#22 0x000000001edc5fc8 in Poco::ThreadPool::startWithPriority (this=0x7ffc8fc36030, priority=Poco::Thread::PRIO_NORMAL, target=..., name=...) at ../contrib/poco/Foundation/src/ThreadPool.cpp:365
#23 0x000000001eb7898f in Poco::Net::TCPServerDispatcher::enqueue (this=0x7b4000001600, socket=...) at ../contrib/poco/Net/src/TCPServerDispatcher.cpp:152
#24 0x000000001eb774a8 in Poco::Net::TCPServer::run (this=<optimized out>) at ../contrib/poco/Net/src/TCPServer.cpp:148
#25 0x000000001edc20f0 in Poco::(anonymous namespace)::RunnableHolder::run (this=<optimized out>) at ../contrib/poco/Foundation/src/Thread.cpp:55
#26 0x000000001edc082c in Poco::ThreadImpl::runnableEntry (pThread=0x7b4400019168) at ../contrib/poco/Foundation/src/Thread_POSIX.cpp:345
#27 0x000000000b2b6f19 in __tsan_thread_start_func ()
#28 0x00007fa19dd04609 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#29 0x00007fa19dc29133 in clone () from /lib/x86_64-linux-gnu/libc.so.6

I looked through llvm code, and looks like this thread is trying to lock the same mutex second time:
in __tsan::CallUserSignalHandler: https://github.com/llvm/llvm-project/blob/12e137ab24dae51553433e00ff96e28a14d5b1f5/compiler-rt/lib/tsan/rtl/tsan_mman.cpp#L272
and then in __tsan::Acquire: https://github.com/llvm/llvm-project/blob/f831d6fc800ccf22c1c09888fce3e3c8ebc2c992/compiler-rt/lib/tsan/rtl/tsan_rtl_mutex.cpp#L448
If I understand correctly, SlotLocker locks thr->slot->mtx which is __sanitizer::Mutex and it's not recursive.

The binary was built with Clang 14.0.5 (build log).

The text was updated successfully, but these errors were encountered:

dvyukov · 2022-06-13T12:30:01Z

This part is problematic:

#5  0x000000000b2bdcd4 in sighandler(int, __sanitizer::__sanitizer_siginfo*, void*) ()
#6  <signal handler called>
#7  0x000000000b33205e in __tsan::MetaMap::FreeRange(__tsan::Processor*, unsigned long, unsigned long, bool) ()

What signal is it?
If it's SIGSEGV/SIGBUS happening inside of tsan runtime, then it's the root cause. Even if it wouldn't deadlock, it will still crash.

tavplubix · 2022-06-13T12:49:45Z

What signal is it?

I'm not sure, but probably it's SIGPROF from our ThreadFuzzer (it randomly send SIGPROF to some threads and sleeps or calls sched_yield in the signal handler). I doubt it's SIGSEGV or SIGBUS, because we attach gdb to the server process before running stress tests and I don't see such signals in the gdb output. Also there are a lot of threads with DB::ThreadFuzzer::signalHandler in stacktrace: log with stacktraces of all threads

dvyukov · 2022-06-13T12:51:53Z

For SIGPROF sighandler wouldn't call CallUserSignalHandler:

#4  0x000000000b2bd7b8 in __tsan::CallUserSignalHandler(__tsan::ThreadState*, bool, bool, int, __sanitizer::__sanitizer_siginfo*, void*) ()
#5  0x000000000b2bdcd4 in sighandler(int, __sanitizer::__sanitizer_siginfo*, void*) ()

see:
https://github.com/llvm/llvm-project/blob/5724231af279d2acf7a6f7b44fe9089456c37777/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp#L2072-L2077

tavplubix · 2022-06-13T15:51:43Z

I'm not familiar with llvm/tsan codebase, but looks like it may call CallUserSignalHandler even for SIGPROF while executing a blocking function (i.e. if sctx->in_blocking_func is true):
https://github.com/llvm/llvm-project/blob/8246b2e156568c31e71e16cbaf4c14d316e7c06e/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp#L2088-L2096

And according to the stacktrace it was in a blocking function:

#16 0x000000000b2b72de in pthread_join ()

And looks like it have set in_blocking_func to 1:
pthread_join - https://github.com/llvm/llvm-project/blob/8246b2e156568c31e71e16cbaf4c14d316e7c06e/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp#L1082
BLOCK_REAL and BlockingCall - https://github.com/llvm/llvm-project/blob/8246b2e156568c31e71e16cbaf4c14d316e7c06e/compiler-rt/lib/tsan/rtl/tsan_interceptors_posix.cpp#L320-L332

(gdb) disas 0x000000000b2b72de
Dump of assembler code for function pthread_join:
...
   0x000000000b2b729d <+173>:	jmp    0xb2b72c0 <pthread_join+208>      <== BlockingCall ctor
   0x000000000b2b729f <+175>:	nop
   0x000000000b2b72a0 <+176>:	movq   $0x0,0x8(%r13)
   0x000000000b2b72a8 <+184>:	cmpl   $0x0,0x30(%rbx)
   0x000000000b2b72ac <+188>:	je     0xb2b72c0 <pthread_join+208>
   0x000000000b2b72ae <+190>:	mov    %rbx,%rdi
   0x000000000b2b72b1 <+193>:	call   0xb2bd5a0 <_ZN6__tsan25ProcessPendingSignalsImplEPNS_11ThreadStateE>
   0x000000000b2b72b6 <+198>:	cs nopw 0x0(%rax,%rax,1)
   0x000000000b2b72c0 <+208>:	movq   $0x1,0x8(%r13)          <== atomic_store(&ctx->in_blocking_func, 1, memory_order_relaxed)
   0x000000000b2b72c8 <+216>:	cmpl   $0x0,0x30(%rbx)
   0x000000000b2b72cc <+220>:	jne    0xb2b72a0 <pthread_join+176>
   0x000000000b2b72ce <+222>:	addl   $0x1,0x8(%rbx)
   0x000000000b2b72d2 <+226>:	mov    %r15,%rdi
   0x000000000b2b72d5 <+229>:	mov    %r14,%rsi
   0x000000000b2b72d8 <+232>:	call   *0x17d25ba2(%rip)        # 0x22fdce80 <_ZN14__interception17real_pthread_joinE>
   0x000000000b2b72de <+238>:	mov    %eax,%ebp               <== address from the stacktrace (so thread is in real_pthread_join)
   0x000000000b2b72e0 <+240>:	addl   $0xffffffff,0x8(%rbx)   <== ~BlockingCall()
   0x000000000b2b72e4 <+244>:	movq   $0x0,0x8(%r13)          <== atomic_store(&ctx->in_blocking_func, 0, memory_order_relaxed);
   0x000000000b2b72ec <+252>:	mov    %rbx,%rdi
   0x000000000b2b72ef <+255>:	call   0xb31cc70 <_ZN6__tsan15ThreadIgnoreEndEPNS_11ThreadStateE>

Binary: link

Am I missing something?

tavplubix · 2022-06-14T09:05:15Z

Also we noticed this deadlock after upgrading from clang-13 to clang-14. As I see from git blame, SlotLocker was added to OnUserFree and Acquire in llvm/llvm-project@b332134 and this change is present since llvmorg-14.0.0-rc1.

dvyukov · 2022-06-14T14:19:23Z

You are right. This is a bug around in_blocking_func.
I think this was fixed in llvm/llvm-project@9ca2afd and then regressed in llvm/llvm-project@ce37210

dvyukov · 2022-06-14T14:29:00Z

Here is a test and additional CHECKs that detect this situation more reliably:
dvyukov/llvm-project@d33f845
But I don't have a solution yet.

Test/fix for google/sanitizers#1540

dvyukov · 2022-06-15T11:20:43Z

Sent https://reviews.llvm.org/D127845 with a fix.

tavplubix · 2022-09-29T16:03:49Z

Hi, has the fix been merged? We faced the same issue again (this time with clang-15). I did not see the changes from https://reviews.llvm.org/D127845 in main branch in https://github.com/llvm/llvm-project, so seems like the patch was forgotten.

dvyukov · 2022-09-30T06:38:53Z

@melver PTAL on https://reviews.llvm.org/D127845

melver · 2022-09-30T11:53:40Z

Wait, what happened here? Wasn't this merged as: dvyukov/llvm-project@fdb73f7

dvyukov · 2022-09-30T11:54:46Z

That's my repo fork.

melver · 2022-09-30T11:57:30Z

Ah, me not looking the repo - and Github also closed this issue based on that commit...

Since initial issue got resolved [1]. [1]: google/sanitizers#1540 Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>

We set in_blocking_func around some blocking C functions so that we don't delay signal infinitely (if in_blocking_func is set we deliver signals synchronously). However, pthread_join is blocking but also call munmap/free to free thread resources. If we are inside the munmap/free interceptors called from pthread_join and deliver a signal synchronously, it can lead to deadlocks and crashes since we re-enter runtime and try to lock the same mutexes or use the same per-thread data structures. If we re-enter runtime via an interceptor when in_blocking_func is set, temporary reset in_blocking_func around the interceptor and restore it back when we return from the recursive interceptor. Also move in_blocking_func from ThreadSignalContext to ThreadContext so that we can CHECK that it's not set in SlotLocker ctor. Fixes google/sanitizers#1540 Reviewed By: melver Differential Revision: https://reviews.llvm.org/D127845

tavplubix mentioned this issue Jun 13, 2022

Deadlock in tsan ClickHouse/ClickHouse#38019

Closed

tavplubix mentioned this issue Jun 13, 2022

ThreadSanitizer: CHECK failed: tsan_interceptors_posix.cpp:1983 #1541

Closed

dvyukov added a commit to dvyukov/llvm-project that referenced this issue Jun 14, 2022

tsan: fix deadlock/crash in signal handling

d368c0c

Test/fix for google/sanitizers#1540

dvyukov closed this as completed in dvyukov/llvm-project@fdb73f7 Jun 15, 2022

Felixoid mentioned this issue Jul 22, 2022

Build thread sanitizer binaries with clang-13 ClickHouse/ClickHouse#39450

Merged

tavplubix mentioned this issue Aug 3, 2022

Try clang-15 for build with tsan ClickHouse/ClickHouse#39758

Merged

tavplubix mentioned this issue Sep 29, 2022

Fix intersecting parts ClickHouse/ClickHouse#41741

Merged

tavplubix mentioned this issue Sep 29, 2022

Temporarily disable ThreadFuzzer with TSan ClickHouse/ClickHouse#41943

Merged

melver reopened this Sep 30, 2022

dvyukov closed this as completed in llvm/llvm-project@3056ccd Sep 30, 2022

azat added a commit to azat/ClickHouse that referenced this issue May 8, 2023

Enable back thread fuzzuer under TSan

c525d13

Since initial issue got resolved [1]. [1]: google/sanitizers#1540 Signed-off-by: Azat Khuzhin <a.khuzhin@semrush.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ThreadSanitizer: deadlock in signal handler #1540

ThreadSanitizer: deadlock in signal handler #1540

tavplubix commented Jun 13, 2022

dvyukov commented Jun 13, 2022

tavplubix commented Jun 13, 2022

dvyukov commented Jun 13, 2022

tavplubix commented Jun 13, 2022

tavplubix commented Jun 14, 2022 •

edited

Loading

dvyukov commented Jun 14, 2022

dvyukov commented Jun 14, 2022

dvyukov commented Jun 15, 2022

tavplubix commented Sep 29, 2022

dvyukov commented Sep 30, 2022

melver commented Sep 30, 2022

dvyukov commented Sep 30, 2022

melver commented Sep 30, 2022

ThreadSanitizer: deadlock in signal handler #1540

ThreadSanitizer: deadlock in signal handler #1540

Comments

tavplubix commented Jun 13, 2022

dvyukov commented Jun 13, 2022

tavplubix commented Jun 13, 2022

dvyukov commented Jun 13, 2022

tavplubix commented Jun 13, 2022

tavplubix commented Jun 14, 2022 • edited Loading

dvyukov commented Jun 14, 2022

dvyukov commented Jun 14, 2022

dvyukov commented Jun 15, 2022

tavplubix commented Sep 29, 2022

dvyukov commented Sep 30, 2022

melver commented Sep 30, 2022

dvyukov commented Sep 30, 2022

melver commented Sep 30, 2022

tavplubix commented Jun 14, 2022 •

edited

Loading