-
Notifications
You must be signed in to change notification settings - Fork 15.2k
Description
Bugzilla Link | 27986 |
Version | 3.8 |
OS | Linux |
Attachments | Simple example to reproduce the issue |
Reporter | LLVM Bugzilla Contributor |
CC | @vitalybuka |
Extended Description
Using Address Sanitizer can cause the program to deadlock on allocations when the following conditions are met:
- Run the application on an x86, 32-bit platform on Linux (I don't know if multi-lib compiles would reproduce this if compiled with -m32)
- Have the threads in the application use the SCHED_FIFO scheduling policy.
- Vary the priority of the threads.
- Force all the threads in the application to use the same CPU.
The reason this seems to happen is that SizeClassAllocator32 is using a spin lock to guard some internal data. Spin locks behave quite badly when they interact with SCHED_FIFO threads, especially when those SCHED_FIFO threads can't migrate CPUs.
Take this hypothetical example:
- Thread 1 has high priority, thread 2 has low priority.
- Thread 1 goes to sleep
- Thread 2 decides to allocate, so it will take the spin lock.
- The kernel interrupts Thread 2 in order to run some SCHED_OTHER processes. Thread 2 still holds the spin lock, as it was interrupted before it was finished.
- While the other process was running, Thread 1 finished its timed sleep (so it gets scheduled).
- Thread 1 is running now and decides to allocate. It tries to take the spin lock, but thread 2 still owns it.
- Thread 1 tries to sched_yield() after a while (as that's how the spin lock for the sanitizers are implemented). However thread 1 still has higher priority than thread 2, so it's immediately scheduled to run again by the kernel.
- "Deadlock" has occurred, as thread 1 will keep spinning on the lock and thread 2 can never run because it's lower priority than thread 1.
This can be seen a bit more clearly in a stack trace of the provided example application. Once the program stops printing the "Alive" messages, you can have GDB interrupt the program and see these two threads (or something similar):
Thread 6 (Thread 0xab4feb40 (LWP 1520)):
#0 0xb7fdad91 in __kernel_vsyscall ()
#1 0xb7cdf217 in syscall () from /usr/lib/libc.so.6
#2 0x08118679 in __sanitizer::internal_sched_yield() ()
#3 0x0806466b in __sanitizer::StaticSpinMutex::LockSlow() ()
#4 0x08064834 in __sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>::AllocateBatch(__sanitizer::AllocatorStats*, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >, unsigned long) ()
#5 0x08064c32 in __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >::Refill(__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>, unsigned long) ()
#6 0x08067760 in __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
#7 0x0806378b in __asan::asan_memalign(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType) ()
#8 0x0812f543 in operator new(unsigned int) ()
#9 0x08132205 in dumb_thread (arg=0xbffffa60) at asan_fifo.cpp:26
#10 0x0806e7bf in asan_thread_start(void*) ()
#11 0xb7de42f1 in start_thread () from /usr/lib/libpthread.so.0
#12 0xb7ce37ce in clone () from /usr/lib/libc.so.6
Thread 5 (Thread 0xabcffb40 (LWP 1519)):
#0 0x08064863 in __sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>::AllocateBatch(__sanitizer::AllocatorStats*, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >, unsigned long) ()
#1 0x08064c32 in __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >::Refill(__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>, unsigned long) ()
#2 0x08067760 in __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
#3 0x0806378b in __asan::asan_memalign(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType) ()
#4 0x0812f543 in operator new(unsigned int) ()
#5 0x08132205 in dumb_thread (arg=0xbffffa5c) at asan_fifo.cpp:26
#6 0x0806e7bf in asan_thread_start(void*) ()
#7 0xb7de42f1 in start_thread () from /usr/lib/libpth
Even if I allow the program to resume and then interrupt it again, these threads don't appear to make any forward progress.
The fix (or at least one fix I can think of) is to not use spin locks. Or at the very least have the spin lock devolve into a blocking lock after a certain number of tries.
Note when running the provided example that you need to run it as root (to have permissions to create SCHED_FIFO threads) and running the application will likely slow one CPU on your system down to a crawl. I recommend running it in a VM. Also you might have to tweak some of the numbers to reproduce it on your system. After running for a few seconds to a minute you should see the 'Alive' messages stop. I compiled and tested this in a 32-bit VM of ArchLinux with both Clang 3.8 and GCC 6.1.1. I compiled with 'clang++ asan_fifo.cpp -o test -fsanitize=address -pthread'.
Also I understand that the example is a bit convoluted. It's a slimmed down version of a real-world application that is much larger, and it takes several days of constant running for this bug to normally manifest itself.