Address Sanitizer deadlocks when used by SCHED_FIFO threads on x86 (not 64) when afined to a single CPU

|  |  |
| --- | --- |
| Bugzilla Link | [27986](https://llvm.org/bz27986) |
| Version | 3.8 |
| OS | Linux |
| Attachments | [Simple example to reproduce the issue](https://user-images.githubusercontent.com/60944935/143753627-4e782730-7380-4c4c-aa36-1777a03220da.gz) |
| Reporter | LLVM Bugzilla Contributor |
| CC | @vitalybuka |

## Extended Description 


Using Address Sanitizer can cause the program to deadlock on allocations when the following conditions are met:

1. Run the application on an x86, 32-bit platform on Linux (I don't know if multi-lib compiles would reproduce this if compiled with -m32)
2. Have the threads in the application use the SCHED_FIFO scheduling policy.
3. Vary the priority of the threads.
4. Force all the threads in the application to use the same CPU.

The reason this seems to happen is that SizeClassAllocator32 is using a spin lock to guard some internal data. Spin locks behave quite badly when they interact with SCHED_FIFO threads, especially when those SCHED_FIFO threads can't migrate CPUs.

Take this hypothetical example:
1. Thread 1 has high priority, thread 2 has low priority.
2. Thread 1 goes to sleep
3. Thread 2 decides to allocate, so it will take the spin lock.
4. The kernel interrupts Thread 2 in order to run some SCHED_OTHER processes. Thread 2 still holds the spin lock, as it was interrupted before it was finished.
5. While the other process was running, Thread 1 finished its timed sleep (so it gets scheduled).
6. Thread 1 is running now and decides to allocate. It tries to take the spin lock, but thread 2 still owns it.
7. Thread 1 tries to sched_yield() after a while (as that's how the spin lock for the sanitizers are implemented). However thread 1 still has higher priority than thread 2, so it's immediately scheduled to run again by the kernel.
8. "Deadlock" has occurred, as thread 1 will keep spinning on the lock and thread 2 can never run because it's lower priority than thread 1.

This can be seen a bit more clearly in a stack trace of the provided example application. Once the program stops printing the "Alive" messages, you can have GDB interrupt the program and see these two threads (or something similar):

Thread 6 (Thread 0xab4feb40 (LWP 1520)):
#&#8203;0  0xb7fdad91 in __kernel_vsyscall ()
#&#8203;1  0xb7cdf217 in syscall () from /usr/lib/libc.so.6
#&#8203;2  0x08118679 in __sanitizer::internal_sched_yield() ()
#&#8203;3  0x0806466b in __sanitizer::StaticSpinMutex::LockSlow() ()
#&#8203;4  0x08064834 in __sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>::AllocateBatch(__sanitizer::AllocatorStats*, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >*, unsigned long) ()
#&#8203;5  0x08064c32 in __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >::Refill(__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>*, unsigned long) ()
#&#8203;6  0x08067760 in __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
#&#8203;7  0x0806378b in __asan::asan_memalign(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType) ()
#&#8203;8  0x0812f543 in operator new(unsigned int) ()
#&#8203;9  0x08132205 in dumb_thread (arg=0xbffffa60) at asan_fifo.cpp:26
#&#8203;10 0x0806e7bf in asan_thread_start(void*) ()
#&#8203;11 0xb7de42f1 in start_thread () from /usr/lib/libpthread.so.0
#&#8203;12 0xb7ce37ce in clone () from /usr/lib/libc.so.6

Thread 5 (Thread 0xabcffb40 (LWP 1519)):
#&#8203;0  0x08064863 in __sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>::AllocateBatch(__sanitizer::AllocatorStats*, __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >*, unsigned long) ()
#&#8203;1  0x08064c32 in __sanitizer::SizeClassAllocatorLocalCache<__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback> >::Refill(__sanitizer::SizeClassAllocator32<0ul, 4294967296ull, 16ul, __sanitizer::SizeClassMap<17ul, 64ul, 14ul>, 20ul, __sanitizer::FlatByteMap<4096ull>, __asan::AsanMapUnmapCallback>*, unsigned long) ()
#&#8203;2  0x08067760 in __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
#&#8203;3  0x0806378b in __asan::asan_memalign(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType) ()
#&#8203;4  0x0812f543 in operator new(unsigned int) ()
#&#8203;5  0x08132205 in dumb_thread (arg=0xbffffa5c) at asan_fifo.cpp:26
#&#8203;6  0x0806e7bf in asan_thread_start(void*) ()
#&#8203;7  0xb7de42f1 in start_thread () from /usr/lib/libpth

Even if I allow the program to resume and then interrupt it again, these threads don't appear to make any forward progress.

The fix (or at least one fix I can think of) is to not use spin locks. Or at the very least have the spin lock devolve into a blocking lock after a certain number of tries.

Note when running the provided example that you need to run it as root (to have permissions to create SCHED_FIFO threads) and running the application will likely slow one CPU on your system down to a crawl. I recommend running it in a VM. Also you might have to tweak some of the numbers to reproduce it on your system. After running for a few seconds to a minute you should see the 'Alive' messages stop. I compiled and tested this in a 32-bit VM of ArchLinux with both Clang 3.8 and GCC 6.1.1. I compiled with 'clang++ asan_fifo.cpp -o test -fsanitize=address -pthread'.

Also I understand that the example is a bit convoluted. It's a slimmed down version of a real-world application that is much larger, and it takes several days of constant running for this bug to normally manifest itself.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address Sanitizer deadlocks when used by SCHED_FIFO threads on x86 (not 64) when afined to a single CPU #28360

Extended Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development


Bugzilla Link	27986
Version	3.8
OS	Linux
Attachments	Simple example to reproduce the issue
Reporter	LLVM Bugzilla Contributor
CC	@vitalybuka

Address Sanitizer deadlocks when used by SCHED_FIFO threads on x86 (not 64) when afined to a single CPU #28360

Description

Extended Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions