## Shattering Illusions in Lock-Free Worlds Compiler and Hardware behaviors in OSes and VMs Marc Blanchou July 31, 2013 #### Introduction #### **Developer:** "Compiler/hardware, that's not the code I wrote for my driver?!" #### **Compiler/hardware:** "But your code is **correctly synchronized** right? So you should not care — you do not want to actually execute this horror you just wrote—**trust us**" July 31, 2013 BH USA 2013 Definitions 2 / 37 #### Introduction - What is it about? - Race conditions introduced by the compiler or the hardware in lock-free sections (in OSes and VMs among others) - Why should you care? - You don't realize how messy lock-free code can be - You want to find these bugs more easily - You want to know more about the different layers involved in these types of race conditions ## Agenda - Definitions - Lock-free programming - Memory models - Optimizations - Compiler, hardware and races - Reordering issues - Double-fetch (TOCTTOU) issues - Other issues - How to find these bugs? - Solutions? #### Locks? - Locks were initially created because of the difficulty of writing correct multi-threaded code - They more or less allow developers (and researchers) to not care too much about memory models and various compiler and hardware optimizations A multiprocessing system on a single computer involves problems similar to those of a distributed system because of the unpredictable order in which certain events can occur. ... We have found that problems often arise because people are not fully aware of this fact and its implications. — Leslie Lamport 1978 ## Lock-free programming - What is it? - Threads never waiting on each others - No more deadlocks - No livelocks or theorical scheduling issues - Usually cheaper and scale better than locks - Always completes operations in time (critical) - Usually only used for a few sections of applications - Way harder to get right than using locks - What for? - Various OSes and VM operations - Multimedia and financial apps, some databases etc. ### The (very) obvious - What a compiler/hardware knows - Memory operations within the thread - What a compiler/hardware does not know - Shared memory locations - Solution - Let the compiler/hardware know - Appropriate memory barriers and atomic operations - BUT you can't make assumptions about memory models anymore - Be careful with compiler/hardware specific code - Can break on newer or different hardware/compilers #### Cache and Sequential Consistency - Cache coherence - No data lost or written before being transferred from the cache to the target memory - Sequential Consistency (or illusion of) - Order of memory operations specified by a program - No memory reordering should be visible - People usually write code that needs SC - This is language and hardware dependent ### **Compiler Optimizations** - If not told otherwise, the compiler can do any optimization it wants to, as long the compiled code acts as if it would run on a single threaded machine. - What about Profile Guided Optimization (PGO) or code obfuscation? ### Memory models #### Source code ``` int gl_value, gl_random_number; int gl_is_updated; void thread1() { gl_value = gl_random_number + 42 gl_is_updated = 1; } int thread2() { if (gl_is_updated) { return gl_value; } } ``` Software memory model #### Compiler ``` Peephole optimizations Constant folding Strength reduction Null sequences Combine Operations Algebraic Laws Special Case Instructions Address Mode Operations Local optimizations Global optimizations Loop optimizations fission/distribution fusion/combining interchange/permutation inversion loop-invariant code motion parallelization reversal scheduling skewing software pipelining splitting/peeling tiling/blocking vectorization unrolling unswitching sectioning Interprocedural, whole-program or link-time optimization Machine code optimization ``` #### Machine code #### Processor / cache #### Hardware memory model Weak Strong (closer to SC) ARM IA-64 (Itanium) X86 - 64 PowerPC ## What can go wrong? #### Developer: "Compiler/harware, but I swear to synchronize properly!" #### **Compiler/hardware:** "Fine... We'll do our best so that you can't tell we modified the program you wrote." July 31, 2013 BH USA 2013 Definitions 12 / 37 # I'll synchronize, I promise - Definitions - Compilers, hardware and races - Reordering issues - Double-fetch (TOCTTOU) issues - Other issues - How to find these bugs? - Solutions July 31, 2013 ## Lock-free and reordering - Reordering can happen at compile time as well as at runtime (hardware). - We did not need to care with locks before - What does it mean for the developer? - Atomic operations - Appropriate memory barriers ### **Atomicity** - C/C++ operations are NOT presumed atomic - But some native types can be if they are aligned - C++11: atomic<> - RMW (read-modify-write) operations - CAS (compare-and-swap) | Non-atomic | Atomic | |----------------------------------------------------------------------------------|---------------------------------| | g_value++; // g_value = g_value + 1; | InterlockedIncrement(&g_value); | | <pre>int* rValue = (int*)([aligned_ptr] + 3); *rValue = 42; // not aligned</pre> | | ## Compiler reordering G++ 4.8 with no optimization flags ``` 7 int gl value, gl random number; .zero 8 int gl is updated; 8 thread1(): 9 .LFB0: 10 void thread1() 11 mov rbp, rsp 12 gl value = gl random number + 42; mov eax, DWORD PTR gl random number[rip] 13 add eax, 42 14 mov DWORD PTR gl value[rip], eax 15 gl is updated = 1; mov DWORD PTR gl is updated[rip], 1 16 pop rbp 17 ret ``` G++ 4.8 with -O3 ``` int gl_value, gl_random_number; int gl_is_updated; void thread1() gl_value = gl_random_number + 42; gl_is_updated = 1; gl_is_updated = 1; nov DWORD PTR gl_is_updated[rip], 1 add eax, 42 mov DWORD PTR gl_value[rip], eax ret ``` ## Compiler reordering G++ 4.8 and –O3 ``` 1 int gl value, gl random number; 1 .Ltext0: 2 int gl is updated; 2 thread1(): mov eax, DWORD PTR gl random number[rip] 4 void thread1() 5 { mov DWORD PTR gl is updated[rip], 1 gl value = gl random number + 42; add eax, 42 mov DWORD PTR gl value[rip], eax gl is updated = 1; 8 } ret 9 .LFE0: 10 int thread2() 10 thread2(): 11 { 11 .LFB1: if (gl is updated) mov eax, DWORD PTR gl is updated[rip] return gl value; 13 eax, eax 14 ie .L4 mov eax, DWORD PTR gl value[rip] 15 16 17 .L4: ``` ### Compiler barriers ``` int gl_value, gl_random_number; int gl_is_updated; void thread1() gl_value = gl_random_number + 42; asm_volatile("" ::: "memory"); gl_is_updated = 1; } Ltext0: thread1(): mov_eax, DWORD PTR gl_random_number[rip] add eax, 42 mov_DWORD PTR gl_value[rip], eax mov_DWORD PTR gl_is_updated[rip], 1 ret gl_is_updated = 1; } Ltext0: thread1(): add eax, 42 mov_DWORD PTR gl_is_updated[rip], 1 ret gl_is_updated = 1; } Ltext0: thread1(): Ltext0: thread1(): add eax, 42 mov_DWORD PTR gl_is_updated[rip], 1 ret gl_is_updated = 1; } Ltext0: thread1(): LIEXTO: LIEXTO: LIEXTO: Thread1(): Add eax, 42 mov_DWORD PTR gl_is_updated[rip], 1 Ret Selection = 1. Se ``` Prevent compiler reordering\* ``` int gl_value, gl_random_number; int gl_is_updated; #define SIMPLE_BARRIER() asm volatile("" ::: "memory") void thread1() gl_value = gl_random_number + 42; SIMPLE_BARRIER(); gl_is_updated = 1; } int thread2() if (gl_is_updated) { SIMPLE_BARRIER(); return_gl_value; } ``` ``` 1 .Ltext0: 2 thread1(): mov eax, DWORD PTR gl random number[rip] add eax, 42 mov DWORD PTR gl value[rip], eax mov DWORD PTR gl is updated[rip], 1 .LFE0: 10 thread2(): 11 .LFB1: mov eax, DWORD PTR gl is updated[rip] test eax, eax 14 15 mov eax, DWORD PTR gl value[rip] 16 17 .L4: rep; ret ``` \*note: this does not act as a hardware barrier #### Preventing compile-time reordering - Compiler barriers - VC++ specific - Interlocked operations - Volatile not atomic, VC++ specific implementation (/volatile:ms) - ReadWriteBarrier() but better use atomic<> - Use the /kernel flag! - GCC - asm volatile (""::: "memory") - Use specific memory barriers defines depending on the kernel - C++11 atomic types - Avoid relaxed atomic! - Volatile - Java: Full barrier (CPU+compiler) (!= C/C++ volatile) - Avoid volatile in C/C++ for synchronization - Implied - CPU fences - Some function calls (containing barriers or "unknown" functions) but they can be inlined - Use \_\_declspec(noinline) for VC++ or \_\_attribute\_\_((noinline)) for gcc ## Real-time reordering - Only visible on multicore or multiprocessor - ONE CPU guarantees - Dependent memory accesses are in order - Overlapping load and store will appear ordered ## Real-time reordering - ONE CPU does NOT Guarantee - Overlapping memory accesses are not merged or discarded - Independent load and store are issued in the order given - Even on x86-64 (strong memory model) - An independent load (read) can be reordered with older stores ``` thread1(): mov DWORD PTR gl is updated[rip], 1 Store Can be reordered by the CPU mov eax, DWORD PTR gl value[rip] ``` - Non-SC load and store instruction: mov - SC load instruction: mov - SC store instruction: xchg (or mfence + mov) #### Hardware barriers - C++11 *αtomic*<> types (apart from relaxed atomic) - GCC: volatile("[instruction]"::: "memory") - And the various defines (mb(), rmb(), wmb() etc.) - VC++ - MemoryBarrier() (full memory barrier, compiler+CPU) - Interlocked operations - Volatile in Java ( ≠ C/C++ volatile) ## Reordering at runtime Lots of different types of barriers depending on the CPU | C/C++11 Operation | x86 implementation | |-------------------|------------------------------------------------------| | Load Relaxed: | MOV (from memory) | | Load Consume: | MOV (from memory) | | Load Acquire: | MOV (from memory) | | Load Seq_Cst: | MOV (from memory) | | Store Relaxed: | MOV (into memory) | | Store Release: | MOV (into memory) | | Store Seq Cst: | (LOCK) XCHG // alternative: MOV (into memory),MFENCE | | Consume Fence: | <ignore></ignore> | | Acquire Fence: | <ignore></ignore> | | Release Fence: | <ignore></ignore> | | Acq_Rel Fence: | <ignore></ignore> | | Seq_Cst Fence: | MFENCE | | C/C++11 Operation | ARM implementation | |---------------------------|------------------------------------------------------------------------------------------------| | Load Relaxed: | ldr | | Load Consume: | ldr + preserve dependencies until next kill_dependency OR ldr; teq; beq; isb | | | OR<br> dr; dmb | | Load Acquire: | ldr; teq; beq; isb OR ldr; dmb | | Load Seq Cst: | ldr; dmb | | Store Relaxed: | str | | Store Release: | dmb; str | | Store Seq Cst: | dmb; str; dmb | | Cmpxchg Relaxed (32 bit): | loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, mewval, [rptr]; teq | | Cmpxchg Acquire (32 bit): | loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, mewval, [rptr]; teq | | Cmpxchg Release (32 bit): | dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, mewval, [rptr | | Cmpxchg AcqRel (32 bit): | dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, mewval, [rptr | | Cmpxchg SeqCst (32 bit): | dmb; _loop: ldrex roldval, [rptr]; mov rres, 0; teq roldval, rold; strexeq rres, mewval, [rptr | | Acquire Fence: | dmb | | Release Fence: | dmb | | AcqRel Fence: | dmb | | SeqCst Fence: | dmb | ### Other potential issues - Speculative register promotion - Write condition write - Adjacent field overwrites - Branch predictions - Merging loops or inverting nested loops - ABA problem? - Conditional "locks" - etc. ## Example | Thread 1 | Thread 2 | |-------------------------------|------------------------------------------------------------| | g_value =;<br>gl_done = true; | <pre>while (!gl_done) { [] } local_data = g_value;</pre> | ## Example Register promotion and reordering | | Thread 1 | Thread 2 | | |--------------------|-----------------------------|----------------------------------------|-------| | | g_value =; gl_done = true; | <b>while (!</b> gl_done) { [] | | | | , 3 = . | } | | | Potential CPU and | d | local_data = g_value; | | | compiler reorderir | ng | | | | | | Potential compiler optimiza | ation | | | | <pre>register int tmp = gl_done;</pre> | | | | | while (!tmp) { | | | | | [] | | | | | } | | | chat | | local_data = q_value; | | #### Classic double-fetch or TOCTTOU - Classic issue that can lead to privilege escalation - Kernel (local privilege escalation, userland->kernel) - Hypervisor (guest->host, VM breakout?) - Example - Two memory reads in kernel space from a user-writable address - Kernel fetches the location once, verifies and validates the data - -> Attacker modifies the memory in user space - Kernel fetches the attacker-controlled value a second time and uses it #### Classic double-fetch or TOCTTOU ``` void called_by_user(void* pUserSpaceMemory, [...]) { // kernel mode [...] try { [....] data_struct* p_data_struct = (data_struct*) pUserSpaceMemory; // attacker controlled [...] Captured twice ProbeForWrite(p_data_struct->buffer, p_data_struct->len, An attacker can sizeof(UCHAR)); change the address of [...] the buffer after the check RtlCopyMemory(p_data_struct->buffer, p_DATA, p_data_struct->len); [...] ``` #### More secure? ### Potential compiler optimization? ``` void called_by_user(void* pUserSpaceMemory, [...]) { // kernel mode [...] try { [...] captured_user_data = *(data_struct*) pUserSpaceMemory; [...] ProbeForWrite(captured_user_data.buffer, captured_user_data.len, sizeof(UCHAR)); ProbeForWrite(((data_struct*)pUserSpaceMemory)->buffer, ((data_struct*)pUserSpaceMemory)->len, sizeof(UCHAR)); [...] RtlCopyMemory(captured user data.buffer, p_SOME_DATA, captured_user_data.len); RtlCopyMemory(((data_struct*)pUserSpaceMemory)->buffer), p_DATA, ((data_struct*)pUserSpaceMemory)->len)); ``` Still captured twice? The compiler may not see why you need the local storage (which just adds instructions) and could optimize away ## Potential compiler bug? ``` void called_by_user(void* pUserSpaceMemory, [...]) { // kernel mode [...] try { [...] captured_user_data = *(volatile data_struct*) pUserSpaceMemory; ProbeForWrite(captured_user_data.buffer, Force volatile semantic. captured_user_data.len, force capture (legal) sizeof(UCHAR)); (/volatile:ms) ProbeForWrite(((data_struct*)pUserSpaceMemory)->buffer, Use copy from user(...) ((data_struct*)pUserSpaceMemory)->len, on Linux sizeof(UCHAR)); [...] Still captured RtlCopyMemory(captured user data.buffer, twice? p_SOME_DATA, captured_user_data.len); RtlCopyMemory(((data_struct*)pUserSpaceMemory)->buffer), p_DATA, ((data_struct*)pUserSpaceMemory)->len)); ``` #### Compilers and CPUs can have issues - Especially with lock-free code - These bugs are more frequent than you may think, and can impact a lot of code – that may never be recompiled - Compilers may not always follow the standard ### How to find these bugs - Blackbox - Determine which compiler was used - Any type of bug known to be introduced by the compiler? - Look for specific instructions (hardware barriers) and see what it is supposed to protect (in weak memory models: is the right instruction used?) - ThreadSanitizer (TSAN) - Linux/Mac based on Valgrind, based on PIN for Windows - Memory access pattern analysis? - See Bochspwn (M. Jurczyk and G. Coldwin) #### Whitebox and solutions - Thoroughly review code that: - Should not be optimized in any way - Where shared memory is accessed/written - Test cases and fuzzing - You are not only testing your code but the compiler/CPU too - Using ThreadSanitizer (TSAN) or Helgrind - Disabling optimizations has limits - Compare an test against CPUs with weaker memory models - Equivalence checking - Using different compilers (could be very difficult, though) - Temporary mitigation for the user: sandbox with one CPU ## Lock-free programming is hard - It can create lots of issues that easily go unnoticed - And even with valid code due to compiler/CPU issues "The fences in the current [C++] standard may be the most experts-only construct we have in the language" — Hans Boehm "It's easy to write lock-free code that appears to work, but it's very difficult to write lock-free code that is correct and performs well. Even good magazines and refereed journals have published a substantial amount of lock-free code that was actually broken in subtle ways and needed correction." — Herb Sutter #### Thank You - Marc Blanchou - Principal Security Consultant at iSEC Partners - marc@isecpartners.com - References - Papers/articles from: - Hans Bohem, Leslie Lamport, Herb Sutter, Vance Morrison, Jeff Preshing, David Howells, Paul E McKenney, Intel Corporation, Andrei Alexandrescu, Linus Torvalds, Petru Marginean, Tian Tian #### **UK Offices** Manchester - Head Office Cheltenham Edinburgh Leatherhead London Thame #### **European Offices** Amsterdam - Netherlands Munich – Germany Zurich - Switzerland #### **North American Offices** San Francisco Atlanta New York Seattle **Australian Offices** Sydney July 31, 2013 BH USA 2013 37 / 37