## **arm** Research

# The Semantics of Transactions and Weak Memory in x86, Power, ARM, and C++

Nathan Chong, Tyler Sorensen, John Wickerson

June 2018



Tyler Sorensen



John Wickerson



#### Transactions

Weak Memory

William W. Collier



Synthesis Lectures on Distributed Computing Theory

## Transactions

The promise of scalable performance without programmer pain

# Weak Memory

William W. Collier



#### **Transactions**

The promise of scalable performance without programmer pain

# REASONING

#### Weak Memory

Making sense of microarchitecture that breaks programmer intuition



#### **Contributions**

Clarify interplay between transactions and weak memory

for x86, Power, Armv8, and C++ using axiomatic semantics and automated tool support

Resulting in the discovery of

Unsoundness of lock elision wrt an Armv8 spinlock impl. (this talk)

Ambiguity in Power TM specification Proposed simplification to C++ TM specification ... (*more in paper*)





TM-aware Memalloy [Wickerson et al., POPL 2017]





# Axiomatic Armv8 model with TM

+

Lock elision

+

Armv8 spinlock impl.

**\** 

TM-aware Memalloy [Wickerson et al., POPL 2017]



# Axiomatic Armv8 model with TM

+

Lock elision

+

Armv8 spinlock impl.

TM-aware Memalloy [Wickerson et al., POPL 2017]





Counterexample



```
// Initially, v == 0
P0
lock() | lock()
v := v + 2 | v := 1
unlock() | unlock()
// Can v == 2?
// A violation of mutual exclusion
```



```
// Initially, v == 0
                               Serialise:
                                 // v == 0
P0
                                 v := v + 2
lock()
          | lock()
                                 v := 1
v := v + 2 | v := 1
                                 // y == 1
unlock() | unlock()
// Can v == 2?
```

// A violation of mutual exclusion



```
// Initially, v == 0
                                Serialise:
                                  // v == 0
P0
                                  \vee := 1
lock()
          | lock()
                                  v := v + 2
v := v + 2 | v := 1
                                  // v == 3
unlock() | unlock()
// Can v == 2?
```

// A violation of mutual exclusion



```
lock()
                      lock()
v := v + 2
                      v := 1
unlock()
                      unlock()
```



```
lock() | lock()
```

#### Lock elision [Rajwar and Goodman, MICRO 2001]



lock() W5,[X0] LDR W5, W5, #2 ADD

STR

unlock()

W5,[X0]

lock() MOV W7,#1 STR W7, [X0] unlock()



Addr of v

```
lock()
                       tx {
                       if (lock taken)
                          txabort()
      W5,[X0]
LDR
                       MOV W7,#1
                       STR W7, [X0]
      W5, W5, #2
ADD
      W5,[X0]
STR
```

arm Research

unlock()

Addr of v

lock() **TXBEGIN** LDR W6, [X1] Addr of v CBZ W6, Crit Lock addr **TXABORT** Crit: W5,[X0] LDR MOV W7,#1 W5, W5, #2 STR W7, [X0] ADD W5,[X0] STR unlock() **TXEND** 



Addr of v

Lock addr

Hypothetical, but representative TM instructions

Compare-Branch-on-Zero Jump to Crit if lock is free; otherwise abort to fail-handler (omitted)

TXBEGIN

LDR W6, [X1]

CBZ W6, Crit

**TXABORT** 

Crit:

MOV W7,#1

STR W7, [X0]

**TXEND** 



```
lock()
                                      TXBEGIN
                                      LDR W6, [X1]
Addr of v
                                      CBZ W6, Crit
Lock addr
                                      TXABORT
                                   Crit:
                    W5, [X0]
             LDR
                                      MOV W7,#1
                                      STR W7, [X0]
                    W5, W5, #2
             ADD
                    W5,[X0]
             STR
             unlock()
                                      TXEND
```





Lock addr

lock()
<crit>
unlock()

Loop: LDAXR W2, [X1] W2, Loop **CBNZ** MOV W3,#1 W4,W3,[X1] STXR W4, Loop CBNZ <crit> WZR,[X1]

Atomically update lock from free to taken



Loop: LDAXR W2, [X1] Lock addr CBNZ W2, Loop Excl load/store pair. MOV W3,#1 ~ Compare-and-swap W4,W3,[X1]STXR W4, Loop CBNZ Excl status <crit> (success) WZR, [X1]



Store-excl succeeds if it is the *immediate coherence successor* of the write read-from by the load-excl [Sarkar et al., PLDI 2012]

STLR WZR, [X1]



Store-excl succeeds if it is the *immediate coherence successor* of the write read-from by the load-excl [Sarkar et al., PLDI 2012]



STLR WZR, [X1]



Store-excl succeeds if it is the *immediate coherence successor* of the write read-from by the load-excl [Sarkar et al., PLDI 2012]



STLR WZR, [X1]



Store-excl succeeds if it is the *immediate coherence successor* of the write read-from by the load-excl [Sarkar et al., PLDI 2012]



STLR WZR, [X1]



Lock addr

```
Loop:
                LDAXR W2, [X1]
                CBNZ
                       W2, Loop
                                        Spin if lock
Compare
                                        taken
                MOV
                       W3,#1
and Branch
                STXR
                       W4, W3, [X1]
on Non-Zero
                CBNZ
                        W4, Loop
                                       Spin if excl
                                       update
                <crit>
                                       failed
                       WZR, [X1]
```



Lock addr

```
Loop:
               LDAXR W2, [X1]
               CBNZ
                     W2,Loop
Unlock by
               MOV
                      W3,#1
writing 0;
               STXR W4, W3, [X1]
WZR = zero
               CBNZ
                      W4, Loop
register
               <crit>
                      WZR, [X1]
```



Loop: LDAXR W2, [X1] Lock addr CBNZ W2, Loop MOV W3,#1 Acquire/Release ~ "half barriers" STXR W4,W3,[X1] RCsc [Gharachorloo CBNZ W4, Loop et al, ISCA 1990] <crit> WZR, [X1] **arm** Research © 2017 Arm Limited

Read-acquire *ordered-before* any program-order successor

Any program-order predecessor ordered-before a write-release

[Arm arch. reference manual, B2.3]





Addr of v

Lock addr

```
lock()
                        TXBEGIN
                        LDR W6, [X1]
                        CBZ W6, Crit
                        TXABORT
                     Crit:
      W5, [X0]
LDR
                        MOV W7,#1
                        STR W7, [X0]
      W5, W5, #2
ADD
      W5,[X0]
STR
unlock()
                        TXEND
```



Addr of v

Lock addr

```
TXBEGIN
Loop:
  LDAXR W2, [X1]
                           LDR W6, [X1]
  CBNZ
                           CBZ W6, Crit
         W2, Loop
  MOV
                           TXABORT
         W3,#1
  STXR
         W4, W3, [X1]
                        Crit:
  CBNZ
         W4, Loop
         W5, [X0]
  LDR
                           MOV W7,#1
                           STR W7, [X0]
  ADD
         W5, W5, #2
         W5, [X0]
  STR
         WZR, [X1]
  STLR
                           TXEND
```



A program combining transactions and weak memory

Can v == 2 (violate mutual exclusion)?

```
TXBEGIN
           Loop:
                                      LDR W6, [X1]
              LDAXR W2, [X1]
              CBNZ
                     W2, Loop
                                       CBZ W6, Crit
lock = 0
              MOV
                     W3,#1
                                      TXABORT
                     W4, W3, [X1]
              STXR
                                    Crit:
              CBNZ
                     W4, Loop
                     W5, [X0]
                                       MOV W7,#1
              LDR
                                      STR W7, [X0]
              ADD
                     W5, W5, #2
                     W5, [X0]
              STR
                     WZR, [X1]
                                       TXEND
```

```
Loop:
                                       TXBEGIN
                                       LDR W6, [X1]
            1 LDAXR W2, [X1]
              CBNZ
                     W2, Loop
                                       CBZ W6, Crit
lock = 0
              MOV
                     W3,#1
                                       TXABORT
                     W4, W3, [X1]
              STXR
W2 = 0
                                    Crit:
              CBNZ
                     W4, Loop
                     W5, [X0]
                                       MOV W7,#1
              LDR
                                      STR W7, [X0]
              ADD
                     W5, W5, #2
                     W5, [X0]
              STR
                     WZR, [X1]
                                       TXEND
```

```
Loop:
                                       TXBEGIN
                                       LDR W6, [X1]
            1 LDAXR W2, [X1]
              CBNZ
                     W2, Loop
                                       CBZ W6, Crit
lock = 0
              MOV
                     W3,#1
                                       TXABORT
                     W4, W3, [X1]
              STXR
W2 = 0
              CBNZ
                     W4, Loop
                                    Crit:
            2 LDR
                     W5, [X0]
                                       MOV W7, #1
                                       STR W7, [X0]
                     W5, W5, #2
              ADD
W5 = 0
                     W5, [X0]
              STR
                     WZR, [X1]
                                       TXEND
```









It is "the [Arm] architecture's intention to allow store exclusives to promise success/failure very early"

Armv8 flat operational model [Pulte et al., POPL 2018]



```
Loop:
                                       TXBEGIN
                                       LDR W6, [X1]
            1 LDAXR W2, [X1]
              CBNZ
                     W2, Loop
                                       CBZ W6, Crit
lock = 0
              MOV
                     W3,#1
                                       TXABORT
                     W4, W3, [X1]
              STXR
W2 = 0
              CBNZ
                     W4, Loop
                                    Crit:
            2 LDR
                     W5, [X0]
                                       MOV W7, #1
                                       STR W7, [X0]
                     W5, W5, #2
              ADD
W5 = 0
                     W5, [X0]
              STR
                     WZR, [X1]
                                       TXEND
```

# lock = W2 = 0W5 = 0W6 = 0

```
Loop:
1 LDAXR W2, [X1]
         W2, Loop
  CBNZ
         W3,#1
         W4, W3, [X1]
  CBNZ
          W4, Loop
2 LDR
         W5, [X0]
         W5, W5, #2
  ADD
         W5, [X0]
  STR
         WZR, [X1]
```

**TXBEGIN** LDR W6, [X1] CBZ W6, Crit **TXABORT** Crit: MOV W7,#1 STR W7, [X0] TXEND

# lock = 1W2 = 0W3 = 1W4 = 0W5 = 0W6 = 0

```
Loop:
1 LDAXR W2, [X1]
  CBNZ
         W2, Loop
  MOV
         W3,#1
         W4, W3, [X1]
4 STXR
         W4, Loop
  CBNZ
         W5, [X0]
2 LDR
         W5, W5, #2
  ADD
         W5, [X0]
  STR
         WZR, [X1]
```

```
TXBEGIN
  LDR W6, [X1]
  CBZ W6, Crit
  TXABORT
Crit:
  MOV W7,#1
  STR W7, [X0]
  TXEND
```

W7 = 1

#### Loop: CBNZ lock = 0MOV 4 STXR W2 = 0CBNZ W3 = 1LDR W4 = 0ADD $W5 = 2^4$ 5 STR W6 = 0

```
1 LDAXR W2, [X1]
         W2, Loop
         W3,#1
         W4, W3, [X1]
         W4, Loop
         W5,[X0]
         W5, W5, #2
         W5, [X0]
         WZR, [X1]
```

```
TXBEGIN
  LDR W6, [X1]
  CBZ W6, Crit
  TXABORT
Crit:
  MOV W7,#1
  STR W7, [X0]
```

**TXEND** 

This Armv8 acquire-exclusive spinlock is safe, individually

Elided locks using only transactions are safe, individually

The combination is unsound



"[the] correctness [of lock elision] is guaranteed without any dependence on memory ordering"

[Rajwar and Goodman, MICRO 2001]





### A seven(teen) year-old counterexample

2001: Rajwar and Goodman introduce lock elision

2011: Acquire-release introduced to Armv8

2018: Lock elision counterexample



```
TXBEGIN
Replace excls
               Loop:
with v8.1 AL
                 LDAXR W2, [X1]
                                            LDR W6, [X1]
(acq-rel)
                 CBNZ
                                            CBZ W6, Crit
                         W2, Loop
atomic
                 MOV
                         W3,#1
                                            TXABORT
                 STXR
                         W4, W3, [X1]
                                          Crit:
                 CBNZ
                         W4, Loop
                  LDR
                         W5, [X0]
                                            MOV W7,#1
Insert DMB
                                            STR W7, [X0]
                 ADD
                         W5, W5, #2
between
                         W5, [X0]
                 STR
lock() and
                         WZR, [X1]
                                            TXEND
                 STLR
critical region
```

## Key ideas and related work

Axiomatic framework and tools [Alglave et al., TOPLAS 2014]

Memalloy tool for automatically comparing memory models [Wickerson et al., POPL 2017]

Litmus test minimality [Lustig et al., ASPLOS 2017]

→ Automated tool support for empirical testing and bounded verification

"Transactions in Relaxed Memory Architectures", [Dongol et al., POPL 2018]

https://bit.ly/2xJvbcT



#### Our paper

TM extensions of x86, Power, Armv8, and C++ axiomatic memory models

Formal models backed by automated tooling for

Synthesis of minimal tests for empirical testing Bounded verification of TM-related transformations and properties

Resulting in the discovery of

Unsoundness of lock elision wrt an Armv8 spinlock impl. Ambiguity in Power TM specification Proposed simplification to C++ TM specification ... (more in paper)



#### References

Alglave et al., "Herding Cats: Modelling, Simulation, Testing, and Data-mining for Weak Memory", TOPLAS 2014

Arm, "ARMv8 Architecture Reference Manual"

Cain et al., "Robust Architectural Support for Transactional Memory in the Power Architecture", ISCA 2013

Dongol et al., "Transactions in Relaxed Memory Architectures", POPL 2018

Gharachorloo et al., "Memory consistency and event ordering in scalable shared-memory multiprocessors", ISCA 1990

Lustig et al., "Automated Synthesis of Comprehensive Memory Model Litmus Test Suites", ASPLOS 2017

Pulte et al., "Simplifying ARM Concurrency: Multicopy-atomic Axiomatic and Operational Models for ARMv8", POPL 2018

Rajwar and Goodman, "Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution", MICRO 2001

Sarkar et al., "Synchronising C/C++ and POWER", PLDI 2012

Wickerson et al., "Automatically Comparing Memory Consistency Models", POPL 2017



#### Acknowledgements

We are grateful to Stephan Diestelhorst, Matt Horsnell, and Grigorios Magklis for extensive discussions of TM in general, and how it might interact with the Armv8 architecture memory model, to Nizamudheen Ahmed and Vishwanath HV for RTL testing, and to Peter Sewell for letting us access his Power machine.

We thank the following people for their insightful comments on various drafts of this work: Mark Batty, Andrea Cerone, George Constantinides, Stephen Dolan, Alastair Donaldson, Brijesh Dongol, Hugues Evrard, Shaked Flur, Graham Hazel, Radha Jagadeesan, Jan Kończak, Dominic Mulligan, Christopher Pulte, Alastair Reid, James Riely, the anonymous PLDI reviewers, and our shepherd, Julian Dolby.

This work was supported by an Imperial College Research Fellowship and the EPSRC (EP/K034448/1).

