

Submitted by **Florian Schrögendorfer** 

Submitted at
Institute for Formal
Models and Verification

Supervisor **Armin Biere** 

August 2021

# **Bounded Model Checking of Lockless Programs**



Master Thesis
to obtain the academic degree of
Master of Science
in the Master's Program
Computer Science

JOHANNES KEPLER UNIVERSITY LINZ

Altenbergerstraße 69 4040 Linz, Österreich www.jku.at DVR 0093696

# **Statutory Declaration**

I hereby declare that the thesis submitted is my own unaided work, that I have not used other than the sources indicated, and that all direct and indirect sources are acknowledged as references.

This printed thesis is identical with the electronic version submitted.

Linz, 24.8.2021

# Contents

| 1  | Introduction                   |    |  |  |  |  |  |
|----|--------------------------------|----|--|--|--|--|--|
| 2  | Machine Model                  | 6  |  |  |  |  |  |
| 3  | Simulation                     | 12 |  |  |  |  |  |
| 4  | Encoding Scheme                | 18 |  |  |  |  |  |
| 5  | SMT-LIB Encoding               | 26 |  |  |  |  |  |
| 6  | Functional Next State Logic    | 48 |  |  |  |  |  |
| 7  | Relational Next State Logic    |    |  |  |  |  |  |
| 8  | BTOR2                          | 69 |  |  |  |  |  |
| 9  | Testing and Debugging          | 81 |  |  |  |  |  |
| 10 | Experiments                    | 82 |  |  |  |  |  |
| 11 | Conclusion                     | 94 |  |  |  |  |  |
| Ap | opendices A Intel Litmus Tests |    |  |  |  |  |  |

# List of Figures

| 1  | System Architecture with Store Buffers |
|----|----------------------------------------|
| 2  | Store Buffer Litmus Test Interleavings |
| 3  | ConcuBinE's Various Modes of Operation |
| 4  | Abstract Machine Model 6               |
| 5  | Program Syntax                         |
| 6  | Trace Syntax                           |
| 7  | Memory Map Syntax                      |
| 8  | Intel Litmus Test Encoder Graph        |
| 9  | AMD Litmus Test Encoder Graph          |
| 10 | Intel Litmus Test Solver Graph         |
| 11 | AMD Litmus Test Solver Graph           |
| 12 | Faulty Counter Encoder Graph 89        |
| 13 | Faulty Counter Solver Graph            |
| 14 | Valid Counter Encoder Graph            |
| 15 | Valid Counter Solver Graph             |

# List of Tables

| 1    | Memory Ordering on Different Architectures                        |
|------|-------------------------------------------------------------------|
| 2    | State Variables                                                   |
| 3    | Transition Variables                                              |
| 4    | Helper Variables                                                  |
| 5    | Frame Axioms                                                      |
| 6    | Store Buffer Litmus Test Programs and Activation Variables . 27   |
| 7    | SMT-LIB Expression Generator Functions                            |
| 8    | Intel Litmus Test Encoder Statistics                              |
| 9    | AMD Litmus Test Encoder Statistics 83                             |
| 10   | Intel Litmus Test Solver Statistics                               |
| 11   | AMD Litmus Test Solver Statistics                                 |
| 12   | Faulty Counter Encoder Statistics                                 |
| 13   | Faulty Counter Solver Statistics                                  |
| 14   | Valid Counter Encoder Statistics                                  |
| 15   | Valid Counter Solver Statistics                                   |
| A.1  | Stores Are Not Reordered with Other Stores                        |
| A.2  | Stores Are Not Reordered with Older Loads                         |
| A.3  | Loads May be Reordered with Older Stores                          |
| A.4  | Loads Are not Reordered with Older Stores to the Same Location 99 |
| A.5  | Intra-Processor Forwarding is Allowed                             |
| A.6  | Stores Are Transitively Visible                                   |
| A.7  | Stores Are Seen in a Consistent Order by Other Threads 102        |
| A.8  | Locked Instructions Have a Total Order                            |
| A.9  | Loads Are not Reordered with Locks                                |
| A.10 | Stores Are not Reordered with Locks                               |
| B.1  | Stores Are Not Reordered with Other Stores 106                    |
| B.2  | Stores Are Not Reordered with Older Loads 107                     |
| B.3  | Stores Can Be Arbitrarily Delayed                                 |
| B.4  | Loads May be Reordered with Older Stores                          |
| B.5  | Sequential Consistency                                            |
| B.6  | Stores Are Seen in a Consistent Order by Other Threads 109        |
| B.7  | Stores Are Transitively Visible                                   |
| B.8  | Intra-Processor Forwarding is Allowed                             |
| B.9  | Global Visibility                                                 |

# 1 Introduction

Correctness of computer systems is critical in today's information age and although software verification made considerable progress in the last decade, it is still an ongoing research topic. Testing alone however is not sufficient to validate parallel programs communicating via shared memory on modern multiprocessor hardware, as fatal race conditions can have extremely low probabilities of occurrence. The situation becomes even worse due to counter-intuitive behaviour introduced by certain hardware optimizations. Without further knowledge about the underlying architecture, inexperienced developers of concurrent low-level systems code like lockless data structures, operating system kernels, synchronization primitives, compilers, and so on might assume sequential consistency [1]. Unfortunately, none of the major hardware architectures follows this rather restrictive memory ordering model and allow reordering of memory access operations in various ways. This opens the door for hard to find bugs caused by unexpected behaviour and requires a deep understanding of the target architecture's memory ordering habits, accompanied by careful use of memory barrier operations to ensure consistency.

|                                     | Alpha    | ARM | Itanium | MIPS     | POWER | SPARC-TSO | x86 | zSystems |
|-------------------------------------|----------|-----|---------|----------|-------|-----------|-----|----------|
| Loads Reordered after Loads/Stores? | <b>✓</b> | 1   | 1       | 1        | 1     |           |     |          |
| Stores Reordered after Stores?      | 1        | 1   | 1       | <b>✓</b> | 1     |           |     |          |
| Stores Reordered after Loads?       | 1        | 1   | 1       | 1        | 1     | 1         | 1   | <b>✓</b> |
| Atomic Reordered with Loads/Stores? | 1        | 1   |         | 1        | 1     |           |     |          |

Table 1: Memory Ordering on Different Architectures

Table 1 shows a rough overview of the memory ordering models used by different architectures. Even the most restrictive – like the x86 processors in our desktop computers – allow loads to be reordered with an earlier store to a different location, thus breaking sequential consistency. The reason for this particular reordering is a widely used optimization technique commonly referred to as *store buffer*.

While caches improve subsequent repeated loads of a specific variable (after an initial cache miss), it is also necessary to accommodate frequent concurrent stores from multiple processors to a set of shared variables. In cache-coherent systems, if the caches hold multiple copies of a given variable, all the copies of that variable must have the same value. Each store will therefore invalidate all copies of the old value, hence slowing down the computation [2].



Figure 1: System Architecture with Store Buffers

To remedy this situation, modern processors come equipped with store buffers, as shown in Figure 1. When a variable is stored, the new value is placed in the processor's store buffer, which can proceed immediately without having to wait for the store to do something about all the old values of that variable residing in other processors' caches. In order to guarantee that each processor at least sees its own operations in program order, it first examines its store buffer before consulting the cache when loading a variable in a process called *store forwarding*. Since each store buffer is allowed to voluntarily flush its contents back to memory, the visibility of the particular

stores to other processors might be arbitrarily delayed, leading to the aforementioned memory misordering illustrated by the store buffer litmus test shown in Listing 1.

```
1 #include <assert.h>
 2 #include <pthread.h>
 4 #define ACCESS(x) (*(volatile typeof(x) *) &(x))
 5 #define READ(x) ({typeof(x) TMP = ACCESS(x); TMP;})
 6 #define WRITE(x,v) ({ACCESS(x) = (v);})
 8 static int w0 = 0;
 9 static int w1 = 0;
11 static int r0 = 0;
12 static int r1 = 0;
14 static void * P0 (void * p)
15
      WRITE(w0, 1);
16
17
     r0 = READ(w1);
      return p;
18
19 }
21 static void * P1 (void * p)
22
   {
23
     WRITE(w1, 1);
     r1 = READ(w0);
24
25
      return p;
26 }
28
   int main ()
29
30
     pthread_t t[2];
     pthread_create (t + 0, 0, P0, 0);
31
32
      pthread_create (t + 1, 0, P1, 0);
33
      pthread_join (t[0], 0);
     pthread_join (t[1], 0);
34
      assert(r0 + r1);
35
      return 0;
36
37 }
```

Listing 1: Store Buffer Litmus Test

One might assume that the assertion on line 35 never triggers, because any possible interleaving should guarantee that at least one WRITE must have happened before a subsequent READ by the other thread in this symmetric example. This intuition however breaks down with the addition of store buffers. Consider the interleavings depicted in Figure 2. After initializing



Figure 2: Store Buffer Litmus Test Interleavings

the shared variables with zero, writes to  $w_0$  and  $w_1$  will be placed into the respective processor's store buffer. If neither of the store buffers has been flushed prior to the subsequent reads  $r_0 = w_1$  and  $r_1 = w_0$ , both processors must resort to the variable's initial value in their caches, hence triggering the assertion. This willingness to reorder can be validated by running the store buffer litmus test given in Listing 1. On our Intel i7-3770K this counterintuitive reordering happened 177169 out of 1000000 test runs, while the perfectly legal outcome of  $r_0 + r_1 = 2$  on the other hand only occurred 82 times.

Programmers must therefore make careful use of memory barriers, forcing the store buffer to be flushed back to memory after writing to a shared variable whenever necessary. Unfortunately, many hardware vendors don't supply concrete formal specifications programmers can rely on. For example, the memory ordering models of Intel's [3] as well as AMD's [4] x86 implementations are specified in an informal prose together with a set of litmus tests, having the potential for leading to widespread confusion.

In this thesis we tried to address this issue by means of bounded model checking [5] based on the architectural view of a simple virtual machine model using store buffers, reassambling the memory ordering habits of the industry's leading x86 implementations. In order to evaluate our approach we developed ConcuBinE<sup>1</sup> (short for **Concu**rrent **Bin**ary **E**valuator), a tool for simulating and automated reasoning about random memory access sequences by concurrent programs. It offers three modes of execution:

- simulate: simulates the execution of given programs and returns the corresponding execution trace in an easy to read format.
- solve: automatically generates an encoding of given programs for a specific upper bound and evaluates the resulting bounded model checking problem with the help of state-of-the-art SMT solvers. If the problem is satisfiable, the corresponding execution trace (model) will be returned.
- replay: reevaluates a given trace via simulation. Used to validate the traces found by the solve mode.

Figure 3 shows a basic overview of its usage, where red nodes symbolize in and output files, blue nodes the mode of opration and yellow nodes external software (SMT solvers).



Figure 3: ConcuBinE's Various Modes of Operation

<sup>1</sup>https://github.com/phlo/concubine

# 2 Machine Model

We will start by defining a minimal virtual machine model of a multiprocessor system as observed by assembly programs. To keep the state space of the resulting model checking problems as small as possible, it is based on a 16 bit architecture, using only a minimal set of registers and a radically reduced instruction set.



Figure 4: Abstract Machine Model

A schematic overview is illustrated in Figure 4, where dashed lines depict components and data paths solely used by our *compare and swap* mechanism. At the top of the figure are an arbitrary number of logical processors, each running a single abstract thread containing:

- accu: a single 16 bit accumulator register
- mem: a special purpose 16 bit register, storing the expected value required by a unary *compare and swap* instruction
- a single element *store buffer*, consisting of:

 full: a one bit flag register, signaling that it contains a value and may be flushed

adr: a 16 bit address registerval: a 16 bit value register

All threads are directly connected to the machine's shared memory, referred to as heap, which will be uninitialized unless any eventual input data. Caches are abstracted as they don't influence the memory subsystem's basic behaviour if they are coherent. In terms of memory ordering, the addition of a *store buffer* allows stores to be reordered after loads, making our model consistent with Intel's or AMD's x86 memory ordering models [3, 4].

## Scheduling

At each step in time a single thread either executes an instruction or flushes its store buffer back to memory. Scheduling is generally performed non-deterministically under the following constraints.

- 1. A thread can execute a read, modify or control operation at any time.
- 2. A thread can voluntarily flush its store buffer to memory only if it is full.
- 3. A thread can execute a write or barrier operation only if its store buffer is empty.

We argue that this interleaving semantic is sufficient for simulating the effects of real hardware, due to the execution of instructions and store buffer flushes being seen as independent. A similar argument applies regarding preemption, assuming the presence of at least as many physical processors as running threads. Since preemptive operating system kernels contain memory barriers when switching contexts, our thread centric view remains valid and preemption is just hidden behind a store buffer flush.

#### Instructions

Our machine uses a radically reduced instruction set that contains only the most substantial operations. Instructions are stored separately for each thread and are therefore not contained in memory. This abstraction allows the program counter to address instructions by their index, starting from zero. To simplify the definition of operational semantics, instructions are labelled using the following attributes:

- modify Modifies a register's content.
- read Reads from memory using *store forwarding*: if full is set and adr equals the given target address, the value contained in val is read instead of the corresponding shared memory location.
- write Writes to the store buffer by setting full to true, adr to the given target address and val to the value contained in accu.
- barrier Blocks execution if the store buffer is full (full is set).
- **atomic** Multiple operations performed as a single, indivisible instruction.
- **control** Modifies the order in which instructions are executed.

Due to the single register architecture, all instructions have at most one operand. The following list shall give an overview of our machine's instruction set.

### Memory

Two addressing modes are supported: direct and indirect denoted by square brackets (e.g. LOAD [arg]).

LOAD arg modify, read

Loads the value found at address arg into accu.

STORE arg write

Stores the value found in accu at address arg.

FENCE barrier

Memory barrier.

#### Arithmetic

Since the interpretation of a value's signedness is left to the programmer, we only included the most basic arithmetic instructions, which are equivalent for unsigned and two's compliment representations. Division is therefore left out due to its different implementations for signed and unsigned integers.

ADD arg modify, read

Adds the value found at address arg to accu.

ADDI arg modify

Adds the immediate value arg to accu.

SUB arg modify, read

Subtracts the value found at address arg from accu.

SUBI arg modify

Subtracts the immediate value arg from accu.

MUL arg modify, read

Multiplies accu with the value found at address arg.

MULI arg modify

Multiplies accu with the immediate value arg.

Control Flow

JMP arg control

Jumps to the statement at arg unconditionally.

JZ arg control

Jumps to the statement at arg if accu is zero.

JNZ arg control

Jumps to the statement at arg if accu is non-zero.

JS arg control

Jumps to the statement at arg if accu is negative according to two's compliment (most significant bit is set).

JNS arg control

Jumps to the statement at arg if accu is zero or positive according to two's compliment (most significant bit is unset).

JNZNS arg control

Jumps to the statement at arg if accu is positive according to two's compliment (non-zero and most significant bit is unset).

#### Atomic

We included a basic atomic *compare and swap* operation, commonly used to implement synchronization primitives like semaphores and mutexes, as well as lockless data structures. Since this particular command requires at least two operands (the target address and expected previous value), it had to be split up into two instructions due to our machine's unary input language.

MEM arg modify, read

Loads the value from address arg into accu and mem as the expectation during a latter *compare and swap* operation.

CAS arg modify, read, barrier, atomic

Atomically compares the expected value in mem to the actual value found at address arg and only writes the value found in accu back to address adr if they are equal. Acts like a memory barrier.

#### Termination

Explicitly stopping a single thread or terminating the whole machine can be achieved by the following commands.

HALT barrier, control

Stops the current thread.

EXIT arg control

Stops the machine with exit code arg.

#### Meta

The following high-level meta instruction, mimicking a reverse semaphore (and therefore not available on real hardware), shall simplify the implementation of so called *checker threads* used to programmatically validate machine states at runtime.

```
CHECK arg control
```

Synchronize on checkpoint arg (rendezvous). Suspends execution until all threads, containing a call to checkpoint arg, reached the corresponding CHECK statement.

## **Programs**

Each thread is programmed using an assembly style language, defined by the following syntax.

```
 \langle int \rangle ::= \text{ an integer number } \\ \langle label \rangle ::= \text{ a sequence of printable characters without #} \\ \langle string \rangle ::= \text{ a sequence of whitespace and printable characters without # and $\setminus \langle comment \rangle ::= \# \langle string \rangle \\ \langle nullary \rangle ::= \text{ FENCE } | \text{ HALT } \\ \langle unary \rangle ::= \text{ ADDI } | \text{ SUBI } | \text{ MULI } | \text{ EXIT } | \text{ CHECK } \\ \langle memory \rangle ::= \text{ LOAD } | \text{ STORE } | \text{ ADD } | \text{ SUB } | \text{ MUL } | \text{ CMP } | \text{ MEM } | \text{ CAS } \\ \langle jump \rangle ::= \text{ JMP } | \text{ JZ } | \text{ JNZ } | \text{ JNS } | \text{ JNZNS } \\ \langle instruction \rangle ::= \langle nullary \rangle \\ | \langle unary \rangle \langle int \rangle \\ | \langle memory \rangle (\langle int \rangle | [\langle int \rangle]) \\ | \langle jump \rangle (\langle int \rangle | \langle label \rangle) \\ \langle statement \rangle ::= \langle label \rangle : \langle instruction \rangle | \langle instruction \rangle \\ \langle line \rangle ::= \langle statement \rangle | \langle statement \rangle \langle comment \rangle | \langle comment \rangle \\ \langle program \rangle ::= \langle line \rangle | \langle line \rangle \setminus n \langle program \rangle \\ \end{cases}
```

Figure 5: Program Syntax

If the final statement in a given program is not an EXIT instruction or unconditional JMP, an additional HALT is inserted implicitly.

# 3 Simulation

Execution can be simulated using the simulate mode, implementing a virtual machine for the previously defined model. It takes an arbitrary number of program files, each being run on a separate thread and produces an execution trace using a pseudo random number generator to determine the schedule.

| ADDI 1  | ADDI 1  |
|---------|---------|
| STORE 0 | STORE 1 |
| LOAD 1  | LOAD O  |
| CHECK O | CHECK O |

Listing 2: thread.0.asm Listing 3: thread.1.asm

An implementation of the store buffer litmus test for our machine is given in Listings 2 and 3. The following command line can now be used to simulate its execution.

#### \$ concubine simulate thread.0.asm thread.1.asm

After the simulation has been finished, the execution trace will be saved to the current directory as sim.trace per default.

#### Traces

The execution traces are simple text files, capturing the basic environment and machine state transitions. A trace of the previous simulation could look as follows.

```
thread.0.asm
thread.1.asm
#
  tid pc
                               mem adr val full
            cmd
                   arg accu
                                                    heap
0
       0
            ADDI
                   1
                        0
                               0
                                    0
                                         0
                                              0
                                                     {}
                                                               # 0
                                              0
                                                     {}
1
       0
            ADDI
                   1
                        0
                               0
                                    0
                                         0
                                                               # 1
                                    0
                                              0
                                                     {}
                                                               # 2
0
       1
            STORE 0
                        1
                               0
                                         0
            STORE 1
                               0
                                    0
                                         0
                                              0
                                                     {}
                                                              # 3
1
       1
                                    0
                                        1
                                              1
0
       2
            FLUSH -
                               0
                                                     {}
       2
            FLUSH -
                               0
                                    1
                                         1
                                              1
                                                     \{(0,1)\} # 5
1
                        1
0
       2
            LOAD
                        1
                               0
                                    0
                                         1
                                              0
                                                     \{(1,1)\} # 6
                   1
                                    1
1
       2
            LOAD
                   0
                        1
                               0
                                         1
                                                     {}
```

| 0 | 3 | CHECK C | ) 1 | 0 | 0 | 1 | 0 | {} | # 8  |
|---|---|---------|-----|---|---|---|---|----|------|
| 1 | 3 | CHECK C | 1   | 0 | 1 | 1 | 0 | {} | # 9  |
| 0 | 4 | HALT -  | . 1 | 0 | 0 | 1 | 0 | {} | # 10 |
| 1 | 4 | HALT -  | . 1 | 0 | 1 | 1 | 0 | {} | # 11 |

Listing 4: Simple Output Trace

At the beginning is a list of paths to the programs involved in creating the specific trace, followed by a delimiter. The rest of the file contains the actual execution steps, one per line, starting with the scheduled thread's identifier, followed by details about the state prior to the current statement's execution, consisting of:

- program counter of the instruction about to be executed
- instruction symbol, or FLUSH if a thread is flushing its store buffer
- instruction argument, or for nullary instructions
- accumulator register
- CAS memory register
- store buffer address register
- store buffer value register
- store buffer full flag
- memory cell updated in the previous step, represented as a single pair in set notation {(index,value)}, or an empty set {} if the memory state didn't change

The result of the instruction's execution will be visible, the next time the specific thread is scheduled. A better visualization of a particular thread's state transitions can be achieved by only considering e.g. thread 0 in the trace given in Listing 4:

| # t | id pc | cmd   | arg | accu | mem | adr | val | full | heap    |   |    |
|-----|-------|-------|-----|------|-----|-----|-----|------|---------|---|----|
| 0   | 0     | ADDI  | 1   | 0    | 0   | 0   | 0   | 0    | {}      | # | 0  |
| 0   | 1     | STORE | 0   | 1    | 0   | 0   | 0   | 0    | {}      | # | 2  |
| 0   | 2     | FLUSH | -   | 1    | 0   | 0   | 1   | 1    | {}      | # | 4  |
| 0   | 2     | LOAD  | 1   | 1    | 0   | 0   | 1   | 0    | {(1,1)} | # | 6  |
| 0   | 3     | CHECK | 0   | 1    | 0   | 0   | 1   | 0    | {}      | # | 8  |
| 0   | 4     | HALT  | -   | 1    | 0   | 0   | 1   | 0    | {}      | # | 10 |

A more formal definition of the trace file syntax is given below.

```
\langle int \rangle ::= an integer number
\langle path \rangle ::= a unix file path
\langle label \rangle ::= a sequence of printable characters without #
\langle string \rangle ::= a sequence of whitespace and printable characters without # and \n
\langle comment \rangle ::= \# \langle string \rangle
\langle program \rangle ::= \langle path \rangle \mid \langle path \rangle \langle comment \rangle \mid \langle comment \rangle
\langle programs \rangle ::= \langle program \rangle \mid \langle program \rangle \setminus n \langle programs \rangle
\langle mmap \rangle ::= \langle path \rangle \mid \langle path \rangle \langle comment \rangle
\langle header \rangle ::= \langle programs \rangle \setminus n . \mid \langle programs \rangle \setminus n . \langle mmap \rangle
\langle pc \rangle ::= \langle int \rangle \mid \langle label \rangle
\langle op \rangle ::= FLUSH
         LOAD | STORE | FENCE
         ADD | ADDI | SUB | SUBI | MUL | MULI
         CMP | JMP | JZ | JNZ | JS | JNS | JNZNS
         MEM | CAS
         HALT | EXIT | CHECK
\langle arg \rangle ::= \langle int \rangle \mid [\langle int \rangle] \mid \langle label \rangle \mid -
\langle bool \rangle ::= 0 \mid 1
\langle heap \rangle ::= \{\} \mid \{(\langle adr \rangle, \langle val \rangle)\}
\langle state \rangle ::= \langle int \rangle \langle pc \rangle \langle op \rangle \langle arg \rangle \langle int \rangle \langle int \rangle \langle int \rangle \langle int \rangle \langle bool \rangle \langle heap \rangle
\langle step \rangle ::= \langle state \rangle \mid \langle state \rangle \langle comment \rangle \mid \langle comment \rangle
\langle steps \rangle ::= \langle state \rangle \mid \langle state \rangle \setminus n \langle steps \rangle
\langle trace \rangle ::= \langle header \rangle \setminus n \langle steps \rangle
```

Figure 6: Trace Syntax

# Memory Maps

Access to uninitialized memory cells is captured in so called  $memory maps^2$  again a simple text file format, containing one address value pair per line, delimited by a whitespace.

```
thread.0.asm
thread.1.asm
. sim.mmap
# tid pc cmd arg accu mem adr val full heap
```

<sup>&</sup>lt;sup>2</sup>naming inspired by, but not to be confused with memory mapped I/O

```
0
            ADDI
                                                        {}
                                                                  # 0
       0
                    1
                         0
                                 0
                                      0
                                           0
                                                0
1
       0
                                      0
                                                0
                                                        {}
                                                                  # 1
            ADDI
                    1
                         0
                                 0
                                           0
0
                                 0
                                      0
                                           0
                                                0
                                                        {}
                                                                  # 2
       1
            STORE 0
                         1
0
       2
            LOAD
                    1
                                 0
                                      0
                                           1
                                                1
                                                        {}
                                                                  # 3
                         1
0
       3
            CHECK O
                         57647 0
                                      0
                                           1
                                                1
                                                        {}
                                                                  # 4
       1
                                                        {}
                                                                  # 5
1
            STORE 1
                         1
                                 0
                                      0
                                           0
                                                0
       2
                                                        {}
1
            LOAD
                    0
                                 0
                                      1
                                                1
                                                                  # 6
                                           1
                                                        {}
       3
            CHECK O
                                                1
                                                                  # 7
1
                         34446 0
                                      1
                                           1
                         57647 0
                                                        {}
0
       4
            FLUSH -
                                      0
                                           1
                                                1
                                                                  #8
1
       4
            FLUSH -
                         34446 0
                                      1
                                           1
                                                1
                                                        \{(0,1)\} # 9
0
       4
            HALT
                         57647 0
                                      0
                                           1
                                                0
                                                        \{(1,1)\} # 10
1
       4
            HALT
                         34446 0
                                      1
                                           1
                                                0
                                                        {}
                                                                  # 11
```

Listing 5: Output Trace Accessing Uninitialized Memory

Consider another random simulation, given in Listing 5. In this simulation, both threads are reading uninitialized memory locations at addresses 0 and 1 respectively. To ensure reproducibility, a memory map file sim.mmap is stored together with the trace file sim.trace per default and is referenced just after the delimiter at the end of the program list in the trace file's header.

```
0 34446
1 57647
```

Listing 6: Output Memory Map of the Trace in Listing 5

Memory maps can also be used to initialize shared memory with input data by supplying the -m command line parameter, followed by a path to an existing memory map file. For completeness, the syntax of such trace files is given below.

```
\begin{split} &\langle int \rangle ::= \text{ an integer number} \\ &\langle string \rangle ::= \text{ a sequence of whitespace and printable characters without # and $\backslash$ $$ $\langle comment \rangle ::= # \langle string \rangle$ $$ $\langle cell \rangle ::= \langle int \rangle \langle int \rangle | \langle int \rangle \langle comment \rangle | \langle comment \rangle$ $$ $$ $\langle cell \rangle ::= \langle cell \rangle | \langle cell \rangle \backslash n \langle mmap \rangle$ $$
```

Figure 7: Memory Map Syntax

## Finding Specific Machine States

It also possible to search for specific machine states via simulation. The command line switch -c simulates the given programs exhaustively until a trace yielding an exit code other than zero is encountered.

```
CHECK O
ADD 0
ADD 1
JZ error
EXIT 0
error: EXIT 1
```

Listing 7: checker.asm

To find traces exhibiting a problematic reordering of writes in our store buffer litmus test example, we make use of a so called *checker thread*, given in Listing 7, that examines the machine state at a specific point in time by using the CHECK instruction and stops execution with an EXIT 1 if none of the store buffers has been flushed, leading to both memory locations still being zero.

```
0 0
1 0
```

Listing 8: init.mmap

This requires that the relevant memory cells are set to zero in order to prevent uninitialized reads from influencing the checking procedure and is achieved by using the memory map given in Listing 8. The search can now be started by issuing the following command line:

After a couple of simulations, a trace like the one shown in Listing 9, where none of the store buffers has been flushed prior to the checker thread's ADD instructions, will be found.

```
checker.asm
thread.0.asm
thread.1.asm
. sim.mmap
# tid pc
                     arg
                            accu
                                  mem adr val full
                                                        heap
              cmd
1
       0
                                                        {}
              ADDI
                     1
                            0
                                   0
                                        0
                                            0
                                                 0
                                                             # 0
2
       0
              ADDI
                            0
                                   0
                                        0
                                            0
                                                 0
                                                        {}
                                                             # 1
                                                        {}
1
              STORE 0
                                   0
                                                             # 2
       1
                            1
                                        0
                                            0
                                                 0
2
                                                        {}
       1
              STORE 1
                                   0
                                        0
                                            0
                                                 0
                                                             # 3
                            1
1
       2
              LOAD
                                   0
                                                        {}
                                                             # 4
                     1
                            1
                                        0
                                            1
                                                 1
2
       2
              LOAD
                     0
                            1
                                   0
                                        1
                                            1
                                                 1
                                                        {}
                                                             # 5
                                                             # 6
1
       3
              CHECK 0
                            0
                                   0
                                        0
                                                        {}
                                            1
                                                 1
2
              CHECK 0
                                                             # 7
       3
                            0
                                   0
                                        1
                                            1
                                                 1
                                                        {}
0
       0
              CHECK 0
                            0
                                   0
                                                        {}
                                                             # 8
                                        0
                                            0
                                                 0
0
       1
                                                        {}
                                                             # 9
              ADD
                            0
                                   0
                                        0
                                            0
                                                 0
0
       2
              ADD
                                            0
                                                        {}
                                                             # 10
                     1
                            0
                                   0
                                        0
                                                 0
0
       3
              JΖ
                     error 0
                                   0
                                        0
                                            0
                                                 0
                                                        {}
                                                             # 11
0
       error EXIT
                     1
                            0
                                   0
                                        0
                                            0
                                                 0
                                                        {}
                                                             # 12
```

Listing 9: Sequentially Inconsistent Trace

# 4 Encoding Scheme

The main part of this work is implemented in the solve mode, allowing verification of concurrent software running on our abstract machine model by the means of bounded model checking [5]. It takes an arbitrary number of programs plus the upper bound  $\mathcal{K}$  as input and encodes them into a finite state machine, expressed as an SMT formula where each transition translates to the execution of a single thread. SMT-LIB [6] and the novel BTOR2 [7] word level model checking format can be generated, using the theories of arrays, uninterpreted functions and bit-vectors. To simplify the definition of bad states directly in the program code, the possibility of encountering an exit code greater than zero is checked for per default, but custom properties may be defined in a separate file and added with the -c command line parameter. The resulting formula is then evaluated by a state-of-the-art SMT solver. If it is satisfiable, the corresponding execution trace is extracted from a generated model and stored for later inspection.

#### Formal Model

Let  $\mathcal{B}^n$  be the fixed size bit-vector sort of width n and  $\mathcal{A}^n$  the array sort with index and element sorts  $\mathcal{B}^n$ . The following variables are used to encode the machine state at a particular step  $k \leq \mathcal{K}$ , where k = 0 is the initial state.

| $\overline{{	t heap}^k}$      | $\in \mathcal{A}^{16}$ | shared memory                                       |
|-------------------------------|------------------------|-----------------------------------------------------|
| $\mathtt{exit}^k$             | $\in \mathcal{B}^1$    | exit flag                                           |
| $\mathtt{exit}\text{-}code^k$ | $\in \mathcal{B}^{16}$ | exit code                                           |
| $accu_t^k$                    | $\in \mathcal{B}^{16}$ | accumulator register of thread $t$                  |
| $\mathtt{mem}_t^k$            | $\in \mathcal{B}^{16}$ | CAS memory register of thread $t$                   |
| $\mathtt{adr}_t^k$            | $\in \mathcal{B}^{16}$ | store buffer address register of thread $t$         |
| $\mathtt{val}_t^k$            | $\in \mathcal{B}^{16}$ | store buffer value register of thread $t$           |
| $\mathtt{full}_t^k$           | $\in \mathcal{B}^1$    | store buffer full flag of thread $t$                |
| $\mathtt{stmt}^k_{t,pc}$      | $\in \mathcal{B}^1$    | activation flag for statement at $pc$ of thread $t$ |
| $\mathtt{block}^k_{id,t}$     | $\in \mathcal{B}^1$    | block flag for checkpoint $id$ of thread $t$        |
| $\mathtt{halt}_t^k$           | $\in \mathcal{B}^1$    | halt flag of thread $t$                             |

Table 2: State Variables

Shared memory states are modelled using the array variables  $\mathtt{heap}^k$  in combination with the functions  $\mathtt{read}^k \colon \mathcal{B}^{16} \mapsto \mathcal{B}^{16}$  for retrieving an element and  $\mathtt{write}^k \colon \mathcal{B}^{16} \times \mathcal{B}^{16} \mapsto \mathcal{A}^{16}$  returning an updated version of the shared memory state array with the given element set to a specific value. Register states of a thread t are determined by the bit-vector variables  $\mathtt{accu}_t^k$ ,  $\mathtt{mem}_t^k$ ,  $\mathtt{adr}_t^k$ ,  $\mathtt{val}_t^k$  and the flag  $\mathtt{full}_t^k$ , signalling that the store buffer is full. To reduce the formula's complexity, program flow is modelled without an explicit program counter ("pc"). Instead, an activation flag  $\mathtt{stmt}_{t,pc}^k$  is added for every statement in the program of thread t, expressing that it is about to execute the statement at pc. Blocking a thread t while it is waiting for all other threads to reach a checkpoint id is achieved by a flag  $\mathtt{block}_t^k$ . Similarly, the flag  $\mathtt{halt}_t^k$  indicates that thread t executed a HALT instruction and is therefore also prevented from being scheduled. Termination is captured by the flag  $\mathtt{exit}^k$  and bit-vector variable  $\mathtt{exit}$ -code $^k$ .

All states are initially set to zero, except the initial statement activation flag  $\mathtt{stmt}_{t,0}^0$  of every thread t and the shared memory state array  $\mathtt{heap}^0$ , which may be initialized with input data according to a given memory map, but is assumed to be uninitialized in general.

Machine state transitions of the form  $s_k \to^k s_{k+1}$  are encoded by the following free variables.

| $\mathtt{thread}_t^k$ | $\in \mathcal{B}^1$ | thread $t$ is scheduled to execute an instruction in step $k$ |
|-----------------------|---------------------|---------------------------------------------------------------|
| $\mathtt{flush}_t^k$  | $\in \mathcal{B}^1$ | thread $t$ flushes its store buffer in step $k$               |

Table 3: Transition Variables

To simplify the definition of successor states, the following helper variables capture frequently used conditions.

| $exec^k_{t,pc}$         | $\in \mathcal{B}^1$ | thread $t$ is executing the statement at $pc$ in step $k$ |
|-------------------------|---------------------|-----------------------------------------------------------|
| $\mathtt{check}_{id}^k$ | $\in \mathcal{B}^1$ | all threads reached checkpoint $id$ in step $k$           |

Table 4: Helper Variables

The actual execution of a specific statement is encoded by  $\mathsf{exec}_{t,pc}^k$  and is defined as a conjunction of the corresponding statement and thread activation variables.

$$extsf{exec}_{t,pc}^k = extsf{stmt}_{t,pc}^k \wedge extsf{thread}_t^k$$

Furthermore, we use  $\operatorname{check}_{id}^k$  to signal that all threads containing a call to checkpoint id (given in the set  $C_{id} = \{t \mid t \text{ contains CHECK } id\}$ ) have synchronized.

$$\mathtt{check}_{id}^k = igwedge_{t \in C_{id}} \mathtt{block}_{id,t}^k$$

#### Scheduling

Non-deterministic scheduling of at most one thread per step is realized by a boolean cardinality constraint over all transition variables and the exit flag  $\mathtt{exit}^k$  to ensure satisfiability if the machine terminates in a step  $k < \mathcal{K}$ . Let  $\leq_n^1(x_1,\ldots,x_n)$  be a predicate expressing that at most one out of n variables is allowed to be true. The intuitive way of encoding  $\leq_n^1(x_1,\ldots,x_n)$  is by excluding all combinations of two variables being simultaneously true.

$$\bigwedge_{1 \le i < j \le n} (\neg x_i \lor \neg x_j)$$

This naïve approach, however, consists of  $\binom{n}{2}$  binary clauses. A more compact formulation, based on a sequential counter circuit computing partial sums  $s_i = \sum_{j=1}^i x_j$  for increasing values of i up to the final i = n, is presented as  $\operatorname{LT}^{n,1}_{\operatorname{SEQ}}$  in [8] and defined as follows.

$$(\neg x_1 \lor s_1) \land (\neg x_n \lor \neg s_{n-1}) \bigwedge_{1 < i < n} ((\neg x_i \lor s_i) \land (\neg s_{i-1} \lor s_i) \land (\neg x_i \lor \neg s_{i-1}))$$

 $LT_{SEQ}^{n,1}$  is superior to the naïve encoding with regard to the number of clauses for all n > 5, as it only requires 3n-4 binary clauses and n-1 additional auxiliary variables. The actual definition of the at most one constraint predicate  $\leq_n^1(x_1,\ldots,x_n)$  is therefore determined by the number of threads involved. Since we have to include two times the number of threads plus one variables in the constraint, the naïve encoding is only used for up to two threads and  $LT_{SEQ}^{n,1}$  otherwise. An at most one constraint alone is not sufficient, as we need exactly one transition variable or the exit flag to be true in every step. This is because our generated formula should not be trivially satisfiable by never scheduling a single thread. Thus, we define the exactly one constraint predicate  $= \frac{1}{n}(x_1,\ldots,x_n)$  by simply adding a disjunction over all variables.

$$(x_1 \vee \ldots \vee x_n) \wedge \leq_n^1 (x_1, \ldots, x_n)$$

If we redeclare n as the number of threads, non-deterministic scheduling of a single thread in step k can now be encoded by the following constraint.

$$=_{2n+1}^1(\mathtt{thread}_0^k,\ldots,\mathtt{thread}_{n-1}^k,\mathtt{flush}_0^k,\ldots,\mathtt{flush}_{n-1}^k,\mathtt{exit}^k)$$

This cardinality constraint is further influenced by explicitly disabling transitions from certain states that are prohibited by our machine model. Flushing an empty store buffer of a thread t can be prevented by a simple relational constraint.

$$\mathtt{flush}_t^k \implies \mathtt{full}_t^k$$

In case thread t containing a write, execution of any barrier operation has to be delayed while the store buffer is full. Let  $F_t$  be a set of statements requiring an empty store buffer and ite:  $\mathcal{B}^1 \times \mathcal{B}^n \times \mathcal{B}^n \to \mathcal{B}^n$  be a functional if-then-else, returning  $a \in \mathcal{B}^n$  if  $x \in \mathcal{B}^1$  is true, else  $b \in \mathcal{B}^n$ .

$$ite(x, a, b) = \begin{cases} a \text{ if } x \text{ is } true \\ b \text{ otherwise} \end{cases}$$

Since both store buffer related constraints mainly depend on mutually exclusive values of  $full_t^k$ , we are able to encode them in a single expression.

$$\mathtt{ite}(\mathtt{full}_t^k, (\bigvee_{pc \in F_t} \mathtt{stmt}_{t,pc}^k \implies \neg\mathtt{thread}_t^k), \neg\mathtt{flush}_t^k)$$

Blocking a thread t, while it is waiting for all other threads reaching a checkpoint id is implied by a conjunction of the corresponding block flag  $block_{id,t}^k$  and synchronization variable  $check_{id}^k$ .

$$\operatorname{block}_{id,t}^k \wedge \neg \operatorname{check}_{id}^k \Longrightarrow \neg \operatorname{thread}_t^k$$

Finally, if a thread t has halted, it must also be stopped from being scheduled.

$$\mathtt{halt}^k_t \Longrightarrow \lnot\mathtt{thread}^k_t$$

The purpose of this conditions is to restrict transitions which do not correspond to a valid execution. Since these constraints are defined relationally, special care must be taken in order to prevent a violation of the cardinality constraint, causing the formula to be unsatisfiable if all included variables are falsified simultaneously. Considering the previously defined constraints, the only way this situation might occur is if a deadlock is introduced by an unfortunate interleaving of different CHECK instructions, resulting in each thread waiting for another. Let T be the set of threads involved and C the set of checkpoint IDs, then the condition causing unsatisfiability can be summarized as follows.

$$\begin{split} \exists\, k \leq \mathcal{K} : \neg \mathtt{exit}^k \\ \land \\ \forall\, t \in T : \neg \mathtt{full}_t^k \land \neg \mathtt{halt}_t^k \\ \land \\ \exists\, id \in C : \mathtt{block}_{id,t}^k \land \neg \mathtt{check}_{id}^k \end{split}$$

#### Memory Access

Due store forwarding, memory access can not be expressed as simple array lookup, but is encoded by a separate function  $load_t^k: \mathcal{B}^{16} \to \mathcal{B}^{16}$  for loading the shared memory element at address  $adr \in \mathcal{B}^{16}$  with store forwarding from thread t.

$$\label{eq:load_t} \begin{split} \mathsf{load}_t^k(adr) &= \mathsf{ite}(\mathsf{full}_t^k \wedge \mathsf{adr}_t^k = adr, \\ & \mathsf{val}_t^k, \\ & \mathsf{read}^k(adr)) \end{split}$$

In case of indirect addressing,  $load_t^k$  is redefined to prevent dependency on certain features of the target language or the use of additional auxiliary variables.

$$load_t^k(adr) = ite(full_t^k, (1)$$

$$ite(adr_t^k = adr, (2)$$

$$ite(val_t^k = adr, (3)$$

$$val_t^k, (4)$$

$$read^{k}(val_{t}^{k})), (5)$$

$$\mathtt{ite}(\mathtt{adr}_t^k = \mathtt{read}^k(adr), \tag{6}$$

$$\operatorname{val}_t^k,$$
 (7)

$$read^k(read^k(adr)))),$$
 (8)

$$read^{k}(read^{k}(adr))) (9)$$

First, we check if the store buffer contains an entry and store forwarding might apply (1). If it is empty, the requested value has to be directly retrieved from memory (9). Otherwise, if the store buffer contains an entry for the given address (2), we must further check if it is equal to the effective address (3) to determine if either both (4), or just the given address can be forwarded (5). Finally, in case there is no entry for the given address, the effective address is fetched from memory and compared to the store buffer address (6) to see if it might be contained (7), or nothing can be forwarded after all (8).

#### Frame Axioms

Successor states of transitions for all possible types of operations, determined by  $\mathtt{flush}_t^k$  and  $\mathtt{exec}_{t,pc}^k$ , are defined by the frame axioms in the table below. All state variables, not explicitly altered are assumed to be unchanged in

the next step, except the block flags  $\mathtt{block}_{id,t}^{k+1}$ , which are reset if all threads synchronized upon checkpoint id.

$$\mathtt{block}_{id,t}^{k+1} = \mathtt{ite}(\mathtt{check}_{id}^k, \mathit{false}, \mathtt{block}_{id,t}^k)$$

To further simplify the definition of axioms, two additional functions are introduced:  $\mathtt{msb}: \mathcal{B}^{16} \to \mathcal{B}^1$  for retrieving the most significant bit of a given bit-vector and  $\mathtt{effective}^k: \mathcal{B}^{16} \to \mathcal{B}^{16}$  for transparently selecting the effective address during STORE or CAS instructions.

$$\texttt{effective}^k(adr) = \begin{cases} \texttt{read}^k(adr) \text{ if indirect} \\ adr \text{ otherwise} \end{cases}$$

| FLUSH     | $\mathtt{heap}^{k+1}$          | $= \mathtt{write}^k(\mathtt{adr}_t^k, \mathtt{val}_t^k)$ |
|-----------|--------------------------------|----------------------------------------------------------|
| rlusn     | $\mathtt{full}^{k+1}$          | = false                                                  |
|           | $\mathtt{accu}_t^{k+1}$        | $= \mathtt{load}_t^k(\mathtt{arg})$                      |
| LOAD arg  | $\mathtt{stmt}_{t,pc}^{k+1}$   | = false                                                  |
|           | $\mathtt{stmt}_{t,pc+1}^{k+1}$ | = true                                                   |
|           | $\mathtt{adr}_t^{k+1}$         | $= {	t effective}^k({	t arg})$                           |
| STORE arg | $\mathtt{val}_t^{k+1}$         | $=\mathtt{accu}_t^k$                                     |
| STORE arg | $\mathtt{stmt}_{t,pc}^{k+1}$   | = false                                                  |
|           | $\mathtt{stmt}_{t,pc+1}^{k+1}$ | = true                                                   |
| FENCE     | $\mathtt{stmt}_{t,pc}^{k+1}$   | = false                                                  |
| FENCE     | $\mathtt{stmt}_{t,pc+1}^{k+1}$ | = true                                                   |
|           | $\mathtt{accu}_t^{k+1}$        | $= \mathtt{accu}_t^k + \mathtt{load}_t^k(\mathtt{arg})$  |
| ADD arg   | $\mathtt{stmt}_{t,pc}^{k+1}$   | =false                                                   |
|           | $\mathtt{stmt}_{t,pc+1}^{k+1}$ | = true                                                   |
|           | $\mathtt{accu}_t^{k+1}$        | $=\mathtt{accu}_t^k+\mathtt{arg}$                        |
| ADDI arg  | $\mathtt{stmt}_{t,pc}^{k+1}$   | = false                                                  |
|           | $\mathtt{stmt}_{t,pc+1}^{k+1}$ | = true                                                   |

|          | $\mathtt{accu}_t^{k+1}$                | $= \mathtt{accu}_t^k - \mathtt{load}_t^k(\mathtt{arg})$                            |
|----------|----------------------------------------|------------------------------------------------------------------------------------|
| SUB arg  | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                            |
|          | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | = true                                                                             |
|          | $\mathtt{accu}_t^{k+1}$                | $=\mathtt{accu}_t^k-\mathtt{arg}$                                                  |
| SUBI arg | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                            |
|          | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | = true                                                                             |
|          | $\mathtt{accu}_t^{k+1}$                | $= \mathtt{accu}_t^k * \mathtt{load}_t^k(\mathtt{arg})$                            |
| MUL arg  | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                            |
|          | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | = true                                                                             |
|          | $\mathtt{accu}_t^{k+1}$                | $= \mathtt{accu}_t^k * \mathtt{arg}$                                               |
| MULI arg | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                            |
|          | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | = true                                                                             |
|          | $\mathtt{accu}_t^{k+1}$                | $= \mathtt{accu}_t^k - \mathtt{load}_t^k(\mathtt{arg})$                            |
| CMP arg  | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                            |
|          | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | = true                                                                             |
| JMP arg  | $\mathtt{stmt}_{t,pc}^{k+1}$           | $= \mathtt{ite}(pc \neq arg, false, true)$                                         |
| Jin arg  | $\mathtt{stmt}_{t,\mathtt{arg}}^{k+1}$ | $= \mathtt{ite}(pc \neq arg, true, false)$                                         |
|          | $\mathtt{stmt}_{t,pc}^{k+1}$           | $= \mathtt{ite}(pc \neq arg, false, \lnot \mathtt{accu}_t^k)$                      |
| JZ arg   | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | $=\mathtt{accu}_t^k$                                                               |
|          | $\mathtt{stmt}_{t,\mathtt{arg}}^{k+1}$ | $= \neg accu_t^k$                                                                  |
|          | $\mathtt{stmt}_{t,pc}^{k+1}$           | $= \mathtt{ite}(\mathit{pc} \neq \mathit{arg}, \mathit{false}, \mathtt{accu}_t^k)$ |
| JNZ arg  | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | $= \neg accu_t^k$                                                                  |
|          | $\mathtt{stmt}_{t,\mathtt{arg}}^{k+1}$ | $=\mathtt{accu}_t^k$                                                               |
|          | $\mathtt{stmt}_{t,pc}^{k+1}$           | $= \mathtt{ite}(pc \neq arg, false, \mathtt{msb}(\mathtt{accu}_t^k))$              |
| JS arg   | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | $= \neg \mathtt{msb}(\mathtt{accu}_t^k)$                                           |
|          | $\mathtt{stmt}_{t,\mathtt{arg}}^{k+1}$ | $= \mathtt{msb}(\mathtt{accu}_t^k)$                                                |
| -        |                                        |                                                                                    |

|           | $\mathtt{stmt}_{t,pc}^{k+1}$           | $= \mathtt{ite}(pc \neq arg, false, \neg \mathtt{msb}(\mathtt{accu}_t^k))$                                   |  |  |
|-----------|----------------------------------------|--------------------------------------------------------------------------------------------------------------|--|--|
| JNS arg   | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | $= \mathtt{msb}(\mathtt{accu}_t^k)$                                                                          |  |  |
|           | $\mathtt{stmt}_{t,\mathtt{arg}}^{k+1}$ | $= \neg \mathtt{msb}(\mathtt{accu}_t^k)$                                                                     |  |  |
|           | $\mathtt{stmt}_{t,pc}^{k+1}$           | $= \mathtt{ite}(pc \neq arg, \mathit{false}, \mathtt{accu}_t^k \wedge \neg \mathtt{msb}(\mathtt{accu}_t^k))$ |  |  |
| JNZNS arg | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | $= \neg \mathtt{accu}_t^k \vee \mathtt{msb}(\mathtt{accu}_t^k)$                                              |  |  |
|           | $\mathtt{stmt}_{t,\mathtt{arg}}^{k+1}$ | $= \mathtt{accu}_t^k \wedge \neg \mathtt{msb}(\mathtt{accu}_t^k)$                                            |  |  |
|           | $\mathtt{accu}_t^{k+1}$                | $= load_t^k(arg)$                                                                                            |  |  |
| MEM amm   | $\mathtt{mem}_t^{k+1}$                 | $= \mathtt{load}_t^k(\mathtt{arg})$                                                                          |  |  |
| MEM arg   | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                                                      |  |  |
|           | $\mathtt{stmt}_{t,pc+1}^{k+k}$         | = true                                                                                                       |  |  |
| ite       |                                        | $	ext{ite}(\mathtt{mem}_t^k = \mathtt{read}^k(\mathtt{effective}^k(\mathtt{arg})),$                          |  |  |
|           | $\mathtt{heap}^{k+1}$                  | $= \qquad \mathtt{write}^k(\mathtt{effective}^k(\mathtt{arg}),\mathtt{accu}_t^k),$                           |  |  |
|           |                                        | $\mathtt{heap}^k)$                                                                                           |  |  |
|           | $\mathtt{accu}_t^{k+1}$                | $= \mathtt{ite}(\mathtt{mem}_t^k = \mathtt{read}^k(\mathtt{effective}^k(\mathtt{arg})), 1, 0)$               |  |  |
| CAS arg   | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                                                      |  |  |
|           | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | = true                                                                                                       |  |  |
| HALT      | $exit_t^{k+1}$                         | $= \bigwedge_{i=0}^{n-1} \mathtt{halt}_i^{k+1}$                                                              |  |  |
| HALI      | $\mathtt{halt}_t^{k+1}$                | = true                                                                                                       |  |  |
|           | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                                                      |  |  |
|           | $\mathtt{exit}_t^{k+1}$                | = true                                                                                                       |  |  |
| EXIT arg  | $\mathtt{exit}	ext{-}\mathtt{code}^k$  | = arg                                                                                                        |  |  |
|           | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                                                      |  |  |
|           | $\mathtt{stmt}_{t,pc}^{k+1}$           | = false                                                                                                      |  |  |
| CHECK arg | $\mathtt{stmt}_{t,pc+1}^{k+1}$         | = true                                                                                                       |  |  |
|           | ${\tt block}^{k+1}_{arg,t}$            | = true                                                                                                       |  |  |
|           |                                        |                                                                                                              |  |  |

Table 5: Frame Axioms

# 5 SMT-LIB Encoding

We will now outline a simplistic C++17 implementation for encoding our bounded model checking problems in the SMT-LIB [6] format according to the previously defined scheme and use the store buffer litmus test as an exemplary input for demonstration. SMT-LIB is the result of an ongoing international initiative with the goal of developing a rigorously standardized language for describing SMT theories as well as formulas and is supported by nearly all SMT solvers. Based on a Lisp like syntax, it covers a wide variety of theories like the theory of integers, fixed size bit-vectors, arrays and many more. In order to apply specialized and more efficient satisfiability techniques, theories are grouped into so called logics and every formula must explicitly specify the logic it is based on. See www.smt-lib.org for a full list of theories and logics available. In our case QF\_AUFBV was used, allowing closed quantifier-free formulas over the theory of bit-vectors and bit-vector arrays extended with free sort and function symbols.

SMT-LIB also offers incremental solving through push and pop commands. Because of our encoding's iterative nature, it would have been the best choice in order to keep the size of our formulas as small as possible. Unfortunately, incremental solving just started to be supported by the main SMT solvers at the time we began development and therefore chose to manually unroll our bounded model checking problem for every step  $k \leq \mathcal{K}$ .

# Types

To differentiate among particular machine states, we start by introducing:

```
enum State
{
  heap,
  accu,
  mem,
  adr,
  val,
  full,
  stmt,
  block,
  halt,
  exit,
  exit_code
};
```

Each available instruction is represented by its own type, derived from an abstract class Instruction, capturing arguments and abolishing the need for lengthy case splits by utilizing dynamic dispatch of virtual member functions for taking the appropriate action at runtime.

```
struct Instruction
{
    uint arg;
    bool indirect;

    Instruction (uint a, bool i = false) : arg(a), indirect(i) {};

    virtual std::string encode (uint k, uint t, State state) = 0;
};
```

#### Globals

Commonly used variables are given as globals to keep things simple and function signatures as small as possible.

| Thread 0 |                         | Thread 1 |                         | Thread 2      |                         |
|----------|-------------------------|----------|-------------------------|---------------|-------------------------|
| ADDI 1   | $\mathtt{stmt}_{0,0}^k$ | ADDI 1   | $\mathtt{stmt}_{1,0}^k$ |               |                         |
| STORE 0  | $\mathtt{stmt}_{0,1}^k$ | STORE 1  | $\mathtt{stmt}_{1,1}^k$ |               |                         |
| LOAD 1   | $\mathtt{stmt}_{0,2}^k$ | LOAD O   | $\mathtt{stmt}_{1,2}^k$ |               |                         |
| CHECK O  | $\mathtt{stmt}_{0,3}^k$ | CHECK O  | $\mathtt{stmt}_{1,3}^k$ | CHECK 0       | $\mathtt{stmt}_{2,0}^k$ |
| HALT     | $\mathtt{stmt}_{0,4}^k$ | HALT     | $\mathtt{stmt}_{1,4}^k$ | ADD O         | $\mathtt{stmt}_{2,1}^k$ |
|          |                         |          |                         | ADD 1         | $\mathtt{stmt}_{2,2}^k$ |
|          |                         |          |                         | JZ error      | $\mathtt{stmt}_{2,3}^k$ |
|          |                         |          |                         | EXIT O        | $\mathtt{stmt}_{2,4}^k$ |
|          |                         |          |                         | error: EXIT 1 | $\mathtt{stmt}_{2,5}^k$ |

Table 6: Store Buffer Litmus Test Programs and Activation Variables

Input programs are expressed as lists of instruction pointers and initialized according to our store buffer litmus test, given in Table 6.

```
std::vector<std::vector<Instruction *>> programs = {
    new Addi(1),
    new Store(0),
    new Load(1),
    new Check(0),
    new Halt()
  },
    new Addi(1),
    new Store(1),
    new Load(0),
    new Check(0),
    new Halt()
  },
  {
    new Check(0),
    new Add(0),
    new Add(1),
    new Jz(5),
    new Exit(0),
    new Exit(1)
  }
};
```

The initial memory layout in our example, given by the memory map in Listing 8, is captured by the following variable.

```
std::map<uint, uint> mmap = {{0, 0}, {1, 0}};
```

Last but not least, we define the upper bound K of the resulting model checking problem as the final input variable.

```
uint bound = 17;
```

The resulting formula is stored in a string stream to increase readability by the use of compound statements.

```
std::ostringstream formula;
```

To simplify the encoding process, the following utility variables are generated during a preprocessing step by analyzing the input programs. Although they might seem a bit complex, we omit further details on how they are created as they just identify certain types of instructions and initialize them according to our store buffer litmus test example for completeness.

• Program counters of instructions modifying a particular state.

```
std::map<State, std::vector<std::vector<uint>>> updates = {
   {State::accu, {{0, 2},
                                             // accu_0 \leftarrow ADDI 1, LOAD 0
                           {0, 2},
                                             // accu_1 \leftarrow ADDI 1, LOAD 1
                           {1, 2}}},
                                           // accu_2 \leftarrow ADD 0, ADD 1
   {State::adr,
                         {{1},
                                             // \text{ adr}_0 \leftarrow \text{STORE } 0
                                             // adr_1 \leftarrow STORE 1
                           {1},
                                             // adr_2 \leftarrow \emptyset
                           {}}},
   {State::val,
                                             // \text{ val}_0 \leftarrow \text{STORE } 0
                         {{1}},
                                             // \text{ val}_1 \leftarrow \text{STORE 1}
                           {1},
                                             // \text{ val}_2 \leftarrow \emptyset
                           {}}},
   {State::full, {{1},
                                             // \text{ full}_0 \leftarrow \text{STORE 0}
                                             // full<sub>1</sub> \leftarrow STORE 1
                           {1},
                           {}}},
                                            // \text{ full}_2 \leftarrow \varnothing
   {State::halt, {{4},
                                             // halt<sub>0</sub> \leftarrow HALT
                           {4},
                                             // halt_1 \leftarrow \texttt{HALT}
                           {}}},
                                             // halt_2 \leftarrow \varnothing
                                             // exit \leftarrow \emptyset
   {State::exit, {{}},
                                             // exit \leftarrow \emptyset
                           {},
                           {4, 5}}}}; // exit ← EXIT 0, EXIT 1
```

• Predecessors of each statement.

```
std::vector<std::vector<uint>>> predecessors = {
  // thread 0
  {{},
              // ADDI 1
                                \leftarrow \varnothing
              // STORE 0
    {0},
                                \leftarrow ADDI 1
              // LOAD 1
    {1},
                                \leftarrow STORE 0
    {2},
              // CHECK 0
                                \leftarrow LOAD 1
    {3}},
              // HALT
                                \leftarrow CHECK 0
  // thread 1
  {{},
              // ADDI 1
                                \leftarrow \varnothing
             // STORE 1
    {0},
                                \leftarrow ADDI 1
    {1},
             // LOAD 0
                                \leftarrow STORE 1
              // CHECK 0
    {2},
                                \leftarrow LOAD O
    {3}}, // HALT
                                \leftarrow CHECK O
  // thread 2
              // CHECK 0
  {{},
                                \leftarrow \varnothing
    {0},
              // ADD 0
                                \leftarrow CHECK O
    {1},
             // ADD 1
                                \leftarrow ADD 0
    {2},
              // JZ error \leftarrow ADD 1
              // EXIT 0
                                \leftarrow JZ error
    {3},
    \{3\}\}\}; // error: EXIT 1 \leftarrow JZ error
```

• Program counters of memory barriers.

```
std::vector<std::vector<uint>> barriers = {
    {1, 4}, // thread 0: STORE 0, HALT
    {1, 4}, // thread 1: STORE 1, HALT
    {}}; // thread 2: Ø
```

• Program counters of CHECK instructions per checkpoint and thread.

#### **SMT-LIB Generator Functions**

Frequently used SMT-LIB expressions are generated by a set of functions, based on the following variadic function template as the basic expression generator.

```
template <class ... T>
std::string expr (const std::string & op, const T & ... args)
{
    std::string e = '(' + op;
    (((e += ' ') += args), ...);
    return e += ')';
}
```

Generator functions for SMT-LIB commands supporting a variable number of arguments by being defined as either :left-assoc, :right-assoc or :chainable also include two additional overloads: a variadic version for an arbitrary but fixed number of arguments, using the same template definition as the basic expression generator expr and another for handling a varying number of arguments at runtime, based on the following template accepting an arbitrary STL container. In order to emit syntactically correct SMT-LIB expressions, both handle calls containing only a single argument by directly returning it.

```
template <template <class, class ...> class C>
std::string expr (const std::string & op, const C<std::string> & args)
{
    std::string e = '(' + op;
    for (const auto & a : args)
        (e += ' ') += a;
    return e += ')';
}
```

Table 7 shows all generator function names and the corresponding SMT-LIB command they are creating.

| declare_bool  | declare boolean           | (declare-fun) |  |
|---------------|---------------------------|---------------|--|
| declare_bv    | declare bit-vector        |               |  |
| declare_array | declare array             |               |  |
| assertion     | assertion                 | (assert)      |  |
| lnot          | negation                  | (not)         |  |
| land          | conjunction               | (and)         |  |
| lor           | disjunction               | (or)          |  |
| lxor          | exclusive disjunction     | (xor)         |  |
| imply         | implication               | (=>)          |  |
| equal         | equivalence               | (=)           |  |
| ite           | functional if-then-else   | (ite)         |  |
| bvadd         | bit-vector addition       | (bvadd)       |  |
| bvsub         | bit-vector subtraction    | (bvsub)       |  |
| bvmul         | bit-vector multiplication | (bvmul)       |  |
| select        | array read                | (select)      |  |
| store         | array store               | (store)       |  |
| extract       | bit-vector extraction     | (extract)     |  |

Table 7: SMT-LIB Expression Generator Functions

Finally, we introduce a function **std::string** consth (**uint** val) for generating hexadecimal bit-vector constants and a helper function to simplify variable assignments.

```
std::string assign (std::string var, std::string val)
{
  return assertion(equal(var, val));
}
```

### **Common Encoding Functions**

We will now give a top-down overview of the actual encoding process implemented in src/encoder\_smtlib.cc and start with the main encoding function defined below.

```
void encode ()
{
  formula << "(set-logic QF_AUFBV)\n";

for (uint k = 0; k <= bound; k++)
  {
    declare_states(k);
    declare_transitions(k);
    define_transitions (k);

    if (k)
        define_states(k);
    else
        init_states();
    }
}</pre>
```

After setting the appropriate logic (QF\_AUFBV), each individual step k is encoded iteratively, starting with the declaration of state variables.

```
void declare_states (uint k)
{
    // thread states
    declare_accu(k);
    declare_mem(k);
    declare_adr(k);
    declare_val(k);
    declare_full(k);
    declare_stmt(k);
    declare_block(k);
    declare_halt(k);

    // machine states
    declare_heap(k);
    declare_exit(k);

if (!k) declare_exit_code();
}
```

Similarly to the generation of SMT-LIB commands, the following variadic function template serves as the basis for generating symbols, matching our established variable naming scheme.

```
template <class ... T>
std::string var (const std::string & name, const T & ... attributes)
{
   return (((name += '_') += std::to_string(attributes)), ...);
}
```

This allows us to define a generator function for each variable as a simple wrapper, like the one of our accumulator register variables  $accu_t^k$  given below.

```
std::string accu (uint k, uint t) { return var("accu", k, t); }
```

The declaration of accumulator register states for each thread t in step k is now appended to the formula by using previously defined generator functions.

```
void declare_accu (uint k)
{
   for (uint t = 0; t < programs.size(); t++)
      formula << declare_bv(accu(k, t)) << '\n';
}

Example: declare_accu(0)

(declare-fun accu_0_0 () (_ BitVec 16))
(declare-fun accu_0_1 () (_ BitVec 16))
(declare-fun accu_0_2 () (_ BitVec 16))</pre>
```

While the declarator functions of the remaining register states  $\operatorname{mem}_t^k$ ,  $\operatorname{adr}_t^k$ ,  $\operatorname{val}_t^k$  and  $\operatorname{full}_t^k$  are defined almost identically, except for the latter changing the sort to Bool, an additional nested loop is needed to declare an activation variable for each statement in a thread's program.

33

```
void declare_stmt (uint k)
{
  for (uint t = 0; t < programs.size(); t++)
    for (uint pc = 0; pc < programs[t].size(); pc++)
        formula << declare_bool(stmt(k, t, pc)) << '\n';
}</pre>
```

```
Example: declare_stmt(0)
(declare-fun stmt_0_0_0 () Bool)
(declare-fun stmt_0_0_1 () Bool)
(declare-fun stmt_0_0_2 () Bool)
(declare-fun stmt_0_0_3 () Bool)
(declare-fun stmt_0_0_4 () Bool)
(declare-fun stmt_0_1_0 () Bool)
(declare-fun stmt_0_1_1 () Bool)
(declare-fun stmt_0_1_2 () Bool)
(declare-fun stmt_0_1_3 () Bool)
(declare-fun stmt_0_1_4 () Bool)
(declare-fun stmt_0_2_0 () Bool)
(declare-fun stmt_0_2_1 () Bool)
(declare-fun stmt_0_2_2 () Bool)
(declare-fun stmt_0_2_3 () Bool)
(declare-fun stmt_0_2_4 () Bool)
(declare-fun stmt_0_2_5 () Bool)
```

In order to avoid repeated iteration of input programs, the precomputed utility variables are used to indicate the presence of certain instructions and other control flow related information. Declaration of block flags  $block_{id,t}^k$  can therefore be simplified by relying on the checkpoints map, containing the list of threads eventually waiting for a specific checkpoint.

```
void declare_block (uint k)
{
  for (const auto & [id, threads] : checkpoints)
    for (const auto & [t, _] : threads)
       formula << declare_bool(block(k, id, t)) << '\n';
}

Example: declare_block(0)

(declare-fun block_0_0_0 () Bool)
(declare-fun block_0_0_1 () Bool)
(declare-fun block_0_0_2 () Bool)</pre>
```

Halt flags  $halt_t^k$  are again declared by the appropriate generator function.

```
void declare_halt (uint k)
{
  for (uint t = 0; t < programs.size(); t++)
    formula << declare_bool(halt(k, t)) << '\n';
}

Example: declare_halt(0)
(declare-fun halt_0_0 () Bool)
(declare-fun halt_0_1 () Bool)
(declare-fun halt_0_2 () Bool)</pre>
```

Machine states  $\mathsf{heap}^k$ ,  $\mathsf{exit}^k$  and  $\mathsf{exit}\text{-}\mathsf{code}^k$  are declared next. In contrast to our basic encoding scheme, only a single bit-vector variable representing the machine's final exit code is used and declared during the initial step. We omit the definition of the corresponding declarator functions as they just add a single variable with the appropriate sort.

```
Example: declare_heap(0)
(declare-fun heap_0 () (Array (_ BitVec 16) (_ BitVec 16)))

Example: declare_exit(0).
(declare-fun exit_0 () Bool)

Example: declare_exit_code()
(declare-fun exit-code () (_ BitVec 16))

Variable declaration is concluded by adding the required transition variables.

void declare_transitions (uint k)
{
    declare_thread(k);
    declare_flush(k);
    declare_exec(k);
    declare_check(k);
```

We skip further details about their declarator functions, as they are based on the same principles used for register states and statement activation variables, except  $\mathtt{check}_{id}^k$  which also refers to the identifiers stored in the  $\mathtt{checkpoints}$  map.

}

```
void declare_check (uint k)
{
  for (const auto & [id, _] : checkpoints)
    formula << declare_bool(check(k, id)) << '\n';
}

Example: declare_check(0)
(declare-fun check_0_0 () Bool)</pre>
```

With all variables being declared, the actual encoding process is started by defining helper variables and scheduling constraints according to our machine model.

```
void define_transitions (uint k)
{
  define_exec(k);
  define_check(k);
  define_cardinality_constraints(k);
  define_store_buffer_constraints(k);
  define_checkpoint_constraints(k);
  define_halt_constraints(k);
}
```

Execution variables  $exec_{t,pc}^k$  are defined as a conjunction of the corresponding statement and thread activation variables.

```
(assert (= exec_0_1_0 (and stmt_0_1_0 thread_0_1)))
(assert (= exec_0_1_1 (and stmt_0_1_1 thread_0_1)))
(assert (= exec_0_1_2 (and stmt_0_1_2 thread_0_1)))
(assert (= exec_0_1_3 (and stmt_0_1_3 thread_0_1)))
(assert (= exec_0_1_4 (and stmt_0_1_4 thread_0_1)))
(assert (= exec_0_2_0 (and stmt_0_2_0 thread_0_2)))
(assert (= exec_0_2_1 (and stmt_0_2_1 thread_0_2)))
(assert (= exec_0_2_2 (and stmt_0_2_2 thread_0_2)))
(assert (= exec_0_2_3 (and stmt_0_2_3 thread_0_2)))
(assert (= exec_0_2_4 (and stmt_0_2_4 thread_0_2)))
(assert (= exec_0_2_5 (and stmt_0_2_5 thread_0_2)))
```

Definition of checkpoint variables  $\mathtt{check}_{id}^k$  is again based on the threads stored in  $\mathtt{checkpoints}$  to generate the list of corresponding block variables  $\mathtt{block}_{id,t}^k$  for creating the required conjunction.

```
void define_check (uint k)
{
    for (const auto & [id, threads] : checkpoints)
        {
        std::vector<std::string> args;
        for (const auto & [t, _] : threads)
            args.push_back(block(k, id, t));
        formula << assign(check(k, id), land(args)) << '\n';
    }
}

Example: define_check(0)

(assert (= check_0_0 (and block_0_0_0 block_0_0_1 block_0_0_2)))</pre>
```

We continue with the cardinality constraint, based on the following functions implementing the different *at-most-one* predicates.

```
void cardinality_naive (const std::vector<std::string> & vars)
{
  for (auto i = vars.begin(); i != vars.end(); ++i)
    for (auto j = i + 1; j != vars.end(); ++j)
        formula << assertion(lor(lnot(*i), lnot(*j))) << '\n';
}</pre>
```

```
void cardinality_sequential (const std::vector<std::string> & vars)
  std::vector<std::string> auxs;
  const auto end = --vars.end();
  for (auto i = vars.begin(); i != end; ++i)
    formula << declare_bool(auxs.emplace_back(*i + "_aux")) << '\n';</pre>
  auto var = vars.begin();
  auto aux = auxs.begin();
  formula << assertion(lor(lnot(*var), *aux)) << '\n';</pre>
  while (++var != end)
    {
      const std::string & prev = *aux++;
      formula << assertion(lor(lnot(*var), *aux)) << '\n'</pre>
               << assertion(lor(lnot(prev), *aux)) << '\n'
               << assertion(lor(lnot(*var), lnot(prev))) << '\n';
  formula << assertion(lor(lnot(*var), lnot(*aux))) << '\n';</pre>
}
```

The required *exactly-one* constraint can now be defined by collecting the relevant variables and selecting the appropriate *at-most-one* predicate after adding a disjunction expressing that *at-least-one* needs to be true.

```
void define_cardinality_constraints (uint k)
{
   std::vector<std::string> vars;
   for (uint t = 0; t < programs.size(); t++)
      {
      vars.push_back(thread(k, t));
      vars.push_back(flush(k, t));
   }
   vars.push_back(exit(k));
   // >= 1 constraint
   formula << assertion(lor(vars)) << '\n';
   // <= 1 constraint
   if (vars.size() > 5)
      cardinality_sequential(vars);
   else
      cardinality_naive(vars);
}
```

```
Example: define_cardinality_constraints(0)
(assert (or thread_0_0 flush_0_0
            thread_0_1 flush_0_1
            thread_0_2 flush_0_2
            exit_0))
(declare-fun thread_0_0_aux () Bool)
(declare-fun flush_0_0_aux () Bool)
(declare-fun thread_0_1_aux () Bool)
(declare-fun flush_0_1_aux () Bool)
(declare-fun thread_0_2_aux () Bool)
(declare-fun flush_0_2_aux () Bool)
(assert (or (not thread_0_0) thread_0_0_aux))
(assert (or (not flush_0_0) flush_0_0_aux))
(assert (or (not thread_0_0_aux) flush_0_0_aux))
(assert (or (not flush_0_0) (not thread_0_0_aux)))
(assert (or (not thread_0_1) thread_0_1_aux))
(assert (or (not flush_0_0_aux) thread_0_1_aux))
(assert (or (not thread_0_1) (not flush_0_0_aux)))
(assert (or (not flush_0_1) flush_0_1_aux))
(assert (or (not thread_0_1_aux) flush_0_1_aux))
(assert (or (not flush_0_1) (not thread_0_1_aux)))
(assert (or (not thread_0_2) thread_0_2_aux))
(assert (or (not flush_0_1_aux) thread_0_2_aux))
(assert (or (not thread_0_2) (not flush_0_1_aux)))
(assert (or (not flush_0_2) flush_0_2_aux))
(assert (or (not thread_0_2_aux) flush_0_2_aux))
(assert (or (not flush_0_2) (not thread_0_2_aux)))
(assert (or (not exit_0) (not flush_0_2_aux)))
```

Since a program might not even contain a single memory barrier, different store buffer related constraints are added for each thread. Beside the common restriction of flushing an empty store buffer, we resort to barriers containing the program counters of barrier instructions for generating the list of corresponding statement activation variables required to additionally delay their execution while the store buffer is full.

```
void define_store_buffer_constraints (uint k)
  for (uint t = 0; t < programs.size(); t++)</pre>
    if (barriers[t].empty())
      formula << assertion(imply(flush(k, t), full(k, t))) << '\n';</pre>
    else
        std::vector<std::string> stmts;
        for (uint pc : barriers[t])
           stmts.push_back(stmt(k, t, pc));
        formula << assertion(ite(full(k, t),</pre>
                                    imply(lor(stmts),
                                           lnot(thread(k, t))),
                                    lnot(flush(k, t)))) << '\n';</pre>
      }
}
Example: define_store_buffer_constraints(0)
(assert (ite full_0_0
              (=> (or stmt_0_0_1 stmt_0_0_4) (not thread_0_0))
              (not flush_0_0)))
(assert (ite full_0_1
              (=> (or stmt_0_1_1 stmt_0_1_4) (not thread_0_1))
              (not flush_0_1)))
(assert (=> flush_0_2 full_0_2))
Checkpoint constraints are defined by using checkpoints to generate the
relevant block_{id,t}^k and check_{id}^k variables for explicitly disabling a thread's
activation while it is waiting for all other participants to synchronize upon
checkpoint id.
void define_checkpoint_constraints (uint k)
  for (const auto & [id, threads] : checkpoints)
    for (const auto & [t, _] : threads)
      formula << assertion(imply(land(block(k, id, t),</pre>
                                         lnot(check(k, id))),
                                    lnot(thread(k, t)))) << '\n';</pre>
}
```

```
(assert (=> (and block_0_0_0 (not check_0_0)) (not thread_0_0)))
(assert (=> (and block_0_0_1 (not check_0_0)) (not thread_0_1)))
(assert (=> (and block_0_0_2 (not check_0_0)) (not thread_0_2)))
The definition of scheduling constraints is concluded by preventing halted
threads from being executed.
void define_halt_constraints (uint k)
  for (uint t = 0; t < programs.size(); t++)</pre>
    formula << assertion(imply(halt(k, t),</pre>
                                 lnot(thread(k, t)))) << '\n';</pre>
}
Example: define_halt_constraints(0)
(assert (=> halt_0_0 (not thread_0_0)))
(assert (=> halt_0_1 (not thread_0_1)))
(assert (=> halt_0_2 (not thread_0_2)))
Encoding of each particular step ends with the definition of machine states,
starting with their initialization in the first step k=0.
void init_states ()
  init_accu();
  init_mem();
  init_adr();
  init_val();
  init_full();
  init_stmt();
  init_block();
  init_halt();
  init_heap();
  init_exit();
}
We begin by setting each thread's initial accumulator register state to zero.
void init_accu ()
  for (uint t = 0; t < programs.size(); t++)</pre>
    formula << assign(accu(0, t), consth(0)) << '\n';</pre>
}
```

Example: define\_checkpoint\_constraints(0)

```
Example: init_accu()

(assert (= accu_0_0 #x0000))
(assert (= accu_0_1 #x0000))
(assert (= accu_0_2 #x0000))
```

Similar to their declaration, we skip the initialization functions of the remaining register states as they are again defined almost identical and continue with the activation of every thread's initial statement  $\mathfrak{stmt}_{t,0}^0$ .

```
void init_stmt ()
  for (uint t = 0; t < programs.size(); t++)</pre>
    for (uint pc = 0; t < programs[t].size(); t++)</pre>
      formula << assertion(pc ? lnot(stmt(0, t, pc))</pre>
                                : stmt(0, t, pc)) << '\n';
}
Example: init_stmt()
(assert stmt_0_0_0)
(assert (not stmt_0_0_1))
(assert (not stmt_0_0_2))
(assert (not stmt_0_0_3))
(assert (not stmt_0_0_4))
(assert stmt_0_1_0)
(assert (not stmt_0_1_1))
(assert (not stmt_0_1_2))
(assert (not stmt_0_1_3))
(assert (not stmt_0_1_4))
(assert stmt_0_2_0)
(assert (not stmt_0_2_1))
(assert (not stmt_0_2_2))
(assert (not stmt_0_2_3))
(assert (not stmt_0_2_4))
(assert (not stmt_0_2_5))
```

Block flags  $block_{id,t}^0$  are initially disabled by relying on the entries given in the checkpoints map.

```
void init_block ()
  for (const auto & [id, threads] : checkpoints)
    for (const auto & [t, _] : threads)
      formula << assertion(lnot(block(0, id, t))) << '\n';</pre>
}
Example: init_block()
(assert (not block_0_0_0))
(assert (not block_0_0_1))
(assert (not block_0_0_2))
Halt flags \mathtt{halt}_t^0 of each thread are initialized in a similar manner.
void init_halt ()
  for (uint t = 0; t < programs.size(); t++)</pre>
    formula << assertion(lnot(halt(0, t))) << '\n';</pre>
}
Example: init_halt()
(assert (not halt_0_0))
(assert (not halt_0_1))
(assert (not halt_0_2))
Although the machine's memory is considered to be uninitialized in general,
input data is assigned according to the entries of a given memory map.
void init_heap ()
  for (const auto & [adr, val] : mmap)
    formula << assign(select(heap(0), consth(adr)),</pre>
                        consth(val)) << '\n';</pre>
}
Example: init_heap()
(assert (= (select heap_0 #x0000) #x0000))
(assert (= (select heap_0 #x0001) #x0000))
```

Finally, the initial exit flag exit<sub>0</sub> is disabled to enforce execution.

```
formula << assertion(lnot(exit(0))) << '\n';</pre>
}
Example: init_exit()
(assert (not exit_0))
Memory access according to the previously defined load_t^k is implemented by
the following helper function.
std::string load (uint k, uint t, uint adr, bool indirect = false)
  std::string address = consth(adr);
  std::string adr_var = adr(k, t);
  std::string val_var = val(k, t);
  std::string full_var = full(k, t);
  std::string heap_var = heap(k);
  if (indirect)
    return
      ite(full_var,
           ite(equal(adr_var, address),
               ite(equal(val_var, address),
                   val_var,
                   select(heap_var, val_var)),
               ite(equal(adr_var, select(heap_var, address)),
                   select(heap_var, select(heap_var, address)))),
           select(heap_var, select(heap_var, address)));
  else
    return
      ite(land(full_var, equal(adr_var, address)),
           val_var,
           select(heap_var, address));
}
```

void init\_exit()

Common encoding functions are concluded by introducing implementations for Instruction::encode(uint, uint, State), returning the successor of a given state after executing the specific instruction expect for the statement activation variables  $\mathtt{stmt}_{t,pc}^k$ , which are handeled seperately and only jump conditions being generated this way. Implementations for FENCE, JMP and HALT instructions are omitted, as they just return an empty  $\mathtt{std}$ ::string.

```
LOAD arg
std::string Load::encode (uint k, uint t, State state)
  return load(k, t, arg, indirect);
}
STORE arg
std::string Store::encode (uint k, uint t, State state)
  switch (state)
    case State::adr: return indirect ? load(k, arg) : consth(arg);
    case State::val: return accu(k, t);
}
ADD arg
std::string Add::encode (uint k, uint t, State state)
  return bvadd(accu(k, t), load(k, arg, indirect));
}
ADDI arg
std::string Addi::encode (uint k, uint t, State state)
  return bvadd(accu(k, t), consth(arg));
}
SUB arg
std::string Sub::encode (uint k, uint t, State state)
  return bvsub(accu(k, t), load(k, arg, indirect));
SUBI arg
std::string Subi::encode (uint k, uint t, State state)
  return bvsub(accu(k, t), consth(arg));
}
```

```
MUL arg
std::string Mul::encode (uint k, uint t, State state)
  return bvmul(accu(k, t), load(k, arg, indirect));
}
MULI arg
std::string Muli::encode (uint k, uint t, State state)
  return bvmul(accu(k, t), consth(arg));
CMP arg
std::string Cmp::encode (uint k, uint t, State state)
  return bvsub(accu(k, t), load(k, arg, indirect));
}
JZ arg
std::string Jz::encode (uint k, uint t, State state)
  return equal(accu(k, t), consth(0));
}
JNZ arg
std::string Jnz::encode (uint k, uint t, State state)
  return lnot(equal(accu(k, t), consth(0)));
}
JS arg
std::string Js::encode (uint k, uint t, State state)
  return equal("#b1", extract("15", "15", accu(k, t)));
}
```

```
JNS arg
std::string Jns::encode (uint k, uint t, State state)
  return equal("#b0", extract("15", "15", accu(k, t)));
}
JNZNS arg
std::string Jnzns::encode (uint k, uint t, State state)
  return land(lnot(equal(accu(k, t), consth(0))),
              equal("#b0", extract("15", "15", accu(k, t))));
}
MEM arg
std::string Mem::encode (uint k, uint t, State state)
  return load(k, arg, indirect);
}
CAS arg
std::string Cas::encode (uint k, uint t, State state)
  auto heap_var = heap(k);
  auto address = indirect ? select(heap_var, consth(arg))
                           : consth(arg);
  auto condition = equal(mem(k, t), select(heap_var, address));
  switch (state)
    {
    case State::accu: return ite(condition, consth(1), consth(0));
    case State::heap: return ite(condition,
                                 store(heap_var,
                                       address,
                                       accu(k, t)),
                                 heap_var);
    }
}
EXIT arg
std::string Exit::encode (uint k, uint t, State state)
  return consth(arg);
}
```

# 6 Functional Next State Logic

In principle there are two ways to encode state updates: a functional, relying on ite cascades to perform *static single assignments* for determining the successor of each state explicitly and a relational, based on implying the next state for each possible transition. While the relational version is easier to encode since it is closer to our model's semantics and results in a simpler structure, the functional encoding has the benefit of being more compact and also turned out to be more efficient for SMT solving. We will now introduce our functional state update scheme, generated by using the -e smtlib command line parameter.

```
void define_states (uint k)
{
    define_accu(k);
    define_mem(k);
    define_adr(k);
    define_val(k);
    define_full(k);
    define_stmt(k);
    define_block(k);
    define_heap(k);
    define_exit(k);

if (k == bound)
    define_exit_code();
}
```

Starting with the accumulator registers  $\mathtt{accu}_t^k$ , each update expression is initialized with its predecessor  $\mathtt{accu}_t^{k-1}$ , forming the base case of preserving the previous state. Any program statement altering the accumulator then extends the expression by embedding its current value in the *else* branch of an ite, using the corresponding instruction's encode implementation to set the successor state depending on the execution variable  $\mathtt{exec}_{t,pc}^{k-1}$ .

```
void define_accu (uint k)
  for (uint t = 0; t < programs.size(); t++)</pre>
      std::string next = accu(k - 1, t);
      const auto & stmts = updates[State::accu][t];
      for (auto pc = stmts.rbegin(); pc != stmts.rend(); ++pc)
        next = ite(exec(k - 1, t, *pc),
                    program[t][*pc]->encode(k - 1, t, State::accu),
                    next);
      formula << assign(accu(k, t), next) << '\n';</pre>
}
Example: define_accu(1)
(assert (= accu_1_0
           (ite exec_0_0_0
                 (bvadd accu_0_0 #x0001)
                 (ite exec_0_0_2
                      (ite (and full_0_0 (= adr_0_0 #x0001))
                           val_0_0
                           (select heap_0 #x0001))
                      accu_0_0))))
(assert (= accu_1_1 ...; same as before but reading address 0
(assert (= accu_1_2
           (ite exec_0_2_1
                 (bvadd accu_0_2
                        (ite (and full_0_2 (= adr_0_2 #x0000))
                             val_0_2
                              (select heap_0 #x0000)))
                 (ite exec_0_2_2
                      (bvadd accu_0_2
                              (ite (and full_0_2
                                        (= adr_0_2 #x0001))
                                  val_0_2
                                   (select heap_0 #x0001)))
                      accu_0_2))))
```

We omit the implementation of state update functions for the remaining register states  $mem_t^k$ ,  $adr_t^k$  and  $val_t^k$  as they only differ in using the appropriate State entries and variable generators, but include their output to show the state updates generated for our demo example.

```
Example: define_mem(1)

(assert (= mem_1_0 mem_0_0))
(assert (= mem_1_1 mem_0_1))
(assert (= mem_1_2 mem_0_2))

Example: define_adr(1)
(assert (= adr_1_0 (ite exec_0_0_1 #x0000 adr_0_0)))
(assert (= adr_1_1 (ite exec_0_1_1 #x0001 adr_0_1)))
(assert (= adr_1_2 adr_0_2))

Example: define_val(1)
(assert (= val_1_0 (ite exec_0_0_1 accu_0_0 val_0_0)))
(assert (= val_1_1 (ite exec_0_1_1 accu_0_1 val_0_1)))
(assert (= val_1_2 val_0_2))
```

Store buffer full flags  $\mathtt{full}_t^k$  are defined by a single  $\mathtt{ite}$ , either falsifying the state in case the store buffer has been flushed, or assigning a conjunction over all execution variables  $\mathtt{exec}_{t,pc}^{k-1}$  of STORE instructions to set it if an entry was added, as well as the predecessor  $\mathtt{full}_t^{k-1}$  for preserving its state otherwise.

```
void define_full (uint k)
  for (uint t = 0; t < programs.size(); t++)</pre>
      std::vector<std::string> args {full(k - 1, t)};
      for (uint pc : updates[State::full][t])
        args.push_back(exec(k - 1, t, pc));
      formula << assign(full(k, t),</pre>
                         ite(flush(k - 1, t),
                              "false",
                              lor(args))) << '\n';
    }
}
Example: define_full(1)
(assert (= full_1_0 (ite flush_0_0
                              (or full_0_0 exec_0_0_1))))
(assert (= full_1_1 (ite flush_0_1
                              (or full_0_1 exec_0_1_1))))
(assert (= full_1_2 (ite flush_0_2 false full_0_2)))
```

In order to correctly model symbolic program counters correctly, individual state updates must ensure that at most one statement is activated in any step. We do so by initializing the update expression of each statement activation variable  $\mathsf{stmt}_{t,pc}^k$  with a conjunction containing its previous activation and the negation of the corresponding execution variable  $\mathsf{exec}_{t,pc}^{k-1}$  for explicitly deactivating the statement if it has been executed while preserving its state otherwise. Similarly to defining the successor of register states, each preceding program statement then embeds the expression's current value in the else branch of an ite, activating the statement depending on the predecessor's execution. Special care must be taken in case of the predecessor being a conditional jump, identified by  $\mathsf{is\_jump}$ , to either activate the target using a conjunction of its execution variable and condition, generated by the relevant  $\mathsf{encode}$  implementation, or the next statement by negating the condition in case of a failed jump.

```
void define_stmt (uint k)
{
 for (uint t = 0; t < programs.size(); t++)</pre>
    for (uint pc = 0; pc < programs[t].size(); pc++)</pre>
      {
        // statement reactivation
        std::string next = land(stmt(k - 1, t, pc),
                               lnot(exec(k - 1, t, pc)));
        // activation by predecessor
        const auto & stmts = predecessors[t][pc];
        for (auto pre = stmts.rbegin(); pre != stmts.rend(); ++pre)
          {
            std::string val = exec(k - 1, t, *pre);
            Instruction * op = programs[t][*pre];
            // add condition if predecessor is a jump
            if (is_jump(op))
              {
                 std::string cond = op->encode(k - 1, t, State::stmt);
                 val = land(val, op->arg == pc ? cond : lnot(cond));
            next = ite(stmt(k - 1, t, *pre), val, next);
        formula << assign(stmt(k, t, pc), next) << '\n';</pre>
      }
}
Example: define_stmt(1)
```

```
(assert (= stmt_1_0_0 (and stmt_0_0_0 (not exec_0_0_0))))
(assert (= stmt_1_0_1 (ite stmt_0_0_0
                           exec_0_0_0
                           (and stmt_0_0_1 (not exec_0_0_1)))))
(assert (= stmt_1_0_2 (ite stmt_0_0_1
                           exec_0_0_1
                           (and stmt_0_0_2 (not exec_0_0_2)))))
(assert (= stmt_1_0_3 (ite stmt_0_0_2
                           exec_0_0_2
                           (and stmt_0_0_3 (not exec_0_0_3)))))
(assert (= stmt_1_0_4 (ite stmt_0_0_3
                           exec_0_0_3
                           (and stmt_0_0_4 (not exec_0_0_4)))))
(assert (= stmt_1_1_0 (and stmt_0_1_0 (not exec_0_1_0))))
(assert (= stmt_1_1_1 (ite stmt_0_1_0
                           exec_0_1_0
                           (and stmt_0_1_1 (not exec_0_1_1)))))
(assert (= stmt_1_1_2 (ite stmt_0_1_1
                           exec_0_1_1
                           (and stmt_0_1_2 (not exec_0_1_2)))))
(assert (= stmt_1_1_3 (ite stmt_0_1_2
                           exec_0_1_2
                           (and stmt_0_1_3 (not exec_0_1_3)))))
(assert (= stmt_1_1_4 (ite stmt_0_1_3
                           exec_0_1_3
                           (and stmt_0_1_4 (not exec_0_1_4)))))
(assert (= stmt_1_2_0 (and stmt_0_2_0 (not exec_0_2_0))))
(assert (= stmt_1_2_1 (ite stmt_0_2_0
                           exec_0_2_0
                           (and stmt_0_2_1 (not exec_0_2_1)))))
(assert (= stmt_1_2_2 (ite stmt_0_2_1
                           exec_0_2_1
                           (and stmt_0_2_2 (not exec_0_2_2)))))
(assert (= stmt_1_2_3 (ite stmt_0_2_2
                           exec_0_2_2
                           (and stmt_0_2_3 (not exec_0_2_3)))))
(assert (= stmt_1_2_4 (ite stmt_0_2_3
                           (and exec_0_2_3 (not (= accu_0_2
                                                    #x0000)))
                           (and stmt_0_2_4 (not exec_0_2_4)))))
(assert (= stmt_1_2_5 (ite stmt_0_2_3
                           (and exec_0_2_3 (= accu_0_2 #x0000))
                           (and stmt_0_2_5 (not exec_0_2_5)))))
```

Block flags  $\operatorname{block}_{id,t}^k$  are defined by a single ite, unblocking the thread by resetting the state if all threads synchronized upon the corresponding checkpoint id in the previous step. Otherwise, we assign a disjunction including all execution variables  $\operatorname{exec}_{t,pc}^{k-1}$  of related CHECK id instructions enabling the state as well as the predecessor  $\operatorname{block}_{id,t}^{k-1}$  to presere its value.

```
void define_block (uint k)
  for (const auto & [id, threads] : checkpoints)
    for (const auto & [t, stmts] : threads)
      {
        std::vector<std::string> args {block(k - 1, id, t)};
        for (uint pc : stmts)
          args.push_back(exec(k - 1, t, pc));
        formula << assign(block(k, id, t),</pre>
                           ite(check(k - 1, id),
                               "false",
                               lor(args))) << '\n';
      }
}
Example: define_block(1)
(assert (= block_1_0_0 (ite check_0_0
                              (or block_0_0_0 exec_0_0_3))))
(assert (= block_1_0_1 (ite check_0_0
                             false
                              (or block_0_0_1 exec_0_1_3))))
(assert (= block_1_0_2 (ite check_0_0
                             false
                              (or block_0_0_2 exec_0_2_0))))
```

Since halt flags  $\mathtt{halt}_t^k$  cannot be reset once a thread has been stopped, they are defined by a simple disjunction including the execution variables  $\mathtt{exec}_{t,pc}^{k-1}$  of <code>HALT</code> instructions and the previous state  $\mathtt{halt}_t^{k-1}$  to preserve their value.

```
void define_halt (uint k)
{
    for (uint t = 0; t < programs.size(); t++)
        {
        std::vector<std::string> args {halt(k - 1, t)};
        for (uint pc : updates[State::halt][t])
            args.push_back(exec(k - 1, t, pc));
        formula << assign(halt(k, t), lor(args)) << '\n';
        }
}

Example: define_halt(1)

(assert (= halt_1_0 (or halt_0_0 exec_0_0_4)))
(assert (= halt_1_1 (or halt_0_1 exec_0_1_4)))
(assert (= halt_1_2 halt_0_2))</pre>
```

Definition of our shared memory array  $\mathtt{heap}^k$  again starts by initializing the update expression with its previous state  $\mathtt{heap}^{k-1}$ . Each thread then extends the expression by embedding it in an ite for every atomic write, modifying the array according to the instruction's encode implementation, followed by a final if-then-else to store the store buffer entry in case it was flushed.

```
void define_heap (uint k)
  std::string next = heap(k - 1);
  for (int t = programs.size() - 1; t >= 0; t--)
    {
      const auto & stmts = updates[State::heap][t];
      for (auto pc = stmts.rbegin(); pc != stmts.rend(); ++pc)
        next = ite(exec(k - 1, t, *pc),
                   programs[t][*pc].encode(k - 1, t, State::heap),
                   next);
      next = ite(flush(k - 1, t),
                 store(heap(k - 1),
                        adr(k - 1, t),
                        val(k - 1, t)),
                 next);
  formula << assign(heap(k, t), next) << '\n';</pre>
}
```

In order to enable the exit flag  $\mathtt{exit}^k$  according to our termination criteria, we define it as a disjunction containing the execution variables  $\mathtt{exec}_{t,pc}^{k-1}$  of every <code>EXIT</code> statement, a conjunction over the halt variables  $\mathtt{halt}_t^k$  of threads containing a call to <code>HALT</code> for stopping the machine if no more threads are running and the previous state  $\mathtt{exit}^{k-1}$  to preserve its value.

```
void define_exit(uint k)
 std::vector<std::string> halts;
 for (const auto & stmts : updates[State::halt])
    for (uint pc : stmts)
      halts.push_back(halt(k, t));
 std::vector<std::string> args {exit(k - 1), land(halts)};
 for (const auto & stmts : updates[State::exit])
    for (uint pc : stmts)
      args.push_back(exec(k - 1, t, pc));
 formula << assign(exit(k), lor(args)) << '\n';</pre>
}
Example: define_exit(1)
(assert (= exit_1
           (or exit_0
                (and halt_1_0 halt_1_1)
               exec_0_2_4
               exec_0_2_5)))
```

Our functional encoding scheme is concluded by setting the machine's exit code during the final step. Since it is unique in every execution, explicit unrolling allows us to reduce the total number of variables by introducing only a single state exit-code, defined by an expression initialized with zero and embedded in an ite for each step  $k \in [0, bound]$  and EXIT statement, assigning the corresponding exit code in case of its execution.

```
void define_exit_code ()
  std::string next = consth(0);
  for (uint k = 0; k <= bound; k++)</pre>
    for (const auto & stmts : updates[State::exit])
      for (uint pc : stmts)
        next = ite(exec(k, t, pc),
                    programs[t][pc].encode(k, t, State::exit),
                    next);
  formula << assign(exit_code, next) << '\n';</pre>
}
(assert (= exit-code
           (ite exec_17_2_5
                 #x0001
                 (ite exec_17_2_4
                      #x0000
                      (ite exec_0_2_5
                           #x0001
                            (ite exec_0_2_4
                                 #x0000
                                 #x0000))))))
```

# 7 Relational Next State Logic

Since state of the art SAT and SMT solvers are able to exploit certain structures in the problem formulation, the relative hardness of different semantically equivalent encodings may vary quite drastically. To highlight these differences, we included another variant using a relational next state logic, generated with the -e smtlib-relational command line parameter, based on implying the successors for each possible transition. Although this flat encoding scheme tends to work well with large combinatorial problems and small domains, our experiments show that it requires considerably more time to solve compared to our functional approach.

```
void define_states (uint k)
{
  for (uint t = 0; t < programs.size(); t++)
      {
      imply_thread_executed(k, t);
      imply_thread_not_executed(k, t);
      imply_thread_flushed(k, t);
    }
  imply_machine_exited(k);
}</pre>
```

We begin by defining variadic template functions for simplifying the generation of frequently reoccurring expressions, used to restore a state's previous value and additionally reset flags depending on a given condition.

Implying a common set of states that might be influenced by executing a certain thread results in most of them remaining unchanged. In order to reduce the effort involved in implementing the encoding function of each instruction, we introduce the following map, using the previously defined template functions to initialize it with the corresponding state-preserving expressions, thus requiring only the entries of states which are actually altered to be replaced and provide an implicit std::string conversion operator returning a conjunction to define all successors at once.

```
struct Next : std::map<State, std::string>
  Next (k, t)
    {
      (*this) [State::accu] = restore(&accu, k, t);
      (*this)[State::mem] = restore(&mem, k, t);
      (*this) [State::adr] = restore(&adr, k, t);
      (*this) [State::val] = restore(&val, k, t);
      (*this) [State::full] = restore(&full, k, t);
      (*this)[State::halt] = restore(&halt, k, t);
      (*this) [State::heap] = restore(&heap, k);
      (*this)[State::exit] = lnot(exit(k, t));
      std::vector<std::string> block_vars;
      for (const auto & [id, threads] : checkpoints)
        if (threads.find(t) != threads.end())
          block_vars.push_back(reset(&block,
                                       &check,
                                       k, id, t));
      if (!block_vars.empty())
        (*this) [State::block] = land(block_vars);
    }
  operator std::string () const
      std::vector<std::string> args;
      for (const auto & [_, expr] : *this)
        if (!expr.empty())
          args.push_back(expr);
      return land(args);
};
```

Since the statement activation variables  $\mathtt{stmt}_{t,pc}^k$  of each thread must be fully defined for every possible transition, we include helper functions to generate the required conjunctions.

In case of jump instructions, another overload either activates the target or successor statement, depending on the corresponding condition.

Generating a common set of successor states for every possible instruction is again based on dynamically dispatched virtual member functions of concrete Instruction class instantiations. Therefore, we introduce the following overload including the current program counter for activating the next statement.

```
virtual std::string Instruction::encode (uint t, uint k, uint pc) = 0;
```

By using the previously defined helper constructs, the actual implementations can be simplified quite drastically and even though they therefore almost look the same, we include the full definitions for completeness.

```
LOAD arg
```

```
std::string Load::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
STORE arg
std::string Store::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::adr] = equal(adr(k, t), encode(k - 1, t, State::adr));
  next[State::val] = equal(val(k, t), encode(k - 1, t, State::val));
  next[State::full] = full(k, t);
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
FENCE
std::string Fence::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
ADD arg
std::string Add::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
```

```
std::string Addi::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
SUB arg
std::string Sub::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
SUBI arg
std::string Subi::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
MUL arg
std::string Mul::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
```

ADDI arg

```
MULI arg
std::string Muli::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
CMP arg
std::string Cmp::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
JMP arg
std::string Jmp::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, arg);
  return next;
}
JZ arg
std::string Jz::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, pc, this);
  return next;
}
```

```
JNZ arg
std::string Jnz::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, pc, this);
  return next;
}
JS arg
std::string Js::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, pc, this);
  return next;
}
JNS arg
std::string Jns::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, pc, this);
  return next;
}
JNZNS arg
std::string Jnzns::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, pc, this);
  return next;
}
```

```
MEM arg
std::string Mem::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::mem] = equal(mem(k, t), encode(k - 1, t, State::mem));
  next[State::stmt] = activate(k, t, pc + 1);
  return next;
}
CAS arg
std::string Cas::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::accu] = equal(accu(k, t),
                             encode(k - 1, t, State::accu));
  next[State::stmt] = activate(k, t, pc + 1);
  next[State::heap] = equal(heap(k),
                             encode(k - 1, t, State::heap));
  return next;
}
CHECK arg
std::string Check::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, pc + 1);
  std::vector<std::string> blocks;
  for (const auto & [id, threads] : checkpoints)
    if (threads.find(t) != threads.end())
      blocks.push_back(id == op.arg
                          ? block(k, id, t)
                          : reset(&block, &check, k, id, t));
  next[State::block] = land(blocks);
```

Since a thread might wait for different checkpoints, the encode function has to enable the corresponding blocking flag  $block_{id,t}^k$  while preserving or resetting any other.

return state;

}

#### HALT

```
std::string Halt::encode (uint t, uint k, uint pc)
  Next next(k, t);
  next[State::stmt] = activate(k, t, -1);
  if (programs.size() > 1)
      std::vector<std::string> args;
      for (uint thread = 0; thread < programs.size(); thread++)</pre>
        if (thread != t)
          args.push_back(halt(k, thread));
      next[State::halt] =
        land(halt(k, t),
              ite(land(args),
                  land(exit(k), equal(exit_code, consth(0))),
                  lnot(exit(k)));
    }
  else
    next[State::halt] = land(halt(k, t),
                              exit(k),
                              equal(exit_code, consth(0)));
  state.erase(state.find(State::exit));
  return state;
}
```

Stopping the machine if all threads halted is achieved by an ite, enabling the exit flag  $exit^k$  and assigning the default exit code if all halt flags  $halt^k_t$  are set. Otherwise, the exit flag is disabled to continue execution. Since the exit flag is now already contained in the State::halt entry, we have to erase State::exit to prevent conflicts due to its falsification.

#### **EXIT** arg

With all utility constructs and instruction related encoding functions defined, we are now able to imply the next states for all transitions, starting with the execution of any given statement.

```
void imply_thread_executed (uint k, uint t)
  for (uint pc = 0; pc < programs[t].size(); pc++)</pre>
    formula << assertion(imply(exec(k - 1, t, pc),</pre>
                                programs[t][pc]->encode(k, t, pc)))
            << '\n';
}
Example: imply_thread_executed(1, 0)
(assert (=> exec_0_0_0 ; ADDI 1
            (and (= accu_1_0 (bvadd accu_0_0 #x0001))
                  (= mem_1_0 mem_0_0)
                  (= adr_1_0 adr_0_0)
                  (= val_1_0 val_0_0)
                  (= full_1_0 full_0_0)
                  (and (not stmt_1_0_0)
                            stmt_1_0_1
                       (not stmt_1_0_2)
                       (not stmt_1_0_3)
                       (not stmt_1_0_4))
                  (= block_1_0_0 (ite check_0_0 false block_0_0_0))
                  (= halt_1_0 halt_0_0)
                  (= heap_1 heap_0)
                  (not exit_1))))
(assert (=> exec_0_0_1 ; STORE 0
            (and (= accu_1_0 accu_0_0)
                  (= mem_1_0 mem_0_0)
                  (= adr_1_0 #x0000)
                  (= val_1_0 accu_0_0)
                  full_1_0
                  (and (not stmt_1_0_0)
                       (not stmt_1_0_1)
                       stmt_1_0_2
                       (not stmt_1_0_3)
                       (not stmt_1_0_4))
                  (= block_1_0_0 (ite check_0_0 false block_0_0_0))
                  (= halt_1_0 halt_0_0)
                  (= heap_1 heap_0)
                  (not exit_1)))) ...
```

In case a thread was not executed in the previous step, all local states must be preserved except the store buffer full flag  $\mathtt{full}_t^k$  and block flags  $\mathtt{block}_{id,t}^k$ , as they might be reset by a store buffer flush, or due to being released from a checkpoint if another thread executed the corresponds CHECK instruction.

```
void imply_thread_not_executed (uint k, uint t)
  Next next(k, t);
  next[State::full] = reset(&full, &flush, k, t);
  std::vector<std::string> stmts;
  for (uint pc = 0; pc < programs.size(); pc++)</pre>
    stmts.push_back(restore(&stmt, k, t, pc));
  next[State::stmt] = land(stmts);
  next.erase(next.find(State::heap));
  next.erase(next.find(State::exit));
  formula << assertion(imply(lnot(thread(k - 1, t)), next)) << '\n';</pre>
}
Example: imply_thread_not_executed(1, 0)
(assert (=> (not thread_0_0))
            (and (= accu_1_0 accu_0_0)
                  (= mem_1_0 mem_0_0)
                  (= adr_1_0 adr_0_0)
                  (= val_1_0 val_0_0)
                  (= full_1_0 (ite flush_0_0 false full_0_0))
                  (and (= stmt_1_0_0 stmt_0_0_0)
                       (= stmt_1_0_1 stmt_0_0_1)
                       (= stmt_1_0_2 stmt_0_0_2)
                       (= stmt_1_0_3 stmt_0_0_3)
                       (= stmt_1_0_4 stmt_0_0_4))
                  (= block_1_0_0 (ite check_0_0 false block_0_0_0))
                  (= halt_1_0 halt_0_0))))
```

With all local thread states already being set, a store buffer flush just needs to write the heap array  $heap^k$  and disable the exit flag  $exit^k$  to continue execution.

Finally, to keep the model consistent if the machine terminated in a step k < bound, the heap array  $heap^{k-1}$  must be preserved. On the other hand, if the machine is still running in the final step, we need to set the default exit code as it would otherwise remain undefined.

## 8 BTOR2

We also included the possibility to generate the word level model checking format BTOR2 [7], using the -e btor2 command line parameter. It provides a sequential extension for specifying states in combination with their transition functions, which are automatically unrolled via symbolic substitution by the accompanying bounded model checker BtorMC. In contrast to the linear growth of our SMT-LIB encodings, the generated formulas size therefore remains constant for any given bound. We omit further details about the encoding process, as it is more or less identical to the previous using our functional next state logic, but a bit more tedious due to the line based syntax. Instead, we review the encoding of our store buffer litmus test example to highlight the pros and cons of BTOR2.

BTOR2 formulas consist of nodes, one per line, each prefixed with a strictly increasing numeric identifier serving as the input to subsequent nodes. By requiring every node to be defined before being used, the resulting direct acyclic graph is easy to parse and since every part of the expression may be referenced at a later point, no additional auxiliary variables are needed. See [7] for a detailed format description and the list of available nodes.

The encoding of our store buffer litmus test example given in Table 6 starts by defining the required boolean, bit-vector and array sorts.

```
1 sort bitvec 1 \mathcal{B}^1 2 sort bitvec 16 \mathcal{B}^{16} 3 sort array 2 2 \mathcal{A}^{16}
```

Next, a special set of input nodes is used to define the required boolean

```
4 zero 1
5 one 1
```

and bit-vector constants.

```
6 zero 2
7 one 2
8 constd 2 5
9 constd 2 17
```

In order to initialize our heap array with the contents of this example's memory map given in Listing 8, an uninitialized array state is declared and the corresponding elements written accordingly. By omitting its transition function, it is treated as a primary input only used during initialization and can safely be ignored in later steps.

```
10 state 3 mmap  
11 write 3 10 6 6 mmap[0] \leftarrow 0  
12 write 3 11 7 6 mmap[1] \leftarrow 0
```

Since our actual machine states are used in the definition of constraints and transition functions, they must be declared in advance.

• Accumulator registers:

```
13 state 2 accu_0
14 state 2 accu_1
15 state 2 accu_2
```

• CAS memory registers:

```
16 state 2 mem<sub>0</sub>
17 state 2 mem<sub>1</sub>
18 state 2 mem<sub>2</sub>
```

• Store buffer address registers:

```
19 state 2 adr_0
20 state 2 adr_1
21 state 2 adr_2
```

• Store buffer value registers:

```
22 state 2 val_0
23 state 2 val_1
24 state 2 val_2
```

• Store buffer full flags:

```
25 state 1 full_0
26 state 1 full_1
27 state 1 full_2
```

• Statement activation flags:

```
28 state 1 stmt_{0,0}
29 state 1 stmt_{0,1}
30 state 1 stmt_{0,2}
31 state 1 stmt_{0,3}
32 state 1 stmt_{0,4}
```

```
33 state 1 stmt<sub>1,0</sub>
34 state 1 stmt<sub>1,1</sub>
35 state 1 stmt<sub>1,2</sub>
36 state 1 stmt<sub>1,3</sub>
37 state 1 stmt<sub>1,4</sub>

38 state 1 stmt<sub>2,0</sub>
39 state 1 stmt<sub>2,1</sub>
40 state 1 stmt<sub>2,2</sub>
41 state 1 stmt<sub>2,3</sub>
42 state 1 stmt<sub>2,4</sub>
43 state 1 stmt<sub>2,5</sub>
```

• Block flags:

• Halt flags:

```
47 state 1 halt_0
48 state 1 halt_1
49 state 1 halt_2
```

• Shared memory:

50 state 3 heap

• Exit flag:

```
51 state 1 exit
```

• Exit code:

52 state 2 exit-code

Same applies to the free transition variables  $\mathsf{thread}_t$  and  $\mathsf{flush}_t$  serving as the model's input.

```
53 input 1 thread<sub>0</sub>
54 input 1 thread<sub>1</sub>
55 input 1 thread<sub>2</sub>

56 input 1 flush<sub>0</sub>
57 input 1 flush<sub>1</sub>
58 input 1 flush<sub>2</sub>
```

In contrast to the SMT-LIB encoding, no additional declarations are needed for the introduction of helper variables and our execution variables  $\mathsf{exec}_{t,pc}$  therefore defined as simple and nodes.

```
59 and 1 28 53
                                                          \operatorname{stmt}_{0,0} \wedge \operatorname{thread}_0 \equiv \operatorname{exec}_{0,0}
60 and 1 29 53
                                                          \operatorname{stmt}_{0,1} \wedge \operatorname{thread}_0 \equiv \operatorname{exec}_{0,1}
61 and 1 30 53
                                                          \operatorname{stmt}_{0,2} \wedge \operatorname{thread}_0 \equiv \operatorname{exec}_{0,2}
62 and 1 31 53
                                                          \operatorname{stmt}_{0.3} \wedge \operatorname{thread}_0 \equiv \operatorname{exec}_{0.3}
63 and 1 32 53
                                                          \mathtt{stmt}_{0,4} \wedge \mathtt{thread}_0 \equiv \mathtt{exec}_{0,4}
64 and 1 33 54
                                                          \operatorname{stmt}_{1,0} \wedge \operatorname{thread}_1 \equiv \operatorname{exec}_{1,0}
65 and 1 34 54
                                                          \operatorname{stmt}_{1,1} \wedge \operatorname{thread}_1 \equiv \operatorname{exec}_{1,1}
66 and 1 35 54
                                                          \mathtt{stmt}_{1,2} \wedge \mathtt{thread}_1 \equiv \mathtt{exec}_{1,2}
67 and 1 36 54
                                                          \operatorname{stmt}_{1,3} \wedge \operatorname{thread}_1 \equiv \operatorname{exec}_{1,3}
68 and 1 37 54
                                                          \mathtt{stmt}_{1,4} \wedge \mathtt{thread}_1 \equiv \mathtt{exec}_{1,4}
69 and 1 38 55
                                                          \operatorname{stmt}_{2,0} \wedge \operatorname{thread}_2 \equiv \operatorname{exec}_{2,0}
70 and 1 39 55
                                                          \operatorname{stmt}_{2,1} \wedge \operatorname{thread}_2 \equiv \operatorname{exec}_{2,1}
71 and 1 40 55
                                                          \operatorname{stmt}_{2,2} \wedge \operatorname{thread}_2 \equiv \operatorname{exec}_{2,2}
72 and 1 41 55
                                                          \operatorname{stmt}_{2,3} \wedge \operatorname{thread}_2 \equiv \operatorname{exec}_{2,3}
73 and 1 42 55
                                                          \operatorname{stmt}_{2,4} \wedge \operatorname{thread}_2 \equiv \operatorname{exec}_{2,4}
74 and 1 43 55
                                                          \mathtt{stmt}_{2.5} \land \mathtt{thread}_2 \equiv \mathtt{exec}_{2.5}
```

The fixed arity of BTOR2 operators imposes a minor inconvenience if the same connective has to be applied over a larger number of variables. For example, a cascade of and nodes is required to build the conjunction over all block flags  $block_{0,t}$  representing the checkpoint variable  $check_0$ .

```
75 and 1 44 45 block_{0,0} \wedge block_{0,1}
76 and 1 46 75 block_{0,2} \equiv check_0
```

Similarly, the disjunction serving as the *at-least-one* predicate of our *exactly-one* constraint is defined by a cascade of or nodes

and added as an invariant through the keyword constraint.

```
83 constraint 82
```

Definition of the remaining at-most-one predicate starts with the declaration of auxiliary input variables required by Carsten Sinz's sequential counter constraint  $LT_{SEQ}^{7,1}$  [8],

```
84 input 1 thread_0^{aux}
85 input 1 thread_0^{aux}
86 input 1 thread_0^{aux}
87 input 1 flush_0^{aux}
88 input 1 flush_1^{aux}
89 input 1 flush_0^{aux}
```

followed by the corresponding set of clauses.

```
90 or 1 -53 84
                                                  \neg \mathtt{thread}_0 \lor \mathtt{thread}_0^{\mathtt{aux}}
                                                  \neg \texttt{thread}_1 \lor \texttt{thread}_1^{\texttt{aux}}
91 or 1 -54 85
                                                  \neg \mathtt{thread}_0^{\mathtt{aux}} \lor \mathtt{thread}_1^{\mathtt{aux}}
92 or 1 -84 85
                                            \neg \mathsf{thread}_0 \lor \mathsf{thread}_1
\neg \mathsf{thread}_1 \lor \neg \mathsf{thread}_0^{\mathsf{aux}}
\neg \mathsf{thread}_2 \lor \mathsf{thread}_2^{\mathsf{aux}}
93 or 1 -54 -84
94 or 1 -55 86
                                                \neg \mathtt{thread}_1^{\mathtt{aux}} \lor \mathtt{thread}_2^{\mathtt{aux}}
95 or 1 -85 86
                                                  \neg \mathtt{thread}_2 \vee \neg \mathtt{thread}_1^{\mathtt{aux}}
96 or 1 -55 -85
                                                  \neg \mathtt{flush}_0 \lor \mathtt{flush}_0^{\mathtt{aux}}
97 or 1 -56 87
                                                  \neg \mathtt{thread}_2^{\mathtt{aux}} \lor \mathtt{flush}_0^{\mathtt{aux}}
98 or 1 -86 87
                                                  \neg \mathtt{flush}_0 \lor \mathtt{thread}_2^{\mathtt{aux}}
99 or 1 -56 -86
100 or 1 -57 88
                                                  \neg \mathtt{flush}_1 \lor \mathtt{flush}_0^{\mathtt{aux}}
                                                  \neg \mathtt{flush}_0^{\mathtt{aux}} \lor \mathtt{flush}_1^{\mathtt{aux}}
101 or 1 -87 88
                                                  \neg flush_1 \lor \neg flush_0^{aux}
102 or 1 -57 -87
                                                  \neg flush_2 \lor flush_1^{aux}
103 or 1 -58 89
                                                  \neg \mathtt{flush}^{\mathtt{aux}}_1 \lor \mathtt{flush}^{\mathtt{aux}}_2
104 or 1 -88 89
                                                  \neg flush_2 \lor \neg flush_1^{aux}
105 or 1 -58 -88
106 or 1 -51 -89
                                                   \neg \text{exit} \lor \neg \text{flush}_2^{\text{aux}}
```

The *exactly-one* constraint can now be completed by building a conjunction over aforementioned clauses

```
107 and 1 90 91
                                                           (\neg \mathtt{thread}_0 \lor \mathtt{thread}_0^{\mathtt{aux}}) \land (\neg \mathtt{thread}_1 \lor \mathtt{thread}_1^{\mathtt{aux}})
                                                          {}^{\downarrow} \wedge (\lnot \mathtt{thread}_0^{\mathtt{aux}} \lor \mathtt{thread}_1^{\mathtt{aux}})
108 and 1 92 107
109 and 1 93 108
                                                          \downarrow \land (\neg \mathsf{thread}_1 \lor \neg \mathsf{thread}_0^{\mathsf{aux}})
110 and 1 94 109
                                                          \downarrow \land (\neg \mathsf{thread}_2 \lor \mathsf{thread}_2^{\mathsf{aux}})
                                                          {} \mathrel{\,\,\,}{} \mathrel{\,\,}{} \mathrel{\,\,}{} \wedge (\lnot {\sf thread}_1^{{\sf aux}} \lor {\sf thread}_2^{{\sf aux}})
111 and 1 95 110
                                                          {} \mathrel{\,\,\,}{} \mathrel{\,\,}{} \mathrel{\,\,}{} \wedge (\lnot {\sf thread}_2 \lor \lnot {\sf thread}_1^{{\sf aux}})
112 and 1 96 111
                                                          \rightarrow \land (\neg \mathtt{flush}_0 \lor \mathtt{flush}_0^{\mathtt{aux}})
113 and 1 97 112
113 and 1 97 112
114 and 1 98 113
115 and 1 99 114
                                                         {}_{\vdash} \wedge (\lnot \mathtt{thread}_2^{\mathtt{aux}} \lor \mathtt{flush}_0^{\mathtt{aux}})
                                                          \downarrow \land (\neg flush_0 \lor thread_2^{aux})
                                                          \downarrow \land (\neg flush_1 \lor flush_0^{aux})
116 and 1 100 115
```

and adding yet another invariant.

#### 123 constraint 122

This cardinality constraint is then again further influenced by prohibiting certain transitions like flushing an empty store buffer or executing barrier instructions if it is full,

```
124 or 1 29 32 \operatorname{stmt}_{0,1} \vee \operatorname{stmt}_{0,4}
125 implies 1 124 -53 \downarrow \implies \neg \operatorname{thread}_0
126 ite 1 25 125 -56 \operatorname{ite}(\operatorname{full}_0, \, \downarrow, \neg \operatorname{flush}_0)
127 constraint 126

128 or 1 34 37 \operatorname{stmt}_{1,1} \vee \operatorname{stmt}_{1,4}
129 implies 1 128 -54 \downarrow \implies \neg \operatorname{thread}_1
130 ite 1 26 129 -57 \operatorname{ite}(\operatorname{full}_1, \, \downarrow, \neg \operatorname{flush}_1)
131 constraint 130

132 implies 1 58 27 \operatorname{flush}_2 \implies \operatorname{full}_2
133 constraint 132
```

executing a thread while it is waiting for a checkpoint,

```
134 and 1 44 -76 block_{0,0} \land \neg \text{check}_0
135 implies 1 134 -53 \lor \implies \neg \text{thread}_0
136 constraint 135

137 and 1 45 -76 block_{0,1} \land \neg \text{check}_0
138 implies 1 137 -54 \lor \implies \neg \text{thread}_1
139 constraint 138

140 and 1 46 -76 block_{0,2} \land \neg \text{check}_0
141 implies 1 140 -55 \lor \implies \neg \text{thread}_2
142 constraint 141
```

and stopping halted threads from being scheduled.

```
143 implies 1 47 -53 halt_0 \Longrightarrow \neg thread_0
144 constraint 143

145 implies 1 48 -54 halt_1 \Longrightarrow \neg thread_1
146 constraint 145

147 implies 1 49 -55 halt_2 \Longrightarrow \neg thread_2
148 constraint 147
```

As mentioned earlier, the main advantage of the BTOR2 format lies in the ability to define states in combination with their transition function. In our case, definition of machine states therefore follows a pretty basic pattern: after initializing the state with the keyword <code>init</code>, its successor is defined through the appropriate set of nodes and the resulting expression registered as the given state's transition function by using the corresponding keyword <code>next</code>.

#### • Accumulator registers:

```
accu_0^0 \leftarrow 0
149 init 2 13 6
150 add 2 13 7
                                accu_0 + 1
151 ite 2 59 150 13
                                ite(exec_{0,0}, \, \downarrow, accu_0) \equiv ADDI_0
152 read 2 50 7
                                read(1)
153 eq 1 19 7
                                adr_0 = 1
154 and 1 25 153
                                \vdash \land \mathtt{full}_0
155 ite 2 154 22 152 ite(, val_0, read(1))
156 ite 2 61 155 151 ite(exec_{0.2}, \, \, \downarrow \, , \, ADDI_0)
157 next 2 13 156
                                accu'_0 \leftarrow 4
158 init 2 14 6
                                accu_1^0 \leftarrow 0
159 add 2 14 7
                                accu_1 + 1
                                \mathtt{ite}(\mathtt{exec}_{1.0}, \, {} \, {}, \mathtt{accu}_1) \equiv \mathtt{ADDI}_1
160 ite 2 64 159 14
161 read 2 50 6
                                read(0)
162 eq 1 20 6
                                adr_1 = 0
163 and 1 26 162
                                \downarrow \land full_1
164 ite 2 163 23 161 ite(\dark, val<sub>1</sub>, read(0))
165 ite 2 66 164 160 ite(exec_{1,2}, \, \downarrow, \, ADDI_1)
166 next 2 14 165
                                \mathtt{accu}_1' \leftarrow \mathord{\downarrow}
                                accu_2^0 \leftarrow 0
167 init 2 15 6
168 eq 1 21 6
                                adr_2 = 0
169 and 1 27 168
                                \downarrow \land full_2
170 ite 2 169 24 161 ite(, val_2, read(0))
```

```
171 add 2 15 170 \qquad \qquad \downarrow + accu_2

172 ite 2 70 171 15 ite(exec_{2,1}, \, \downarrow \, , accu_2) \equiv \text{ADD}

173 eq 1 21 7 adr_2 = 1

174 and 1 27 173 \qquad \qquad \downarrow \land full_2

175 ite 2 174 24 152 ite(\, \downarrow \, , val_2, read(1))

176 add 2 15 175 \qquad \downarrow + accu_2

177 ite 2 71 176 172 ite(exec_{2,2}, \, \downarrow \, , \text{ADD})

178 next 2 15 177 accu_2' \leftarrow \downarrow
```

## • CAS memory registers:

```
179 init 2 16 6 \operatorname{mem}_0^0 \leftarrow 0
180 next 2 16 16 \operatorname{mem}_0^0 \leftarrow \operatorname{mem}_0

181 init 2 17 6 \operatorname{mem}_1^0 \leftarrow 0
182 next 2 17 17 \operatorname{mem}_1^0 \leftarrow \operatorname{mem}_1

183 init 2 18 6 \operatorname{mem}_2^0 \leftarrow 0
184 next 2 18 18 \operatorname{mem}_2^0 \leftarrow \operatorname{mem}_2
```

#### • Store buffer address registers:

185 init 2 19 6 
$$\operatorname{adr}_0^0 \leftarrow 0$$
  
186 ite 2 60 6 19  $\operatorname{ite}(\operatorname{exec}_{0,1}, 0, \operatorname{adr}_0)$   
187 next 2 19 186  $\operatorname{adr}_0' \leftarrow \downarrow$   
188 init 2 20 6  $\operatorname{adr}_1^0 \leftarrow 0$   
189 ite 2 65 7 20  $\operatorname{ite}(\operatorname{exec}_{1,1}, 0, \operatorname{adr}_1)$   
190 next 2 20 189  $\operatorname{adr}_1' \leftarrow \downarrow$   
191 init 2 21 6  $\operatorname{adr}_2' \leftarrow 0$   
192 next 2 21 21  $\operatorname{adr}_2' \leftarrow \operatorname{adr}_2$ 

#### • Store buffer value registers:

193 init 2 22 6 
$$\operatorname{val}_0^0 \leftarrow 0$$
194 ite 2 60 13 22  $\operatorname{ite}(\operatorname{exec}_{0,1}, \operatorname{accu}_0, \operatorname{val}_0)$ 
195 next 2 22 194  $\operatorname{val}_0' \leftarrow \downarrow$ 

196 init 2 23 6  $\operatorname{val}_1^0 \leftarrow 0$ 
197 ite 2 65 14 23  $\operatorname{ite}(\operatorname{exec}_{1,1}, \operatorname{accu}_1, \operatorname{val}_1)$ 
198 next 2 23 197  $\operatorname{val}_1' \leftarrow \downarrow$ 

199 init 2 24 6  $\operatorname{val}_1^0 \leftarrow 0$ 
200 next 2 24 24  $\operatorname{val}_2' \leftarrow \operatorname{val}_2$ 

#### • Store buffer full flags:

```
full_0^0 \leftarrow 0
201 init 1 25 4
202 or 1 60 25
                                 exec_{0.1} \lor full_0
203 ite 1 56 4 202
                                 ite(flush_0, 0, \downarrow)
204 next 1 25 203
                                 full'_0 \leftarrow \emptyset
                                 full_1^0 \leftarrow 0
205 init 1 26 4
206 or 1 65 26
                                 exec_{1,1} \lor full_1
207 ite 1 57 4 206
                                 ite(flush_1, 0, \downarrow)
208 next 1 26 207
                                 full'_1 \leftarrow \emptyset
                                 full_2^0 \leftarrow 0
209 init 1 27 4
210 ite 1 58 4 27
                                 ite(flush_2, 0, full_2)
                                 \mathtt{full}_2' \leftarrow \mathord{\downarrow}
211 next 1 27 210
```

#### • Statement activation flags:

```
\mathtt{stmt}_{0,0}^0 \leftarrow 1
212 init 1 28 5
213 and 1 28 -59
                                            \mathsf{stmt}_{0,0} \land \neg \mathsf{exec}_{0,0}
214 next 1 28 213
                                            \operatorname{stmt}'_{0,0} \leftarrow \emptyset
                                           \mathtt{stmt}_{0.1}^0 \leftarrow 0
215 init 1 29 4
216 and 1 29 -60
                                            \mathsf{stmt}_{0,1} \land \neg \mathsf{exec}_{0,1}
217 ite 1 28 59 216
                                            ite(stmt_{0,0}, exec_{0,0}, \triangleleft)
218 next 1 29 217
                                            \mathtt{stmt}'_{0.1} \leftarrow \forall
                                           \mathtt{stmt}_{0.2}^0 \leftarrow 0
219 init 1 30 4
220 and 1 30 -61
                                            \operatorname{stmt}_{0,2} \wedge \neg \operatorname{exec}_{0,2}
                                            ite(stmt_{0,1}, exec_{0,1}, \triangleleft)
221 ite 1 29 60 220
222 next 1 30 221
                                            \operatorname{stmt}'_{0,2} \leftarrow \emptyset
                                            \mathtt{stmt}_{0.3}^0 \leftarrow 0
223 init 1 31 4
224 and 1 31 -62
                                            \mathsf{stmt}_{0.3} \land \neg \mathsf{exec}_{0.3}
225 ite 1 30 61 224
                                            ite(stmt_{0,2}, exec_{0,2}, \checkmark)
226 next 1 31 225
                                            \operatorname{stmt}'_{0.3} \leftarrow A
                                           \mathtt{stmt}_{0.4}^0 \leftarrow 0
227 init 1 32 4
228 and 1 32 -63
                                            \operatorname{stmt}_{0,4} \wedge \neg \operatorname{exec}_{0,4}
                                            ite(stmt_{0,3}, exec_{0,3}, \triangleleft)
229 ite 1 31 62
230 next 1 32 229
                                            \mathtt{stmt}'_{0.4} \leftarrow 4
                                           \mathtt{stmt}_{1,0}^0 \leftarrow 1
231 init 1 33 5
```

```
232 and 1 33 -64
                                              \operatorname{stmt}_{1,0} \wedge \neg \operatorname{exec}_{1,0}
233 next 1 33 232
                                              \mathtt{stmt}_{1.0}' \leftarrow \forall
                                              \mathtt{stmt}_{1,1}^0 \leftarrow 0
234 init 1 34 4
235 and 1 34 -65
                                              \operatorname{stmt}_{1,1} \wedge \neg \operatorname{exec}_{1,1}
236 ite 1 33 64 235
                                              ite(stmt_{1,0}, exec_{1,0}, \triangleleft)
                                              \mathtt{stmt}'_{1.1} \leftarrow \forall
237 next 1 34 236
                                             \mathtt{stmt}_{1,2}^0 \leftarrow 0
238 init 1 35 4
239 and 1 35 -66
                                              \mathsf{stmt}_{1,2} \land \neg \mathsf{exec}_{1,2}
240 ite 1 34 65 239
                                              ite(stmt_{1,1}, exec_{1,1}, \triangleleft)
241 next 1 35 240
                                              \operatorname{stmt}_{1,2}' \leftarrow \emptyset
                                              \mathtt{stmt}_{1.3}^0 \leftarrow 0
242 init 1 36 4
243 and 1 36 -67
                                              \mathsf{stmt}_{1,3} \land \neg \mathsf{exec}_{1,3}
244 ite 1 35 66 243
                                              ite(stmt_{1,2}, exec_{1,2}, \triangleleft)
                                              \mathtt{stmt}_{1.3}' \leftarrow \forall
245 next 1 36 244
                                             \mathtt{stmt}_{1.4}^0 \leftarrow 0
246 init 1 37 4
                                              \operatorname{stmt}_{1.4} \wedge \neg \operatorname{exec}_{0.4}
247 and 1 37 -68
                                              \mathtt{ite}(\mathtt{stmt}_{1,3},\mathtt{exec}_{1,3},\, \mathord{\dashv}\,)
248 ite 1 36 67 247
249 next 1 37 248
                                              \operatorname{stmt}_{1.4}^{\prime} \leftarrow \exists
                                              \operatorname{stmt}_{2,0}^0 \leftarrow 1
250 init 1 38 5
251 and 1 38 -69
                                              \operatorname{stmt}_{2,0} \wedge \neg \operatorname{exec}_{2,0}
                                              \operatorname{stmt}_{2,0}' \leftarrow \emptyset
252 next 1 38 251
                                              \mathtt{stmt}_{2,1}^0 \leftarrow 0
253 init 1 39 4
254 and 1 39 -70
                                              \operatorname{stmt}_{2,1} \wedge \neg \operatorname{exec}_{2,1}
255 ite 1 38 69 254
                                              ite(stmt_{2,0}, exec_{2,0}, \triangleleft)
                                              \mathtt{stmt}_{2,1}' \leftarrow \mathbf{4}
256 next 1 39 255
                                              \mathtt{stmt}_{2.2}^0 \leftarrow 0
257 init 1 40 4
258 and 1 40 -71
                                              \operatorname{stmt}_{2,2} \wedge \neg \operatorname{exec}_{2,2}
259 ite 1 39 70 258
                                              ite(stmt_{2,1}, exec_{2,1}, \triangleleft)
260 next 1 40 259
                                              \operatorname{stmt}_{2.2}' \leftarrow \exists
                                              \mathtt{stmt}_{2.3}^0 \leftarrow 0
261 init 1 41 4
262 and 1 41 -72
                                              \operatorname{stmt}_{2,3} \wedge \neg \operatorname{exec}_{2,3}
263 ite 1 40 71 262
                                              ite(stmt_{2,2}, exec_{2,2}, \triangleleft)
264 next 1 41 263
                                              \operatorname{stmt}_{2.3}' \leftarrow A
```

```
\mathtt{stmt}^0_{2,4} \leftarrow 0
265 init 1 42 4
266 and 1\ 42\ -73
                                            \mathtt{stmt}_{2,4} \wedge \neg \mathtt{exec}_{2,4}
                                             accu_2 = 0
267 eq 1 15 6
268 and 1 72 -267
                                             exec_{2,3} \land accu_2 \neq 0
269 ite 1 41 268 266 \mathtt{ite}(\mathtt{stmt}_{2,3}, \, {} \, {} \, {}_{\!4}\,, \mathtt{stmt}_{2,4} \wedge \neg \mathtt{exec}_{2,4})
                                            \mathtt{stmt}_{2,4}' \leftarrow \forall
270 next 1 42 269
                                            \mathtt{stmt}^0_{2.5} \leftarrow 0
271 init 1 43 4
272 and 143 - 74
                                             \mathtt{stmt}_{2,5} \wedge \neg \mathtt{exec}_{2,5}
273 and 1 72 267
                                             \mathtt{exec}_{2,3} \wedge \mathtt{accu}_2 = 0
274 ite 1 41 273 272 \mathtt{ite}(\mathtt{stmt}_{2,3}, \, {} \, {} \, {}, \mathtt{stmt}_{2,5} \wedge \neg \mathtt{exec}_{2,5})
275 next 1 43 274
                                            \mathtt{stmt}_{2.5}' \leftarrow \forall
```

## • Block flags:

| 276 i | nit 1 44 4    | $block_{0,0}^0 \leftarrow 0$                    |
|-------|---------------|-------------------------------------------------|
| 277 o | r 1 62 44     | $exec_{0,3} \lor block_{0,0}$                   |
| 278 i | te 1 76 4 277 | $ite(check_0,0,4)$                              |
| 279 n | ext 1 44 278  | $\mathtt{block}_{0,0}' \leftarrow \forall$      |
|       |               | _                                               |
| 280 i | nit 1 45 4    | $block_{0,1}^0 \leftarrow 0$                    |
| 281 o | r 1 67 45     | $\mathtt{exec}_{1,3} \lor \mathtt{block}_{0,1}$ |
| 282 i | te 1 76 4 281 | $ite(check_0, 0, \triangleleft)$                |
| 283 n | ext 1 45 282  | $block'_{0,1} \leftarrow \downarrow$            |
|       |               |                                                 |
| 284 i | nit 1 46 4    | ${\tt block}_{0,2}^0 \leftarrow 0$              |
| 285 o | r 1 69 46     | $exec_{2,0} \lor block_{0,2}$                   |
| 286 i | te 1 76 4 285 | $ite(check_0, 0, \triangleleft)$                |
| 287 n | ext 1 46 286  | $\mathtt{block}'_{0,2} \leftarrow \forall$      |
|       |               |                                                 |

#### • Halt flags:

| 288 | init | 1 47 4   | $\mathtt{halt}_0^0 \leftarrow 0$                            |
|-----|------|----------|-------------------------------------------------------------|
| 289 | or 1 | 63 47    | $\mathtt{exec}_{0,4} \lor \mathtt{halt}_0$                  |
| 290 | next | 1 47 289 | $\mathtt{halt}_0' \leftarrow \forall$                       |
|     |      |          |                                                             |
| 291 | init | 1 48 4   | $\mathtt{halt}_1^0 \leftarrow 0$                            |
| 292 | or 1 | 68 48    | $exec_{1,4} \lor halt_1$                                    |
| 293 | next | 1 48 292 | $\mathtt{halt}_1' \leftarrow \forall$                       |
|     |      |          | -                                                           |
| 294 | init | 1 49 4   | $\mathtt{halt}_2^0 \leftarrow 0$                            |
| 295 | next | 1 49 49  | $\mathtt{halt}_2^{\bar{\prime}} \leftarrow \mathtt{halt}_2$ |

#### • Shared memory:

#### • Exit flag:

#### • Exit code:

```
311 init 2 52 6 exit-code^0 \leftarrow 0

312 ite 2 73 6 52 ite(exec_{2,4}, 0, exit-code)

313 ite 2 74 7 312 ite(exec_{2,5}, 1, \ \ \ )

314 next 2 52 313 exit-code' \leftarrow \ \ \ \ \
```

Finally, we declare an exit code greater than zero as the only bad state.

```
315 neq 1 6 52
316 bad 315
```

# 9 Testing and Debugging

The basic functionality of our toolchain is validated by a total of 800 test cases. While fixing C++17 related errors was relatively trivial due to the broad range of available tools, debugging the generated SMT encodings is a bit more tedious since cause and effect have to be determined manually by inspecting rather comprehensive models and formulas.

In order to automatically validate the correctness of our encodings, we included a replay mode for reevaluating the execution sequence of a given trace by the virtual machine used for simulation. If a mismatch is found, it returns the first step in which the results start to differ and prints the corresponding states of both execution traces for comparison. This mode also simplified the encodings' development significantly, since the condensed error trace helps to narrow down the potentially flawed SMT expressions.

Of course, only satisfiable formulas may be replayed. If the outcome of a valid execution sequence is accidentally unsatisfiable and therefore yields no trace, we still have to manually inspect the formula. Unfortunately, no SMT solver supported the extraction of an *unsatisfiable core* at the time we developed our encodings, but this feature starts to get implemented in recent solvers like Bitwuzla [9]. However, without any additional constraints, the only reason our formulas can get unsatisfiable is due to a violation of the cardinality constraint and investigations may therefore be concentrated on expressions containing transition variables.

Another frequent cause for unintended results lies in the nature of bounded model checking. Before expecting an error in the encoding, it is highly advisable to make sure that the targeted states are indeed reachable by the chosen bound. For example, when a model for the CAS variant of our concurrent counter experiments (described in the next section) was found during our initial trials, we immediately started to panic and questioned our encoding. Since such situations already commonly occurred during development, we remembered the following proverb which emerged during that time and soon realized that the problem indeed was yet another insufficient bound.

If you ask yourself - "How can this be?" - maybe just the bound's too wee!

# 10 Experiments

To asses performance related aspects of our encodings, we conducted a series of experiments, using the following versions of supported SMT solvers.

- Boolector 3.2.1<sup>3</sup> (including BtorMC)
- CVC4 1.8<sup>4</sup>
- $Z3 4.8.9^5$

We recorded the resulting formula sizes in terms of the number of generated expression, as well as the runtimes for encoding and solving each particular instance. All tests were performed on a cluster of Intel Xeon® E5-2620 v4 nodes, with CPU runtime and memory usage limited to 86400 seconds (24 hours) and 8 GB respectively.

#### Litmus Tests

Instead of a rigorous formal description, the memory ordering principles of Intel's [3] and AMD's [4] x86 implementations are defined by example through a set of so called "litmus tests". In order to show conformance with the memory ordering model of our virtual machine, the test suites have been ported to ConcuBinE and details are available in Appendices A and B.

|    |       | btor2 |      | functional |      | relational |      |
|----|-------|-------|------|------------|------|------------|------|
| №  | bound | time  | size | time       | size | time       | size |
| 1  | 9     | 0.01  | 166  | 6.38       | 2051 | 10.18      | 2404 |
| 2  | 10    | 0.69  | 176  | 7.04       | 2508 | 9.51       | 2950 |
| 3  | 10    | 0.75  | 174  | 6.61       | 2408 | 7.79       | 2850 |
| 4  | 5     | 0.32  | 93   | 1.95       | 627  | 2.41       | 724  |
| 5  | 12    | 0.98  | 194  | 9.62       | 3228 | 12.71      | 3950 |
| 6  | 13    | 0.76  | 256  | 12.62      | 4889 | 16.51      | 5723 |
| 7  | 14    | 0.94  | 322  | 16.15      | 6770 | 21.15      | 7780 |
| 8  | 12    | 0.93  | 334  | 12.83      | 6000 | 18.24      | 6890 |
| 9  | 8     | 0.01  | 186  | 4.71       | 2058 | 7.09       | 2428 |
| 10 | 8     | 0.28  | 173  | 3.40       | 1938 | 5.72       | 2252 |

Table 8: Intel litmus test encoding times in milliseconds and formula sizes in terms of the number of expressions.

<sup>3</sup>https://github.com/Boolector/boolector

<sup>4</sup>https://github.com/CVC4/CVC4
5https://github.com/Z3Prover/z3

Bounds and encoder statistics of both test suites are given in Tables 8 and 9. Comparing the sizes of generated formulas, visualized in Figures 8 and 9, clearly shows the main advantage of using BTOR2: compact problem instances and reduced computational complexity of its encoding process due to being automatically unrolled for any given bound via symbolic substitution by BtorMC at runtime. Furthermore, the additional redundancy introduced by the relational next state logic causes a quite substantial gap between the size of our SMT-LIB encoding variants.



Figure 8: Intel litmus test formula sizes in terms of the number of expressions.

|                    |       | btor2 |      | funct | functional |       | relational |  |
|--------------------|-------|-------|------|-------|------------|-------|------------|--|
| $N_{\overline{0}}$ | bound | time  | size | time  | size       | time  | size       |  |
| 1                  | 9     | 0.05  | 166  | 4.79  | 2051       | 6.99  | 2404       |  |
| 2                  | 10    | 0.06  | 176  | 6.76  | 2508       | 8.73  | 2950       |  |
| 3                  | 16    | 0.01  | 206  | 12.54 | 4536       | 18.64 | 5818       |  |
| 4                  | 10    | 0.01  | 174  | 7.96  | 2408       | 8.75  | 2850       |  |
| 5                  | 12    | 0.05  | 188  | 8.93  | 3108       | 11.94 | 3830       |  |
| 6                  | 14    | 1.47  | 322  | 16.36 | 6770       | 21.21 | 7780       |  |
| 7                  | 13    | 0.01  | 256  | 12.39 | 4889       | 14.74 | 5723       |  |
| 8                  | 12    | 0.61  | 196  | 9.07  | 3348       | 12.10 | 4070       |  |
| 9                  | 14    | 0.66  | 210  | 11.59 | 4168       | 16.63 | 5290       |  |

Table 9: AMD litmus test encoding times in milliseconds and formula sizes in terms of the number of expressions.



Figure 9: AMD litmus test formula sizes in terms of the number of expressions.

All litmus tests passed and the claim, that the encodings of our virtual machine model conform with the memory ordering principles of both major x86 vendors therefore validated by example. The corresponding solving times, given in Tables 10 and 11, reveal a dramatic difference between the tested solvers and encoding variants, even forcing us to resort to a logarithmic scale for plotting runtimes as shown in Figures 10 and 11 to get a meaningful graphical representation. Our functional encoding (BTOR2 and SMT-LIB) turned out to be the fastest, with BtorMC on top, closely followed by Boolector and Z3. As expected, the relational SMT-LIB approach is somewhat slower, but still beats CVC4 for any encoding variant when using Boolector or Z3. After reporting this observation to members of CVC4's development team it was confirmed that the latest release still uses a customized version of MiniSAT<sup>6</sup> as SAT backend for QF\_AUFBV formulas, which seems to be the main reason for the performance gap in comparison to Boolector (incorporating CaDiCaL<sup>7</sup> 1.0.3) and Z3.

<sup>6</sup>https://github.com/niklasso/minisat

<sup>&</sup>lt;sup>7</sup>https://github.com/arminbiere/cadical

|    | BtorMC | Bool       | ector      | Z          | 3          | CV         | C4         |
|----|--------|------------|------------|------------|------------|------------|------------|
| Nº | btor2  | functional | relational | functional | relational | functional | relational |
| 1  | 0.01   | 0.01       | 0.06       | 0.02       | 0.15       | 0.05       | 0.22       |
| 2  | 0.01   | 0.02       | 0.05       | 0.03       | 0.29       | 0.12       | 0.27       |
| 3  | 0.02   | 0.05       | 1.79       | 0.04       | 0.20       | 0.80       | 1.29       |
| 4  | 0.01   | 0.01       | 0.01       | 0.02       | 0.03       | 0.02       | 0.03       |
| 5  | 0.05   | 0.08       | 2.54       | 0.11       | 0.38       | 4.87       | 2.22       |
| 6  | 0.03   | 0.05       | 0.68       | 0.07       | 1.13       | 0.63       | 2.91       |
| 7  | 0.11   | 0.16       | 3.86       | 0.45       | 8.10       | 12225.10   | 35213.50   |
| 8  | 0.05   | 0.07       | 10.80      | 0.18       | 4.49       | 17415.50   | 26895.40   |
| 9  | 0.01   | 0.03       | 0.64       | 0.03       | 0.25       | 0.36       | 12.42      |
| 10 | 0.01   | 0.01       | 0.15       | 0.03       | 0.13       | 0.15       | 0.22       |

Table 10: Intel litmus test solving times in seconds.



Figure 10: Intel litmus test solving times in seconds.

|                    | BtorMC | Bool       | ector      | Z          | 3          | CV         | C4         |
|--------------------|--------|------------|------------|------------|------------|------------|------------|
| $N_{\overline{0}}$ | btor2  | functional | relational | functional | relational | functional | relational |
| 1                  | 0.01   | 0.01       | 0.03       | 0.01       | 0.14       | 0.05       | 0.22       |
| 2                  | 0.01   | 0.03       | 0.06       | 0.03       | 0.26       | 0.11       | 0.27       |
| 3                  | 0.08   | 0.10       | 5.71       | 0.09       | 1.08       | 1.82       | 4.67       |
| 4                  | 0.01   | 0.02       | 2.64       | 0.04       | 0.22       | 0.21       | 0.32       |
| 5                  | 0.05   | 0.06       | 1.58       | 0.15       | 0.90       | 1.90       | 127.06     |
| 6                  | 0.10   | 0.16       | 3.86       | 0.45       | 8.12       | 12213.70   | 35524.30   |
| 7                  | 0.05   | 0.07       | 0.71       | 0.09       | 1.05       | 0.58       | 2.84       |
| 8                  | 0.04   | 0.10       | 3.03       | 0.07       | 0.35       | 2.46       | 4.06       |
| 9                  | 0.10   | 0.10       | 2.11       | 0.46       | 1.16       | 4.28       | 3064.73    |

Table 11: AMD litmus test solving times in seconds.



Figure 11: AMD litmus test solving times in seconds.

#### Concurrent Counter

To show the scalability of our approach, we also use a parametrized concurrent counter (inspired by Paul McKenney [2]), where m threads increment a shared variable n times. Two flavours were tested: a faulty version using STORE to compare runtimes of satisfiable instances and a valid one relying on CAS, resulting in unsatisfiable runs. Both were executed for any combination of  $2 \le m \le 4$  threads and  $2 \le n \le 4$  increments respectively, forming a set of parameter configurations large enough to yield representative results while still being solvable within the given time and memory limits.

```
inc: LOAD 0
ADDI 1
STORE 0
LOAD <adr>
SUBI 1
STORE <adr>
JNZ inc
CHECK 0
HALT
```

Listing 10: Faulty Counter Template

Listing 10 shows the faulty counter template. Each thread t starts by loading, incrementing and storing the shared counter at address 0. Next, the local counter (limiting the number of iterations) is loaded, decremented and simply stored at the address specified by the template parameter  $\langle adr \rangle$ , which is replaced by t+10 for each thread  $0 \le t < m$ . If this local counter (initialized with n) remains greater than zero, the counting process is restarted by jumping back to the initial program statement. Otherwise, the threads synchronize upon checkpoint 0, signaling that counting has finished and the result should be evaluated. The checker thread given in Listing 11 then compares the shared counter at address 0 to the expected value m\*n supplied by yet another template parameter  $\langle sum \rangle$  and exits 1 if they differed.

```
CHECK 0
ADDI <sum>
CMP 0
JNZ error
EXIT 0
error: EXIT 1
```

Listing 11: Checker Template

In this example, relying on STORE to write a shared variable leads to an obvious data-race, even without the additional inconsistency introduced by the store buffer. Since an exit code greater than zero will serve as the bad state in the resulting model checking problem, we now must choose a proper bound such that the program will be fully executed and the checker thread's exit statements can be reached. We therefore have to find the longest possible trace through the program, its worst-case execution time (WCET) so to say. For our faulty counter, the process is straightforward and it can simply be determined by adding n times the number of steps required by each iteration of the counting loop inc (7 instructions + 2 flushes) to the 2 instructions left for completing the counter thread's execution. The minimum required bound can now be computed by multiplying the result with the number of participating threads m and adding the 5 steps needed by the checker thread, leading to the following equation.

$$m*(9*n+2)+5$$

For example, m = n = 2 would require a bound of 2 \* (9 \* 2 + 2) + 5 = 45.

|     |       | btor2 |      | functional |        | relational |        |
|-----|-------|-------|------|------------|--------|------------|--------|
| m n | bound | time  | size | time       | size   | time       | size   |
| 2 2 | 45    | 1.26  | 396  | 54.26      | 24783  | 80.01      | 33288  |
| 2 3 | 63    | 0.60  | 396  | 112.04     | 34539  | 111.30     | 46446  |
| 2 4 | 81    | 0.01  | 396  | 97.39      | 44295  | 146.15     | 59604  |
| 3 2 | 65    | 0.83  | 547  | 105.60     | 50284  | 166.88     | 67704  |
| 3 3 | 92    | 1.62  | 547  | 187.94     | 70939  | 228.75     | 95595  |
| 3 4 | 119   | 1.80  | 546  | 131.82     | 91594  | 188.94     | 123486 |
| 4 2 | 85    | 1.92  | 664  | 111.56     | 76363  | 171.92     | 105858 |
| 4 3 | 121   | 0.01  | 663  | 160.77     | 108439 | 277.84     | 150426 |
| 4 4 | 157   | 2.07  | 664  | 208.47     | 140515 | 313.55     | 194994 |

Table 12: Faulty counter encoding times in milliseconds and formula sizes in terms of the number of expressions for increasing values of m and n.

Table 12 shows the bounds and encoder statistics of our faulty counter experiments. While encoder runtimes are rather negligible, Figure 12 again highlights the dramatic difference in size between BTOR2 and SMT-LIB encoding variants for increasing values of m and n. As expected, the number of participating threads m is the main driving factor for increasing the problem size. The off-by-one differences in the size of the BTOR2 encodings for an equal number of threads m can be explained by the reuse of constants necessary to specify local counter addresses and the expected final value.



Figure 12: Faulty counter formula sizes in terms of the number of expressions for increasing values of m and n.

The time it took the supported solvers to find faulty traces for our faulty counter experiments are listed in Table 13 and visualized in Figure 13. Only BtorMC, Boolector using the functional and Z3 using the relational next state logic were able to find a solution for all experiments. Surprisingly, Z3 in combination with the functional next state logic performed much worse than expected and was just able to solve as many instances as Boolector using the relational variant. CVC4 however could not even solve a single instance within the given time and memory limits, again most likely due to the SAT solver used.

|     | BtorMC Boolector |            | polector Z3 |            | 3          | CVC4       |            |
|-----|------------------|------------|-------------|------------|------------|------------|------------|
| m n | btor2            | functional | relational  | functional | relational | functional | relational |
| 2 2 | 0.51             | 0.82       | 1838.45     | 82.94      | 19.76      | -          | -          |
| 2 3 | 8.41             | 10.69      | 2173.17     | 460.48     | 54.97      | -          | -          |
| 2 4 | 21.61            | 29.18      | 2367.87     | 1775.55    | 90.35      | -          | -          |
| 3 2 | 15.73            | 23.73      | 13990.70    | 4527.69    | 277.55     | -          | -          |
| 3 3 | 80.74            | 82.19      | 66764.00    | 13899.70   | 781.81     | -          | -          |
| 3 4 | 171.25           | 165.11     | -           | -          | 3308.92    | -          | -          |
| 4 2 | 157.16           | 158.09     | -           | -          | 1555.43    | -          | -          |
| 4 3 | 434.82           | 493.33     | -           | -          | 7060.05    | -          | -          |
| 4 4 | 1023.00          | 1037.32    | -           | -          | 35431.10   | -          | -          |

Table 13: Faulty counter solving times in seconds for increasing values of m and n.



Figure 13: Faulty counter solving times in seconds for increasing values of m and n.

Finally, we also experimented with a valid version of the previous concurrent counter example, given in Listing 12, which resorts to *compare-and-swap* for writing a shared variable in a predictable manner. Instead of simply reading the shared counter's value, we now use MEM to additionally memorize it for the latter application of CAS. Since CAS might fail due to an intermediate alteration of the shared counter variable by another thread, we must prevent an erroneous subsequent decrement of the thread's local counter variable by embedding it in a nested loop (labelled cas) and repeat the increment until it succeeds. Everything else remains the same, including the checker thread.

Listing 12: Valid Counter Template

Determining the required bound is a bit more complex, since we also have to account for the maximum number of failed CAS instructions to ensure that all possible traces are included in the search space. This worst case can occur if the participating counter threads are scheduled one after the other,

$$\mathtt{thread}_0^0 o \ldots o \mathtt{thread}_{m-1}^{m-1} o \mathtt{thread}_0^m o \ldots o \mathtt{thread}_{m-1}^{2m-1} o \ldots$$

causing all but one (the first) to fail in each iteration of the counting loop inc. Consider the following example trace for m=2 and n=1, where counting to m\*n=2 may require up to three cas loop iterations in total.

| Thread | CAS Status | Local Counter |
|--------|------------|---------------|
| 0      | done       | 0             |
| _1     | failed     | 1             |
| 1      | done       | 0             |

If we now examine the influence by an increased number of threads (m = 3) and n = 1, one notices that the maximum number of cas iterations follows a triangular number:  $m * \frac{m+1}{2} = 6$ .

| Thread | <b>CAS</b> Status | Local Counter |
|--------|-------------------|---------------|
| 0      | done              | 0             |
| 1      | failed            | 1             |
| 2      | failed            | 1             |
| 1      | done              | 1             |
| 2      | failed            | 0             |
| 2      | done              | 1             |

Since increasing the local increments n directly influences the number of cas iterations, we just need to multiply the previously identified triangular number by n to get the maximum:  $n*m*\frac{m+1}{2}$ . If we now consider the number of steps required by each iteration of the cas loop (4), the ones left to complete the counting loop inc (5) and the 2 instructions remaining in the counter thread's program, the final bound can be computed by the following equation.

$$m*(4*n*m*\frac{m+1}{2}+5*n+2)+5$$

For example, the valid counter experiment for m=n=2 already requires a bound of  $(4*2*2*\frac{2+1}{2}+5*2+2)*2+5=77$ .

Bounds and encoder statistics for our valid counter experiments are given in Table 14 and formula sizes visualized in Figure 14. While the addition of a single instruction only slightly increases the formula sizes of our BTOR2 encoding, the SMT-LIB variants exhibit a significant blowup due to the larger bound required by the introduction of a nested loop. Naturally, encoder runtimes are also affected, but remain acceptable as all experiments could be encoded in less than a second.

|     |       | btor2 |      | btor2 functional |        | relational |         |
|-----|-------|-------|------|------------------|--------|------------|---------|
| m n | bound | time  | size | time             | size   | time       | size    |
| 2 2 | 77    | 0.01  | 424  | 70.08            | 46453  | 98.68      | 63470   |
| 2 3 | 111   | 1.33  | 424  | 96.43            | 66785  | 67.35      | 91316   |
| 2 4 | 145   | 0.16  | 424  | 120.35           | 87117  | 87.92      | 119162  |
| 3 2 | 185   | 0.51  | 589  | 101.87           | 157645 | 156.29     | 216105  |
| 3 3 | 272   | 0.51  | 589  | 146.43           | 231508 | 231.16     | 317460  |
| 3 4 | 359   | 0.51  | 588  | 189.60           | 305371 | 304.00     | 418815  |
| 4 2 | 373   | 0.60  | 720  | 256.03           | 374775 | 391.62     | 528078  |
| 4 3 | 553   | 0.61  | 719  | 369.42           | 555315 | 607.02     | 782598  |
| 4 4 | 733   | 0.62  | 720  | 498.63           | 735855 | 774.74     | 1037118 |

Table 14: Valid counter encoding times in milliseconds and formula sizes in terms of the number of expressions for increasing values of m and n.



Figure 14: Valid counter formula sizes in terms of the number of expressions for increasing values of m and n.

Solving times of this experiment's unsatisfiable runs are shown in Table 15 and visualized in Figure 15. Unfortunately, only 3 out of 9 experiments could be solved by BtorMC and Boolector within the given time and memory limits. This puts the practicability of our approach in question, since further increasing the program size, number of threads and bound required by even the tiniest real world example would generate problem instances which most likely won't be solvable in a reasonable amount of time, possibly due to symmetric execution parts. Thus, these examples probably meet some kind of symmetry reduction or use a form of partial order reduction.

|     | BtorMC   | Bool       | ector      | Z          | 3          | CV         | C4         |
|-----|----------|------------|------------|------------|------------|------------|------------|
| m n | btor2    | functional | relational | functional | relational | functional | relational |
| 2 2 | 2105.44  | 893.10     | 15283.50   | -          | -          | -          | -          |
| 2 3 | 10188.30 | 5706.20    | -          | -          | -          | -          | -          |
| 2 4 | 56972.70 | 25617.10   | -          | -          | -          | -          | -          |
| 3 2 | -        | -          | -          | -          | -          | -          | -          |
| 3 3 | -        | -          | -          | -          | -          | -          | -          |
| 3 4 | -        | -          | -          | -          | -          | -          | -          |
| 4 2 | -        | -          | -          | -          | -          | -          | -          |
| 4 3 | -        | -          | -          | -          | -          | -          | -          |
| 4 4 | -        | ı          | -          | -          | 1          | ı          | -          |

Table 15: Valid counter solving times in seconds for increasing values of m and n.



Figure 15: Valid counter solving times in seconds for increasing values of m and n.

## 11 Conclusion

In this thesis we explored the potential of bounded model checking for verifying concurrent programs including the target architecture's memory ordering habits. Due to the complexity of actual processors, we devised a highly simplified virtual machine model and showed that the addition of store buffers is sufficient to imitate the behaviour of current x86 implementations.

After designing a formal framework for the execution of arbitrary programs on this virtual machine model, encodings of the corresponding model checking problems in two different SMT formats were developed. We then implemented ConcuBinE, a tool for automating the encoding and evaluation of generated SMT formulas using state-of-the-art solvers.

Even though experiments show the basic feasibility of our approach, the huge runtimes for even small examples puts its practicability in question. However, we still see it as an opportunity to integrate a similar checking procedure into compiler toolchains, automatically verifying the binaries of critical software with regards to the target architecture's memory ordering model and other specifics.

Future work could therefore involve optimizing the encoding, devising a method for specifying assertions, developing a procedure to additionally compute the WCET for automatically determining the upper bound and applying the presented principles to a real architecture like Intel's or AMD's x86 processors.

## References

- [1] Leslie Lamport. How to Make a Correct Multiprocess Program Execute Correctly on a Multiprocessor. *IEEE Transactions on Computers*, 46(7):779–782, 1997.
- [2] Paul E. McKenney. Is Parallel Programming Hard, And, If So, What Can You Do About It? (v2017.01.02a). CoRR, abs/1701.00854, 2017.
- [3] Intel Corporation. Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, rev. 63. Technical report, October 2019.
- [4] Advanced Micro Devices. AMD64 Architecture Programmer's Manual Volume 2: System Programming, rev 3.32. Technical report, October 2019.
- [5] Armin Biere, Alessandro Cimatti, Edmund M. Clarke, Ofer Strichman, and Yunshan Zhu. Bounded Model Checking. *Advances in computers*, 58(11):117–148, 2003.
- [6] Clark Barrett, Pascal Fontaine, and Cesare Tinelli. The SMT-LIB Standard: Version 2.6. Technical report, Department of Computer Science, The University of Iowa, 2017. Available at www.SMT-LIB.org.
- [7] Aina Niemetz, Mathias Preiner, Clifford Wolf, and Armin Biere. Btor2, BtorMC and Boolector 3.0. In *CAV* (1), volume 10981 of *Lecture Notes in Computer Science*, pages 587–595. Springer, 2018.
- [8] Carsten Sinz. Towards an Optimal CNF Encoding of Boolean Cardinality Constraints. In *Proc. of the 11th Intl. Conf. on Principles and Practice of Constraint Programming (CP 2005)*, pages 827–831, Sitges, Spain, October 2005.
- [9] Aina Niemetz and Mathias Preiner. Bitwuzla at the SMT-COMP 2020. CoRR, abs/2006.01621, 2020.

# Appendices

#### A Intel Litmus Tests

Intel's memory ordering litmus tests as seen in [3, Section 8.2.3] ported to our virtual machine model.

#### Neither Loads Nor Stores Are Reordered with Like Operations

The Intel-64 memory ordering model allows neither loads nor stores to be reordered with the same kind of operation. That is, it ensures that loads are seen in program order and that stores are seen in program order. This is illustrated by the following example:

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   |          |
| STORE 0  | MEM 1    |
| STORE 1  | LOAD O   |

| Initial                                   | Not Allowed 🗡                                   |
|-------------------------------------------|-------------------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{mem}_1 = 1 \wedge \mathtt{accu}_1 = 0$ |

Example A.1: Stores Are Not Reordered with Other Stores [3, Example 8-1]

The disallowed return values could be exhibited only if thread 0's two stores are reordered (with the two loads occurring between them) or if thread 1's two loads are reordered (with the two stores occurring between them).

If  $mem_1 = 1$ , the store to heap[1] occurs before the load from heap[1]. Because the Intel-64 memory ordering model does not allow stores to be reordered, the earlier store to heap[0] occurs before the load from heap[1]. Because the Intel-64 memory ordering model does not allow loads to be reordered, the store to heap[0] also occurs before the later load from heap[0]. Thus  $accu_1 = 1$ .

#### Stores Are Not Reordered with Earlier Loads

The Intel-64 memory ordering model ensures that a store by a thread may not occur before a previous load by the same thread. This is illustrated by the following example:

| Thread 0 | Thread 1 |
|----------|----------|
| MEM O    | MEM 1    |
| ADDI 1   | ADDI 1   |
| STORE 1  | STORE 0  |

| Initial                                   | Not Allowed 🗡                         |
|-------------------------------------------|---------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{mem}_0 = \mathtt{mem}_1 = 1$ |

Example A.2: Stores Are Not Reordered with Older Loads [3, Example 8-2]

Assume  $mem_0 = 1$ .

- Because  $mem_0 = 1$ , thread 1's store to heap[0] occurs before thread 0's load from heap[0].
- Because the Intel-64 memory ordering model prevents each store from being reordered with the earlier load by the same thread, thread 1's load from heap[1] occurs before its store to heap[0].
- Similarly, thread 0's load from heap[0] occurs before its store to heap[1].
- Thus, thread 1's load from heap[1] occurs before thread 0's store to heap[1], implying  $mem_1 = 0$ .

#### Loads May Be Reordered with Earlier Stores to Different Locations

The Intel-64 memory ordering model allows a load to be reordered with an earlier store to a different location. However, loads are not reordered with stores to the same location.

The fact that a load may be reordered with an earlier store to a different location is illustrated by the following example:

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   | ADDI 1   |
| STORE 0  | STORE 1  |
| LOAD 1   | LOAD O   |

| Initial                                   | Allowed   ✓                             |  |
|-------------------------------------------|-----------------------------------------|--|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{accu}_0 = \mathtt{accu}_1 = 0$ |  |

Example A.3: Loads May be Reordered with Older Stores [3, Example 8-3]

At each thread, the load and the store are to different locations and hence may be reordered. Any interleaving of the operations is thus allowed. One such interleaving has the two loads occurring before the two stores. This would result in each load returning value 0.

#### Loads Are Not Reordered with Older Stores to the Same Location

The Intel-64 memory ordering model allows a load to be reordered with an earlier store to a different location. However, loads are not reordered with stores to the same location.

The fact that a load may not be reordered with an earlier store to the same location is illustrated by the following example:

| Thread 0 |  |
|----------|--|
| ADDI 1   |  |
| STORE 0  |  |
| LOAD O   |  |

| Initial                | Not Allowed X         |
|------------------------|-----------------------|
| $\mathtt{heap}[0] = 0$ | $\mathtt{accu}_0 = 0$ |

Example A.4: Loads Are not Reordered with Older Stores to the Same Location [3, Example 8-4]

The Intel-64 memory ordering model does not allow the load to be reordered with the earlier store because the accesses are to the same location. Therefore,  $accu_0 = 1$  must hold.

#### Intra-Processor Forwarding Is Allowed

The memory ordering model allows concurrent stores by two threads to be seen in different orders by those two threads; specifically, each thread may perceive its own store occurring before that of the other. This is illustrated by the following example:

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   | ADDI 1   |
| STORE 0  | STORE 1  |
| LOAD O   | LOAD 1   |
| LOAD 1   | LOAD O   |

| Initial                                   | Allowed   ✓                             |
|-------------------------------------------|-----------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{accu}_0 = \mathtt{accu}_1 = 0$ |

Example A.5: Intra-Processor Forwarding is Allowed [3, Example 8-5]

The memory ordering model imposes no constraints on the order in which the two stores appear to execute by the two threads. This fact allows thread 0 to see its store before seeing thread 1's, while thread 1 sees its store before seeing thread 0's. (Each thread is self consistent.) This allows  $accu_0 = accu_1 = 0$ .

In practice, the reordering in this example can arise as a result of store-buffer forwarding. While a store is temporarily held in a thread's store buffer, it can satisfy the thread's own loads but is not visible to (and cannot satisfy) loads by other threads.

#### Stores Are Transitively Visible

The memory ordering model ensures transitive visibility of stores; stores that are causally related appear to all threads to occur in an order consistent with the causality relation. This is illustrated by the following example:

| Thread 0 | Thread 1 | Thread 2 |
|----------|----------|----------|
| ADDI 1   | MEM O    |          |
| STORE 0  | JNZ 3    |          |
|          | ADDI 1   |          |
|          | STORE 1  | MEM 1    |
|          |          | LOAD O   |

| Initial                                   | Not Allowed X                                                   |
|-------------------------------------------|-----------------------------------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{mem}_1 = \mathtt{mem}_2 = 1 \land \mathtt{accu}_2 = 0$ |

Example A.6: Stores Are Transitively Visible [3, Example 8-6]

Assume that  $mem_1 = 1$  and  $mem_2 = 1$ .

- Because  $mem_1 = 1$ , thread 0's store occurs before thread 1's load.
- Because the memory ordering model prevents a store from being reordered with an earlier load (see [3, Section 8.2.3.3]), thread 1's load occurs before its store. Thus, thread 0's store causally precedes thread 1's store.
- Because thread 0's store causally precedes thread 1's store, the memory ordering model ensures that thread 0's store appears to occur before thread 1's store from the point of view of all threads.
- Because  $mem_2 = 1$ , thread 1's store occurs before thread 2's load.
- Because the Intel-64 memory ordering model prevents loads from being reordered (see [3, Section 8.2.3.2]), thread 2's load occur in order.
- The above items imply that thread 0's store to heap[0] occurs before thread 2's load from heap[0]. This implies that  $accu_2 = 1$ .

#### Stores Are Seen in a Consistent Order by Other Threads

As noted in [3, Section 8.2.3.5], the memory ordering model allows stores by two threads to be seen in different orders by those two threads. However, any two stores must appear to execute in the same order to all threads other than those performing the stores. This is illustrated by the following example:

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
|----------|----------|----------|----------|
| ADDI 1   | ADDI 1   | MEM O    | MEM 1    |
| STORE 0  | STORE 1  | LOAD 1   | LOAD O   |

| Initial               | Not Allowed 🗡                                                                      |
|-----------------------|------------------------------------------------------------------------------------|
| heap[0] = heap[1] = 0 | $\mathtt{mem}_2 = \mathtt{mem}_3 = 1 \wedge \mathtt{accu}_2 = \mathtt{accu}_3 = 0$ |

Example A.7: Stores Are Seen in a Consistent Order by Other Threads [3, Example 8-7]

By the principles discussed in [3, Section 8.2.3.2],

- thread 2's first and second load cannot be reordered,
- thread 3's first and second load cannot be reordered.
- If  $mem_2 = 1$  and  $accu_2 = 0$ , thread 0's store appears to precede thread 1's store with respect to thread 2.
- Similarly,  $mem_3 = 1$  and  $accu_3 = 0$  imply that thread 1's store appears to precede thread 0's store with respect to thread 1.

Because the memory ordering model ensures that any two stores appear to execute in the same order to all threads (other than those performing the stores), this set of return values is not allowed.

#### Locked Instructions Have a Total Order

The memory ordering model ensures that all threads agree on a single execution order of all locked instructions. This is illustrated by the following example:

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
|----------|----------|----------|----------|
| ADDI 1   | ADDI 1   |          |          |
| CAS 0    | CAS 1    |          |          |
|          |          | MEM O    | MEM 1    |
|          |          | LOAD 1   | LOAD O   |

| Initial               | Not Allowed 🗡                                                                     |
|-----------------------|-----------------------------------------------------------------------------------|
| heap[0] = heap[1] = 0 | $\mathtt{mem}_2 = \mathtt{mem}_3 = 1 \land \mathtt{accu}_2 = \mathtt{accu}_3 = 0$ |

Example A.8: Locked Instructions Have a Total Order [3, Example 8-8]

Thread 2 and thread 3 must agree on the order of the two executions of CAS. Without loss of generality, suppose that thread 0's CAS occurs first.

- If  $mem_3 = 1$ , thread 1's CAS into heap[1] occurs before thread 3's load from heap[1].
- Because the Intel-64 memory ordering model prevents loads from being reordered (see [3, Section 8.2.3.2]), thread 3's loads occur in order and, therefore, thread 1's CAS occurs before thread 3's load from heap[0].
- Since thread 0's CAS into heap[0] occurs before thread 1's CAS (by assumption), it occurs before thread 3's load from heap[0]. Thus,  $accu_3 = 1$ .

A similar argument (referring instead to thread 2's loads) applies if thread 1's CAS occurs before thread 0's CAS.

#### Loads Are Not Reordered with Locked Instructions

The memory ordering model prevents loads and stores from being reordered with locked instructions that execute earlier or later. The examples in this section illustrate only cases in which a locked instruction is executed before a load or a store. The reader should note that reordering is prevented also if the locked instruction is executed after a load or a store.

The first example illustrates that loads may not be reordered with earlier locked instructions:

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   | ADDI 1   |
| CAS 0    | CAS 1    |
| LOAD 1   | LOAD O   |

| Initial                                   | Not Allowed X                           |
|-------------------------------------------|-----------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{accu}_0 = \mathtt{accu}_1 = 0$ |

Example A.9: Loads Are not Reordered with Locks [3, Example 8-9]

As explained in [3, Section 8.2.3.8], there is a total order of the executions of locked instructions. Without loss of generality, suppose that thread 0's CAS occurs first.

Because the Intel-64 memory ordering model prevents thread 1's load from being reordered with its earlier CAS, thread 0's CAS occurs before thread 1's load. This implies  $accu_1 = 1$ .

A similar argument (referring instead to thread 2's accesses) applies if thread 1's CAS occurs before thread 0's CAS.

#### Stores Are Not Reordered with Locked Instructions

The memory ordering model prevents loads and stores from being reordered with locked instructions that execute earlier or later. The examples in this section illustrate only cases in which a locked instruction is executed before a load or a store. The reader should note that reordering is prevented also if the locked instruction is executed after a load or a store.

The second example illustrates that a store may not be reordered with an earlier locked instruction:

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   |          |
| CAS 0    | MEM 1    |
| STORE 1  | LOAD O   |

| Initial                                   | Not Allowed X                                  |
|-------------------------------------------|------------------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{accu}_1 = 0 \land \mathtt{mem}_1 = 1$ |

Example A.10: Stores Are not Reordered with Locks [3, Example 8-10]

#### Assume $mem_1 = 1$ .

- Because  $mem_1 = 1$ , thread 0's store to heap[1] occurs before thread 1's load from heap[1].
- Because the memory ordering model prevents a store from being reordered with an earlier locked instruction, thread 0's CAS into heap[0] occurs before its store to heap[1].
- Thus, thread 0's CAS into heap[0] occurs before thread 1's load from heap[1].
- Because the memory ordering model prevents loads from being reordered (see [3, Section 8.2.3.2]), thread 1's loads occur in order and, therefore, thread 1's CAS into heap[0] occurs before thread 1's load from heap[0]. Thus, accu<sub>1</sub> = 1.

## B AMD Litmus Tests

AMD's memory ordering litmus tests as seen in [4, Section 7.2] ported to our virtual machine model.

#### Neither Loads Nor Stores Are Reordered with Like Operations

Successive stores from a single thread are committed to system memory and visible to other threads in program order. A store by a thread cannot be committed to memory before a read appearing earlier in the program has captured its targeted data from memory. In other words, stores from a thread cannot be reordered to occur prior to a load preceding it in program order.

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   |          |
| STORE 0  | MEM 1    |
| STORE 1  | LOAD O   |

| Initial                                   | Not Allowed 🗡                                  |
|-------------------------------------------|------------------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{mem}_1 = 1 \land \mathtt{accu}_1 = 0$ |

Example B.1: Stores Are Not Reordered with Other Stores [4, Example 1]

LOAD 0 cannot read 0 when LOAD 1 reads 1.

#### Stores Are Not Reordered with Earlier Loads

Successive stores from a single thread are committed to system memory and visible to other threads in program order. A store by a thread cannot be committed to memory before a read appearing earlier in the program has captured its targeted data from memory. In other words, stores from a thread cannot be reordered to occur prior to a load preceding it in program order.

| Thread 0 | Thread 1 |
|----------|----------|
| MEM O    | MEM 1    |
| ADDI 1   | ADDI 1   |
| STORE 1  | STORE 0  |

| Initial                                   | Not Allowed 🗡                         |
|-------------------------------------------|---------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{mem}_0 = \mathtt{mem}_1 = 1$ |

Example B.2: Stores Are Not Reordered with Older Loads [4, Example 2]

LOAD 0 and LOAD 1 cannot both read 1.

#### Stores Can Be Arbitrarily Delayed

Stores from a thread appear to be committed to the memory system in program order; however, stores can be delayed arbitrarily by store buffering while the thread continues operation. Therefore, stores from a thread may not appear to be sequentially consistent.

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   | ADDI 1   |
| STORE 0  | STORE 1  |
| ADDI 1   | ADDI 1   |
| STORE 0  | STORE 1  |
| LOAD O   | LOAD 1   |

| Allowed   ✓                             |  |
|-----------------------------------------|--|
| $\mathtt{accu}_0 = \mathtt{accu}_1 = 1$ |  |

Example B.3: Stores Can Be Arbitrarily Delayed [4, Example 3]

Both LOAD 0 and LOAD 1 may read 1.

#### Loads May Be Reordered with Earlier Stores to Different Locations

Non-overlapping Loads may pass stores.

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   | ADDI 1   |
| STORE 0  | STORE 1  |
| LOAD 1   | LOAD O   |

| Initial                                   | Allowed ✓                               |
|-------------------------------------------|-----------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{accu}_0 = \mathtt{accu}_1 = 0$ |

Example B.4: Loads May be Reordered with Older Stores [4, Example 4]

All combinations of values (00, 01, 10, and 11) may be observed by threads 0 and 1.

#### Sequential Consistency

Where sequential consistency is needed (for example in Dekker's algorithm for mutual exclusion), a FENCE instruction should be used between the store and the subsequent load, or an atomic instruction, such as CAS, should be used for the store.

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   | ADDI 1   |
| STORE 0  | STORE 1  |
| FENCE    | FENCE    |
| LOAD 1   | LOAD O   |

| Initial                                   | Not Allowed X                           |
|-------------------------------------------|-----------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{accu}_0 = \mathtt{accu}_1 = 0$ |

Example B.5: Sequential Consistency [4, Example 5]

LOAD 0 and LOAD 1 cannot both read 0.

#### Stores Are Seen in a Consistent Order by Other Threads

Stores to different locations in memory observed from two (or more) other threads will appear in the same order to all observers. Behavior such as that is shown in this code example:

| Thread 0 | Thread 1 | Thread 2 | Thread 3 |
|----------|----------|----------|----------|
| ADDI 1   | ADDI 1   | MEM O    | MEM 1    |
| STORE 0  | STORE 1  | LOAD 1   | LOAD O   |

| Initial                                   | Not Allowed 🗡                                                                      |
|-------------------------------------------|------------------------------------------------------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{mem}_2 = \mathtt{mem}_3 = 1 \wedge \mathtt{accu}_2 = \mathtt{accu}_3 = 0$ |

Example B.6: Stores Are Seen in a Consistent Order by Other Threads [4, Example 6]

Thread 2 seeing STORE 0 from thread 0 before STORE 1 from thread 1, while thread 3 sees STORE 1 from thread 1 before STORE 0 from thread 0, is not allowed.

#### Stores Are Transitively Visible

Dependent stores between different threads appear to occur in program order, as shown in the code example below.

| Thread 0 | Thread 1 | Thread 2 |
|----------|----------|----------|
| ADDI 1   | MEM O    |          |
| STORE 0  | JNZ 3    |          |
|          | ADDI 1   |          |
|          | STORE 1  | MEM 1    |
|          |          | LOAD O   |

| Initial                                   | Not Allowed X                                                   |
|-------------------------------------------|-----------------------------------------------------------------|
| $\mathtt{heap}[0] = \mathtt{heap}[1] = 0$ | $\mathtt{mem}_1 = \mathtt{mem}_2 = 1 \land \mathtt{accu}_2 = 0$ |

Example B.7: Stores Are Transitively Visible [4, Example 7]

If thread 1 reads a value from heap[0] (written by thread 0) before carrying out a store to heap[1], and if thread 2 reads the updated value from heap[1], a subsequent read of heap[0] must also be the updated value.

### Intra-Processor Forwarding Is Allowed

The local visibility (within a thread) for a memory operation may differ from the global visibility (from another thread). Using a data bypass, a local load can read the result of a local store in a store buffer, before the store becomes globally visible. Program order is still maintained when using such bypasses.

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   | ADDI 1   |
| STORE 0  | STORE 1  |
| MEM O    | MEM 1    |
| LOAD 1   | LOAD O   |

| Initial               | Allowed ✓                                                                          |
|-----------------------|------------------------------------------------------------------------------------|
| heap[0] = heap[1] = 0 | $\mathtt{mem}_0 = \mathtt{mem}_1 = 1 \wedge \mathtt{accu}_0 = \mathtt{accu}_1 = 0$ |

Example B.8: Intra-Processor Forwarding is Allowed [4, Example 8]

LOAD 0 in thread 0 can read 1 using the data bypass, while LOAD 0 in thread 1 can read 0. Similarly, LOAD 1 in thread 1 can read 1 while LOAD 1 in thread 0 can read 0. Therefore, the result  $mem_0 = 1$ ,  $accu_0 = 0$ ,  $mem_1 = 1$  and  $accu_1 = 0$  may occur. There are no constraints on the relative order of when the STORE 0 of thread 0 is visible to thread 1 relative to when the STORE 1 of thread 1 is visible to thread 0.

#### Global Visibility

If a very strong memory ordering model is required that does not allow local store-load bypasses, a FENCE instruction or an atomic instruction such as CAS should be used between the store and the subsequent load. This enforces a memory ordering stronger than total store ordering.

| Thread 0 | Thread 1 |
|----------|----------|
| ADDI 1   | ADDI 1   |
| STORE 0  | STORE 1  |
| FENCE    | FENCE    |
| MEM O    | MEM 1    |
| LOAD 1   | LOAD O   |

| Initial               | Not Allowed 🗡                                                                      |
|-----------------------|------------------------------------------------------------------------------------|
| heap[0] = heap[1] = 0 | $\mathtt{mem}_0 = \mathtt{mem}_1 = 1 \wedge \mathtt{accu}_0 = \mathtt{accu}_1 = 0$ |

Example B.9: Global Visibility [4, Example 9]

In this example, the FENCE instruction ensures that any buffered stores are globally visible before the loads are allowed to execute, so the result  $\mathtt{mem}_0 = 1$ ,  $\mathtt{accu}_0 = 0$ ,  $\mathtt{mem}_1 = 1$  and  $\mathtt{accu}_1 = 0$  will not occur.