Skip to content

Dgraph transactions violated causal consistency #8146

@siliunobi

Description

@siliunobi

We found transactional causal consistency (TCC) anomalies on the Dgraph Cluster. According to https://dgraph.io/docs/design-concepts/consistency-model/ , Dgraph provides Snapshot Isolation that is stronger than TCC.

TCC Definition
The causal consistency we checked is defined in the SOSP’11 paper “Don’t settle for eventual: scalable causal consistency for wide-area storage with COPS” (Section 3.1), as well as in the POPL’17 paper "On verifying causal consistency".

Experimental Setup
In our experiment we set up the cluster on a local machine with 3 server nodes. Here is the configuration information:

Dgraph Version == 21.03.2

server_node = 3
client_num(session_num) = 2
client_stub_num = 1
txn_per_session = 10
operation_per_txn = 10
key_number = 20
key_distribution = uniform

We are using a simple table schema, just containing key-value pairs, e.g., key=10, value=5. Keys are initialized to value 0. Note that, for each write on a key, the value (generated by the workload generator) is unique.

Anomaly Found
One anomaly was found on five transactions from two sessions, where r/w(A,B) denotes read/write value B on key A:

145a828a4393d6a3ac39c97dd77899b1557c67af_2_375x500

txn4 and txn6 have “write-read” order on key 19, denoted by txn4 ->wr txn6; txn13 and txn6 have write-read order on key 0, i.e., txn13 → txn6. The “wr” order means two transactions contain write/read operations on the same key with the same value respectively, thus the transaction with write operation should happen before the transaction contains the read operation.

To satisfy transactional causal consistency txn13 must be ordered before txn4 because we already know txn4 → txn6 and txn13 → txn6 and txn6 read the value of key 19 written by txn4. Thus we have commit order txn13 ->co txn4 (see the bottom for the definition of commit order). However, txn4 can reach txn13 via txn4 ->so txn5 ->wr txn11 ->so txn12 ->so txn13. Hence, there is a cycle between txn13 and txn4 that violates TCC.

The dataset is given at https://github.com/20211202na/dgraph_data/blob/main/data.txt

Reproducibility

  1. We use github.com/20211202na/dgraph_data/blob/main/data_generation.py to deal with the client-side logic (e.g., generating txn workloads)

  2. For each run, “histories” of txns (values read and/or written) for all clients are also collected and printed.

  3. We then invoke the checker (the function run_oopsla_graph) in github.com/20211202na/dgraph_data/blob/main/run_verification.py to verify if the history violates transactional causal consistency.

Note that, since we are doing random testing (e.g., workloads are generated probabilistically), it’s hard to reproduce a specific anomaly. However, we observed 5 to 8 violating histories per 100 histories with the setup posted. We believe that anomalies manifest with sufficiently large number of runs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    investigateRequires further investigationpriority/P0Critical issue that requires immediate attention.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions