Skip to content

hechen0/leaderless-log-protocol

 
 

Repository files navigation

TLA+ Verification Fizzbee Verification License

Leaderless Log Protocol

A formally verified distributed append-only log protocol with concurrent writers, compaction, and readers — built on the coordination-delegated pattern, where stateless workers offload all coordination to an external linearizable store.

What is the Leaderless Log Protocol?

The Leaderless Log Protocol implements a distributed append-only log where any writer can append without being elected leader. There is no Raft, no Paxos, no leader election. Instead, writers are stateless processes that delegate ordering and index management to an external linearizable coordination store:

  • Monotonic offsets are assigned via the store's AtomicIncrement (a sequence counter)
  • Log index is maintained as linearizable key-value entries with atomic CAS
  • Compaction rewrites ranges of WAL entries into compacted files via a 3-step CAS update
  • Fencing prevents stale writers from corrupting the log after a takeover

The protocol models three interacting sub-protocols — writer append path, compaction index update, and reader path — and is verified in both TLA+ and Fizzbee against safety and liveness properties (see Expected Results).

Full canonical specification: 1-leaderless-log-protocol.md.

Production Use: Ursa

The Leaderless Log Protocol formalized in this repo is the protocol used in Ursa, the lakehouse-native streaming engine from StreamNative. The formal spec here is the distilled, model-checked core of the protocol that Ursa runs in production.

Reference Implementation: S3-Queue

S3-Queue is a distributed message queue built entirely on S3-compatible object storage — no external coordination store required. It instantiates the Leaderless Log Protocol with S3 conditional writes as the coordination primitive.

Protocol Leaderless Log
System SPEC examples/s3-queue/SPEC.md — 1,100+ lines covering the full design
Implementation Rust CLI (s3q) — generated by a coding agent from the SPEC alone

Try it yourself — tell your favorite coding agent:

Implement S3-Queue according to the following spec: https://github.com/lakestream-io/leaderless-log-protocol/blob/main/examples/s3-queue/SPEC.md

See examples/ for the full list of examples.

The Coordination-Delegated Pattern

The Leaderless Log Protocol is one instance of a broader architectural pattern: coordination-delegated distributed systems. In this pattern, worker nodes carry no consensus logic — they are stateless processes that offload all coordination decisions (ordering, locking, leader election, failure detection) to an external linearizable store. The store becomes the single source of truth for shared state, while workers focus purely on data-path operations.

┌──────────┐   ┌──────────┐   ┌──────────┐
│ Worker A │   │ Worker B │   │ Worker C │
│(stateless)│  │(stateless)│  │(stateless)│
└────┬─────┘   └────┬─────┘   └────┬─────┘
     │              │              │
     ▼              ▼              ▼
┌──────────────────────────────────────────┐
│     Coordination Store (linearizable)    │
│  CAS · AtomicIncrement · Ephemeral keys  │
└──────────────────────────────────────────┘

Unlike embedded consensus (Raft, Paxos), where every node participates in the consensus protocol, coordination-delegated systems separate the coordination plane from the data plane entirely. See 0-coordination-delegated-pattern.md for the full treatment of the pattern, its primitives, and its limitations.

Protocol Family

This repo formalizes a family of coordination-delegated protocols. The Leaderless Log Protocol is the headliner; additional protocols share the same pattern and coordination primitives.

Layer Document Description
0 0-coordination-delegated-pattern.md The foundational pattern: abstract coordination primitives, properties, limitations, and reference implementations
1 1-leaderless-log-protocol.md Leaderless Log — distributed append-only log with concurrent writers, compaction, and readers
2 2-coordination-delegated-task-claiming.md Task Claiming — distributed task claiming with ephemeral locks and crash recovery

Protocols 1 and 2 are independent and composable. Each instantiates the Layer 0 pattern with a specific subset of coordination primitives.

From Protocol to Implementation

The real value of formally verified protocols: you can write a system SPEC that instantiates a protocol for a specific system, hand that SPEC to a coding agent, and get a correct implementation in one shot.

┌─────────────────────┐      ┌─────────────────────┐      ┌─────────────┐      ┌──────────────────┐
│   Protocol SPEC     │      │    System SPEC       │      │   Coding    │      │   Working        │
│   (formally         │  →   │    (language-agnostic,│  →   │   Agent     │  →   │   Implementation │
│    verified)        │      │     self-contained)  │      │             │      │                  │
└─────────────────────┘      └─────────────────────┘      └─────────────┘      └──────────────────┘

The protocol SPEC guarantees correctness properties (safety, liveness) via model checking. The system SPEC translates those guarantees into a concrete design — domain model, state machines, coordination primitives, pseudocode, error handling, and a test matrix. Together, they give a coding agent everything it needs to produce a correct implementation without back-and-forth.

Directory Structure

├── 0-coordination-delegated-pattern.md      -- Layer 0: The pattern
├── 1-leaderless-log-protocol.md             -- Layer 1: Canonical spec (Leaderless Log)
├── 2-coordination-delegated-task-claiming.md -- Layer 2: Canonical spec (Task Claiming)
├── README.md                                -- This file
├── tlaplus/
│   ├── LeaderlessLog.tla                    -- TLA+ implementation of Protocol 1
│   ├── LeaderlessLog.cfg                    -- Small TLC config (fast, ~10^5 states)
│   ├── LeaderlessLog-medium.cfg             -- Medium TLC config (thorough, ~10^6-7 states)
│   ├── TaskClaiming.tla                     -- TLA+ implementation of Protocol 2
│   ├── TaskClaiming.cfg                     -- Small TLC config
│   └── TaskClaiming-medium.cfg              -- Medium TLC config
├── fizzbee/
│   ├── LeaderlessLog.fizz                   -- Fizzbee implementation of Protocol 1
│   └── TaskClaiming.fizz                    -- Fizzbee implementation of Protocol 2
└── examples/
    └── s3-queue/                            -- Example: S3-Queue (spec + Rust impl)

Spec-First Approach

The canonical SPEC docs (Markdown) define each protocol precisely. Both TLA+ and Fizzbee are independent implementations of the same spec. The spec doc is the source of truth.

  • If TLA+ and Fizzbee disagree on a property verdict, the spec doc resolves which is wrong
  • Every state variable, action, and property in the spec has a corresponding construct in both TLA+ and Fizzbee

Running the Models

TLA+ (TLC Model Checker)

Prerequisites: Java 17+ and TLA+ tools (tla2tools.jar). Quick setup: make tlaplus-install

Leaderless Log Protocol:

# Small config — fast checking (~10^5 states)
cd tlaplus
java -jar tla2tools.jar -config LeaderlessLog.cfg LeaderlessLog.tla

# Medium config — thorough checking (~10^6-7 states)
java -jar tla2tools.jar -config LeaderlessLog-medium.cfg LeaderlessLog.tla

Task Claiming Protocol:

# Small config
cd tlaplus
java -jar tla2tools.jar -config TaskClaiming.cfg TaskClaiming.tla

# Medium config
java -jar tla2tools.jar -config TaskClaiming-medium.cfg TaskClaiming.tla

TLC Config Parameters

Leaderless Log

Parameter Small Medium Description
Writers {w1, w2} {w1, w2, w3} Writer processes
MaxOffset 5 8 Offset upper bound
CompactRangeStart 1 1 Compaction range start
CompactRangeEnd 3 5 Compaction range end
MaxBatch 2 3 Max records per write batch

The MaxBatch parameter controls multi-record entry modeling:

  • MaxBatch = 1: Dense index (every offset has an entry) — equivalent to the original 1:1 model
  • MaxBatch > 1: Sparse index — writes non-deterministically choose batch sizes in 1..MaxBatch, producing entries that cover multiple offsets

Note: With MaxBatch > 1, some write sequences produce entries that don't align with CompactRangeEnd. For example, with MaxBatch=2 and CompactRangeEnd=3: writes of [1,2],[3,4] produce entries at offsets 2 and 4 — offset 3 has no entry boundary. The CompactStart guard naturally handles this by only firing when all entries are fully contained in the range.

Task Claiming

Parameter Small Medium Description
Workers {w1, w2, w3} {w1, w2, w3, w4} Worker processes
Tasks {t1, t2} {t1, t2, t3} Tasks
MaxFailures 2 3 Max failures before DLQ

Fizzbee

# Install Fizzbee CLI locally (one-time setup)
make fizzbee-install

# Run Fizzbee model checker on all specs
make fizzbee

# Or run individual specs directly
cd fizzbee && ~/.local/fizzbee/fizz LeaderlessLog.fizz
cd fizzbee && ~/.local/fizzbee/fizz TaskClaiming.fizz

You can also upload .fizz files to fizzbee.io for browser-based checking.

Expected Results

Protocol 1: Leaderless Log

Property Type TLA+ Fizzbee Notes
TypeOK Invariant PASS TLA+ only
MonotonicOffsets Safety PASS PASS SequenceCounterPositive in TLA+
FencedRejectsAppends Safety PASS Uses ENABLED; omitted from Fizzbee
CompactionPreservesData Safety PASS PASS
NoPhantomEntries Safety PASS PASS
CursorConsistency Safety PASS PASS
NoOverlappingRanges Safety PASS PASS WAL-WAL only; WAL-COMPACTED overlap allowed
NoReaderError Safety PASS PASS
SequentialCompactionSafety Safety PASS PASS Compositionality: round 2 preserves round 1
AppendProgress Liveness PASS Requires fairness (TLA+ only)
CompactionCompletes Liveness PASS Requires fairness (TLA+ only)
ReaderEventuallySucceeds Liveness PASS Vacuously true; retained as regression guard

All properties pass. The ReadEntry action correctly models that both WAL and COMPACTED entries are readable — compaction reorganizes data into compacted files but does not delete it. A reader implementation dispatches on entry file type to route reads to the appropriate storage backend.

Protocol 2: Task Claiming

Property Type Expected
MutualExclusion Safety PASS
NoDoubleExecution Safety PASS
DLQOnlyAfterMaxFailures Safety PASS
LockConsistency Safety PASS
NoOrphanExecution Safety PASS
TaskCompletion Liveness PASS
CrashRecovery Liveness PASS
NoStarvation Liveness PASS

All properties should pass for Protocol 2.

Spec-to-Implementation Mapping

Protocol 1: Leaderless Log

Spec Section TLA+ Construct Fizzbee Construct
State: logIndex VARIABLE logIndex logIndex = {}
State: sequenceCounter VARIABLE sequenceCounter sequenceCounter = 1
State: logState VARIABLE logState logState = OPEN
State: compactionCursor VARIABLE compactionCursor compactionCursor = 1
State: writerState VARIABLE writerState writerState = {}
State: writerBatchSize VARIABLE writerBatchSize writerBatchSize = {}
State: compactorState VARIABLE compactorState compactorState = "IDLE"
Action 1: StartAppend StartAppend(w) action StartAppend
Action 2: WALWrite WALWriteSuccess(w) / WALWriteFail(w) action WALWrite (either/or)
Action 3: AssignOffset AssignOffset(w) / AssignOffsetFenced(w) action AssignOffset
Action 4: AppendComplete AppendComplete(w) action AppendComplete
Actions 5-9: Compaction CompactStart through CompactReset action CompactStart through action CompactReset
Actions 10-11: Fencing FenceLog / UnfenceLog action FenceLog / action UnfenceLog
Action 12: ReadEntry ReadEntry(w, off) action ReadEntry
Helper: HasCoveringEntry HasCoveringEntry(off) hasCoveringEntry(off)
Helper: CoveringEntryOffset CoveringEntryOffset(off) coveringEntryOffset(off)
Safety S2 MonotonicOffsets always assertion MonotonicOffsets
Safety S3 FencedRejectsAppends always assertion FencedRejectsAppends
Safety S4 CompactionPreservesData always assertion CompactionPreservesData
Safety S7 NoOverlappingRanges always assertion NoOverlappingRanges
Safety S8 NoReaderError always assertion NoReaderError
Safety S9 SequentialCompactionSafety always assertion SequentialCompactionSafety
Liveness L3 ReaderEventuallySucceeds — (liveness requires fairness; TLA+ only)

Protocol 2: Task Claiming

Spec Section TLA+ Construct Fizzbee Construct
State: taskStatus VARIABLE taskStatus taskStatus = {}
State: lockState VARIABLE lockState lockState = {}
State: workerAlive VARIABLE workerAlive workerAlive = {}
State: workerState VARIABLE workerState workerState = {}
Action 1: ScanTasks ScanTasks(w, t) action ScanTasks
Action 2: TryLockTask TryLockSuccess(w) / TryLockFail(w) action TryLockSuccess / action TryLockFail
Action 3: ExecuteTask ExecuteTaskSuccess(w) / ExecuteTaskFailure(w) action ExecuteTask (either/or)
Action 4: UnlockTask UnlockTask(w) action UnlockTask
Actions 5-7: Crash/Recovery WorkerCrash / SessionExpiry / WorkerRecover Same action names
Safety S1 MutualExclusion always assertion MutualExclusion
Safety S2 NoDoubleExecution always assertion NoDoubleExecution
Safety S4 LockConsistency always assertion LockConsistency
Liveness L1 TaskCompletion always eventually assertion TaskCompletion
Liveness L2 CrashRecovery always eventually assertion CrashRecovery

Cross-Implementation Validation

After running both TLA+ and Fizzbee on both protocols:

  1. Confirm all property verdicts agree between TLA+ and Fizzbee
  2. All safety properties must pass in both protocols and both tools
  3. All liveness properties must pass in both protocols and both tools

If results disagree, consult the canonical spec doc to determine which implementation has the bug.

License

This project is licensed under the Apache License, Version 2.0 — see the LICENSE file for the full text.

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

About

The spec of leaderless log protocol used in Ursa

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • TLA 93.0%
  • Makefile 7.0%