A formally verified distributed append-only log protocol with concurrent writers, compaction, and readers — built on the coordination-delegated pattern, where stateless workers offload all coordination to an external linearizable store.
The Leaderless Log Protocol implements a distributed append-only log where any writer can append without being elected leader. There is no Raft, no Paxos, no leader election. Instead, writers are stateless processes that delegate ordering and index management to an external linearizable coordination store:
- Monotonic offsets are assigned via the store's
AtomicIncrement(a sequence counter) - Log index is maintained as linearizable key-value entries with atomic CAS
- Compaction rewrites ranges of WAL entries into compacted files via a 3-step CAS update
- Fencing prevents stale writers from corrupting the log after a takeover
The protocol models three interacting sub-protocols — writer append path, compaction index update, and reader path — and is verified in both TLA+ and Fizzbee against safety and liveness properties (see Expected Results).
Full canonical specification: 1-leaderless-log-protocol.md.
The Leaderless Log Protocol formalized in this repo is the protocol used in Ursa, the lakehouse-native streaming engine from StreamNative. The formal spec here is the distilled, model-checked core of the protocol that Ursa runs in production.
- VLDB 2025 paper — Ursa: A Lakehouse-Native Streaming System: https://www.vldb.org/pvldb/vol18/p5184-guo.pdf
- Ursa for Kafka (engineering blog): https://streamnative.io/blog/ursa-for-kafka-native-apache-kafka-service-on-lakestream
S3-Queue is a distributed message queue built entirely on S3-compatible object storage — no external coordination store required. It instantiates the Leaderless Log Protocol with S3 conditional writes as the coordination primitive.
| Protocol | Leaderless Log |
| System SPEC | examples/s3-queue/SPEC.md — 1,100+ lines covering the full design |
| Implementation | Rust CLI (s3q) — generated by a coding agent from the SPEC alone |
Try it yourself — tell your favorite coding agent:
Implement S3-Queue according to the following spec: https://github.com/lakestream-io/leaderless-log-protocol/blob/main/examples/s3-queue/SPEC.md
See examples/ for the full list of examples.
The Leaderless Log Protocol is one instance of a broader architectural pattern: coordination-delegated distributed systems. In this pattern, worker nodes carry no consensus logic — they are stateless processes that offload all coordination decisions (ordering, locking, leader election, failure detection) to an external linearizable store. The store becomes the single source of truth for shared state, while workers focus purely on data-path operations.
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Worker A │ │ Worker B │ │ Worker C │
│(stateless)│ │(stateless)│ │(stateless)│
└────┬─────┘ └────┬─────┘ └────┬─────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────┐
│ Coordination Store (linearizable) │
│ CAS · AtomicIncrement · Ephemeral keys │
└──────────────────────────────────────────┘
Unlike embedded consensus (Raft, Paxos), where every node participates in the consensus protocol, coordination-delegated systems separate the coordination plane from the data plane entirely. See 0-coordination-delegated-pattern.md for the full treatment of the pattern, its primitives, and its limitations.
This repo formalizes a family of coordination-delegated protocols. The Leaderless Log Protocol is the headliner; additional protocols share the same pattern and coordination primitives.
| Layer | Document | Description |
|---|---|---|
| 0 | 0-coordination-delegated-pattern.md |
The foundational pattern: abstract coordination primitives, properties, limitations, and reference implementations |
| 1 | 1-leaderless-log-protocol.md |
Leaderless Log — distributed append-only log with concurrent writers, compaction, and readers |
| 2 | 2-coordination-delegated-task-claiming.md |
Task Claiming — distributed task claiming with ephemeral locks and crash recovery |
Protocols 1 and 2 are independent and composable. Each instantiates the Layer 0 pattern with a specific subset of coordination primitives.
The real value of formally verified protocols: you can write a system SPEC that instantiates a protocol for a specific system, hand that SPEC to a coding agent, and get a correct implementation in one shot.
┌─────────────────────┐ ┌─────────────────────┐ ┌─────────────┐ ┌──────────────────┐
│ Protocol SPEC │ │ System SPEC │ │ Coding │ │ Working │
│ (formally │ → │ (language-agnostic,│ → │ Agent │ → │ Implementation │
│ verified) │ │ self-contained) │ │ │ │ │
└─────────────────────┘ └─────────────────────┘ └─────────────┘ └──────────────────┘
The protocol SPEC guarantees correctness properties (safety, liveness) via model checking. The system SPEC translates those guarantees into a concrete design — domain model, state machines, coordination primitives, pseudocode, error handling, and a test matrix. Together, they give a coding agent everything it needs to produce a correct implementation without back-and-forth.
├── 0-coordination-delegated-pattern.md -- Layer 0: The pattern
├── 1-leaderless-log-protocol.md -- Layer 1: Canonical spec (Leaderless Log)
├── 2-coordination-delegated-task-claiming.md -- Layer 2: Canonical spec (Task Claiming)
├── README.md -- This file
├── tlaplus/
│ ├── LeaderlessLog.tla -- TLA+ implementation of Protocol 1
│ ├── LeaderlessLog.cfg -- Small TLC config (fast, ~10^5 states)
│ ├── LeaderlessLog-medium.cfg -- Medium TLC config (thorough, ~10^6-7 states)
│ ├── TaskClaiming.tla -- TLA+ implementation of Protocol 2
│ ├── TaskClaiming.cfg -- Small TLC config
│ └── TaskClaiming-medium.cfg -- Medium TLC config
├── fizzbee/
│ ├── LeaderlessLog.fizz -- Fizzbee implementation of Protocol 1
│ └── TaskClaiming.fizz -- Fizzbee implementation of Protocol 2
└── examples/
└── s3-queue/ -- Example: S3-Queue (spec + Rust impl)
The canonical SPEC docs (Markdown) define each protocol precisely. Both TLA+ and Fizzbee are independent implementations of the same spec. The spec doc is the source of truth.
- If TLA+ and Fizzbee disagree on a property verdict, the spec doc resolves which is wrong
- Every state variable, action, and property in the spec has a corresponding construct in both TLA+ and Fizzbee
Prerequisites: Java 17+ and TLA+ tools (tla2tools.jar). Quick setup: make tlaplus-install
Leaderless Log Protocol:
# Small config — fast checking (~10^5 states)
cd tlaplus
java -jar tla2tools.jar -config LeaderlessLog.cfg LeaderlessLog.tla
# Medium config — thorough checking (~10^6-7 states)
java -jar tla2tools.jar -config LeaderlessLog-medium.cfg LeaderlessLog.tlaTask Claiming Protocol:
# Small config
cd tlaplus
java -jar tla2tools.jar -config TaskClaiming.cfg TaskClaiming.tla
# Medium config
java -jar tla2tools.jar -config TaskClaiming-medium.cfg TaskClaiming.tla| Parameter | Small | Medium | Description |
|---|---|---|---|
Writers |
{w1, w2} |
{w1, w2, w3} |
Writer processes |
MaxOffset |
5 |
8 |
Offset upper bound |
CompactRangeStart |
1 |
1 |
Compaction range start |
CompactRangeEnd |
3 |
5 |
Compaction range end |
MaxBatch |
2 |
3 |
Max records per write batch |
The MaxBatch parameter controls multi-record entry modeling:
MaxBatch = 1: Dense index (every offset has an entry) — equivalent to the original 1:1 modelMaxBatch > 1: Sparse index — writes non-deterministically choose batch sizes in1..MaxBatch, producing entries that cover multiple offsets
Note: With
MaxBatch > 1, some write sequences produce entries that don't align withCompactRangeEnd. For example, withMaxBatch=2andCompactRangeEnd=3: writes of[1,2],[3,4]produce entries at offsets 2 and 4 — offset 3 has no entry boundary. TheCompactStartguard naturally handles this by only firing when all entries are fully contained in the range.
| Parameter | Small | Medium | Description |
|---|---|---|---|
Workers |
{w1, w2, w3} |
{w1, w2, w3, w4} |
Worker processes |
Tasks |
{t1, t2} |
{t1, t2, t3} |
Tasks |
MaxFailures |
2 |
3 |
Max failures before DLQ |
# Install Fizzbee CLI locally (one-time setup)
make fizzbee-install
# Run Fizzbee model checker on all specs
make fizzbee
# Or run individual specs directly
cd fizzbee && ~/.local/fizzbee/fizz LeaderlessLog.fizz
cd fizzbee && ~/.local/fizzbee/fizz TaskClaiming.fizzYou can also upload .fizz files to fizzbee.io for browser-based checking.
| Property | Type | TLA+ | Fizzbee | Notes |
|---|---|---|---|---|
TypeOK |
Invariant | PASS | — | TLA+ only |
MonotonicOffsets |
Safety | PASS | PASS | SequenceCounterPositive in TLA+ |
FencedRejectsAppends |
Safety | PASS | — | Uses ENABLED; omitted from Fizzbee |
CompactionPreservesData |
Safety | PASS | PASS | |
NoPhantomEntries |
Safety | PASS | PASS | |
CursorConsistency |
Safety | PASS | PASS | |
NoOverlappingRanges |
Safety | PASS | PASS | WAL-WAL only; WAL-COMPACTED overlap allowed |
NoReaderError |
Safety | PASS | PASS | |
SequentialCompactionSafety |
Safety | PASS | PASS | Compositionality: round 2 preserves round 1 |
AppendProgress |
Liveness | PASS | — | Requires fairness (TLA+ only) |
CompactionCompletes |
Liveness | PASS | — | Requires fairness (TLA+ only) |
ReaderEventuallySucceeds |
Liveness | PASS | — | Vacuously true; retained as regression guard |
All properties pass. The ReadEntry action correctly models that both WAL and COMPACTED entries are readable — compaction reorganizes data into compacted files but does not delete it. A reader implementation dispatches on entry file type to route reads to the appropriate storage backend.
| Property | Type | Expected |
|---|---|---|
MutualExclusion |
Safety | PASS |
NoDoubleExecution |
Safety | PASS |
DLQOnlyAfterMaxFailures |
Safety | PASS |
LockConsistency |
Safety | PASS |
NoOrphanExecution |
Safety | PASS |
TaskCompletion |
Liveness | PASS |
CrashRecovery |
Liveness | PASS |
NoStarvation |
Liveness | PASS |
All properties should pass for Protocol 2.
| Spec Section | TLA+ Construct | Fizzbee Construct |
|---|---|---|
State: logIndex |
VARIABLE logIndex |
logIndex = {} |
State: sequenceCounter |
VARIABLE sequenceCounter |
sequenceCounter = 1 |
State: logState |
VARIABLE logState |
logState = OPEN |
State: compactionCursor |
VARIABLE compactionCursor |
compactionCursor = 1 |
State: writerState |
VARIABLE writerState |
writerState = {} |
State: writerBatchSize |
VARIABLE writerBatchSize |
writerBatchSize = {} |
State: compactorState |
VARIABLE compactorState |
compactorState = "IDLE" |
Action 1: StartAppend |
StartAppend(w) |
action StartAppend |
Action 2: WALWrite |
WALWriteSuccess(w) / WALWriteFail(w) |
action WALWrite (either/or) |
Action 3: AssignOffset |
AssignOffset(w) / AssignOffsetFenced(w) |
action AssignOffset |
Action 4: AppendComplete |
AppendComplete(w) |
action AppendComplete |
| Actions 5-9: Compaction | CompactStart through CompactReset |
action CompactStart through action CompactReset |
| Actions 10-11: Fencing | FenceLog / UnfenceLog |
action FenceLog / action UnfenceLog |
Action 12: ReadEntry |
ReadEntry(w, off) |
action ReadEntry |
Helper: HasCoveringEntry |
HasCoveringEntry(off) |
hasCoveringEntry(off) |
Helper: CoveringEntryOffset |
CoveringEntryOffset(off) |
coveringEntryOffset(off) |
| Safety S2 | MonotonicOffsets |
always assertion MonotonicOffsets |
| Safety S3 | FencedRejectsAppends |
always assertion FencedRejectsAppends |
| Safety S4 | CompactionPreservesData |
always assertion CompactionPreservesData |
| Safety S7 | NoOverlappingRanges |
always assertion NoOverlappingRanges |
| Safety S8 | NoReaderError |
always assertion NoReaderError |
| Safety S9 | SequentialCompactionSafety |
always assertion SequentialCompactionSafety |
| Liveness L3 | ReaderEventuallySucceeds |
— (liveness requires fairness; TLA+ only) |
| Spec Section | TLA+ Construct | Fizzbee Construct |
|---|---|---|
State: taskStatus |
VARIABLE taskStatus |
taskStatus = {} |
State: lockState |
VARIABLE lockState |
lockState = {} |
State: workerAlive |
VARIABLE workerAlive |
workerAlive = {} |
State: workerState |
VARIABLE workerState |
workerState = {} |
Action 1: ScanTasks |
ScanTasks(w, t) |
action ScanTasks |
Action 2: TryLockTask |
TryLockSuccess(w) / TryLockFail(w) |
action TryLockSuccess / action TryLockFail |
Action 3: ExecuteTask |
ExecuteTaskSuccess(w) / ExecuteTaskFailure(w) |
action ExecuteTask (either/or) |
Action 4: UnlockTask |
UnlockTask(w) |
action UnlockTask |
| Actions 5-7: Crash/Recovery | WorkerCrash / SessionExpiry / WorkerRecover |
Same action names |
| Safety S1 | MutualExclusion |
always assertion MutualExclusion |
| Safety S2 | NoDoubleExecution |
always assertion NoDoubleExecution |
| Safety S4 | LockConsistency |
always assertion LockConsistency |
| Liveness L1 | TaskCompletion |
always eventually assertion TaskCompletion |
| Liveness L2 | CrashRecovery |
always eventually assertion CrashRecovery |
After running both TLA+ and Fizzbee on both protocols:
- Confirm all property verdicts agree between TLA+ and Fizzbee
- All safety properties must pass in both protocols and both tools
- All liveness properties must pass in both protocols and both tools
If results disagree, consult the canonical spec doc to determine which implementation has the bug.
This project is licensed under the Apache License, Version 2.0 — see the LICENSE file for the full text.
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.