-
Notifications
You must be signed in to change notification settings - Fork 2
IO Persistence Engine
codeQ keeps every task, every result, every queue position, and every secondary index in a single embedded log-structured merge-tree. A broker that loses durability after a crash is a broker that loses jobs, so the persistence engine is the load-bearing wall of the system. This page describes that engine the way an operator needs to understand it: what shape data takes on disk, how a write crosses from a Go function call into a durable file, and which knobs trade latency for throughput.
The engine is implemented as a thin wrapper over Pebble, the LSM key-value library CockroachDB extracted from RocksDB and rewrote in Go. Pebble is a means, not the message. What matters is the contract: ordered byte keys, atomic batch writes, point reads and range scans, and a write-ahead log that survives kill -9. The wrapper lives at internal/repository/pebble/db.go:83-101 and exposes that contract to the rest of codeQ through a handful of methods — Get, Has, Set, Delete, Batch, CommitBatch, Iter — that the repositories build queues, leases, idempotency tables, and TTL reapers on top of.
A log-structured merge-tree is a write-optimised data structure that turns random updates into sequential disk traffic. Every write lands first in an in-memory sorted map called the memtable. When the memtable fills, the engine seals it, makes it immutable, and starts a fresh one for new writes. The sealed memtable flushes to disk as an immutable file — a sorted string table, or SST — that lives at level zero of the tree. Background compaction threads then merge level-zero SSTs down into level one, level one into level two, and so on through level six. Each level holds roughly ten times the data of the level above it, so the lower levels are fewer in number but vastly larger in size.
Reads search top-down: first the active memtable, then the immutable memtable being flushed, then every level-zero SST (they overlap in key range so the engine has to check each one), then a single SST per lower level (the engine knows which one because lower-level SSTs partition the key space). Bloom filters attached to every SST short-circuit the search for keys that aren't present: a negative answer from a level's bloom skips that level's disk read entirely. The engine in codeQ enables bloom filters at every level with ten bits per key, which is the textbook trade-off — about one percent false-positive rate, ten bits of RAM per indexed key. The configuration sits at internal/repository/pebble/db.go:147-153.
The block cache lives above the SST files. codeQ sizes it at 256 MiB by default (internal/repository/pebble/db.go:147); that's enough to keep the hot working set of a single-node broker resident even under sustained read load. Operators running heavier read workloads can lift the cache to several gigabytes; the engine's memory pressure stays predictable because the cache is the dominant consumer outside the memtable itself.
The disk story is the write-ahead log. Every batch the broker commits is appended to the WAL before the memtable mutation is visible. On crash, the engine replays the WAL on startup, reconstructs the memtable, and resumes. WAL writes are sequential, which is why an LSM tree can sustain six-digit write rates per second on commodity SSDs: the bottleneck is the page cache and the fsync, not the seek time. Compactions, by contrast, are slower and run in the background; they trade CPU and disk bandwidth for read performance over time. A healthy engine has compactions keeping pace with ingest — falling behind is the only operational pathology that breaks the LSM model.
A repository call such as taskRepository.Save(task) builds a *pebbledb.Batch, calls batch.Set for every key the operation touches — the task body, the pending-queue index, the idempotency entry, the TTL reaper index — and then hands the batch to db.CommitBatch(b). From the caller's point of view this is a synchronous, atomic write: either every key in the batch lands or none of them do, and the function returns the commit error if any.
What happens behind that call is the interesting part. CommitBatch does not itself talk to the engine. It submits the batch to a single coalescer goroutine over a buffered channel and blocks on a per-request done channel (internal/repository/pebble/db.go:322-339). The coalescer pops the first request, then opportunistically drains the channel up to maxMergeBatch=64 more requests that have already queued. Every drained batch is merged into a single Pebble batch with merged.Apply(other, nil); the operations are appended verbatim. Once the merge stops, the coalescer issues exactly one merged.Commit(pebbledb.NoSync) and fans the resulting error out to every submitter (internal/repository/pebble/db.go:351-401).
The reason this design exists is mutex contention inside the engine. Pebble serialises commits through an internal structure called the commit pipeline, which holds a single mutex while it appends to the WAL, advances the sequence number, and links the batch into the memtable. Phase-zero profiling on codeQ pinned this mutex at 96 percent of total mutex time and 44 percent of block time at 26,000 requests per second. Every Commit call was queueing for the same lock. Merging sixty-four small batches into one big batch collapses sixty-four lock acquisitions into one. The constants are documented inline at internal/repository/pebble/db.go:117-129; the full discussion of why this matters lives on IO Group Commit Coalescer.
When a replicator is attached — i.e., codeQ is running with raft enabled — the coalescer is bypassed and every write flows through repl.Replicate(batch.Repr()) instead. The raft layer has its own apply coalescer with a similar shape but a higher batch ceiling. The decision lives in the same CommitBatch (internal/repository/pebble/db.go:322-331); the raft side is documented on IO Raft Replication.
Every byte of broker state lives under the codeq/ namespace so the underlying engine can be shared with unrelated data without prefix collisions. Within that namespace the keys are structured so that range scans return data in the right order without needing a separate index. The layout is documented in internal/repository/pebble/keys.go:25-59.
The flat keys are straightforward — codeq/tasks/<id> for a task body, codeq/results/<id> for a result record, codeq/idempo/<key> for an idempotency mapping, codeq/lease/<id> for a worker lease entry. The interesting structure is in the queue prefixes. A pending entry takes the shape codeq/q/<cmd>/<tenant>/pending/<prio_be1>/<seq_be8>/<id>. The priority is a single big-endian byte; the sequence number is a monotonic uint64 packed big-endian. Because the bytes sort lexicographically, a range scan over codeq/q/<cmd>/<tenant>/pending/ yields tasks in priority order first, then in insertion order within each priority bucket. The repository does not need a secondary index, a heap, or a sort step — the key encoding does the work.
Delayed entries follow the same pattern with a unix-second score in place of the priority/seq pair: codeq/q/<cmd>/<tenant>/delayed/<score_be8>/<id>. The reaper sweeps these with a half-open range scan of [delayed/, delayed/<now+1>) (internal/repository/pebble/keys.go:155-163). TTL keys are codeq/ttl/<expire_unix_be8>/<id> and follow the same convention — the reaper pops the lowest score first, no secondary structure needed.
The empty tenant case is encoded as the literal _ so that splitting on / never loses an empty position (internal/repository/pebble/keys.go:58-66). Commands are lowercased before being used as a key segment.
The engine commits with pebbledb.NoSync by default. NoSync writes the WAL but does not call fsync(2) afterwards — the bytes reach the OS page cache, not the platter. A clean shutdown still flushes; a kill -9 flushes; a kernel panic or power loss does not. For task queue semantics this is usually the right call: the producer's HTTP retry on failure plus the broker's idempotency table cover the small window where the very last in-flight batches might be lost. Throughput, in exchange, is dramatically higher.
Operators who need strict durability flip the FsyncOnCommit knob in the engine options (internal/repository/pebble/db.go:133-138). The trade-off is not subtle. The table below sketches the orders of magnitude observed on the codeQ bench harness; exact numbers will vary by disk, but the shape is universal.
| Setting | Per-commit cost | Throughput ceiling | What survives |
|---|---|---|---|
| NoSync (default) | ~5 µs | 76,000 tasks/s (single node) | Process crash, panic, clean shutdown |
| Sync (fsync) | ~1-3 ms | a few thousand tasks/s | Process crash, panic, power loss, kernel panic |
The 76,000 tasks-per-second figure comes from the single-node profile_full_cycle benchmark at internal/bench/profile_full_cycle_test.go; the workload is a full enqueue + claim + complete cycle, so the actual write rate on the engine is roughly triple that. With fsync enabled the same workload drops to a few thousand cycles per second because each commit waits for the platter. There is no middle ground in Pebble itself; the choice is binary at commit time. What sits between them in practice is the group-commit coalescer, which amortises whichever cost the operator chose across many submitters. With fsync on, sixty-four batches still share one fsync, which makes durable mode usable for workloads that genuinely require it.
When the engine opens, two recovery passes run before the broker accepts traffic. The first is Pebble's own WAL replay, which reconstructs the active memtable from the log on disk. The second is codeQ-specific: recoverSeq scans every pending-queue key and seeds the in-memory sequence counter to one past the highest seq it finds (internal/repository/pebble/db.go:200-230). This is what prevents a restart from issuing duplicate sequence numbers and breaking the FIFO-within-priority ordering of the pending queue. The cost is linear in the pending queue size, which for the workloads codeQ targets is well below the WAL replay time anyway.
The lease table goes through its own recovery in TaskRepository.recoverLeases at startup — workers that held leases at crash time get them re-instated in memory so the lease reaper can expire them on schedule. The TTL reaper picks up where it left off because its index keys are already on disk; no in-memory state to rebuild.
The engine holds an exclusive file lock on its directory. Two codeQ processes pointing at the same data directory will fail to open the second one with a clear error message. This is intentional — the LSM tree is single-writer by design, and the replication path (raft) is how multiple processes share state, not file locking.
The earlier broker design used a network KV with its own server process. The persistence engine described here is embedded — it runs in the same process as the broker, on the same node, with no network hop and no separate daemon. The cost is operational: one codeQ process per data directory, so horizontal scaling at the data layer happens through clustering (replication and sharding), not through pointing many brokers at one shared store. The benefit is performance and operational simplicity. There is no second daemon to deploy, no second healthcheck to wire up, no second restart loop to monitor. The broker's healthcheck doubles as the engine's healthcheck.
The two corollaries are worth stating explicitly. First, pub/sub no longer crosses the network — the SubscriptionRepository uses in-process channels, so events do not propagate to other broker instances. Cross-broker propagation, where needed, happens through the raft log. Second, TTL is no longer a server-side feature; codeQ runs its own background reapers — one for leases, one for TTL keys — that range-scan the appropriate index and delete expired entries on a tick. The reaper interval and concurrency are configurable.
The engine exposes a handful of metrics on the Prometheus endpoint at :9091. The most useful for diagnosing write-path problems are the LSM amplification ratios (write amplification and read amplification per level), the memtable size and flush count, the compaction queue depth, and the WAL fsync histogram. A healthy broker shows the memtable cycling steadily, level-zero SST count staying below the compaction trigger, and compactions completing well within their interval budget. A pathological broker shows level-zero SST count climbing, compactions falling behind, and write stalls appearing in the bench output. The remedy is almost always to give the engine more disk bandwidth or to reduce the write rate; tuning the level multipliers is a last resort and usually not necessary at codeQ's working scales.
For request-level visibility the broker's /metrics endpoint also exposes per-repository operation histograms — task.save, task.claim, result.save — that let the operator correlate engine pressure with the HTTP-level workload that caused it. The interplay between the coalescer, the engine, and the surrounding repositories is covered on Concepts Persistence Architecture and the operational specifics of monitoring lag and queue depth live on Monitoring Persistence.
The group-commit coalescer is described in depth on IO Group Commit Coalescer. The raft replication layer that sits on top of the engine in cluster mode lives on IO Raft Replication, and the multiplexed transport that ties raft groups across nodes is on IO Mux Transport. The repository surfaces that build queues, idempotency, leases, and TTL on top of the engine are covered on Concepts Queues, Concepts Idempotency, and Concepts Leases. The benchmark methodology that produced the 76,000 tasks-per-second figure is on Benchmarks Single Node.
Source: github.com/osvaldoandrade/codeq.
- Overview
- Tasks and Results
- Queue Model
- Sharding
- Leases and Ownership
- Multi-Tenancy
- Authentication and Authorization
- Persistence Engine
- Consensus and Replication
- Cluster-Level Failover
- Deployment Modes
- Architecture Overview