Skip to content

v0.3.0 — Levels, Compaction, and a Frozen Format

Pre-release
Pre-release

Choose a tag to compare

@jamesgober jamesgober released this 07 Jun 05:17
· 8 commits to main since this release

lsm-db v0.3.0 — Levels, Compaction, and a Frozen Format

The real engine. v0.3.0 turns the single-run foundation into a proper
log-structured merge store: flushes append immutable sorted runs, reads merge
across the memtable and every run, and a background thread compacts runs into one
when they accumulate — concurrent with reads and writes. A manifest makes the run
set crash-recoverable, and the on-disk format is now frozen for the 1.x
series
.

The Tier-1 API is unchanged; the only public addition is configuration.

What is lsm-db?

A log-structured merge-tree storage engine for Rust — the write path that powers
RocksDB, LevelDB, Cassandra, and ScyllaDB, packaged as a small, audited library.
It is the storage layer the portfolio's database crates (txn-db, Hive DB) build
on, so the durability and read/write contract is implemented and tested once.

What's new in 0.3.0

Multiple runs and a newest-wins merge

Each flush now writes a new immutable run instead of rewriting one. Point reads
consult the memtable, then each run from newest to oldest; range
scans merge the memtable and every run into one
ascending stream, with the newest source winning per key and tombstones resolved
away. The merge surfaces a corruption error rather than silently truncating if a
run's block fails its checksum.

# fn main() -> Result<(), Box<dyn std::error::Error>> {
# let db = lsm_db::Lsm::open(tempfile::tempdir()?.path())?;
db.put(b"k", b"first")?;
db.flush()?;          // run 1
db.put(b"k", b"second")?;
db.flush()?;          // run 2 — newer
assert_eq!(db.get(b"k")?, Some(b"second".to_vec())); // newest wins
# Ok(())
# }

Background compaction

A dedicated thread merges the runs into one when their count reaches the
configured trigger (default four). The expensive merge runs with no lock
held
; only the final swap takes the engine lock, so compaction never blocks
reads or writes for the duration of the merge. A run superseded by compaction is
reference counted — its file is deleted only once the last reader still holding it
has finished. Dropping the Lsm stops and joins the compactor.

use lsm_db::{Lsm, LsmConfig};
# fn main() -> Result<(), Box<dyn std::error::Error>> {
// Compact whenever six runs accumulate.
let db = Lsm::open_with(
    tempfile::tempdir()?.path(),
    LsmConfig::new().compaction_trigger(6),
)?;
# let _ = db;
# Ok(())
# }

A frozen, block-structured on-disk format

Runs are now block-structured: data blocks (~4 KiB) with a block index, a
per-block CRC32C verified on read, and a per-index CRC32C in the footer.
Tombstones are stored on disk so a deletion can mask older runs until compaction
resolves it. Opening a run reads only the footer and index; values are read one
block at a time with a single positioned read.

The byte layout is frozen for the 1.x series and specified in
docs/SSTABLE_FORMAT.md.

Crash recovery via a manifest

A MANIFEST file records the live runs in recency order and the next sequence
number, rewritten atomically on every flush and compaction. It is the single
source of truth: on open, the engine opens exactly the runs it names and reclaims
any temporary file or run file it does not — the orphans a crash mid-flush or
mid-compaction leaves behind. Because both runs and the manifest are installed by
atomic rename, recovery always lands on a consistent state.

Testing

  • Property test: a randomized put/delete workload against a small,
    frequently-compacting engine matches a BTreeMap model exactly — no live key
    lost or duplicated through compaction.
  • Concurrency stress: four writer threads and a scanning reader against an
    engine with a low compaction trigger; the final state is exact and scans are
    always strictly increasing (no torn or duplicated keys).
  • Crash recovery: ungraceful exit, stale temporary file, orphan run, missing
    run, and corrupted block are all handled correctly.
  • loom model: exhaustive interleavings of a reader versus a compaction swap
    confirm a reader never observes a torn merge, for both a live value and a
    deletion.

Counts at this tag:

  • Default features: 54 unit + 4 integration + 2 compaction + 6 recovery +
    3 property + 23 doctests.
  • loom (under RUSTFLAGS="--cfg loom"): 2 model checks.

All green on stable and MSRV (1.85) across Linux, macOS, and Windows; cargo fmt,
cargo clippy -D warnings, cargo doc -D warnings, cargo deny check, and
cargo audit clean. Zero unsafe (#![forbid(unsafe_code)]).

What's next

  • 0.4.0 — Durability + crash recovery. A wal-db-backed write path under the
    durability feature: log before the memtable insert, and replay the log on
    open so no acknowledged write is lost across a crash.

Installation

[dependencies]
lsm-db = "0.3"

MSRV: Rust 1.85 (2024 edition).

Documentation


Full diff: v0.2.0...v0.3.0.
Changelog: CHANGELOG.md.