v0.5.0 — Bloom filters & feature freeze
Pre-releaselsm-db v0.5.0 — Bloom filters & feature freeze
The engine is feature-complete. v0.5.0 adds optional per-run bloom filters
that let a point read skip any run that cannot contain the key — a negative
lookup across many runs now reads no data blocks at all — and declares the
feature freeze. From here to 1.0 the work is optimization (0.6) and hardening
with the API frozen (0.7), not new surface.
What is lsm-db?
A log-structured merge-tree storage engine for Rust — the write path that powers
RocksDB, LevelDB, Cassandra, and ScyllaDB, packaged as a small, audited library.
It is the storage layer the portfolio's database crates (txn-db, Hive DB) build
on, so the durability and read/write contract is implemented and tested once.
What's new in 0.5.0
Bloom-filtered point reads (bloom feature)
[dependencies]
lsm-db = { version = "0.5", features = ["bloom"] }With the feature enabled, every sorted run carries a bloom filter over its keys.
A point lookup that misses the memtable checks each run's filter first and skips
the run entirely when the filter rejects the key — no data block is read. Filters
never produce false negatives, so a skip is always safe; a false positive merely
falls through to a normal, correct lookup. The public API is identical with or
without the feature, and it is a zero-cost no-op when off.
The win is on negative lookups across many runs. A benchmark of misses over 16
runs:
| negative lookup | |
|---|---|
without bloom |
~280 µs |
with bloom |
~3 µs |
A deterministic, CI-enforced test asserts that a negative lookup reads zero
data blocks under the feature.
Sidecar persistence keeps the frozen format intact
The on-disk run format has been frozen for the 1.x series since 0.3, so the
filter is not embedded in the run. It lives in a sidecar file
(<run>.sst.bloom, encoded with postcard), written before the manifest commit
so any run the manifest names is guaranteed to have its filter, loaded when the
run is reopened, and removed with the run during compaction. A sidecar is a pure
acceleration hint: if it is missing or corrupt, the run is consulted directly
with identical results, and orphan sidecars from a crashed compaction are
reclaimed on open.
# fn main() -> Result<(), Box<dyn std::error::Error>> {
let dir = tempfile::tempdir()?;
let db = lsm_db::Lsm::open(dir.path())?;
db.put(b"present", b"1")?;
db.flush()?;
// With `bloom`, this miss is answered from the filter — no data block read.
assert_eq!(db.get(b"absent")?, None);
# Ok(())
# }Feature freeze
With bloom filters in place the engine is feature-complete. The public surface is
frozen against new features; 0.6 is optimization and 0.7 is hardening with the
API formally frozen.
Testing
- A deterministic, CI-enforced check that a negative lookup reads zero data
blocks underbloom(and a positive lookup reads at least one). - Sidecar round-trip, missing-sidecar and corrupt-sidecar graceful-degradation,
orphan-sidecar reclamation, and sidecar/compaction lifecycle tests. - A negative-lookup benchmark across 16 runs.
- All prior suites continue to pass with the feature on: compaction property
test, concurrent-writer stress, manifest crash recovery, durability recovery,
and theloomread-versus-compaction model.
Counts at this tag:
- Default features: 54 unit + 4 integration + 2 compaction + 6 recovery +
3 property + 23 doctests. --all-features: 67 unit + 6 bloom + 8 durability + 4 integration +
2 compaction + 6 recovery + 3 property + 23 doctests.loom(underRUSTFLAGS="--cfg loom"): 2 model checks.
All green on stable and MSRV (1.87) across Windows and Linux (WSL2); cargo fmt,
cargo clippy -D warnings, cargo doc -D warnings, cargo deny check, and
cargo audit clean. Zero unsafe (#![forbid(unsafe_code)]).
Breaking changes
None. The bloom feature and its dependencies (bloom-lib, postcard) are
additive.
A note on the pluggable comparator
The roadmap listed a pluggable comparator for this phase. It is dropped from
the 1.0 scope: it would require threading a generic comparator parameter
through every public type (Lsm<C>, Scan, …), which conflicts with the
crate's simplified-API mandate. Lexicographic byte ordering covers the common
case — callers encode keys to sort — matching sled and redb. The decision is
recorded in dev/ROADMAP.md.
What's next
- 0.6.0 — Optimization. Profile the flush, compaction, and read paths;
block cache for hot run blocks; batched group commit on the durable path; lazy
scan streaming. Comparative benchmarks vssled/redb. - 0.7.0 — Hardening + API freeze. Adversarial and corrupted-input tests,
edge cases (disk-full, huge values), cross-platform re-verification; the API
formally frozen.
Installation
[dependencies]
lsm-db = "0.5"
# Crash-safe writes and/or bloom-filtered point reads:
lsm-db = { version = "0.5", features = ["durability", "bloom"] }MSRV: Rust 1.87 (2024 edition).
Documentation
Full diff: v0.4.0...v0.5.0.
Changelog: CHANGELOG.md.