Skip to content

v0.4.0 — Partitioned placement and online rebalancing

Choose a tag to compare

@incognick incognick released this 16 Jun 01:16
· 71 commits to main since this release
Immutable release. Only release title and notes can be modified.

The fourth Hamster release: partitioned placement and online rebalancing — a cluster can finally grow, shrink, and reshape its data plane after data exists, and a continuous scrubber keeps that data healthy on its own.

Dev preview. The data-plane membership is no longer frozen once data exists — that was v0.3's headline limitation, and it's gone. But the S3 write path still commits only on the Raft leader (a non-leader answers 503 SlowDown, clients retry elsewhere), multipart and server-side copy are still not on the cluster path, a single-node serve store still cannot become a cluster in place, and on-disk/on-wire formats may still change between v0 releases (they are additively versioned, but nothing is frozen yet). Please don't trust real data to it.

What's in v0.4

  • Partitioned placement, made real. Placement now resolves from a stored, versioned cluster layout — a committed record every node reads identically, so the same object lands on the same nodes regardless of transient membership views. The placement function is failure-domain-aware (shards spread across zones, then hosts, then nodes — node-distinct by construction, the hard floor) and capacity-weighted (a larger node holds proportionally more, via deterministic integer weighted rendezvous — no floats, so every node computes the identical mapping even across CPU architectures). Equal-weight, single-zone clusters are unchanged, byte for byte.
  • Online rebalancing — grow, drain, replace, remove, shrink. A layout change relocates where shards live, so it opens a transition: reads dual-read each shard from its new or old home, so existing objects never go unreadable, while repair migrates shards old→new (a copy, not a reconstruct). One layout change at a time.
    • Grow — add a node (cluster run -token …); existing data migrates onto it at its current width. Then cluster optimize re-encodes data up to the larger cluster's profile, spreading older objects across the new nodes (explicit, never automatic).
    • Drain / undrain (cluster drain <node>) — new writes steer off the node and its shards migrate away; reversible.
    • Replace (cluster run -token … -replaces <old>) — swaps a fresh node in for an existing one at the same cluster size: same profile, a pure migrate, no re-encode. Works even on a tight cluster where every node is load-bearing.
    • Remove (cluster remove <node>) — evicts a drained, empty node for good; its ID is tombstoned, so a return needs a fresh join. Gated so durability is never traded for a shrink.
    • Shrink — draining a node past a storage-profile boundary re-encodes every object down to the smaller profile, with a [y/N] that prints the real per-step trade (e.g. 6→5 keeps two-failure tolerance and costs only efficiency; 5→4 drops tolerance to one).
  • A continuous background scrubber. Loop-owned, seam-clock-paced, leader-gated: it walks the keyspace object-by-object, hashing every shard against its replicated checksum and rebuilding bitrot or lost shards before any read trips over them — no operator action. The same self-healing as the on-demand repair sweep, now always-on, yielding to operator and transition sweeps through a shared single-flight guard.
  • Node liveness. A passive per-node detector, fed by every data-plane outcome (PUT, GET, repair), lets a PUT skip a known-down node up front instead of paying its retransmit timeout — but only while enough nodes stay up to meet the durability floor, so the skip changes latency, never durability (repair rebuilds the skipped shard). cluster status reports each member's STATE — up, draining, or down — alongside its host/zone and capacity weight.
  • Metadata durable on every replica. Each node's metadata now lives in a per-replica BadgerDB store that is the boot source of truth (only the un-applied WAL tail is replayed), with the Raft log's full-dump snapshots as the rebuild fallback if that store is lost or corrupt.
  • One cluster port. The join/status protocol shares the peer transport's port, split off by ALPN — a node binds one cluster address, not two.

How it's verified

  • The deterministic simulation harness drives the full Raft metadata plane plus the data plane on every node, with durability checked by decoding shard files off the surviving disks. Repair and transition schedules cover an emptied node healed, two-shard bitrot rebuilt without any read, beyond-tolerance loss reported honestly, crash-mid-sweep convergence, a transition migrated shard-by-shard to a new placement that still reads with the drained node crashed, a downsize re-encoding 4+2→3+2 that reads at the new profile, and an upsize optimize widening 2+1→4+2 after growth that then survives two crashed nodes. The persister path (load, skip-don't-reapply, snapshot-install reset, WAL rebuild) is exercised under the same harness.
  • A deep multi-node end-to-end suite runs real processes over loopback mTLS: a reusable cluster harness; ~100 objects of varied sizes with paged, prefix, and delimiter listings and ranged GETs; the operations isolated and back-to-back, asserting the one-layout-op-at-a-time guard; an operation running under continuous load with a zero-error contract; the full profile ladder (2+1 / 3+2 / 4+2, each proven to reconstruct after losing m nodes); growth + optimize; downsize; and a shard bitrotted on disk and watched self-heal by the scrubber.
  • The race detector (now split so the race-meaningful packages gate while the real-process cluster suite is best-effort) and the v0.1 compatibility suite (aws CLI, rclone, restic, s3cmd) keep passing.

Binaries below are static (CGO_ENABLED=0), version-stamped (hamster version), with SHA-256 checksums in SHA256SUMS. Next up, v0.5: object versioning — the metadata has modeled every key as an ordered version list since v0.1, so the API to expose it needs no schema change.