observability and reliability: prometheus counters, crash recovery, TLS tests by kacy · Pull Request #314 · kacy/ember

kacy · 2026-02-26T18:40:21Z

summary

two commits covering the remaining launch-readiness items:

prometheus metrics — the ember_keys_expired_total and ember_keys_evicted_total metrics were incorrectly published as gauges. they're now proper counters that increment by delta on each polling interval, which matches the prometheus _total naming convention and enables rate queries like rate(ember_expired_keys_total[5m]). adds ember_replication_connected_replicas and ember_replication_max_lag_records gauges so operators can alert on replication health without poking INFO replication.

crash recovery test — sigkill_crash_recovery writes 50 keys with appendfsync always, drops the server immediately (no sleep, no graceful shutdown — Child::kill() is SIGKILL on unix), restarts with the same data directory, and asserts all 50 keys are present. validates the fundamental appendfsync always guarantee under worst-case conditions.

TLS integration tests — two new tests in tls.rs:

tls_basic_commands starts a server with a self-signed cert generated by rcgen, connects via tokio-rustls using the cert as a pinned root, and verifies PING / SET / GET work correctly over TLS
plain_tcp_rejected_on_tls_port verifies that a plain TCP connection to the TLS port never receives a valid PONG

what was tested

cargo test -p ember-server — 143 unit tests, all pass
cargo check -p ember-integration-tests — clean
cargo build -p ember-server — clean build
new integration tests compile and type-check against the actual binary CLI surface

design considerations

for the replication lag metric, a _seconds gauge would require recording wall-clock timestamps per write — significant added complexity. ember_replication_max_lag_records (record count behind) is an honest and actionable metric: if it's non-zero and not converging, the replica is falling behind. operators can correlate with throughput to estimate time-to-catch-up.

adds redis-compatible keyspace notifications via the notify-keyspace-events config parameter. when enabled, ember publishes to: - __keyspace@0__:<key> with the event name as the message - __keyevent@0__:<event> with the key name as the message changes: - ember-core: expire_sample now collects expired key names into a buffer; run_expiration_cycle returns Vec<String> of expired keys for callers - engine: new expired_tx broadcast channel in EngineConfig; shards broadcast expired key names to server without touching the GET/SET hot path - keyspace_notifications: flag parser (K/E/g/$/ l/z/h/s/x/A), notify helper - config: notify-keyspace-events added to MUTABLE_PARAMS and EmberConfig - server: keyspace_event_flags AtomicU32 on ServerContext; background task subscribes to expired-key broadcast and fires __keyevent@0__:expired - execute: notify_write helper (single atomic load guard — zero overhead when disabled) wired into SET, DEL, EXPIRE, HSET, LPUSH, RPUSH, ZADD, SADD - config set: updates keyspace_event_flags atomically at runtime

…on lag - replace gauge-based ember_keys_expired_total / ember_keys_evicted_total with monotonic counters (ember_expired_keys_total / ember_evicted_keys_total) by tracking deltas between polling intervals — semantics now match the prometheus _total convention - add ember_replication_connected_replicas and ember_replication_max_lag_records gauges; the lag is computed from ReplicaTracker.replica_lags() which returns write_offset - acked_offset per replica - add ReplicaTracker::replica_lags() helper to expose per-replica offsets

persistence: - sigkill_crash_recovery — writes 50 keys with appendfsync=always, kills the server immediately (no sleep, no graceful shutdown), restarts and verifies all 50 keys are intact; validates that fsync-per-write guarantees survive worst-case SIGKILL tls: - tls_basic_commands — generates a self-signed cert via rcgen at test time, starts the server with --tls-port / --tls-cert-file / --tls-key-file, connects with tokio-rustls using the cert as a pinned root, exercises PING / SET / GET over the TLS transport - plain_tcp_rejected_on_tls_port — verifies that a plain TCP connection to the TLS port never receives a valid PONG (TLS handshake must succeed before RESP3 commands are processed) helpers: add tls_cert_file / tls_key_file to ServerOptions; TestServer gains a tls_port field populated when TLS args are present

kacy added 3 commits February 26, 2026 13:26

kacy merged commit 166937b into main Feb 26, 2026
4 of 7 checks passed

kacy deleted the feat/prometheus-metrics-and-reliability-tests branch February 26, 2026 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

observability and reliability: prometheus counters, crash recovery, TLS tests#314

observability and reliability: prometheus counters, crash recovery, TLS tests#314
kacy merged 3 commits intomainfrom
feat/prometheus-metrics-and-reliability-tests

kacy commented Feb 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kacy commented Feb 26, 2026

summary

what was tested

design considerations

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant