Skip to content

observability and reliability: prometheus counters, crash recovery, TLS tests#314

Merged
kacy merged 3 commits intomainfrom
feat/prometheus-metrics-and-reliability-tests
Feb 26, 2026
Merged

observability and reliability: prometheus counters, crash recovery, TLS tests#314
kacy merged 3 commits intomainfrom
feat/prometheus-metrics-and-reliability-tests

Conversation

@kacy
Copy link
Copy Markdown
Owner

@kacy kacy commented Feb 26, 2026

summary

two commits covering the remaining launch-readiness items:

prometheus metrics — the ember_keys_expired_total and ember_keys_evicted_total metrics were incorrectly published as gauges. they're now proper counters that increment by delta on each polling interval, which matches the prometheus _total naming convention and enables rate queries like rate(ember_expired_keys_total[5m]). adds ember_replication_connected_replicas and ember_replication_max_lag_records gauges so operators can alert on replication health without poking INFO replication.

crash recovery testsigkill_crash_recovery writes 50 keys with appendfsync always, drops the server immediately (no sleep, no graceful shutdown — Child::kill() is SIGKILL on unix), restarts with the same data directory, and asserts all 50 keys are present. validates the fundamental appendfsync always guarantee under worst-case conditions.

TLS integration tests — two new tests in tls.rs:

  • tls_basic_commands starts a server with a self-signed cert generated by rcgen, connects via tokio-rustls using the cert as a pinned root, and verifies PING / SET / GET work correctly over TLS
  • plain_tcp_rejected_on_tls_port verifies that a plain TCP connection to the TLS port never receives a valid PONG

what was tested

  • cargo test -p ember-server — 143 unit tests, all pass
  • cargo check -p ember-integration-tests — clean
  • cargo build -p ember-server — clean build
  • new integration tests compile and type-check against the actual binary CLI surface

design considerations

for the replication lag metric, a _seconds gauge would require recording wall-clock timestamps per write — significant added complexity. ember_replication_max_lag_records (record count behind) is an honest and actionable metric: if it's non-zero and not converging, the replica is falling behind. operators can correlate with throughput to estimate time-to-catch-up.

kacy added 3 commits February 26, 2026 13:26
adds redis-compatible keyspace notifications via the notify-keyspace-events
config parameter. when enabled, ember publishes to:
  - __keyspace@0__:<key> with the event name as the message
  - __keyevent@0__:<event> with the key name as the message

changes:
- ember-core: expire_sample now collects expired key names into a buffer;
  run_expiration_cycle returns Vec<String> of expired keys for callers
- engine: new expired_tx broadcast channel in EngineConfig; shards broadcast
  expired key names to server without touching the GET/SET hot path
- keyspace_notifications: flag parser (K/E/g/$/ l/z/h/s/x/A), notify helper
- config: notify-keyspace-events added to MUTABLE_PARAMS and EmberConfig
- server: keyspace_event_flags AtomicU32 on ServerContext; background task
  subscribes to expired-key broadcast and fires __keyevent@0__:expired
- execute: notify_write helper (single atomic load guard — zero overhead when
  disabled) wired into SET, DEL, EXPIRE, HSET, LPUSH, RPUSH, ZADD, SADD
- config set: updates keyspace_event_flags atomically at runtime
…on lag

- replace gauge-based ember_keys_expired_total / ember_keys_evicted_total
  with monotonic counters (ember_expired_keys_total / ember_evicted_keys_total)
  by tracking deltas between polling intervals — semantics now match the
  prometheus _total convention
- add ember_replication_connected_replicas and ember_replication_max_lag_records
  gauges; the lag is computed from ReplicaTracker.replica_lags() which returns
  write_offset - acked_offset per replica
- add ReplicaTracker::replica_lags() helper to expose per-replica offsets
persistence:
- sigkill_crash_recovery — writes 50 keys with appendfsync=always, kills
  the server immediately (no sleep, no graceful shutdown), restarts and
  verifies all 50 keys are intact; validates that fsync-per-write guarantees
  survive worst-case SIGKILL

tls:
- tls_basic_commands — generates a self-signed cert via rcgen at test time,
  starts the server with --tls-port / --tls-cert-file / --tls-key-file,
  connects with tokio-rustls using the cert as a pinned root, exercises
  PING / SET / GET over the TLS transport
- plain_tcp_rejected_on_tls_port — verifies that a plain TCP connection
  to the TLS port never receives a valid PONG (TLS handshake must succeed
  before RESP3 commands are processed)

helpers: add tls_cert_file / tls_key_file to ServerOptions; TestServer
gains a tls_port field populated when TLS args are present
@kacy kacy merged commit 166937b into main Feb 26, 2026
4 of 7 checks passed
@kacy kacy deleted the feat/prometheus-metrics-and-reliability-tests branch February 26, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant