Skip to content

Observability: app-level metrics, dashboards, alerts #460

@intendednull

Description

@intendednull

Tracking issue for the performance/observability work scoped out of the deployment-infra research (see branch claude/research-deployment-infrastructure-V4ilv). Not blocking the deploy spec — captured here so we can pick it up after deploy lands.

Goal

See when relay/workers are overwhelmed, view resource usage over time, alert before users notice. Host-level CPU/RAM graphs are not enough — we need app-level signals (peer churn, sync queue depth, gossip throughput, SQLite write latency).

Proposed shape

Decoupled from host choice. Two viable stacks:

  • A. Managed: Grafana Cloud free tier (10k series, 50GB logs, 14d retention) + Grafana Alloy agent on each NixOS box. No infra to babysit. Lock-in low (Prometheus + OTel are open formats).
  • B. Self-hosted: VictoriaMetrics + Grafana + Loki + Alertmanager on the same VM via NixOS modules. ~+€4/mo RAM cost. Full sovereignty, no series cap.

Pragmatic path: start A, migrate to B if/when free tier hits limits. Same /metrics endpoint feeds both — no app code change.

Instrumentation tasks

  • Add metrics + metrics-exporter-prometheus (or opentelemetry-rust) to relay, replay, storage binaries
  • Expose /metrics Prometheus endpoint per binary
  • Wire tracing-opentelemetry so existing tracing spans become traces for free
  • node_exporter via services.prometheus.exporters.node on each host
  • Ship journalctl → Alloy → Loki (use tracing-subscriber JSON formatter)

Key metrics to add

  • relay: connected peers, bytes in/out, gossip msgs/s, dial failures, ws upgrade rate
  • replay: events/s applied, sync queue depth, peers served, head-set size
  • storage: SQLite write latency p50/p99, db size, events backfilled/s
  • common (actor framework): mailbox depth, channel lag, panic count

SLOs to alert on

  • relay: peer dial p95 < 2s; ws connect error rate < 1%; CPU < 70% sustained
  • replay: sync lag (head event age) < 10s; mailbox depth < 1000
  • storage: write p99 < 100ms; disk free > 20%
  • host: load1 < cores * 1.5; mem free > 15%; swap < 50%

Alerts → Discord/Slack webhook.

Out of scope here

  • Synthetic uptime checks (Better Stack / Uptime Kuma) — separate concern
  • User-facing analytics — separate concern
  • Tracing sampling strategy — defer until traces volume justifies

Risks / caveats

  • Cardinality discipline: one series per peer ID would blow past Grafana Cloud free tier instantly. Need labelling rules before shipping.
  • Privacy: Willow is a privacy-focused P2P project. Logs/metrics shipped to a managed vendor must scrub PII (peer IDs, IPs, content hashes). Self-host route avoids this entirely.
  • Instrumentation cost: ~1–2 days to wire metrics crate through actor framework and add per-actor counters.

Dependencies

Wait for deployment-infra spec (docs/specs/2026-04-28-...-design.md, TBD) to land first so the obs stack can target the same NixOS modules.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions