Observability: app-level metrics, dashboards, alerts

Tracking issue for the performance/observability work scoped out of the deployment-infra research (see branch `claude/research-deployment-infrastructure-V4ilv`). Not blocking the deploy spec — captured here so we can pick it up after deploy lands.

## Goal

See when relay/workers are overwhelmed, view resource usage over time, alert before users notice. Host-level CPU/RAM graphs are not enough — we need app-level signals (peer churn, sync queue depth, gossip throughput, SQLite write latency).

## Proposed shape

Decoupled from host choice. Two viable stacks:

- **A. Managed:** Grafana Cloud free tier (10k series, 50GB logs, 14d retention) + Grafana Alloy agent on each NixOS box. No infra to babysit. Lock-in low (Prometheus + OTel are open formats).
- **B. Self-hosted:** VictoriaMetrics + Grafana + Loki + Alertmanager on the same VM via NixOS modules. ~+€4/mo RAM cost. Full sovereignty, no series cap.

Pragmatic path: start A, migrate to B if/when free tier hits limits. Same `/metrics` endpoint feeds both — no app code change.

## Instrumentation tasks

- Add `metrics` + `metrics-exporter-prometheus` (or `opentelemetry-rust`) to relay, replay, storage binaries
- Expose `/metrics` Prometheus endpoint per binary
- Wire `tracing-opentelemetry` so existing `tracing` spans become traces for free
- `node_exporter` via `services.prometheus.exporters.node` on each host
- Ship `journalctl` → Alloy → Loki (use `tracing-subscriber` JSON formatter)

## Key metrics to add

- **relay:** connected peers, bytes in/out, gossip msgs/s, dial failures, ws upgrade rate
- **replay:** events/s applied, sync queue depth, peers served, head-set size
- **storage:** SQLite write latency p50/p99, db size, events backfilled/s
- **common (actor framework):** mailbox depth, channel lag, panic count

## SLOs to alert on

- relay: peer dial p95 < 2s; ws connect error rate < 1%; CPU < 70% sustained
- replay: sync lag (head event age) < 10s; mailbox depth < 1000
- storage: write p99 < 100ms; disk free > 20%
- host: load1 < cores * 1.5; mem free > 15%; swap < 50%

Alerts → Discord/Slack webhook.

## Out of scope here

- Synthetic uptime checks (Better Stack / Uptime Kuma) — separate concern
- User-facing analytics — separate concern
- Tracing sampling strategy — defer until traces volume justifies

## Risks / caveats

- **Cardinality discipline:** one series per peer ID would blow past Grafana Cloud free tier instantly. Need labelling rules before shipping.
- **Privacy:** Willow is a privacy-focused P2P project. Logs/metrics shipped to a managed vendor must scrub PII (peer IDs, IPs, content hashes). Self-host route avoids this entirely.
- **Instrumentation cost:** ~1–2 days to wire `metrics` crate through actor framework and add per-actor counters.

## Dependencies

Wait for deployment-infra spec (`docs/specs/2026-04-28-...-design.md`, TBD) to land first so the obs stack can target the same NixOS modules.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observability: app-level metrics, dashboards, alerts #460

Goal

Proposed shape

Instrumentation tasks

Key metrics to add

SLOs to alert on

Out of scope here

Risks / caveats

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Observability: app-level metrics, dashboards, alerts #460

Description

Goal

Proposed shape

Instrumentation tasks

Key metrics to add

SLOs to alert on

Out of scope here

Risks / caveats

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions