Tracking issue for the performance/observability work scoped out of the deployment-infra research (see branch claude/research-deployment-infrastructure-V4ilv). Not blocking the deploy spec — captured here so we can pick it up after deploy lands.
Goal
See when relay/workers are overwhelmed, view resource usage over time, alert before users notice. Host-level CPU/RAM graphs are not enough — we need app-level signals (peer churn, sync queue depth, gossip throughput, SQLite write latency).
Proposed shape
Decoupled from host choice. Two viable stacks:
- A. Managed: Grafana Cloud free tier (10k series, 50GB logs, 14d retention) + Grafana Alloy agent on each NixOS box. No infra to babysit. Lock-in low (Prometheus + OTel are open formats).
- B. Self-hosted: VictoriaMetrics + Grafana + Loki + Alertmanager on the same VM via NixOS modules. ~+€4/mo RAM cost. Full sovereignty, no series cap.
Pragmatic path: start A, migrate to B if/when free tier hits limits. Same /metrics endpoint feeds both — no app code change.
Instrumentation tasks
- Add
metrics + metrics-exporter-prometheus (or opentelemetry-rust) to relay, replay, storage binaries
- Expose
/metrics Prometheus endpoint per binary
- Wire
tracing-opentelemetry so existing tracing spans become traces for free
node_exporter via services.prometheus.exporters.node on each host
- Ship
journalctl → Alloy → Loki (use tracing-subscriber JSON formatter)
Key metrics to add
- relay: connected peers, bytes in/out, gossip msgs/s, dial failures, ws upgrade rate
- replay: events/s applied, sync queue depth, peers served, head-set size
- storage: SQLite write latency p50/p99, db size, events backfilled/s
- common (actor framework): mailbox depth, channel lag, panic count
SLOs to alert on
- relay: peer dial p95 < 2s; ws connect error rate < 1%; CPU < 70% sustained
- replay: sync lag (head event age) < 10s; mailbox depth < 1000
- storage: write p99 < 100ms; disk free > 20%
- host: load1 < cores * 1.5; mem free > 15%; swap < 50%
Alerts → Discord/Slack webhook.
Out of scope here
- Synthetic uptime checks (Better Stack / Uptime Kuma) — separate concern
- User-facing analytics — separate concern
- Tracing sampling strategy — defer until traces volume justifies
Risks / caveats
- Cardinality discipline: one series per peer ID would blow past Grafana Cloud free tier instantly. Need labelling rules before shipping.
- Privacy: Willow is a privacy-focused P2P project. Logs/metrics shipped to a managed vendor must scrub PII (peer IDs, IPs, content hashes). Self-host route avoids this entirely.
- Instrumentation cost: ~1–2 days to wire
metrics crate through actor framework and add per-actor counters.
Dependencies
Wait for deployment-infra spec (docs/specs/2026-04-28-...-design.md, TBD) to land first so the obs stack can target the same NixOS modules.
Tracking issue for the performance/observability work scoped out of the deployment-infra research (see branch
claude/research-deployment-infrastructure-V4ilv). Not blocking the deploy spec — captured here so we can pick it up after deploy lands.Goal
See when relay/workers are overwhelmed, view resource usage over time, alert before users notice. Host-level CPU/RAM graphs are not enough — we need app-level signals (peer churn, sync queue depth, gossip throughput, SQLite write latency).
Proposed shape
Decoupled from host choice. Two viable stacks:
Pragmatic path: start A, migrate to B if/when free tier hits limits. Same
/metricsendpoint feeds both — no app code change.Instrumentation tasks
metrics+metrics-exporter-prometheus(oropentelemetry-rust) to relay, replay, storage binaries/metricsPrometheus endpoint per binarytracing-opentelemetryso existingtracingspans become traces for freenode_exporterviaservices.prometheus.exporters.nodeon each hostjournalctl→ Alloy → Loki (usetracing-subscriberJSON formatter)Key metrics to add
SLOs to alert on
Alerts → Discord/Slack webhook.
Out of scope here
Risks / caveats
metricscrate through actor framework and add per-actor counters.Dependencies
Wait for deployment-infra spec (
docs/specs/2026-04-28-...-design.md, TBD) to land first so the obs stack can target the same NixOS modules.