Skip to content

Observability: Prometheus/Grafana/Alertmanager/Blackbox + alerts + systemd/SPOF hardening #178

@anneoneone

Description

@anneoneone

Goal

Add monitoring, alerting, and kill the single-VPS SPOF (Phase 4 of #164).

Tasks

  • Prometheus + Grafana + Alertmanager + Blackbox. Drop in markuslindenberg/icecast_exporter (icecast_listeners on :9146) for history with near-zero code. Bind all admin/metrics ports to 127.0.0.1 — never expose publicly. Don't scrape per-client endpoints (cardinality blow-up).
  • Alerts: Blackbox probe_success == 0 for 5m on the public stream URL (relay-down, independent of server metrics) + absent(<listeners metric>) == 1. CRITICAL: do NOT use or on() vector(0) inside an alert rulevector(0) is always present and < 1, so it fires permanently; that trick is for Grafana panels only. Keep "stream-down" and "zero-listeners" as two separate alerts, or gate zero-listeners on a "show-scheduled" signal. Add recording-failure alerts: r2_upload_failures_total > 0, recording byte-rate stalls while stream up, ffmpeg/tee restarts.
  • Dead-air detection: byte/listener metrics stay nominal during silence — add a loudness/silence probe (Liquidsoap blank.strip covers the audio path; consider an EBU-R128 tap for alerting).
  • Reliability hardening: run the relay under systemd Restart=always + hard memory cap (cgroup MemoryMax).
  • Kill the single-VPS SPOF: put Cloudflare/a CDN or a second relay in front of the public mount (higher-leverage than the relay choice itself). Scope by the peak-listener answer from Phase 0.

Acceptance

Stopping the stream fires exactly one "stream-down" alert (not a permanently-firing one); a missing recording fires an alert; relay auto-restarts under memory pressure; public mount is fronted by a CDN/second relay.

Depends on

Phase-0 spike (peak listeners) and #164 Phase-2 stack + telemetry ticket.


Parent: #164 (Phase 4).

Metadata

Metadata

Assignees

No one assigned

    Labels

    project::InfrastructureArea: hosting, networking, server commstype::ciLayer: CI/CD, Docker, deploy & Python automation

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions