chore(alerting): alert hygiene + SLO burn-rate alerts + VaultSealed#39
Merged
Conversation
Every alert now links to the relevant Operations Manual section so responders have a consistent first stop. No expr/severity/threshold changes; all existing alerts already had severity, summary, and description, so this is purely an annotation-hygiene pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Assisted-by: Claude:Opus-4.8 [claude-code]
Add recording rules for the Loki 5xx request-error ratio over 5m/30m/1h/2h/6h/1d/3d and three multi-window multi-burn-rate alerts (14.4x fast page, 6x medium page, 1x slow ticket) against a 99.9% availability SLO, per the Google SRE workbook. Loki is the only platform service that exports a scraped per-status-code HTTP counter, so it is the only defensible request-based SLI today. A gateway availability SLO is deferred: NGINX Gateway Fabric's OSS data plane exposes no per-status-code metric and is not yet scraped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Assisted-by: Claude:Opus-4.8 [claude-code]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Workstream C of the alerting/SLO uplift: alerting hygiene across all PrometheusRules, a multi-window multi-burn-rate SLO for Loki, and verification of the VaultSealed alert + routing. Scope is alert rules + Alertmanager config only — the kube-prometheus-stack chart revision (Workstream B) is untouched.
Alert inventory (pre-change)
Finding: every alert already carried a
severitylabel andsummary/descriptionannotations. The only systematic gap was a missingrunbook_url.Hygiene fixes
runbook_urlto all 21 existing alerts, each pointing at the relevant section ofdocs/OPERATIONS.md(TLS, Vault lifecycle, PostgreSQL ops, observability quick-refs, or the troubleshooting symptom→fix table). Used existing doc anchors only — no fabricated per-alert pages.Removals / merges
LokiRequestErrors(flat 10% per-route 5xx) overlaps in spirit with the new Loki fast-burn alert but is a per-route symptom alert, not budget-aware — kept as a complementary fast symptom signal.LonghornInstanceManagerRecreatedusesfor: 0m(intentionally eager) — documented in-file as catching the instance-manager cascade upstream; kept.ArgoCDAppNotSyncedcan be chatty during GitOps churn but is gated byfor: 15m; kept.New SLO / burn-rate alerts
New file
rules-slo-loki.yaml:LokiErrorBudgetBurnFast— 14.4x, 1h+5m windows, critical (page)LokiErrorBudgetBurnMedium— 6x, 6h+30m windows, critical (page)LokiErrorBudgetBurnSlow— 1x, 3d+6h windows, warning (ticket)Loki is the only platform service exporting a per-status-code HTTP counter that Prometheus actually scrapes (
loki_request_duration_seconds_count), so it is the only defensible request-based SLI today.SLO follow-ups (not fabricated here)
:9113endpoint. Follow-up: add a ServiceMonitor and, if status-coded SLIs are needed, evaluate the Plus data plane or an alternative request-status source.defaultRules.kubeApiserverSlos: true) — not duplicated.VaultSealed status
VaultSealed(vault_core_unsealed == 0, critical,for: 1m) already exists inrules-platform-stability.yamland thevault-metricsServiceMonitor (/v1/sys/metrics) is already onmain, so the alert is live, not gated. This PR only adds itsrunbook_url. The layered Vault detection (VaultNotReady/VaultDegradedfrom kube-state-metrics,VaultSealedfrom Vault telemetry) is intact.Routing coverage
Alertmanager config (managed via the
alertmanager-configExternalSecret) routes all severities to the singlediscordreceiver, withWatchdog/InfoInhibitorsent tonulland a critical→warning inhibit rule (equal: [alertname, namespace]). Newcritical/warningSLO alerts are therefore covered by the default route — no routing change required.Manual test step (not performed — no live cluster): after merge + sync, fire a synthetic alert to confirm Discord delivery, e.g. apply a temporary always-firing PrometheusRule (
expr: vector(1),severity: warning) and verify it reaches the Discord channel, then delete it.Validation
yaml.safe_load_all). yamllint/kubeconform/promtool not available locally — CI will validate.ruleSelectorNilUsesHelmValues: falsein values.yaml means the operator selects all PrometheusRules, so the new file needs noreleaselabel.Relates to JDWLABS-54 (Workstream C)
🤖 Generated with Claude Code