chore(alerting): alert hygiene + SLO burn-rate alerts + VaultSealed by jdwillmsen · Pull Request #39 · jdwlabs/platform

jdwillmsen · 2026-06-17T18:43:21Z

Summary

Workstream C of the alerting/SLO uplift: alerting hygiene across all PrometheusRules, a multi-window multi-burn-rate SLO for Loki, and verification of the VaultSealed alert + routing. Scope is alert rules + Alertmanager config only — the kube-prometheus-stack chart revision (Workstream B) is untouched.

Alert inventory (pre-change)

File	Alerts	Severity	summary/description	runbook_url
rules-cert-manager	CertificateExpiringSoon, CertificateExpired, CertManagerNotReady	yes	yes	missing
rules-longhorn	LonghornVolumeUsageHigh, LonghornVolumeFaulted, LonghornNodeStorageLow	yes	yes	missing
rules-cnpg	CNPGClusterDown, CNPGReplicationLag, CNPGLongRunningTransaction	yes	yes	missing
rules-loki	LokiRequestErrors, LokiIngestionRateDropped	yes	yes	missing
rules-argocd	ArgoCDAppNotSynced, ArgoCDAppUnhealthy, ArgoCDSyncFailing	yes	yes	missing
rules-platform-stability	LonghornInstanceManagerRecreated, LonghornManagerOOMKilled, VaultNotReady, VaultDegraded, VaultSealed, NodeMemoryHigh	yes	yes	missing
storage-capacity-rules	PVCUsageHigh	yes	yes	missing

Finding: every alert already carried a severity label and summary/description annotations. The only systematic gap was a missing runbook_url.

Hygiene fixes

Added a runbook_url to all 21 existing alerts, each pointing at the relevant section of docs/OPERATIONS.md (TLS, Vault lifecycle, PostgreSQL ops, observability quick-refs, or the troubleshooting symptom→fix table). Used existing doc anchors only — no fabricated per-alert pages.

Removals / merges

None. No dead, never-firing, or duplicate alerts were found. Notes on the closest calls:
- LokiRequestErrors (flat 10% per-route 5xx) overlaps in spirit with the new Loki fast-burn alert but is a per-route symptom alert, not budget-aware — kept as a complementary fast symptom signal.
- LonghornInstanceManagerRecreated uses for: 0m (intentionally eager) — documented in-file as catching the instance-manager cascade upstream; kept.
- ArgoCDAppNotSynced can be chatty during GitOps churn but is gated by for: 15m; kept.

New SLO / burn-rate alerts

New file rules-slo-loki.yaml:

Recording rules for the Loki 5xx error ratio over 5m/30m/1h/2h/6h/1d/3d.
Three multi-window multi-burn-rate alerts against a 99.9% availability SLO (Google SRE workbook, table 5-8):
- LokiErrorBudgetBurnFast — 14.4x, 1h+5m windows, critical (page)
- LokiErrorBudgetBurnMedium — 6x, 6h+30m windows, critical (page)
- LokiErrorBudgetBurnSlow — 1x, 3d+6h windows, warning (ticket)

Loki is the only platform service exporting a per-status-code HTTP counter that Prometheus actually scrapes (loki_request_duration_seconds_count), so it is the only defensible request-based SLI today.

SLO follow-ups (not fabricated here)

Gateway/ingress availability SLO is deferred. NGINX Gateway Fabric runs the OSS data plane, which exposes no per-status-code request metric (status codes require NGINX Plus). It also currently has no ServiceMonitor scraping its :9113 endpoint. Follow-up: add a ServiceMonitor and, if status-coded SLIs are needed, evaluate the Plus data plane or an alternative request-status source.
kube-apiserver SLOs are already provided by the chart (defaultRules.kubeApiserverSlos: true) — not duplicated.

VaultSealed status

VaultSealed (vault_core_unsealed == 0, critical, for: 1m) already exists in rules-platform-stability.yaml and the vault-metrics ServiceMonitor (/v1/sys/metrics) is already on main, so the alert is live, not gated. This PR only adds its runbook_url. The layered Vault detection (VaultNotReady/VaultDegraded from kube-state-metrics, VaultSealed from Vault telemetry) is intact.

Routing coverage

Alertmanager config (managed via the alertmanager-config ExternalSecret) routes all severities to the single discord receiver, with Watchdog/InfoInhibitor sent to null and a critical→warning inhibit rule (equal: [alertname, namespace]). New critical/warning SLO alerts are therefore covered by the default route — no routing change required.

Manual test step (not performed — no live cluster): after merge + sync, fire a synthetic alert to confirm Discord delivery, e.g. apply a temporary always-firing PrometheusRule (expr: vector(1), severity: warning) and verify it reaches the Discord channel, then delete it.

Validation

All changed/new YAML parses (yaml.safe_load_all). yamllint/kubeconform/promtool not available locally — CI will validate. ruleSelectorNilUsesHelmValues: false in values.yaml means the operator selects all PrometheusRules, so the new file needs no release label.

Relates to JDWLABS-54 (Workstream C)

🤖 Generated with Claude Code

Every alert now links to the relevant Operations Manual section so responders have a consistent first stop. No expr/severity/threshold changes; all existing alerts already had severity, summary, and description, so this is purely an annotation-hygiene pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Assisted-by: Claude:Opus-4.8 [claude-code]

Add recording rules for the Loki 5xx request-error ratio over 5m/30m/1h/2h/6h/1d/3d and three multi-window multi-burn-rate alerts (14.4x fast page, 6x medium page, 1x slow ticket) against a 99.9% availability SLO, per the Google SRE workbook. Loki is the only platform service that exports a scraped per-status-code HTTP counter, so it is the only defensible request-based SLI today. A gateway availability SLO is deferred: NGINX Gateway Fabric's OSS data plane exposes no per-status-code metric and is not yet scraped. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Assisted-by: Claude:Opus-4.8 [claude-code]

jdwillmsen and others added 2 commits June 17, 2026 13:41

jdwillmsen marked this pull request as ready for review June 18, 2026 04:15

jdwillmsen merged commit 562a31e into main Jun 18, 2026
5 checks passed

jdwillmsen deleted the chore/alerting-hygiene-slos branch June 18, 2026 05:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(alerting): alert hygiene + SLO burn-rate alerts + VaultSealed#39

chore(alerting): alert hygiene + SLO burn-rate alerts + VaultSealed#39
jdwillmsen merged 2 commits into
mainfrom
chore/alerting-hygiene-slos

jdwillmsen commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jdwillmsen commented Jun 17, 2026

Summary

Alert inventory (pre-change)

Hygiene fixes

Removals / merges

New SLO / burn-rate alerts

SLO follow-ups (not fabricated here)

VaultSealed status

Routing coverage

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant