From b3f58e890f37ffc9d137b121ad0521f554b29023 Mon Sep 17 00:00:00 2001 From: Fabian Fischer Date: Fri, 26 Aug 2022 15:56:45 +0200 Subject: [PATCH 1/2] Add best practices for writing Prometheus Alerts --- docs/modules/ROOT/nav.adoc | 1 + .../commodore-components/alerts.adoc | 132 ++++++++++++++++++ 2 files changed, 133 insertions(+) create mode 100644 docs/modules/ROOT/pages/explanations/commodore-components/alerts.adoc diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc index 783bcc7d..fe5cea64 100644 --- a/docs/modules/ROOT/nav.adoc +++ b/docs/modules/ROOT/nav.adoc @@ -96,6 +96,7 @@ include::steward:ROOT:partial$nav-reference.adoc[] ** xref:explanations/commodore-components/helm-charts.adoc[Using Helm charts] ** xref:explanations/commodore-components/parameters-logic.adoc[Conditionals in the parameters hierarchy] ** xref:explanations/commodore-components/crds.adoc[Custom Resource Defintions] +** xref:explanations/commodore-components/alerts.adoc[Writing Prometheus Alert Rules] * xref:explanations/commodore-packages.adoc[Commodore Packages Best Practices] * xref:explanations/jsonnet.adoc[Jsonnet Best Practices] * xref:explanations/component_template_sync.adoc[Keep components in sync] diff --git a/docs/modules/ROOT/pages/explanations/commodore-components/alerts.adoc b/docs/modules/ROOT/pages/explanations/commodore-components/alerts.adoc new file mode 100644 index 00000000..13c15fc7 --- /dev/null +++ b/docs/modules/ROOT/pages/explanations/commodore-components/alerts.adoc @@ -0,0 +1,132 @@ += Providing Prometheus Alert Rules + +When writing a component that manages a critical piece of infrastructure, you should provide alerts that notify the operator if it fails. +Writing good alerts and runbooks is difficult. +This document should give you some best practices that worked for us so far. + +== Writing Alert Rules + +In nearly all cases you can provide Prometheus alert rules through the https://prometheus-operator.dev/docs/operator/api/#monitoring.coreos.com/v1.PrometheusRule[PrometheusRule CRD]. +This definition is then picked up by the responsible monitoring component. + +For OpenShift cluster this generally means labeling the namespace with `openshift.io/cluster-monitoring: 'true'` and for clusters with rancher monitoring this would mean labeling it with `SYNMonitoring: 'main'` + + + +* *Alerts need to be actionable* ++ +Try to imagine what you would do if you received this alert. +If the answer is "I don't know" or wait and see if it resolves itself, you probably shouldn't emit this alert. + +* *Label your alert* ++ +Label your alerts, so that they can be routed effectively. +At the very least add labels `syn: 'true'` and `syn_component: 'COMPONENT_NAME'` to indicate that this alert is managed by the syn component, and a label `severity`. + +* *Assess severity* ++ +How critical is this alert? +We generally differentiate three severity levels. ++ +`info` for alerts that don't need urgent intervention. +These are things that someone should look into, but it can usually wait up to a few days. +Info alerts could also often just be part of a dashboard. ++ +`warning` for alerts that should be looked at as soon as possible, but it can usually wait until regular office hours. ++ +`critical` for alerts that need immediate attention, even outside office hours. ++ +Carefully decide in which category your alert should be and add the appropriate `severity` label. +But keep in mind that if all alerts are critical none of them are. + +* *Make alerts tunable* ++ +You most likely won't be able to write a perfect alert out of the box. +It will either be too noisy, not sensitive enough, or in some other way not relevant for the user. +With that in mind, give the user a way to tune your alert. ++ +At the very least provide ways to selectively enable or disable individual alerts. +It's considered best practice to let the user overwrite all of the alert specification if they wish. +However, it's a good idea to also provide some more convenient parameters to tune configuration that often need to be adapted such as alert labels or alert specific parameters like a list of relevant namespaces. ++ +Try to imagine what a user might need to change and make tuning it as easy as possible. + +* *Provide a runbook* ++ +You should always provide a link to a runbook in an annotation `runbook_url`. +See the section below on writing good runbooks. + + +Following these guidelines, you should get a usable alert. +There are still some pitfalls when writing Prometheus alerts, but there are also many guides to help you write them. +You can look at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/[the official documenation] or check out how https://blog.cloudflare.com/monitoring-our-monitoring/[Cloudscale writes alert rules]. + +[WARNING] +==== +When installing third party software there are often upstream alerts. +It's a good idea to reuse these alerts, but the best practices still apply. + +Don't blindly include all upstream alerts. +Check if they're actionable, add labels, make them tunable, and provide a runbook, even if you didn't write the alert yourself. +==== + +== Writing Runbooks + +Every alert rule should have a runbook. +The runbook is the first place a user looks to get information on the alert and how to debug it. + +It should tell the reader: + +* *What does this alert mean?* ++ +Tell the reader why they got the alert. +What exactly doesn't work as it should? +Maybe also tell the user how the alert was measured and if there might be false positive. +* *What's the impact?* ++ +Who and what's effected? +How fast should the reader react? +The alert labels should already give an impression how critical the alert is, but try to be more explicit in the runbook. +* *How do I diagnose this?* ++ +Provide some input on how to debug this. +Where might the reader get the relevant events or logs? +How can he narrow down the possible root causes. +* *How may I mitigate the issue?* ++ +List some possible mitigation strategies or ways to resolve this alert for good. ++ +NOTE: Ideally, you shouldn't alert on issues that could be fixed automatically. +If you have one clear way to resolve this alert, check if you could resolve this automatically. +* *How do I tune the alert?* ++ +Maybe this alert wasn't actionable, or maybe the alert was raised far too late. +Give the reader options to tune the alert to make it less noisy or more sensitive. + +Whenever possible try to provide code snippets and precise instructions. +If the reader got a critical alert, they don't have the time or nerves to build the `jq` query they need right now or to find out exactly which controller is responsible for this CRD. + +It's considered best practice to put all your runbooks at `docs/modules/ROOT/pages/runbooks/ALERTNAME.adoc`, but there might be good reasons to deviate from this. +Just make sure to adjust the runbook links as necessary. + +Finally, a runbook doesn't have to be perfect. +Maybe you don't really know how this might fail or how to debug this, or maybe you simply don't have the resources right now to write a comprehensive runbook. +Add one anyway. +Any input can be valuable when debugging an alert and at the very least there is now a basis on which to improve on when we learn more. + +[IMPORTANT] +==== +.Removing or Renaming Alert Rules + +Sometimes alerts become obsolete. +Maybe the system can now resolve the issue automatically, or the responsible part simply doesn't exist anymore. + +However, you need to make sure that you *never* break a runbook link. +There might be people using older releases of your component and their runbook links should still lead to valid runbooks. + +* Don't remove runbook remarks if they get obsolete, but make a note that they're only relevant for older versions. +* Don't remove runbooks, but simply remove them from the navigation. +* If you rename an alert or move the runbook, use https://docs.antora.org/antora/latest/page/page-aliases/[page aliases] to keep old links valid. + +If you follow these three rules, runbook links should always stay relevant. +==== From b3abb327fb7d2f50c876e70edc6f1e429c917f79 Mon Sep 17 00:00:00 2001 From: Fabian Fischer <10788152+glrf@users.noreply.github.com> Date: Wed, 31 Aug 2022 15:24:24 +0200 Subject: [PATCH 2/2] Fix typo in alerts best practices MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Co-authored-by: Sebastian Widmer Co-authored-by: Christian Häusler <794584+corvus-ch@users.noreply.github.com> --- .../ROOT/pages/explanations/commodore-components/alerts.adoc | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/docs/modules/ROOT/pages/explanations/commodore-components/alerts.adoc b/docs/modules/ROOT/pages/explanations/commodore-components/alerts.adoc index 13c15fc7..79fe535e 100644 --- a/docs/modules/ROOT/pages/explanations/commodore-components/alerts.adoc +++ b/docs/modules/ROOT/pages/explanations/commodore-components/alerts.adoc @@ -59,7 +59,7 @@ See the section below on writing good runbooks. Following these guidelines, you should get a usable alert. There are still some pitfalls when writing Prometheus alerts, but there are also many guides to help you write them. -You can look at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/[the official documenation] or check out how https://blog.cloudflare.com/monitoring-our-monitoring/[Cloudscale writes alert rules]. +You can look at https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/[the official documentation] or check out how https://blog.cloudflare.com/monitoring-our-monitoring/[Cloudscale writes alert rules]. [WARNING] ==== @@ -91,7 +91,7 @@ The alert labels should already give an impression how critical the alert is, bu + Provide some input on how to debug this. Where might the reader get the relevant events or logs? -How can he narrow down the possible root causes. +How to narrow down the possible root causes? * *How may I mitigate the issue?* + List some possible mitigation strategies or ways to resolve this alert for good.