Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus Monitoring Mixin for Prometheus itself. #4474

Merged
merged 29 commits into from Jun 28, 2019
Merged
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
ee1427f
Prometheus monitoring mixin for Prometheus itself.
tomwilkie May 9, 2018
e8a8ce5
Basic Prometheus dashboard.
tomwilkie Aug 7, 2018
266ba18
Remove PromScrapeFailed alert.
tomwilkie Aug 7, 2018
50861d5
Alert if more than 1% of alerts fail for a given integration.
tomwilkie Aug 7, 2018
5fd712b
copypasta.
tomwilkie Aug 7, 2018
dfbdf8d
Add a basic readme with link to the mixin docs.
tomwilkie Nov 16, 2018
8f42192
Add Prometheus alerts from kube-prometheus, remove the alertmanager a…
tomwilkie Nov 19, 2018
638204c
Typo
tomwilkie Nov 19, 2018
e248ffb
Add alert for WAL remote write falling behind.
tomwilkie Feb 12, 2019
b615069
Update metric names.
tomwilkie Mar 1, 2019
38a9bbb
Loosen off PrometheusRemoteWriteBehind alert.
tomwilkie Mar 4, 2019
5639aaf
Merge branch 'master' into mixin
beorn7 Jun 17, 2019
a5762f3
Add dashboard for remote write to prometheus-mixin.
cstyan Jun 17, 2019
e248f4d
Merge pull request #5601 from cstyan/callum-mixin-rw-dashboard
beorn7 Jun 18, 2019
498d31e
Merge pull request #5681 from prometheus/beorn7/mixin
beorn7 Jun 19, 2019
e943803
Add .gitignore file
beorn7 Jun 26, 2019
ddfabda
Add Makefile and suitable jsonnet files
beorn7 Jun 26, 2019
5c04ef3
Make README.md immediately useful
beorn7 Jun 26, 2019
d45e8a0
Adjust to jsonnet v0.13
beorn7 Jun 26, 2019
d5845ad
Fix formatting
beorn7 Jun 26, 2019
23c0320
Fixed indentation
beorn7 Jun 26, 2019
e34af6d
Address various comments from the review
beorn7 Jun 26, 2019
613cb54
Add a "work in progress" disclaimer.
beorn7 Jun 26, 2019
1336a28
Use a config variable for the Prometheus name
beorn7 Jun 27, 2019
ded0705
Update remote repo for grafana-builder dependency
beorn7 Jun 27, 2019
7a25a25
Sync with alerts from kube-prometheus
beorn7 Jun 27, 2019
5270753
Remove/improve unused variables and weird doc comments
beorn7 Jun 28, 2019
9a21779
Protect gauge-based alerts against failed scrapes
beorn7 Jun 28, 2019
4825585
Tweak tenses
beorn7 Jun 28, 2019
File filter...
Filter file types
Jump to…
Jump to file
Failed to load files.

Always

Just for now

@@ -0,0 +1,4 @@
*.yaml
dashboards_out
vendor
jsonnetfile.lock.json
@@ -0,0 +1,25 @@
JSONNET_FMT := jsonnetfmt -n 2 --max-blank-lines 2 --string-style s --comment-style s

all: fmt prometheus_alerts.yaml dashboards_out lint

fmt:
find . -name 'vendor' -prune -o -name '*.libsonnet' -print -o -name '*.jsonnet' -print | \
xargs -n 1 -- $(JSONNET_FMT) -i

prometheus_alerts.yaml: mixin.libsonnet config.libsonnet alerts.libsonnet
jsonnet -S alerts.jsonnet > $@

dashboards_out: mixin.libsonnet config.libsonnet dashboards.libsonnet
@mkdir -p dashboards_out
jsonnet -J vendor -m dashboards_out dashboards.jsonnet

lint: prometheus_alerts.yaml
find . -name 'vendor' -prune -o -name '*.libsonnet' -print -o -name '*.jsonnet' -print | \
while read f; do \
$(JSONNET_FMT) "$$f" | diff -u "$$f" -; \
done

promtool check rules prometheus_alerts.yaml

clean:
rm -rf dashboards_out prometheus_alerts.yaml
@@ -0,0 +1,36 @@
# Prometheus Mixin

_This is work in progress. We aim for it to become a good role model for alerts
and dashboards eventually, but it is not quite there yet._

The Prometheus Mixin is a set of configurable, reusable, and extensible alerts
and dashboards for Prometheus.

To use them, you need to have `jsonnet` (v0.13+) and `jb` installed. If you
have a working Go development environment, it's easiest to run the following:
```bash
$ go get github.com/google/go-jsonnet/cmd/jsonnet
$ go get github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb
```

_Note: The make targets `lint` and `fmt` need the `jsonnetfmt` binary, which is
currently not included in the Go implementation of `jsonnet`. For the time
being, you have to install the [C++ version of
jsonnetfmt](https://github.com/google/jsonnet) if you want to use `make lint`
or `make fmt`._

Next, install the dependencies by running the following command in this
directory:
```bash
$ jb install
```

You can then build a `prometheus_alerts.yaml` with the alerts and a directory
`dashboards_out` with the Grafana dashboard JSON files:
```bash
$ make prometheus_alerts.yaml
$ make dashboards_out
```

For more advanced uses of mixins, see https://github.com/monitoring-mixins/docs.

@@ -0,0 +1 @@
std.manifestYamlDoc((import 'mixin.libsonnet').prometheusAlerts)
@@ -0,0 +1,260 @@
{
prometheusAlerts+:: {
groups+: [
{
name: 'prometheus',
rules: [
{
alert: 'PrometheusBadConfig',
expr: |||
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_config_last_reload_successful{%(prometheusSelector)s}[5m]) == 0
||| % $._config,
'for': '10m',
labels: {
severity: 'critical',
},
annotations: {
summary: 'Failed Prometheus configuration reload.',
description: 'Prometheus %(prometheusName)s has failed to reload its configuration.' % $._config,
},
},
{
alert: 'PrometheusNotificationQueueRunningFull',
expr: |||
# Without min_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
predict_linear(prometheus_notifications_queue_length{%(prometheusSelector)s}[5m], 60 * 30)
>
min_over_time(prometheus_notifications_queue_capacity{%(prometheusSelector)s}[5m])
)
||| % $._config,
'for': '15m',
labels: {
severity: 'warning',
This conversation was marked as resolved by tomwilkie

This comment has been minimized.

@brian-brazil

brian-brazil Aug 7, 2018 Member

Ticket and page are our standard severities.

This comment has been minimized.

@brancz

brancz Aug 7, 2018 Member

hmm .. looking at all the sets of alerts out there my feeling is that warning / critical is more commonly used. I remember we've had discussions and did not really come to an agreement. Maybe these should be configurable and default to one or the other?

This comment has been minimized.

@tomwilkie

tomwilkie Aug 7, 2018 Author Member

Agree with @brancz; we never reached consensus on this. Most definitions I've seen and used stick with warning / critical . With jsonnet is is trivial to map these to ticket / page, but we do need to pick a default. One for the dev summit?

This comment has been minimized.

@tomwilkie

tomwilkie Aug 7, 2018 Author Member

Have stuck it on the agenda

This comment has been minimized.

@brian-brazil

brian-brazil Aug 7, 2018 Member

My recollection is that we went with ticket/page in the end, and our examples should use those.

The issue with warning/critical is that they're poorly defined, and subject to semantic drift over time. For example many companies have hundreds to thousands of active "critical" alerts at any time that noone is working on. What we really want to indicate is should this alert wake someone up in the middle of the night, or can it wait until morning.

My understanding is that if someone wants to override these, they can. That's the whole point of this approach.

This comment has been minimized.

@brian-brazil

brian-brazil Aug 7, 2018 Member

Looking through our existing docs, "page" is what we use. No other values are present.

This comment has been minimized.

@tomwilkie

tomwilkie Aug 7, 2018 Author Member

There are good arguments in both directions; I'm not super opinionated about this.

But lets not use this PR to discuss this particular issue as it can be divisive; either link to previous discussion where this was decided, start a new discussion on the -dev list, or use the dev summit.

I'm happy to hold this PR until this is decided.

This comment has been minimized.

@brian-brazil

brian-brazil Aug 7, 2018 Member

I can't seem to find it among my email. As it stands our existing usage is "page" in the docs, so if that's to be changed that would need discussion.

This comment has been minimized.

@tomwilkie

tomwilkie Nov 16, 2018 Author Member

We discussed at the dev summit and settled on critical and warning:

https://docs.google.com/document/d/1-C5PycocOZEVIPrmM1hn8fBelShqtqiAmFptoG4yK70/edit

},
annotations: {
summary: 'Prometheus alert notification queue predicted to run full in less than 30m.',
description: 'Alert notification queue of Prometheus %(prometheusName)s is running full.' % $._config,
},
},
{
alert: 'PrometheusErrorSendingAlertsToSomeAlertmanagers',
expr: |||
(
rate(prometheus_notifications_errors_total{%(prometheusSelector)s}[5m])
/
rate(prometheus_notifications_sent_total{%(prometheusSelector)s}[5m])
)
* 100
> 1
||| % $._config,
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
summary: 'Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager.',
description: '{{ printf "%%.1f" $value }}%% errors while sending alerts from Prometheus %(prometheusName)s to Alertmanager {{$labels.alertmanager}}.' % $._config,
},
},
{
alert: 'PrometheusErrorSendingAlertsToAnyAlertmanager',
expr: |||
min without(alertmanager) (
rate(prometheus_notifications_errors_total{%(prometheusSelector)s}[5m])
/
rate(prometheus_notifications_sent_total{%(prometheusSelector)s}[5m])
)
* 100
> 3
||| % $._config,
'for': '15m',
labels: {
severity: 'critical',
},
annotations: {
summary: 'Prometheus encounters more than 3% errors sending alerts to any Alertmanager.',
description: '{{ printf "%%.1f" $value }}%% minimum errors while sending alerts from Prometheus %(prometheusName)s to any Alertmanager.' % $._config,
},
},
{
alert: 'PrometheusNotConnectedToAlertmanagers',
expr: |||
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
max_over_time(prometheus_notifications_alertmanagers_discovered{%(prometheusSelector)s}[5m]) < 1
||| % $._config,
'for': '10m',
labels: {
severity: 'warning',

This comment has been minimized.

@brancz

brancz Jun 28, 2019 Member

This could potentially result in not being alerted at all (and also this one is likely to be hard to be sent at all), should this be critical or do we expect users to have dead-mans-switch type alerting setup (if so we should probably at least mention it somewhere)?

This comment has been minimized.

@beorn7

beorn7 Jun 28, 2019 Member

“Production-ready” meta-monitoring should not be done by a Prometheus server monitoring itself. Thus, this alert will fire if the metamon-prometheus detects other Prometheus servers without any discovered Alertmanagers.

Of course, the metamon-prometheus itself still needs to be able to send alerts. That's where a dead-man-switch like setup enters the game. In general, I think the metamon-prometheus is a good location to test the whole alerting chain with a dead-man-switch like setup. But I believe discussing that aspect of meta-monitoring is out of scope of this mixin example.

},
annotations: {
summary: 'Prometheus is not connected to any Alertmanagers.',
description: 'Prometheus %(prometheusName)s is not connected to any Alertmanagers.' % $._config,
},
},
{
alert: 'PrometheusTSDBReloadsFailing',
expr: |||
increase(prometheus_tsdb_reloads_failures_total{%(prometheusSelector)s}[3h]) > 0
||| % $._config,
'for': '4h',
labels: {
severity: 'warning',
},
annotations: {
summary: 'Prometheus has issues reloading blocks from disk.',
description: 'Prometheus %(prometheusName)s has detected {{$value | humanize}} reload failures over the last 3h.' % $._config,
},
},
{
alert: 'PrometheusTSDBCompactionsFailing',
expr: |||
increase(prometheus_tsdb_compactions_failed_total{%(prometheusSelector)s}[3h]) > 0
||| % $._config,
'for': '4h',
labels: {
severity: 'warning',
},
annotations: {
summary: 'Prometheus has issues compacting blocks.',
description: 'Prometheus %(prometheusName)s has detected {{$value | humanize}} compaction failures over the last 3h.' % $._config,
},
},
{
alert: 'PrometheusTSDBWALCorruptions',
expr: |||
increase(tsdb_wal_corruptions_total{%(prometheusSelector)s}[3h]) > 0
||| % $._config,
'for': '4h',

This comment has been minimized.

@brian-brazil

brian-brazil Nov 19, 2018 Member

This doesn't really jive with the expression you're using, I'd expect a much shorter for

This comment has been minimized.

@beorn7

beorn7 Jun 26, 2019 Member

Is there any reason to have a for clause at all if we want to alert on any WAL corruptions having occurred ever?

On the other hand, once the corrupted WAL is obsolete and no corruptions have happened in the current cycle, this alert is not actionable anymore.

How about an increase over a 3h window with a 4h for clause? We'll get alerted if at least two cycles happen consecutively with a WAL corruption, and the alert will stop firing if WALs are created fine again.

This comment has been minimized.

@brian-brazil

brian-brazil Jun 26, 2019 Member

It's always wise to have some for clause, in case of weirdness. I'd use something like 5-10m.

The alert isn't actionable one way or the other.

This comment has been minimized.

@beorn7

beorn7 Jun 26, 2019 Member

"Weirdness" doesn't sound like a good reason to have a for clause.

This is a warning. If I regularly get WAL corruptions, I should start to wonder about a bug or a broken disk. I also should avoid restarting the server while the WAL is corrupted.

labels: {
severity: 'warning',
},
annotations: {
summary: 'Prometheus is detecting WAL corruptions.',
description: 'Prometheus %(prometheusName)s has detected {{$value | humanize}} corruptions of the write-ahead log (WAL) over the last 3h.' % $._config,
},
},
{
alert: 'PrometheusNotIngestingSamples',
expr: |||
rate(prometheus_tsdb_head_samples_appended_total{%(prometheusSelector)s}[5m]) <= 0
||| % $._config,
'for': '10m',
labels: {
severity: 'warning',
},
annotations: {
summary: 'Prometheus is not ingesting samples.',
description: 'Prometheus %(prometheusName)s is not ingesting samples.' % $._config,
},
},
{
alert: 'PrometheusDuplicateTimestamps',
expr: |||
rate(prometheus_target_scrapes_sample_duplicate_timestamp_total{%(prometheusSelector)s}[5m]) > 0
||| % $._config,
'for': '10m',
labels: {
severity: 'warning',
},
annotations: {
summary: 'Prometheus is dropping samples with duplicate timestamps.',
description: 'Prometheus %(prometheusName)s is dropping {{$value | humanize}} samples/s with different values but duplicated timestamp.' % $._config,
},
},
{
alert: 'PrometheusOutOfOrderTimestamps',
expr: |||
rate(prometheus_target_scrapes_sample_out_of_order_total{%(prometheusSelector)s}[5m]) > 0
||| % $._config,
'for': '10m',
labels: {
severity: 'warning',
},
annotations: {
summary: 'Prometheus drops samples with out-of-order timestamps.',
description: 'Prometheus %(prometheusName)s is dropping {{$value | humanize}} samples/s with timestamps arriving out of order.' % $._config,
},
},
{
alert: 'PrometheusRemoteStorageFailures',
expr: |||
(
rate(prometheus_remote_storage_failed_samples_total{%(prometheusSelector)s}[5m])
/
(
rate(prometheus_remote_storage_failed_samples_total{%(prometheusSelector)s}[5m])
+
rate(prometheus_remote_storage_succeeded_samples_total{%(prometheusSelector)s}[5m])
)
)
* 100
> 1
||| % $._config,
'for': '15m',
labels: {
severity: 'critical',
},
annotations: {
summary: 'Prometheus fails to send samples to remote storage.',
This conversation was marked as resolved by beorn7

This comment has been minimized.

@brian-brazil

brian-brazil Jun 28, 2019 Member

is failing

description: 'Prometheus %(prometheusName)s failed to send {{ printf "%%.1f" $value }}%% of the samples to queue {{$labels.queue}}.' % $._config,
},
},
{
alert: 'PrometheusRemoteWriteBehind',
expr: |||
# Without max_over_time, failed scrapes could create false negatives, see
# https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
(
max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds{%(prometheusSelector)s}[5m])
- on(job, instance) group_right
max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds{%(prometheusSelector)s}[5m])
)
> 120
||| % $._config,
'for': '15m',
labels: {
severity: 'critical',
},
annotations: {
summary: 'Prometheus remote write is behind.',
description: 'Prometheus %(prometheusName)s remote write is {{ printf "%%.1f" $value }}s behind for queue {{$labels.queue}}.' % $._config,
},
},
{
alert: 'PrometheusRuleFailures',
expr: |||
increase(prometheus_rule_evaluation_failures_total{%(prometheusSelector)s}[5m]) > 0
||| % $._config,
'for': '15m',
labels: {
severity: 'critical',
},
annotations: {
summary: 'Prometheus is failing rule evaluations.',
description: 'Prometheus %(prometheusName)s has failed to evaluate {{ printf "%%.0f" $value }} rules in the last 5m.' % $._config,
},
},
{
alert: 'PrometheusMissingRuleEvaluations',
expr: |||
increase(prometheus_rule_group_iterations_missed_total{%(prometheusSelector)s}[5m]) > 0
||| % $._config,
'for': '15m',
labels: {
severity: 'warning',
},
annotations: {
summary: 'Prometheus is missing rule evaluations due to slow rule group evaluation.',
description: 'Prometheus %(prometheusName)s has missed {{ printf "%%.0f" $value }} rule group evaluations in the last 5m.' % $._config,
},
},
],
},
],
},
}
@@ -0,0 +1,16 @@
{
_config+:: {
// prometheusSelector is inserted as part of the label selector in
// PromQL queries to identify metrics collected from Prometheus
// servers.
prometheusSelector: 'job="prometheus"',

// prometheusName is inserted into annotations to name the Prometheus
// instance affected by the alert.
prometheusName: '{{$labels.instance}}',
// If you run Prometheus on Kubernetes with the Prometheus
// Operator, you can make use of the configured target labels for
// nicer naming:
// prometheusNameTemplate: '{{$labels.namespace}}/{{$labels.pod}}'
},
}
@@ -0,0 +1,6 @@
local dashboards = (import 'mixin.libsonnet').dashboards;

{
[name]: dashboards[name]
for name in std.objectFields(dashboards)
}
ProTip! Use n and p to navigate between commits in a pull request.
You can’t perform that action at this time.