Beginnings of an Alertmanager mixin. by tomwilkie · Pull Request #1629 · prometheus/alertmanager

tomwilkie · 2018-11-19T11:30:52Z

Alerts are from https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/ and prometheus/prometheus#4474.

Signed-off-by: Tom Wilkie tom.wilkie@gmail.com

doc/alertmanager-mixin/alerts.libsonnet

brian-brazil · 2018-11-19T12:25:28Z

doc/alertmanager-mixin/alerts.libsonnet

+              severity: 'critical',
+            },
+            annotations:{
+              message: 'Alertmanager has not found all other members of the cluster.',


Won't this page when an AM is down? That seems a bit spammy.

brian-brazil · 2018-11-19T12:26:07Z

doc/alertmanager-mixin/alerts.libsonnet

+          {
+            alert:'AlertmanagerMembersInconsistent',
+            expr: |||
+              alertmanager_cluster_members{%(alertmanagerSelector)s}


indentation is the wrong way around here, and others also need fixing.

mxinden · 2018-11-20T14:56:48Z

@tomwilkie first of all thanks for pushing this forward.

Might it make sense to add a config.libsonnet like in prometheus/prometheus#4474?

tomwilkie · 2019-11-17T15:41:16Z

@beorn7 can you add this to your TODO list please?

beorn7 · 2019-11-17T22:42:41Z

I have done that long ago. Just that there are a lot of items on that list…

simonpasquier · 2020-02-28T15:03:47Z

@beorn7 @tomwilkie I'm ok to tackle this one if you don't mind...

beorn7 · 2020-02-28T15:24:22Z

Yes, please go ahead. I have this on my long todo list, but way too far down.
Any help appreciated.

beorn7 · 2020-10-26T19:32:10Z

@simonpasquier I have some practical reasons to work on this. Will push a few commits today and tomorrow.

roidelapluie · 2020-10-27T16:13:13Z

@simonpasquier I have some practical reasons to work on this. Will push a few commits today and tomorrow.

Oh, I was going to say the exact same thing. Happy to try & review.

beorn7 · 2020-10-28T18:45:27Z

OK, got all my piled-up ideas implemented. I think this is meaty enough now to get merged into the main branch.

Reviewers, please have another look.

@Duologic this might be of special interest to you. We could try this out from this branch.

roidelapluie · 2020-10-28T20:36:20Z

doc/alertmanager-mixin/alerts.libsonnet

+            },
+            annotations: {
+              summary: 'All Alertmanager instances within the same cluster are down.',
+              description: 'Each Alertmanager instances within the %(alertmanagerClusterName)s cluster has been up for less than {{ $value | humanizePercentage }} of the last 5m.' % $._config,


Should'nt we alert earlier? if they are all down it's too late to get the notification :) as a reminder it is regularly said that one alertmanager cluster is enough company-wide

Right. Let me rephrase this to alert if half of the instances are down.

In any case, a proper alerting setup should also have an end-to-end test with an always firing alert, which, if it doesn't call in anymore, will trigger a secondary alert via a service like Dead Man's Snitch.

OK, done. The alerts fire now if half or more of the cluster is affected, which means 2 or more with a cluster of 3 or 4 instances, which means that you need two failed instances (rather than only one) before getting a page.

beorn7 · 2020-10-30T17:09:30Z

This is now deployed to our production environment. Will let you know if any insights come out of that.

beorn7 · 2020-11-02T15:47:04Z

All looks good so far.

@simonpasquier I think this should be considered for merging. Could you do a final review?

beorn7 · 2020-11-05T11:44:30Z

@simonpasquier are you available to review this? Or alternatively, would you be fine delegating review to someone else?

beorn7 · 2020-11-12T15:02:56Z

@simonpasquier or perhaps @brancz @tomwilkie @gouthamve @roidelapluie ? I guess this is of some interest to all of you.

I'd really like to get any feedback on this and then merge this finally.

doc/alertmanager-mixin/Makefile

tomwilkie

LGTM from me, with a minor nit.

doc/alertmanager-mixin/README.md

roidelapluie · 2020-11-14T19:48:19Z

Can you add the CI like we have in Prometheus?

beorn7 · 2020-11-16T14:46:10Z

Great feedback. Thank you very much. I'll incorporate it tomorrow (hopefully…).

brancz · 2020-11-26T09:19:06Z

lgtm 👍

Signed-off-by: beorn7 <beorn@grafana.com>

beorn7 · 2020-12-02T20:55:13Z

Sorry all for the delay, had busy weeks recently. Finally addressed all the comments.

Getting promtool to build within CI was PITA because of that dreaded clash between semantic versioning we use for Prometheus releases and how Go modules insists on using it on the code level.

roidelapluie · 2020-12-02T20:57:01Z

Mmh Makefile.common knows how to download a compiled version of promtool.

beorn7 · 2020-12-02T21:27:14Z

Makefile.common knows how to download a compiled version of promtool.

I couldn't find that. Could you explain in more detail please?

roidelapluie · 2020-12-02T21:53:40Z

We have a $(PROMU) target that downloads promu, you do not need to go get it. Running make promu in the pipeline should work, instead of go get.

beorn7 · 2020-12-02T23:23:55Z

But we don't need PROMU, we need promtool.

And getting promu to then compile promtool seems even worse than what I'm doing now.

At the very least, I want to keep the Makefile simple because it is not the Makefile of Alertmanager, it's just a simple Makefile for users of the mixin. They really shouldn't get in touch with Promu.

beorn7 · 2020-12-02T23:24:58Z

If you know a way to put this all into the CircleCI config without bloating the makefile, please let me know.

roidelapluie · 2020-12-02T23:30:04Z

Sorry, I made the confusion between promu and promtool. It looks like you could look at https://github.com/monitoring-mixins/mixtool too

beorn7 · 2020-12-03T09:53:46Z

It looks like you could look at https://github.com/monitoring-mixins/mixtool too

That's already in here. Am I missing anything?

beorn7 · 2020-12-03T13:52:43Z

@roidelapluie do you have any merge-blocking concerns left? I'd like to get this in if there is nothing substantially wrong with it. We can still improve CI details later.

beorn7 · 2020-12-03T13:56:27Z

Now I see that mixtool seems to include the promtool check rules capability. That's not documented, and I had a short look at the code and couldn't see it, but it's in there. So I can rip out promtool altogether.

Signed-off-by: beorn7 <beorn@grafana.com>

roidelapluie

Thanks, LGTM!!

beorn7 · 2020-12-03T14:55:49Z

Thanks, everyone. Merging now (and will create a new PR with a new feature momentarily, I just didn't want to load up this here even more).

Because we have accumulated 20 commits in this PR, I will make an exception from my usual practice and squash them.

Reference to upstream PR here: prometheus/alertmanager#1629 Signed-off-by: Matthias Riegler <matthias.riegler@taotesting.com>

brian-brazil reviewed Nov 19, 2018

View reviewed changes

stale bot added the stale label Feb 17, 2020

stale bot removed the stale label Feb 28, 2020

simonpasquier force-pushed the mixin branch from 386439d to f17d781 Compare February 28, 2020 15:03

beorn7 assigned beorn7 and simonpasquier Feb 28, 2020

stale bot added the stale label Apr 28, 2020

simonpasquier mentioned this pull request Jun 22, 2020

Fix AlertmanagerConfigInconsistent alert prometheus-operator/kube-prometheus#576

Merged

beorn7 removed the stale label Oct 26, 2020

roidelapluie reviewed Oct 28, 2020

View reviewed changes

tomwilkie commented Nov 14, 2020

View reviewed changes

doc/alertmanager-mixin/Makefile Outdated Show resolved Hide resolved

tomwilkie commented Nov 14, 2020

View reviewed changes

roidelapluie reviewed Nov 14, 2020

View reviewed changes

doc/alertmanager-mixin/README.md Outdated Show resolved Hide resolved

roidelapluie reviewed Nov 14, 2020

View reviewed changes

doc/alertmanager-mixin/README.md Outdated Show resolved Hide resolved

roidelapluie reviewed Nov 14, 2020

View reviewed changes

doc/alertmanager-mixin/README.md Outdated Show resolved Hide resolved

beorn7 added 8 commits December 2, 2020 21:09

Add AlertmanagerConfigInconsistent alert

d44cd3e

Signed-off-by: beorn7 <beorn@grafana.com>

Add cluster-wide crashloop and down alerts

65713da

Signed-off-by: beorn7 <beorn@grafana.com>

Make cluster health alert fire if half or more of instances affected

026e589

Signed-off-by: beorn7 <beorn@grafana.com>

Increase FOR duration for AlertmanagerConfigInconsistent

49443d2

Signed-off-by: beorn7 <beorn@grafana.com>

Move to mixtool

cb0beb7

Signed-off-by: beorn7 <beorn@grafana.com>

Add CI for mixin

8ae9766

Signed-off-by: beorn7 <beorn@grafana.com>

Update README.md

a26ef9e

Signed-off-by: beorn7 <beorn@grafana.com>

Fix CircleCI

8019d2b

Signed-off-by: beorn7 <beorn@grafana.com>

beorn7 force-pushed the mixin branch from a7f1675 to 8019d2b Compare December 2, 2020 20:10

beorn7 added 2 commits December 2, 2020 21:40

Fix CircleCI

88c51fc

Signed-off-by: beorn7 <beorn@grafana.com>

More attempt to work against GO modules semver

f167df5

Signed-off-by: beorn7 <beorn@grafana.com>

Rip out promtool because mixtool does its job

6fde21f

Signed-off-by: beorn7 <beorn@grafana.com>

roidelapluie approved these changes Dec 3, 2020

View reviewed changes

beorn7 merged commit 6c5dee0 into master Dec 3, 2020

beorn7 deleted the mixin branch December 3, 2020 14:58

paulfantom mentioned this pull request Dec 4, 2020

use alertmanager-mixin instead of alerts baked in kube-prometheus prometheus-operator/kube-prometheus#823

Merged

xvzf pushed a commit to xvzf/grafana-jsonnet-libs that referenced this pull request Dec 7, 2020

fix: Dependency moved

954886e

Reference to upstream PR here: prometheus/alertmanager#1629 Signed-off-by: Matthias Riegler <matthias.riegler@taotesting.com>

xvzf mentioned this pull request Dec 7, 2020

Fix moved dependency for alertmanager mixin grafana/jsonnet-libs#398

Merged

Duologic pushed a commit to grafana/jsonnet-libs that referenced this pull request Dec 7, 2020

fix: Dependency moved (#398)

dec1e4f

Reference to upstream PR here: prometheus/alertmanager#1629 Signed-off-by: Matthias Riegler <matthias.riegler@taotesting.com>

Conversation

tomwilkie commented Nov 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

brian-brazil Nov 19, 2018

Choose a reason for hiding this comment

Uh oh!

brian-brazil Nov 19, 2018

Choose a reason for hiding this comment

Uh oh!

mxinden commented Nov 20, 2018

Uh oh!

tomwilkie commented Nov 17, 2019

Uh oh!

beorn7 commented Nov 17, 2019

Uh oh!

simonpasquier commented Feb 28, 2020

Uh oh!

beorn7 commented Feb 28, 2020

Uh oh!

beorn7 commented Oct 26, 2020

Uh oh!

roidelapluie commented Oct 27, 2020

Uh oh!

beorn7 commented Oct 28, 2020

Uh oh!

roidelapluie Oct 28, 2020

Choose a reason for hiding this comment

Uh oh!

beorn7 Oct 28, 2020

Choose a reason for hiding this comment

Uh oh!

beorn7 Oct 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beorn7 commented Oct 30, 2020

Uh oh!

beorn7 commented Nov 2, 2020

Uh oh!

beorn7 commented Nov 5, 2020

Uh oh!

beorn7 commented Nov 12, 2020

Uh oh!

Uh oh!

tomwilkie left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

roidelapluie commented Nov 14, 2020

Uh oh!

beorn7 commented Nov 16, 2020

Uh oh!

brancz commented Nov 26, 2020

Uh oh!

beorn7 commented Dec 2, 2020

Uh oh!

roidelapluie commented Dec 2, 2020

Uh oh!

beorn7 commented Dec 2, 2020

Uh oh!

roidelapluie commented Dec 2, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

beorn7 commented Dec 2, 2020

Uh oh!

beorn7 commented Dec 2, 2020

Uh oh!

roidelapluie commented Dec 2, 2020 via email

Uh oh!

beorn7 commented Dec 3, 2020

Uh oh!

beorn7 commented Dec 3, 2020

Uh oh!

beorn7 commented Dec 3, 2020

Uh oh!

roidelapluie left a comment

tomwilkie commented Nov 19, 2018 •

edited

Loading

beorn7 Oct 28, 2020 •

edited

Loading

roidelapluie commented Dec 2, 2020 •

edited

Loading