Skip to content

Beginnings of an Alertmanager mixin.#1629

Merged
beorn7 merged 20 commits intomasterfrom
mixin
Dec 3, 2020
Merged

Beginnings of an Alertmanager mixin.#1629
beorn7 merged 20 commits intomasterfrom
mixin

Conversation

@tomwilkie
Copy link
Member

@tomwilkie tomwilkie commented Nov 19, 2018

severity: 'critical',
},
annotations:{
message: 'Alertmanager has not found all other members of the cluster.',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this page when an AM is down? That seems a bit spammy.

{
alert:'AlertmanagerMembersInconsistent',
expr: |||
alertmanager_cluster_members{%(alertmanagerSelector)s}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is the wrong way around here, and others also need fixing.

@mxinden
Copy link
Member

mxinden commented Nov 20, 2018

@tomwilkie first of all thanks for pushing this forward.

Might it make sense to add a config.libsonnet like in prometheus/prometheus#4474?

@tomwilkie
Copy link
Member Author

@beorn7 can you add this to your TODO list please?

@beorn7
Copy link
Member

beorn7 commented Nov 17, 2019

I have done that long ago. Just that there are a lot of items on that list…

@simonpasquier
Copy link
Member

@beorn7 @tomwilkie I'm ok to tackle this one if you don't mind...

@beorn7
Copy link
Member

beorn7 commented Feb 28, 2020

Yes, please go ahead. I have this on my long todo list, but way too far down.
Any help appreciated.

@beorn7
Copy link
Member

beorn7 commented Oct 26, 2020

@simonpasquier I have some practical reasons to work on this. Will push a few commits today and tomorrow.

@beorn7 beorn7 removed the stale label Oct 26, 2020
@roidelapluie
Copy link
Member

@simonpasquier I have some practical reasons to work on this. Will push a few commits today and tomorrow.

Oh, I was going to say the exact same thing. Happy to try & review.

@beorn7
Copy link
Member

beorn7 commented Oct 28, 2020

OK, got all my piled-up ideas implemented. I think this is meaty enough now to get merged into the main branch.

Reviewers, please have another look.

@Duologic this might be of special interest to you. We could try this out from this branch.

},
annotations: {
summary: 'All Alertmanager instances within the same cluster are down.',
description: 'Each Alertmanager instances within the %(alertmanagerClusterName)s cluster has been up for less than {{ $value | humanizePercentage }} of the last 5m.' % $._config,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should'nt we alert earlier? if they are all down it's too late to get the notification :) as a reminder it is regularly said that one alertmanager cluster is enough company-wide

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Let me rephrase this to alert if half of the instances are down.

In any case, a proper alerting setup should also have an end-to-end test with an always firing alert, which, if it doesn't call in anymore, will trigger a secondary alert via a service like Dead Man's Snitch.

Copy link
Member

@beorn7 beorn7 Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, done. The alerts fire now if half or more of the cluster is affected, which means 2 or more with a cluster of 3 or 4 instances, which means that you need two failed instances (rather than only one) before getting a page.

@beorn7
Copy link
Member

beorn7 commented Oct 30, 2020

This is now deployed to our production environment. Will let you know if any insights come out of that.

@beorn7
Copy link
Member

beorn7 commented Nov 2, 2020

All looks good so far.

@simonpasquier I think this should be considered for merging. Could you do a final review?

@beorn7
Copy link
Member

beorn7 commented Nov 5, 2020

@simonpasquier are you available to review this? Or alternatively, would you be fine delegating review to someone else?

@beorn7
Copy link
Member

beorn7 commented Nov 12, 2020

@simonpasquier or perhaps @brancz @tomwilkie @gouthamve @roidelapluie ? I guess this is of some interest to all of you.

I'd really like to get any feedback on this and then merge this finally.

Copy link
Member Author

@tomwilkie tomwilkie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from me, with a minor nit.

@roidelapluie
Copy link
Member

Can you add the CI like we have in Prometheus?

@beorn7
Copy link
Member

beorn7 commented Nov 16, 2020

Great feedback. Thank you very much. I'll incorporate it tomorrow (hopefully…).

@brancz
Copy link
Member

brancz commented Nov 26, 2020

lgtm 👍

Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
@beorn7
Copy link
Member

beorn7 commented Dec 2, 2020

Sorry all for the delay, had busy weeks recently. Finally addressed all the comments.

Getting promtool to build within CI was PITA because of that dreaded clash between semantic versioning we use for Prometheus releases and how Go modules insists on using it on the code level.

@roidelapluie
Copy link
Member

Mmh Makefile.common knows how to download a compiled version of promtool.

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2020

Makefile.common knows how to download a compiled version of promtool.

I couldn't find that. Could you explain in more detail please?

@roidelapluie
Copy link
Member

roidelapluie commented Dec 2, 2020

We have a $(PROMU) target that downloads promu, you do not need to go get it. Running make promu in the pipeline should work, instead of go get.

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2020

But we don't need PROMU, we need promtool.

And getting promu to then compile promtool seems even worse than what I'm doing now.

At the very least, I want to keep the Makefile simple because it is not the Makefile of Alertmanager, it's just a simple Makefile for users of the mixin. They really shouldn't get in touch with Promu.

@beorn7
Copy link
Member

beorn7 commented Dec 2, 2020

If you know a way to put this all into the CircleCI config without bloating the makefile, please let me know.

@roidelapluie
Copy link
Member

roidelapluie commented Dec 2, 2020 via email

@beorn7
Copy link
Member

beorn7 commented Dec 3, 2020

It looks like you could look at https://github.com/monitoring-mixins/mixtool too

That's already in here. Am I missing anything?

@beorn7
Copy link
Member

beorn7 commented Dec 3, 2020

@roidelapluie do you have any merge-blocking concerns left? I'd like to get this in if there is nothing substantially wrong with it. We can still improve CI details later.

@beorn7
Copy link
Member

beorn7 commented Dec 3, 2020

Now I see that mixtool seems to include the promtool check rules capability. That's not documented, and I had a short look at the code and couldn't see it, but it's in there. So I can rip out promtool altogether.

Signed-off-by: beorn7 <beorn@grafana.com>
Copy link
Member

@roidelapluie roidelapluie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!!

@beorn7
Copy link
Member

beorn7 commented Dec 3, 2020

Thanks, everyone. Merging now (and will create a new PR with a new feature momentarily, I just didn't want to load up this here even more).

Because we have accumulated 20 commits in this PR, I will make an exception from my usual practice and squash them.

@beorn7 beorn7 merged commit 6c5dee0 into master Dec 3, 2020
@beorn7 beorn7 deleted the mixin branch December 3, 2020 14:58
xvzf pushed a commit to xvzf/grafana-jsonnet-libs that referenced this pull request Dec 7, 2020
Reference to upstream PR here: prometheus/alertmanager#1629

Signed-off-by: Matthias Riegler <matthias.riegler@taotesting.com>
Duologic pushed a commit to grafana/jsonnet-libs that referenced this pull request Dec 7, 2020
Reference to upstream PR here: prometheus/alertmanager#1629

Signed-off-by: Matthias Riegler <matthias.riegler@taotesting.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants