Conversation
| severity: 'critical', | ||
| }, | ||
| annotations:{ | ||
| message: 'Alertmanager has not found all other members of the cluster.', |
There was a problem hiding this comment.
Won't this page when an AM is down? That seems a bit spammy.
| { | ||
| alert:'AlertmanagerMembersInconsistent', | ||
| expr: ||| | ||
| alertmanager_cluster_members{%(alertmanagerSelector)s} |
There was a problem hiding this comment.
indentation is the wrong way around here, and others also need fixing.
|
@tomwilkie first of all thanks for pushing this forward. Might it make sense to add a |
|
@beorn7 can you add this to your TODO list please? |
|
I have done that long ago. Just that there are a lot of items on that list… |
|
@beorn7 @tomwilkie I'm ok to tackle this one if you don't mind... |
|
Yes, please go ahead. I have this on my long todo list, but way too far down. |
|
@simonpasquier I have some practical reasons to work on this. Will push a few commits today and tomorrow. |
Oh, I was going to say the exact same thing. Happy to try & review. |
|
OK, got all my piled-up ideas implemented. I think this is meaty enough now to get merged into the main branch. Reviewers, please have another look. @Duologic this might be of special interest to you. We could try this out from this branch. |
| }, | ||
| annotations: { | ||
| summary: 'All Alertmanager instances within the same cluster are down.', | ||
| description: 'Each Alertmanager instances within the %(alertmanagerClusterName)s cluster has been up for less than {{ $value | humanizePercentage }} of the last 5m.' % $._config, |
There was a problem hiding this comment.
Should'nt we alert earlier? if they are all down it's too late to get the notification :) as a reminder it is regularly said that one alertmanager cluster is enough company-wide
There was a problem hiding this comment.
Right. Let me rephrase this to alert if half of the instances are down.
In any case, a proper alerting setup should also have an end-to-end test with an always firing alert, which, if it doesn't call in anymore, will trigger a secondary alert via a service like Dead Man's Snitch.
There was a problem hiding this comment.
OK, done. The alerts fire now if half or more of the cluster is affected, which means 2 or more with a cluster of 3 or 4 instances, which means that you need two failed instances (rather than only one) before getting a page.
|
This is now deployed to our production environment. Will let you know if any insights come out of that. |
|
All looks good so far. @simonpasquier I think this should be considered for merging. Could you do a final review? |
|
@simonpasquier are you available to review this? Or alternatively, would you be fine delegating review to someone else? |
|
@simonpasquier or perhaps @brancz @tomwilkie @gouthamve @roidelapluie ? I guess this is of some interest to all of you. I'd really like to get any feedback on this and then merge this finally. |
tomwilkie
left a comment
There was a problem hiding this comment.
LGTM from me, with a minor nit.
|
Can you add the CI like we have in Prometheus? |
|
Great feedback. Thank you very much. I'll incorporate it tomorrow (hopefully…). |
|
lgtm 👍 |
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
Signed-off-by: beorn7 <beorn@grafana.com>
|
Sorry all for the delay, had busy weeks recently. Finally addressed all the comments. Getting |
|
Mmh Makefile.common knows how to download a compiled version of promtool. |
I couldn't find that. Could you explain in more detail please? |
|
We have a $(PROMU) target that downloads promu, you do not need to go get it. Running make promu in the pipeline should work, instead of go get. |
|
But we don't need PROMU, we need promtool. And getting promu to then compile promtool seems even worse than what I'm doing now. At the very least, I want to keep the Makefile simple because it is not the Makefile of Alertmanager, it's just a simple Makefile for users of the mixin. They really shouldn't get in touch with Promu. |
|
If you know a way to put this all into the CircleCI config without bloating the makefile, please let me know. |
|
Sorry, I made the confusion between promu and promtool.
It looks like you could look at
https://github.com/monitoring-mixins/mixtool too
|
That's already in here. Am I missing anything? |
|
@roidelapluie do you have any merge-blocking concerns left? I'd like to get this in if there is nothing substantially wrong with it. We can still improve CI details later. |
|
Now I see that |
Signed-off-by: beorn7 <beorn@grafana.com>
|
Thanks, everyone. Merging now (and will create a new PR with a new feature momentarily, I just didn't want to load up this here even more). Because we have accumulated 20 commits in this PR, I will make an exception from my usual practice and squash them. |
Reference to upstream PR here: prometheus/alertmanager#1629 Signed-off-by: Matthias Riegler <matthias.riegler@taotesting.com>
Reference to upstream PR here: prometheus/alertmanager#1629 Signed-off-by: Matthias Riegler <matthias.riegler@taotesting.com>
Alerts are from https://github.com/coreos/prometheus-operator/blob/master/contrib/kube-prometheus/jsonnet/kube-prometheus/alerts/ and prometheus/prometheus#4474.
Signed-off-by: Tom Wilkie tom.wilkie@gmail.com