-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Description
tl;dr: I think it would be awesome if Alertmanager logged a rich/contextual message whenever it sent a notification that could be used for auditing/downstream analytics/etc. The current best practice of sending every alert to a webhook receiver would work but has some drawbacks that I think are enough to warrant adding some additional logging.
My Wish 🧞
I run a centralized Alertmanager cluster that sends alerts to hundreds of different destinations/teams/people. I would like to have a way of tracking how many alerts are sent to each destination, integration, what the metadata is for each alert so that we can track it over time.
The Current Best Practice
Reading docs and #560, it seems like the generally agreed/used practice here is to set up a route that matches all alerts right below the root route and send a webhook to something that does what you're looking for.
This would work fairly well for tracking alerts that have good metadata in labels about where they're going, but there are a few things about the approach that I'm not super excited about:
- You wouldn't have any way of knowing where the actual alert will be delivered (which integration/channel/etc). So counting things by actual destination (in the unfortunate world where many routes lead to similar destinations) isn't possible. It is possible to count notifications delivered by integration in metrics, but correlating that with inconsistent alert labels would be difficult.
- Adding a route that matches every alert for auditing means that if you somehow have an alert that doesn't have a matching route below it, it won't fall back to the root route. That's actually what I use the root route for now, catching bad user-defined alerts.
- (this is probably not a real issue) It adds a non-zero amount of overhead/work for alertmanager and the cluster to track and execute, sending webhooks. They're computers to it probably isn't a big deal, but definitely extra work being done.
I'll definitely acknowledge here that we are not using alertmanager in the way that it should be. If our routing tree was a lot cleaner, and the labels on all of our alerts were more consistent, the webhook approach would be less problematic. But the path of least resistance for me (and hopefully to the benefit of the community) is to see if I can make the change in alertmanager, organizations/people are difficult :)
Proposal
I took a quick look at the code. My initial thought was:
- Adding to either the
IntegrationorNotifierinterface to expose some rich metadata about the notification destination that could be logged at notify time - Grabbing the
CommonLabelsorGroupLabelsfrom one of the sent alerts - A very rich structured log could be added that contained everything you want to know about the alert and the destination
I'd be willing to take on the work. But I'll obviously defer to the expertise of the folks here on if this is even a reasonable thing to do, and if so on the approach.
Thank you so much for taking the time to read my request, thank you for all you do maintaining Prometheus and Alertmanager!
Metadata
Metadata
Assignees
Type
Projects
Status