Add metric for permanently failed notifications#2383
Conversation
|
@SuperQ could you please take a look? |
| r.metrics.numFailedNotifications.WithLabelValues(r.integration.Name()).Inc() | ||
| r.metrics.numTotalFailedNotifications.WithLabelValues(r.integration.Name()).Inc() | ||
| if !retry { | ||
| r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc() |
There was a problem hiding this comment.
the metric needs to be incremented before L701 too (when the context is done).
There was a problem hiding this comment.
you're right. Fixed, thanks.
There was a problem hiding this comment.
For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:
func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
r.metrics.numNotifications.Inc()
ctx, alerts, err := r.exec(ctx, l, alerts...)
if err != nil {
r.metrics.numFailedNotifications.Inc()
}
return ctx, alerts, err
}
func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
...
}94998ff to
4efde55
Compare
| r.metrics.numFailedNotifications.WithLabelValues(r.integration.Name()).Inc() | ||
| r.metrics.numTotalFailedNotifications.WithLabelValues(r.integration.Name()).Inc() | ||
| if !retry { | ||
| r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc() |
There was a problem hiding this comment.
For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:
func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
r.metrics.numNotifications.Inc()
ctx, alerts, err := r.exec(ctx, l, alerts...)
if err != nil {
r.metrics.numFailedNotifications.Inc()
}
return ctx, alerts, err
}
func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
...
}| }, []string{"integration"}), | ||
| numPermanentlyFailedNotifications: prometheus.NewCounterVec(prometheus.CounterOpts{ | ||
| Namespace: "alertmanager", | ||
| Name: "notifications_permanently_failed_total", |
There was a problem hiding this comment.
Not totally convinced by the metric name. We should also count the number of tries.
I'm wondering if we shouldn't use the existing alertmanager_notifications_failed_total to count notifications that couldn't be delivered permanently and introduce alertmanager_notification_requests_total/alertmanager_notification_requests_failed_total metrics to count notification request attempts and failures.
There was a problem hiding this comment.
@simonpasquier thanks for the update. I tried to be non-intrusive and keep the existing meaning for the notifications_failed_total. I agree that your suggested metrics have much clearer intent given these names.
Should I introduce them and delete the old metric or just add the new ones?
There was a problem hiding this comment.
I think that existing metrics have their own merit and shouldn't be dropped. I'd advise to rename the current alertmanager_notifications_total and alertmanager_notifications_failed_total metrics as alertmanager_notification_requests_total and alertmanager_notification_requests_failed_total and use alertmanager_notifications_total and alertmanager_notifications_failed_total for counting actual notification results.
Signed-off-by: Max Neverov <neverov.max@gmail.com>
4efde55 to
0749478
Compare
|
@simonpasquier could you please have another look? |
Fixes: #2361
Signed-off-by: Max Neverov neverov.max@gmail.com