Skip to content

Add metric for permanently failed notifications#2383

Merged
simonpasquier merged 1 commit into
prometheus:masterfrom
mneverov:failed_permanently_metric
Nov 6, 2020
Merged

Add metric for permanently failed notifications#2383
simonpasquier merged 1 commit into
prometheus:masterfrom
mneverov:failed_permanently_metric

Conversation

@mneverov
Copy link
Copy Markdown
Contributor

@mneverov mneverov commented Oct 4, 2020

Fixes: #2361

Signed-off-by: Max Neverov neverov.max@gmail.com

@mneverov
Copy link
Copy Markdown
Contributor Author

mneverov commented Oct 4, 2020

@SuperQ could you please take a look?

Comment thread notify/notify.go Outdated
r.metrics.numFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
r.metrics.numTotalFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
if !retry {
r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the metric needs to be incremented before L701 too (when the context is done).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. Fixed, thanks.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:

func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
    r.metrics.numNotifications.Inc()
    ctx, alerts, err := r.exec(ctx, l, alerts...)
    if err != nil {
        r.metrics.numFailedNotifications.Inc()
    }
    return ctx, alerts, err
}

func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
...
}

Comment thread notify/notify.go Outdated
r.metrics.numFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
r.metrics.numTotalFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
if !retry {
r.metrics.numPermanentlyFailedNotifications.WithLabelValues(r.integration.Name()).Inc()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For readability it might be better to extract the existing code into a private method and wrap the new instrumentation around:

func (r RetryStage) Exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
    r.metrics.numNotifications.Inc()
    ctx, alerts, err := r.exec(ctx, l, alerts...)
    if err != nil {
        r.metrics.numFailedNotifications.Inc()
    }
    return ctx, alerts, err
}

func (r RetryStage) exec(ctx context.Context, l log.Logger, alerts ...*types.Alert) (context.Context, []*types.Alert, error) {
...
}

Comment thread notify/notify.go Outdated
}, []string{"integration"}),
numPermanentlyFailedNotifications: prometheus.NewCounterVec(prometheus.CounterOpts{
Namespace: "alertmanager",
Name: "notifications_permanently_failed_total",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not totally convinced by the metric name. We should also count the number of tries.
I'm wondering if we shouldn't use the existing alertmanager_notifications_failed_total to count notifications that couldn't be delivered permanently and introduce alertmanager_notification_requests_total/alertmanager_notification_requests_failed_total metrics to count notification request attempts and failures.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonpasquier thanks for the update. I tried to be non-intrusive and keep the existing meaning for the notifications_failed_total. I agree that your suggested metrics have much clearer intent given these names.
Should I introduce them and delete the old metric or just add the new ones?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that existing metrics have their own merit and shouldn't be dropped. I'd advise to rename the current alertmanager_notifications_total and alertmanager_notifications_failed_total metrics as alertmanager_notification_requests_total and alertmanager_notification_requests_failed_total and use alertmanager_notifications_total and alertmanager_notifications_failed_total for counting actual notification results.

Signed-off-by: Max Neverov <neverov.max@gmail.com>
@mneverov mneverov force-pushed the failed_permanently_metric branch from 4efde55 to 0749478 Compare October 17, 2020 20:03
@mneverov
Copy link
Copy Markdown
Contributor Author

@simonpasquier could you please have another look?

Copy link
Copy Markdown
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks!

@simonpasquier simonpasquier merged commit c39b787 into prometheus:master Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Separate metrics for temporary and permanently failed notifications

2 participants