Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add metric for alertrule template rendering failure #4634

Closed
juliantaylor opened this Issue Sep 19, 2018 · 2 comments

Comments

Projects
None yet
3 participants
@juliantaylor
Copy link

juliantaylor commented Sep 19, 2018

The alertrules can have templated values which can be filled out based on metric labels for summaries.
This template rendering can fail due to typos in the template. In that case no alert is sent to the alertmanager which can be a major problem.

E.g. forgetting the .Labels causes:

Sep 19 13:11:47 po-prometheus-live-bs01 prometheus[23155]: level=warn ts=2018-09-19T11:11:47.902437202Z caller=alerting.go:220 component="rule manager" alert=k8s_puppet_inconsistent_environment msg="Expanding alert template failed" err="error executing template __alert_k8s_puppet_inconsistent_environment: template: __alert_k8s_puppet_inconsistent_environment:1:68: executing \"__alert_k8s_puppet_inconsistent_environment\" at <.group>: can't evaluate field group in type struct { Labels map[string]string; Value float64 }" data="unsupported value type"

There should be a metric that is increased on rule template rendering failures similar to prometheus_rule_evaluation_failures_total and prometheus_notifications_errors_total so one can alert on that failure instead.

We are currently using prometheus 2.3.2.

@mucahitkurt

This comment has been minimized.

Copy link
Contributor

mucahitkurt commented Oct 16, 2018

@simonpasquier I would like to help about this issue. I think, a metric like prometheus_template_expand_failures_total will be added and increased when a parse error ocured at methods Expand() and ExpandHTML()

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Oct 16, 2018

@mucahitkurt sure you're more than welcome! You're mostly correct. Maybe we can limit the counter to Expand() as ExpandHTML is only used for rendering web pages, not alerts.

mucahitkurt added a commit to mucahitkurt/prometheus that referenced this issue Oct 16, 2018

template expand failures counter metric is added to count alert templ…
…ate expanding errors prometheus#4634

Signed-off-by: Mucahit Kurt <mucahitkurt@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.