Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide more information about recording rules. #3797

Closed
iksaif opened this Issue Feb 5, 2018 · 10 comments

Comments

Projects
None yet
2 participants
@iksaif
Copy link
Contributor

iksaif commented Feb 5, 2018

This is something that has been asked by quite a few or our users:

  • When was this recording rule last executed (or when is it going to execute next?)
  • How much time did it take to execute last time? (Present in prometheus 2.1, also applies to targets)
  • How many timeseries did it generate? (also applies to targets)

Basically when their instance suddenly has a lot more timeseries or slows down they try to understand what is causing that. We already have some per-target information, I think it would be great to add some of these metadata to recording rules too.

scrape.target.Target already has lastError and lastScrape, we could add lastNumMetrics and lastScrapeTime.
Rules (both alerts and recording rules) recently got GetEvaluationTime, we could add GetLastEvaluationTime() and GetLastOutputMetricsNum() (and input) ?

If both objects export very similar things we could even uniformize the UI (and API) for these.

Is there any work in progress regarding this? Would you be opposed if we submitted a patch in this direction ? Thanks !

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 5, 2018

Have you tried using 2.1? It has most of these features.

@iksaif

This comment has been minimized.

Copy link
Contributor Author

iksaif commented Feb 5, 2018

I've seen that 2.1 does have evaluation time, and that is great ! But AFAIK:

  • Targets do not yet have number of ingested metrics and scrape time
  • Rules do not yet have size of input/output and last evaluation time

Are these things that could be added ?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 5, 2018

Targets do not yet have number of ingested metrics and scrape time

They've had that for a long time, scrape_samples_scraped and scrape_duration_seconds.

Rules do not yet have size of input/output and last evaluation time.

Why do you want the last evaluation time?

@iksaif

This comment has been minimized.

Copy link
Contributor Author

iksaif commented Feb 5, 2018

They've had that for a long time, scrape_samples_scraped and scrape_duration_seconds.

Very good point, it would be nice to have them somewhere in the target UI too. But at least that's already something we can play with now.
I completly missed those because they appear in /federate but not on /metrics (and when monitoring prometheus itself, we didn't think about getting some additional stuff from federate)

Why do you want the last evaluation time?

When you see that the output is missing, if you have both input/output size + evaluation time, you can super quickly understand why you don't have your new values (either input is empty of evaluation time is late). Basically, this highlights the fact that you might be late (because you have too many rules, not enough CPU, etc..). The UI could even display part of what is in prometheus_rule_evaluation_.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 6, 2018

you can super quickly understand why you don't have your new values

The intended way to spot that is the last evaluation duration is greater than the interval, which are both on /metrics.

@iksaif

This comment has been minimized.

Copy link
Contributor Author

iksaif commented Feb 6, 2018

Ok, fair enough, do you still it would still be worth it to put a small red/green indicator somewhere in the rule page to show that the last evaluation is greater than the interval ?

Also, given your rationale, I'm not sure why we keep the last target scrape time in the target UI (which is essentially the same thing, people use it at startup to know if their target as been scrapped or not yet). I guess one could argue that scrape failure happen way more often than evaluation failures.

And would you be ok to add some input/output set size per rules somewhere (either only in the UI, or as a metric like the one for targets). I'd like to have it in the UI, but if we do that it might also make sense to put the same number in the target UI.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 6, 2018

It's a brand new feature, I think it's a bit early to consider adding more to the UI.

Evaluation failure would mean Prometheus is broken, while scrape failures are normal.

@gouthamve was looking at adding some instrumentation around that, thus far the duration seems to catch most of the expensive rules.

@iksaif

This comment has been minimized.

Copy link
Contributor Author

iksaif commented Feb 6, 2018

Fine, let's wait for it to settle down a little bit, and add more things later. I'll build what is missing in additional Grafana dashboards for now.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 8, 2018

Closing for now.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.