New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prometheus configuration reloads slowly #4301
Comments
Do you have rule groups taking 2.5 minutes to evaluate? |
I have a single rule group with 5 recording rules and 12 alerting rules. Is there a way that I could hook into Prometheus to get a gauge of how long it is taking to evaluate that rule group? |
If you look at the Rules Status page it'll tell you. |
@lyonssp probably a terrible idea, but if nothing else works you can background it. there were some ideas how to speed up the reloads, but at first glance it looked like it will cause many race conditions. |
@brian-brazil my rules group is reporting taking 440ms to evaluate |
@krasi-georgiev In a continuous deployment context, the problem is that the deployment job would appear to have succeeded in the event of a failure to reload |
There's something odd going on then. How are you measuring the 2.5 minutes, and can you share logs? |
@brian-brazil I'm not measuring strictly -- more eyeballing it. I can guarantee the time is more than 90s though, as that is my failure threshold. I'll get you some details about the process and some logs as well. |
First thing's first -- here are some stats on timing reloads with a local curl request:
I didn't execute those curls consecutively -- I left a bit of time between executions. It definitely highlights some inconsistency in the behavior. |
Here are some logs from the server where I executed the above reloads b9ed65e46f5d12625ea54a0ef356eb25baf49f2e259fdb0c9f2f9ebb8206605d-json.log |
EC2 discovery error and relative timeout, from the logs: |
Hi, Thanks, |
I am working on a PR that will speed up the reloading quite a bit. The problem is that stopping scraping loops for running target is now executed in serial and I am updating the code to stop these in parallel. Will link the pr when ready. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Bug Report
I'm reloading Prometheus configurations through a continuous deployment process using ansible. I'm finding that configurations take such a long time to reload that what seem like sensible timeout thresholds are consistently passed by the reload process.
What are some factors that could be causing the significant reload times?
What did you do?
curl -X POST localhost:9090/-/reload
What did you expect to see?
Expected Prometheus to reload my configuration in < 90s
What did you see instead? Under which circumstances?
Prometheus seems to take ~2.5 minutes to reload the configuration.
Environment
Prometheus is deployed using docker with the configuration mounted from the file system to
/etc/prometheus/prometheus.yml. I am loading a new file to the local filesystem and almost immediately triggering a reload of the prometheus configuration via lifecycle APIs.
Containers aren't rebooted frequently.
System information:
Linux 4.4.0-1057-aws x86_64
Prometheus version:
prometheus, version 2.2.1
build date: 20180314-14:15:45
go version: go1.10
No relevant logs available
The text was updated successfully, but these errors were encountered: