Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus configuration reloads slowly #4301

Closed
lyonssp opened this issue Jun 21, 2018 · 14 comments · Fixed by #4526
Closed

Prometheus configuration reloads slowly #4301

lyonssp opened this issue Jun 21, 2018 · 14 comments · Fixed by #4526

Comments

@lyonssp
Copy link

lyonssp commented Jun 21, 2018

Bug Report

I'm reloading Prometheus configurations through a continuous deployment process using ansible. I'm finding that configurations take such a long time to reload that what seem like sensible timeout thresholds are consistently passed by the reload process.

What are some factors that could be causing the significant reload times?

What did you do?
curl -X POST localhost:9090/-/reload

What did you expect to see?
Expected Prometheus to reload my configuration in < 90s

What did you see instead? Under which circumstances?
Prometheus seems to take ~2.5 minutes to reload the configuration.

Environment

Prometheus is deployed using docker with the configuration mounted from the file system to
/etc/prometheus/prometheus.yml. I am loading a new file to the local filesystem and almost immediately triggering a reload of the prometheus configuration via lifecycle APIs.

Containers aren't rebooted frequently.

  • System information:
    Linux 4.4.0-1057-aws x86_64

  • Prometheus version:

prometheus, version 2.2.1
build date: 20180314-14:15:45
go version: go1.10

  • Prometheus configuration file:
global:
  scrape_interval: 30s
  scrape_timeout: 30s

alerting:
  alertmanagers:
  - file_sd_configs:
    - files:
      - /etc/prometheus/alertmanagers.yml

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: us-east-1 
      port: 9100
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9100
      access_key: <redacted>
      secret_key: <redacted>
     relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9100
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9090
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9090
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9090
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    kubernetes_sd_configs:
    - role: pod
      api_server: <redacted>
    bearer_token: <redacted>
    scheme: https
    relabel_configs: <redacted>
 
  - job_name: '<redacted>'
    kubernetes_sd_configs:
    - role: pod
      api_server: <redacted>
    bearer_token: <redacted>
    scheme: https
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    kubernetes_sd_configs:
    - role: pod
      api_server: <redacted>
    bearer_token: <redacted>
    scheme: https
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    kubernetes_sd_configs:
    - role: pod
      api_server: <redacted>
    bearer_token: <redacted>
    scheme: https
    relabel_configs: <redacted>
 
  - job_name: '<redacted>'
    static_configs:
    - targets: ['<redacted>']

  - job_name: '<redacted>'
    static_configs:
    - targets: ['<redacted>']
  • Logs:
    No relevant logs available
@brian-brazil
Copy link
Contributor

Do you have rule groups taking 2.5 minutes to evaluate?

@lyonssp
Copy link
Author

lyonssp commented Jun 21, 2018

I have a single rule group with 5 recording rules and 12 alerting rules. Is there a way that I could hook into Prometheus to get a gauge of how long it is taking to evaluate that rule group?

@brian-brazil
Copy link
Contributor

If you look at the Rules Status page it'll tell you.

@krasi-georgiev
Copy link
Contributor

@lyonssp probably a terrible idea, but if nothing else works you can background it.
curl -X POST localhost:9090/-/reload &

there were some ideas how to speed up the reloads, but at first glance it looked like it will cause many race conditions.

@lyonssp
Copy link
Author

lyonssp commented Jun 22, 2018

@brian-brazil my rules group is reporting taking 440ms to evaluate

@lyonssp
Copy link
Author

lyonssp commented Jun 22, 2018

@krasi-georgiev In a continuous deployment context, the problem is that the deployment job would appear to have succeeded in the event of a failure to reload

@brian-brazil
Copy link
Contributor

There's something odd going on then. How are you measuring the 2.5 minutes, and can you share logs?

@lyonssp
Copy link
Author

lyonssp commented Jun 22, 2018

@brian-brazil I'm not measuring strictly -- more eyeballing it. I can guarantee the time is more than 90s though, as that is my failure threshold. I'll get you some details about the process and some logs as well.

@lyonssp
Copy link
Author

lyonssp commented Jun 22, 2018

First thing's first -- here are some stats on timing reloads with a local curl request:

$ time curl -X POST localhost:9090/-/reload
real	1m44.133s
user	0m0.000s
sys	0m0.008s

$ time curl -X POST localhost:9090/-/reload
real	3m24.448s
user	0m0.008s
sys	0m0.004s

$ time curl -X POST localhost:9090/-/reload
real	2m0.908s
user	0m0.000s
sys	0m0.008s

I didn't execute those curls consecutively -- I left a bit of time between executions. It definitely highlights some inconsistency in the behavior.

@lyonssp
Copy link
Author

lyonssp commented Jun 22, 2018

Here are some logs from the server where I executed the above reloads

b9ed65e46f5d12625ea54a0ef356eb25baf49f2e259fdb0c9f2f9ebb8206605d-json.log

@maurorappa
Copy link

EC2 discovery error and relative timeout, from the logs:
"discovery manager scrape" discovery=ec2 msg="Refresh failed" err="could not describe instances: EC2RoleRequestError: no EC2 instance role found"

@tejaswiniVadlamudi
Copy link

Hi,
We too get the same problem. Is this solved with your latest version? Do we have any updates on this case?

Thanks,
Teja

@krasi-georgiev
Copy link
Contributor

I am working on a PR that will speed up the reloading quite a bit. The problem is that stopping scraping loops for running target is now executed in serial and I am updating the code to stop these in parallel.

Will link the pr when ready.

@lock
Copy link

lock bot commented Mar 25, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 25, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants