Prometheus configuration reloads slowly #4301

lyonssp · 2018-06-21T21:45:23Z

Bug Report

I'm reloading Prometheus configurations through a continuous deployment process using ansible. I'm finding that configurations take such a long time to reload that what seem like sensible timeout thresholds are consistently passed by the reload process.

What are some factors that could be causing the significant reload times?

What did you do?
curl -X POST localhost:9090/-/reload

What did you expect to see?
Expected Prometheus to reload my configuration in < 90s

What did you see instead? Under which circumstances?
Prometheus seems to take ~2.5 minutes to reload the configuration.

Environment

Prometheus is deployed using docker with the configuration mounted from the file system to
/etc/prometheus/prometheus.yml. I am loading a new file to the local filesystem and almost immediately triggering a reload of the prometheus configuration via lifecycle APIs.

Containers aren't rebooted frequently.

System information:
Linux 4.4.0-1057-aws x86_64
Prometheus version:

prometheus, version 2.2.1
build date: 20180314-14:15:45
go version: go1.10

Prometheus configuration file:

global:
  scrape_interval: 30s
  scrape_timeout: 30s

alerting:
  alertmanagers:
  - file_sd_configs:
    - files:
      - /etc/prometheus/alertmanagers.yml

rule_files:
  - /etc/prometheus/rules/*.yml

scrape_configs:
  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: us-east-1 
      port: 9100
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9100
      access_key: <redacted>
      secret_key: <redacted>
     relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9100
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9090
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9090
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    ec2_sd_configs:
    - region: <redacted>
      port: 9090
      access_key: <redacted>
      secret_key: <redacted>
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    kubernetes_sd_configs:
    - role: pod
      api_server: <redacted>
    bearer_token: <redacted>
    scheme: https
    relabel_configs: <redacted>
 
  - job_name: '<redacted>'
    kubernetes_sd_configs:
    - role: pod
      api_server: <redacted>
    bearer_token: <redacted>
    scheme: https
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    kubernetes_sd_configs:
    - role: pod
      api_server: <redacted>
    bearer_token: <redacted>
    scheme: https
    relabel_configs: <redacted>

  - job_name: '<redacted>'
    kubernetes_sd_configs:
    - role: pod
      api_server: <redacted>
    bearer_token: <redacted>
    scheme: https
    relabel_configs: <redacted>
 
  - job_name: '<redacted>'
    static_configs:
    - targets: ['<redacted>']

  - job_name: '<redacted>'
    static_configs:
    - targets: ['<redacted>']

Logs:
No relevant logs available

The text was updated successfully, but these errors were encountered:

brian-brazil · 2018-06-21T21:48:54Z

Do you have rule groups taking 2.5 minutes to evaluate?

lyonssp · 2018-06-21T22:48:08Z

I have a single rule group with 5 recording rules and 12 alerting rules. Is there a way that I could hook into Prometheus to get a gauge of how long it is taking to evaluate that rule group?

brian-brazil · 2018-06-22T05:01:34Z

If you look at the Rules Status page it'll tell you.

krasi-georgiev · 2018-06-22T10:27:27Z

@lyonssp probably a terrible idea, but if nothing else works you can background it.
curl -X POST localhost:9090/-/reload &

there were some ideas how to speed up the reloads, but at first glance it looked like it will cause many race conditions.

lyonssp · 2018-06-22T14:21:13Z

@brian-brazil my rules group is reporting taking 440ms to evaluate

lyonssp · 2018-06-22T14:22:11Z

@krasi-georgiev In a continuous deployment context, the problem is that the deployment job would appear to have succeeded in the event of a failure to reload

brian-brazil · 2018-06-22T14:23:25Z

There's something odd going on then. How are you measuring the 2.5 minutes, and can you share logs?

lyonssp · 2018-06-22T14:33:18Z

@brian-brazil I'm not measuring strictly -- more eyeballing it. I can guarantee the time is more than 90s though, as that is my failure threshold. I'll get you some details about the process and some logs as well.

lyonssp · 2018-06-22T15:31:45Z

First thing's first -- here are some stats on timing reloads with a local curl request:

$ time curl -X POST localhost:9090/-/reload
real	1m44.133s
user	0m0.000s
sys	0m0.008s

$ time curl -X POST localhost:9090/-/reload
real	3m24.448s
user	0m0.008s
sys	0m0.004s

$ time curl -X POST localhost:9090/-/reload
real	2m0.908s
user	0m0.000s
sys	0m0.008s

I didn't execute those curls consecutively -- I left a bit of time between executions. It definitely highlights some inconsistency in the behavior.

lyonssp · 2018-06-22T15:42:53Z

Here are some logs from the server where I executed the above reloads

b9ed65e46f5d12625ea54a0ef356eb25baf49f2e259fdb0c9f2f9ebb8206605d-json.log

maurorappa · 2018-07-12T09:03:16Z

EC2 discovery error and relative timeout, from the logs:
"discovery manager scrape" discovery=ec2 msg="Refresh failed" err="could not describe instances: EC2RoleRequestError: no EC2 instance role found"

tejaswiniVadlamudi · 2018-08-21T09:14:48Z

Hi,
We too get the same problem. Is this solved with your latest version? Do we have any updates on this case?

Thanks,
Teja

krasi-georgiev · 2018-08-21T09:35:41Z

I am working on a PR that will speed up the reloading quite a bit. The problem is that stopping scraping loops for running target is now executed in serial and I am updating the code to stop these in parallel.

Will link the pr when ready.

lock · 2019-03-25T10:40:34Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

krasi-georgiev added kind/enhancement not-as-easy-as-it-looks labels Jul 12, 2018

simonpasquier mentioned this issue Aug 21, 2018

Data loss in grafana dashboard on prometheus safe reload #4519

Closed

krasi-georgiev mentioned this issue Aug 22, 2018

Fix for the slow updates of targets changes #4526

Merged

krasi-georgiev closed this as completed in #4526 Sep 26, 2018

lock bot locked and limited conversation to collaborators Mar 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus configuration reloads slowly #4301

Prometheus configuration reloads slowly #4301

lyonssp commented Jun 21, 2018 •

edited

brian-brazil commented Jun 21, 2018

lyonssp commented Jun 21, 2018

brian-brazil commented Jun 22, 2018

krasi-georgiev commented Jun 22, 2018

lyonssp commented Jun 22, 2018

lyonssp commented Jun 22, 2018

brian-brazil commented Jun 22, 2018

lyonssp commented Jun 22, 2018

lyonssp commented Jun 22, 2018

lyonssp commented Jun 22, 2018

maurorappa commented Jul 12, 2018

tejaswiniVadlamudi commented Aug 21, 2018

krasi-georgiev commented Aug 21, 2018

lock bot commented Mar 25, 2019

Prometheus configuration reloads slowly #4301

Prometheus configuration reloads slowly #4301

Comments

lyonssp commented Jun 21, 2018 • edited

Bug Report

brian-brazil commented Jun 21, 2018

lyonssp commented Jun 21, 2018

brian-brazil commented Jun 22, 2018

krasi-georgiev commented Jun 22, 2018

lyonssp commented Jun 22, 2018

lyonssp commented Jun 22, 2018

brian-brazil commented Jun 22, 2018

lyonssp commented Jun 22, 2018

lyonssp commented Jun 22, 2018

lyonssp commented Jun 22, 2018

maurorappa commented Jul 12, 2018

tejaswiniVadlamudi commented Aug 21, 2018

krasi-georgiev commented Aug 21, 2018

lock bot commented Mar 25, 2019

lyonssp commented Jun 21, 2018 •

edited