Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus not returning all metrics #2258

Closed
simonaws opened this Issue Dec 6, 2016 · 8 comments

Comments

Projects
None yet
3 participants
@simonaws
Copy link

simonaws commented Dec 6, 2016

What did you do?
We are running prometheus and using blackbox exporter to ping 60+ targets and HTTP request + 10 web api targets.

What did you expect to see?
count(probe_success{job="blackbox"}) returns a different count every time executed in prom UI. It seems that prometheus is not returning all scrapes.

Should we expect to see count(probe_success{job="blackbox"}) equal to the number of targets in our config file for that job?

What did you see instead? Under which circumstances?
We want the query count(probe_success{job="blackbox_icmp"}== 1) to return the number of targets up and count(probe_success{job="blackbox_icmp"}== 0) to return the number of targets down but these queries result in different answers every time.

Environment

  • System information:
    Linux 3.10.0-327.36.1.el7.x86_64 x86_64

  • Prometheus version:
    prom/prometheus:v1.2.1

  • Prometheus configuration file:

  
rule_files:
  - '/prometheus/alert.rules'

scrape_configs:

  ###############
  - job_name: 'prometheus'
    scrape_interval: 20s
    static_configs:
      - targets: ['prometheus:9090']
    relabel_configs:
      - source_labels: [__metrics_path__ ]
        regex: .*
        replacement:   '/prometheus/metrics'
        target_label:  __metrics_path__

  #############
  - job_name: 'blackbox_icmp'
    scrape_interval: 20s
    metrics_path: /probe
    params:
      module: [icmp] 
    file_sd_configs:
    - files: ['targets/blackbox-ping.json']
      refresh_interval: 1m
    relabel_configs:
      - source_labels: [__address__]
        regex: (.*)
        target_label: __param_target
        replacement: ${1}
      - target_label: __address__
        replacement: blackbox:9115


   ###############
  - job_name: 'blackbox_http'
    scrape_interval: 20s
    metrics_path: /probe
    file_sd_configs:
    - files: ['targets/blackbox-http.json']
      refresh_interval: 1m
    relabel_configs:
      - source_labels: [__address__]
        regex: (.*)
        target_label: __param_target
        replacement: ${1}
      - source_labels: [__blackbox_module]
        target_label: __param_module
      - target_label: __address__
        replacement: blackbox:9115

  ###############
  - job_name: 'node-exporter'
    scrape_interval: 1m
    file_sd_configs:
    - files: ['targets/node-exporter.json']
      refresh_interval: 1m
    relabel_configs:
      - source_labels: [__address__, __node_exporter_port]
        separator:     ':'
        target_label: __address__


insert configuration here
  • Alertmanager configuration file:
insert configuration here (if relevant to the issue)
  • Logs:
insert Prometheus and Alertmanager logs relevant to the issue here
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 6, 2016

Yes, that is the result that would be expected. What is up showing for that job?

Also you don't need to set refresh_interval, as that's just a fallback in case inotify doesn't work.

@simonaws

This comment has been minimized.

Copy link
Author

simonaws commented Dec 6, 2016

Hi Brian

up{job="blackbox_icmp"} in prometheus UI returns a different number of results every time. I get between 10 lines to any number but still below the number of targets in the targets file.

Thanks for the advice on the refresh_interval.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 6, 2016

What that sounds like is your Prometheus is overloaded and getting throttled. Are there messages about throttling in the logs?

@simonaws

This comment has been minimized.

Copy link
Author

simonaws commented Dec 6, 2016

I don't see any mention of throttling in the logs nor any errors. Prometheus is just logging checkpointing. I will look into the reasons for prometheus throttling from other issues here. I presume I should start by decreasing targets and increasing scrape intervals.

time="2016-12-06T14:20:49Z" level=info msg="Starting prometheus (version=1.2.1, branch=master, revision=dd66f2e94b2b662804b9aa1b6a50587b990ba8b7)" source="main.go:75"
time="2016-12-06T14:20:49Z" level=info msg="Build context (go=go1.7.1, user=root@fd9b0daff6bd, date=20161010-15:58:23)" source="main.go:76"
time="2016-12-06T14:20:49Z" level=info msg="Loading configuration file /prometheus/prometheus.yml" source="main.go:247"
time="2016-12-06T14:20:49Z" level=info msg="Loading series map and head chunks..." source="storage.go:354"
time="2016-12-06T14:20:49Z" level=info msg="29454 series loaded." source="storage.go:359"
time="2016-12-06T14:20:49Z" level=info msg="Listening on :9090" source="web.go:240"
time="2016-12-06T14:20:49Z" level=info msg="Starting target manager..." source="targetmanager.go:76"
time="2016-12-06T14:25:49Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-06T14:25:50Z" level=info msg="Done checkpointing in-memory metrics and chunks in 534.052024ms." source="persistence.go:573"
time="2016-12-06T14:30:50Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-12-06T14:30:50Z" level=info msg="Done checkpointing in-memory metrics and chunks in 537.371609ms." source="persistence.go:573"

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 6, 2016

In that case the number of targets in those files are probably changing quite a lot. If you keep those files unchanged for a while, do you still see the issue?

@simonaws

This comment has been minimized.

Copy link
Author

simonaws commented Dec 7, 2016

This is solved. The client that displays prometheus web UI had a system clock ahead of linux server hosting prometheus. When executing the UP query in the prometheus UI the request timestamps were slightly in the future which did not return all metrics as they had not be all been scraped yet.

The HTTP request from the prometheus UI is below. I checked the unix time on the linux server and it was behind. Once I synced the times on client and server, UP metrics for all targets were returned every time.

/prometheus/api/v1/query?query=up&time=1481108856.537&_=1481105072244

thanks for your help on this Brian.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Dec 7, 2016

That's a bit odd, you should still have seen the data from the previous scrape.

@grobie grobie closed this Mar 5, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.