Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reported data and promtheus data do not match #3828

Closed
pecigonzalo opened this Issue Feb 12, 2018 · 9 comments

Comments

Projects
None yet
2 participants
@pecigonzalo
Copy link

pecigonzalo commented Feb 12, 2018

What did you do?
I'm checking metrics from node-exporter, but the metrics reported by Prometheus do not match the metrics on node-exporter (across all nodes).

What did you expect to see?
Example metrics from 10.10.0.235

root@ip-10-10-0-235:~# curl localhost:19997/metrics | grep node_load
# TYPE node_load1 gauge
node_load1 0.31
# HELP node_load15 15m load average.
# TYPE node_load15 gauge
node_load15 0.46
# HELP node_load5 5m load average.
# TYPE node_load5 gauge
node_load5 0.49
root@ip-10-10-0-235:~# 

What did you see instead? Under which circumstances?
Load is reported with a value of 2.88, this happens across all monitored targets

Element Value
node_load1{instance="i-someid58",instance_ip="10.10.2.143",job="node-exporter"} 2.34
node_load1{instance="i-someid9d",instance_ip="10.10.0.220",job="node-exporter"} 2.46
node_load1{instance="i-someid6e",instance_ip="10.10.1.119",job="node-exporter"} 2.34
node_load1{instance="i-someiddd",instance_ip="10.10.0.235",job="node-exporter"} 2.5
node_load1{instance="i-someidce",instance_ip="10.11.0.83",job="node-exporter"} 2.46
node_load1{instance="i-someid60",instance_ip="10.11.1.103",job="node-exporter"} 2.34

This is not the case on my other cluster, running the same version of Prometheus with the same configuration. The only thing I can think off is data corruption.

Environment

  • System information:
    Host: Linux 4.9.0-5-amd64 x86_64
    Docker: Docker version 17.09.0-ce, build afdb6d4
    Network: Weave 2.0.4
    Storage: EFS

  • Prometheus version:

prometheus, version 1.8.2 (branch: HEAD, revision: 5211b96d4d1291c3dd1a569f711d3b301b635ecb)
  build user:       root@1412e937e4ad
  build date:       20171104-16:09:14
  go version:       go1.9.2

This is running on a Docker container.

  • Prometheus configuration file:
global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    monitor: prometheus-ecs
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093
    scheme: http
    timeout: 10s
rule_files:
- /etc/prometheus/alert.rules_es
- /etc/prometheus/alert.rules_ecs
- /etc/prometheus/alert.rules_nodes
scrape_configs:
- job_name: prometheus
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  dns_sd_configs:
  - names:
    - prometheus.weave.local
    refresh_interval: 30s
    type: A
    port: 9090
- job_name: netdata-exporter
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /api/v1/allmetrics?format=prometheus
  scheme: http
  ec2_sd_configs:
  - region: eu-central-1
    profile: default-tf-stage-AmazonECSContainer
    refresh_interval: 1m
    port: 19999
  relabel_configs:
  - source_labels: [__meta_ec2_tag_Environment]
    separator: ;
    regex: stage
    replacement: $1
    action: keep
  - source_labels: [__meta_ec2_instance_id]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - source_labels: [__meta_ec2_private_ip]
    separator: ;
    regex: (.*)
    target_label: instance_ip
    replacement: $1
    action: replace
- job_name: cadvisor
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  ec2_sd_configs:
  - region: eu-central-1
    profile: default-tf-stage-AmazonECSContainer
    refresh_interval: 1m
    port: 19998
  relabel_configs:
  - source_labels: [__meta_ec2_tag_Environment]
    separator: ;
    regex: stage
    replacement: $1
    action: keep
  - source_labels: [__meta_ec2_instance_id]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - source_labels: [__meta_ec2_private_ip]
    separator: ;
    regex: (.*)
    target_label: instance_ip
    replacement: $1
    action: replace
- job_name: node-exporter
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  ec2_sd_configs:
  - region: eu-central-1
    profile: default-tf-stage-AmazonECSContainer
    refresh_interval: 1m
    port: 19997
  relabel_configs:
  - source_labels: [__meta_ec2_tag_Environment]
    separator: ;
    regex: stage
    replacement: $1
    action: keep
  - source_labels: [__meta_ec2_instance_id]
    separator: ;
    regex: (.*)
    target_label: instance
    replacement: $1
    action: replace
  - source_labels: [__meta_ec2_private_ip]
    separator: ;
    regex: (.*)
    target_label: instance_ip
    replacement: $1
    action: replace
- job_name: elasticsearch-exporter
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  dns_sd_configs:
  - names:
    - elasticsearch_exporter.weave.local
    refresh_interval: 30s
    type: A
    port: 9108
- job_name: ecs-exporter
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  dns_sd_configs:
  - names:
    - ecs_exporter.weave.local
    refresh_interval: 30s
    type: A
    port: 9222
  • Executing command:
"Cmd": [
                "/bin/prometheus",
                "-config.file=/etc/prometheus/prometheus.yml",
                "-storage.local.path=/prometheus",
                "-storage.local.retention=168h",
                "-storage.local.target-heap-size=2147483648",
                "-storage.local.series-file-shrink-ratio=0.5",
                "-web.console.libraries=/etc/prometheus/console_libraries",
                "-web.console.templates=/etc/prometheus/consoles",
                "-alertmanager.url=http://alertmanager:9093"
            ],

Logs:
No error or relevant logs, just regular

time="2018-02-12T11:53:38Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:633" 
time="2018-02-12T11:53:45Z" level=info msg="Done checkpointing in-memory metrics and chunks in 7.808982441s." source="persistence.go:665" 

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 12, 2018

I don't see anything odd here.

Also, NFS is not supported.

@pecigonzalo

This comment has been minimized.

Copy link
Author

pecigonzalo commented Feb 12, 2018

I know NFS is not supported (that is why I mentioned its on EFS, in case its due to corruption because of this) but it does not seem to be an issue with NFS at a glance. If it is due to data corruption because of the storage layer, then I'll just close the issue, but is this the case?

Sorry, how is it that nothing is odd, node-exporter reports 0.31 and Prometheus shows 2.5 for load

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 12, 2018

Metrics can be different at different times.

@pecigonzalo

This comment has been minimized.

Copy link
Author

pecigonzalo commented Feb 12, 2018

This is at the same time, not a different time, its also the same for node_load5 and node_load15

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 12, 2018

Can you confirm with strace that the value coming back from the node exporter when Prometheus scrapes it is as expecred?

@pecigonzalo

This comment has been minimized.

Copy link
Author

pecigonzalo commented Feb 12, 2018

I can do, but I dont really understand if you want me to strace prometheus or node-exporter.

Do you have some example on how to go about this?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 12, 2018

You'd want to strace the node exporter. You'd run strace ./node_exporter

tcpdump could also work here.

@pecigonzalo

This comment has been minimized.

Copy link
Author

pecigonzalo commented Feb 13, 2018

Sure thing, I thought maybe there was some special usage to get more clear info.

EDIT:
That helped clear up the problem, ill close this.
It was a routing problem, thank you for your help and time.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.