Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss in grafana dashboard on prometheus safe reload #4519

Open
ijonsnow opened this Issue Aug 20, 2018 · 9 comments

Comments

Projects
None yet
2 participants
@ijonsnow
Copy link

ijonsnow commented Aug 20, 2018

Proposal

Use case. Why is this important?
We find data point loss in grafana dashboards and prometheus dashboard after normal reload of the prometheus.

Note: we did not have this issue in 1.7 or 1.8 prometheus versions.

Nice to have' is not a good use case :)

Bug Report

What did you do?
On reloading Prometheus with below approaches. Data loss in Prometheus dashboard / Grafana

curl -X POST http://localhost:9090/-/reload
sudo service prometheus reload
kill -HUP 31763

What did you expect to see?
Reload of prometheus with new configs or rules to be reflected.

What did you see instead? Under which circumstances?
Data loss point in grafana.

Environment
Ubuntu envirnoment 14.04

  • System information:
    $uname -srm
    Linux 3.13.0-135-generic x86_64

  • Prometheus version:
    prometheus, version 2.1.0 (branch: HEAD, revision: 85f23d8)
    build user: root@6e784304d3ff
    build date: 20180119-12:01:23
    go version: go1.9.2

  • Alertmanager version:

    insert output of alertmanager --version here (if relevant to the issue)

  • Prometheus configuration file:

 global:
  scrape_interval:     1m 
  evaluation_interval: 1m 
  external_labels:
    datacenter: node
    metrics: node

rule_files:
 - /etc/prometheus/rules/*.rules

scrape_configs:
  - job_name: 'ec2-details'
    ec2_sd_configs:
      - region: region
        port: port_number
        refresh_interval: 300s
    relabel_configs:
      - source_labels: [__meta_ec2_tag_env]
        regex: prod
        action: keep
      - source_labels: [__meta_ec2_instance_state]
        regex: running
        action: keep
      - source_labels: [__meta_ec2_tag_Name]
        target_label: instance
      - source_labels: [__meta_ec2_tag_pod]
        target_label: pod
      - source_labels: [__meta_ec2_tag_service]
        target_label: service
      - source_labels: [__meta_ec2_instance_type]
        target_label: instance_type
      - source_labels: [__meta_ec2_instance_id]
        target_label: instance_id
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: '(ALERTS|node_cpu|node_disk_bytes_read|node_disk_bytes_written|node_disk_io_time_ms|node_disk_read_time_ms|node_disk_write_time_ms|node_disk_reads_completed|node_disk_writes_completed|node_filefd_allocated|node_filefd_maximum|node_filesystem_avail|node_filesystem_free|node_filesystem_size|node_load1|node_load15|node_load5|node_memory_Active|node_memory_Buffers|node_memory_Cached|node_memory_Inactive|node_memory_MemFree|node_memory_MemTotal|node_memory_SwapCached|node_memory_SwapFree|node_memory_SwapTotal|node_network_receive_bytes|node_network_receive_drop|node_network_receive_errs|node_network_receive_packets|node_network_transmit_bytes|node_network_transmit_drop|node_network_transmit_errs|node_network_transmit_packets|node_procs_blocked|node_procs_running)'
        action: keep

remote_read:
 - url: "http://influx-endpoint/read"
   remote_timeout: 5m
remote_write:
 - url: "http://influx-endpoint/write"
  • Alertmanager configuration file:
    NA , since its prometheus related query

  • Logs:
    Am experiencing the weird issue after Prometheus upgrade from 1.7 to 2.1 also tried on 2.3.
    Still issue is being experienced.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 21, 2018

edited the description to fix the formatting

This looks similar to #4301. How many targets are they in Prometheus? Can you measure how long it takes for Prometheus to reload (see #4301 (comment))?

@ijonsnow

This comment has been minimized.

Copy link
Author

ijonsnow commented Aug 23, 2018

@simonpasquier targets is only one. as mentioned in the prometheys.yaml config.
This is the minimal config we have for which Prometheus reload having data loss in grafana dashboard.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 23, 2018

Please run with --log.level=debug, do a few reloads and share the logs.

@ijonsnow

This comment has been minimized.

Copy link
Author

ijonsnow commented Aug 25, 2018

Command

/opt/prometheus/prometheus-2.3/prometheus --web.enable-admin-api --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/libprometheus/db --web.enable-lifecycle --web.console.templates=/opt/prometheus/prometheus-2.3/consoles --web.console.libraries=/opt/prometheus/prometheus-2.3/console_libraries --log.level=debug

Log output

level=info ts=2018-08-25T03:08:00.18181658Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=HEAD, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-08-25T03:08:00.181884078Z caller=main.go:223 build_context="(go=go1.10.3, user=root@5258e0bd9cc1, date=20180712-14:02:52)"
level=info ts=2018-08-25T03:08:00.181904121Z caller=main.go:224 host_details="(Linux 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 prometheus-rules (none))"
level=info ts=2018-08-25T03:08:00.181918594Z caller=main.go:225 fd_limits="(soft=65536, hard=65536)"
level=info ts=2018-08-25T03:08:00.18263059Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-08-25T03:08:00.182697145Z caller=web.go:415 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-08-25T03:08:00.182845821Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1533816000000 maxt=1533880800000 ulid=01CMHA548347C30CZM34YRA42C
level=info ts=2018-08-25T03:08:00.18291819Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1533880800000 maxt=1533945600000 ulid=01CMK7YP78XJD40NRN6P8V2RKY
level=info ts=2018-08-25T03:08:00.182985987Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1533945600000 maxt=1534010400000 ulid=01CMN5R7J36ZF94XM35V1WQFT0
level=info ts=2018-08-25T03:08:00.183132427Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534010400000 maxt=1534075200000 ulid=01CMR556NHQD329S2TP7N1RHCM
level=info ts=2018-08-25T03:08:00.183220465Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534075200000 maxt=1534140000000 ulid=01CMS89Z7NBPM4KV8EW5TYRZ31
level=info ts=2018-08-25T03:08:00.183305135Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534140000000 maxt=1534204800000 ulid=01CMVS7CJQA682FTZCPXVCCY6R
level=info ts=2018-08-25T03:08:00.18336503Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534204800000 maxt=1534269600000 ulid=01CMX3VCWAMSZJN0XFCXDYZBR7
level=info ts=2018-08-25T03:08:00.18340975Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534269600000 maxt=1534334400000 ulid=01CMZ1MX7MCDT536HWVSAP37BS
level=info ts=2018-08-25T03:08:00.183455463Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534334400000 maxt=1534399200000 ulid=01CN0ZEFEX45HFJZJT98FFTPDY
level=info ts=2018-08-25T03:08:00.183500289Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534399200000 maxt=1534464000000 ulid=01CN2X7X29QVPY2E6R6F9ZND8A
level=info ts=2018-08-25T03:08:00.183543944Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534464000000 maxt=1534528800000 ulid=01CN4V1JH196WQ74NZB7GSFJ7A
level=info ts=2018-08-25T03:08:00.183592899Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534528800000 maxt=1534593600000 ulid=01CN6RTZN56GEFPGA1X09Z5646
level=info ts=2018-08-25T03:08:00.18363719Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534593600000 maxt=1534658400000 ulid=01CN8PMMM2WTVZAWDCMY183XXC
level=info ts=2018-08-25T03:08:00.183686636Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534658400000 maxt=1534723200000 ulid=01CNAME573MZ98BDQ68WND61N9
level=info ts=2018-08-25T03:08:00.183730321Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534723200000 maxt=1534788000000 ulid=01CNCJ7SMDQ6PGJEBE74MFHV24
level=info ts=2018-08-25T03:08:00.183772343Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534788000000 maxt=1534852800000 ulid=01CNEG1CV9DZYPF4VEN5AHSX9R
level=info ts=2018-08-25T03:08:00.18381497Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534852800000 maxt=1534917600000 ulid=01CNGDTVA2YW0W2GQR7VF81PXC
level=info ts=2018-08-25T03:08:00.183856958Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534917600000 maxt=1534982400000 ulid=01CNJBM8BCC3173V5JM6Y5EYGM
level=info ts=2018-08-25T03:08:00.183905285Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534982400000 maxt=1535047200000 ulid=01CNM9DY9KZHSGX9PAVNJHZPGQ
level=info ts=2018-08-25T03:08:00.183959356Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535047200000 maxt=1535112000000 ulid=01CNP77KNHA3014QG541K1KDZ6
level=info ts=2018-08-25T03:08:00.184006129Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535112000000 maxt=1535133600000 ulid=01CNPVSZBMMW1RZ8J3W79WXFS2
level=info ts=2018-08-25T03:08:00.184046065Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535155200000 maxt=1535162400000 ulid=01CNQGCD0VAVB0HJ6FXETZJQAN
level=info ts=2018-08-25T03:08:00.184085264Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535133600000 maxt=1535155200000 ulid=01CNQGD2EATRKC4EG9EV6EZE15

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 27, 2018

@ijonsnow I'd like to see more logs with msg="Loading configuration file" followed by msg="Loading configuration file" .

@ijonsnow

This comment has been minimized.

Copy link
Author

ijonsnow commented Aug 27, 2018

Below are the log lines found in prometheus.log

$grep "Loading configuration file" prometheus/prometheus.log
level=info ts=2018-08-25T03:08:16.45321681Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-08-25T07:59:05.109245539Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-08-25T08:14:18.883806656Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-08-25T08:16:08.953236351Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-08-27T08:15:37.164911927Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-08-27T08:35:42.799243745Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 27, 2018

Can you paste the full logs on the pastebin site of your choice? This will be easier for troubleshooting.

@ijonsnow

This comment has been minimized.

Copy link
Author

ijonsnow commented Aug 29, 2018

@simonpasquier
Sample federated cluster Prometheus config and Loglines are given below. Issues not only persist here but for configuration mentioned in the first issue case. Loglines given in previous comments and now (except the last post) are a full list which generated during restart.

Prometheus.yml file

global:
  scrape_interval:     1m # By default, scrape targets every 15 seconds.
  evaluation_interval: 1m # By default, scrape targets every 15 seconds.
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093
    scheme: http
    timeout: 10s

rule_files:
 - /etc/prometheus/rules/prometheus1/*.rules
 - /etc/prometheus/rules/prometheus2/*.rules
 - /etc/prometheus/rules/prometheus3/*.rules 


scrape_configs:
  - job_name: prometheus1-federated
    scrape_interval: 1m
    scrape_timeout: 1m
    metrics_path: /federate
    scheme: http
    honor_labels: true
    params:
      match[]:
       - '{__name__=~".+"}'
    static_configs:
     - targets:
       - 'prometheus1:9090'

  - job_name: prometheus2-federated
    scrape_interval: 1m
    scrape_timeout: 1m
    metrics_path: /federate
    scheme: http
    honor_labels: true
    params:
      match[]:
       - '{__name__=~".+"}'
    static_configs:
     - targets:
       - 'prometheus2:9090'

  - job_name: prometheus3-federated
    scrape_interval: 1m
    scrape_timeout: 1m
    metrics_path: /federate
    scheme: http
    honor_labels: true
    params:
      match[]:
       - '{__name__=~".+"}'
    static_configs:
     - targets:
       - 'prometheus3:9090'

remote_read:
 - url: "http://influx/read"
   remote_timeout: 5m

Below are the loglines on restart of Prometheus service.

level=warn ts=2018-08-29T08:00:40.950122817Z caller=main.go:377 msg="Received SIGTERM, exiting gracefully..."
level=info ts=2018-08-29T08:00:40.950193933Z caller=main.go:402 msg="Stopping scrape discovery manager..."
level=info ts=2018-08-29T08:00:40.950217952Z caller=main.go:416 msg="Stopping notify discovery manager..."
level=info ts=2018-08-29T08:00:40.950228178Z caller=main.go:438 msg="Stopping scrape manager..."
level=info ts=2018-08-29T08:00:40.950236194Z caller=main.go:398 msg="Scrape discovery manager stopped"
level=info ts=2018-08-29T08:00:40.950252657Z caller=main.go:412 msg="Notify discovery manager stopped"
level=debug ts=2018-08-29T08:00:40.950452946Z caller=scrape.go:703 component="scrape manager" scrape_pool=prometheus3-federated target="http://prometheus3:9090/federate?match%5B%5D=%7B__name__%3D~%22.%2B%22%7D" msg="Scrape failed" err="context canceled"
level=info ts=2018-08-29T08:00:41.006388328Z caller=manager.go:464 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-08-29T08:00:41.006520489Z caller=manager.go:470 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-08-29T08:00:41.006560862Z caller=main.go:432 msg="Scrape manager stopped"
level=info ts=2018-08-29T08:00:41.017061148Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2018-08-29T08:00:41.017097706Z caller=main.go:587 msg="Notifier manager stopped"
level=info ts=2018-08-29T08:00:41.017256236Z caller=main.go:599 msg="See you next time!"
level=info ts=2018-08-29T08:00:41.356769317Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=HEAD, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
level=info ts=2018-08-29T08:00:41.356831467Z caller=main.go:223 build_context="(go=go1.10.3, user=root@5258e0bd9cc1, date=20180712-14:02:52)"
level=info ts=2018-08-29T08:00:41.356851232Z caller=main.go:224 host_details="(Linux 3.13.0-135-generic #184-Ubuntu SMP Wed Oct 18 11:55:51 UTC 2017 x86_64 prometheus-rules (none))"
level=info ts=2018-08-29T08:00:41.356866903Z caller=main.go:225 fd_limits="(soft=65536, hard=65536)"
level=info ts=2018-08-29T08:00:41.357519355Z caller=main.go:533 msg="Starting TSDB ..."
level=info ts=2018-08-29T08:00:41.357559318Z caller=web.go:415 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-08-29T08:00:41.357789209Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534204800000 maxt=1534269600000 ulid=01CMX3VCWAMSZJN0XFCXDYZBR7
level=info ts=2018-08-29T08:00:41.357886546Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534269600000 maxt=1534334400000 ulid=01CMZ1MX7MCDT536HWVSAP37BS
level=info ts=2018-08-29T08:00:41.357966914Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534334400000 maxt=1534399200000 ulid=01CN0ZEFEX45HFJZJT98FFTPDY
level=info ts=2018-08-29T08:00:41.358051303Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534399200000 maxt=1534464000000 ulid=01CN2X7X29QVPY2E6R6F9ZND8A
level=info ts=2018-08-29T08:00:41.358139261Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534464000000 maxt=1534528800000 ulid=01CN4V1JH196WQ74NZB7GSFJ7A
level=info ts=2018-08-29T08:00:41.358227324Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534528800000 maxt=1534593600000 ulid=01CN6RTZN56GEFPGA1X09Z5646
level=info ts=2018-08-29T08:00:41.358291118Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534593600000 maxt=1534658400000 ulid=01CN8PMMM2WTVZAWDCMY183XXC
level=info ts=2018-08-29T08:00:41.35833751Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534658400000 maxt=1534723200000 ulid=01CNAME573MZ98BDQ68WND61N9
level=info ts=2018-08-29T08:00:41.358380801Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534723200000 maxt=1534788000000 ulid=01CNCJ7SMDQ6PGJEBE74MFHV24
level=info ts=2018-08-29T08:00:41.358422842Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534788000000 maxt=1534852800000 ulid=01CNEG1CV9DZYPF4VEN5AHSX9R
level=info ts=2018-08-29T08:00:41.358484802Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534852800000 maxt=1534917600000 ulid=01CNGDTVA2YW0W2GQR7VF81PXC
level=info ts=2018-08-29T08:00:41.358529926Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534917600000 maxt=1534982400000 ulid=01CNJBM8BCC3173V5JM6Y5EYGM
level=info ts=2018-08-29T08:00:41.358572719Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534982400000 maxt=1535047200000 ulid=01CNM9DY9KZHSGX9PAVNJHZPGQ
level=info ts=2018-08-29T08:00:41.358615725Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535047200000 maxt=1535112000000 ulid=01CNP77KNHA3014QG541K1KDZ6
level=info ts=2018-08-29T08:00:41.358662606Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535112000000 maxt=1535176800000 ulid=01CNR512CK1DW7W6ANHBVYGX57
level=info ts=2018-08-29T08:00:41.358710399Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535176800000 maxt=1535241600000 ulid=01CNT2TBVC9HKVZ6C8HHCD938N
level=info ts=2018-08-29T08:00:41.358755085Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535241600000 maxt=1535306400000 ulid=01CNW0M3TGGJPFHT50R9ABVJEY
level=info ts=2018-08-29T08:00:41.358797883Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535306400000 maxt=1535371200000 ulid=01CNXYDQT8ES2B20XDFPSD29MD
level=info ts=2018-08-29T08:00:41.35884045Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535371200000 maxt=1535436000000 ulid=01CNZW75PP9JQEFRCHFHC68T68
level=info ts=2018-08-29T08:00:41.358878557Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535500800000 maxt=1535508000000 ulid=01CP1SZ90T01RVGA39JC4J74QH
level=info ts=2018-08-29T08:00:41.358923725Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535436000000 maxt=1535500800000 ulid=01CP1T0G7841MKYXKN94Y4Q5CT
level=info ts=2018-08-29T08:00:41.358953187Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535508000000 maxt=1535515200000 ulid=01CP20V08SKCXM6WQFFAA6PGJ7
level=info ts=2018-08-29T08:00:41.358981669Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535515200000 maxt=1535522400000 ulid=01CP27PQGTXNSDSJK9BSMKRZQS

Note:
In configs and loglines have removed the DNS entries for security reasons. Where ever you see prometheus1,prometheus2,prometheus3 they are DNS entries

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Aug 29, 2018

Reload and restart are different operations, mostly because restart involves a sanity check of the TSDB that can take some time to complete and scraping won't start before the TSDB is ready. And #4519 (comment) is a restart so gaps in the metric collection wouldn't be unexpected.
Also federating all metrics from one Prometheus server to another isn't a recommended practice (see https://prometheus.io/docs/prometheus/latest/federation/ and https://www.robustperception.io/federation-what-is-it-good-for).

Lets go back to your initial report and please provide the full logs for the system on which you had the issue initially.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.