Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data lost after restart #5258

Closed
fgrzadkowski opened this Issue Feb 22, 2019 · 4 comments

Comments

Projects
None yet
2 participants
@fgrzadkowski
Copy link

fgrzadkowski commented Feb 22, 2019

Bug Report

What did you do?
Prometheus is running as a systemd service. After a restart it looses some metrics from last 1-2 hours.

What did you expect to see?

No metric samples should be lost. This is true for most timeseries, except some that I'm interested in (they're exported by custom_exporter; see config below)

What did you see instead? Under which circumstances?

Metrics were lost.

Environment

  • System information:
$ uname -srm
Linux 4.4.0-131-generic x86_64
  • Prometheus version:
$ prometheus --version
prometheus, version 2.7.1 (branch: HEAD, revision: 62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)
  build user:       root@f9f82868fc43
  build date:       20190131-11:16:59
  go version:       go1.11.5
  • Prometheus configuration file:
$ cat /etc/prometheus/prometheus.yml 
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  external_labels:
    x: 'XX'

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/alerting_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node_exporter'
    static_configs:
    - targets: ['localhost:9100']

  - job_name: 'custom_exporter'
    static_configs:
    - targets: ['localhost:8000']

  - job_name: 'dummy_exporter'
    static_configs:
    - targets: ['localhost:9200']

alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets: ['alertmanager']
  • Prometheus service config:
$ sudo cat /etc/systemd/system/prometheus.service
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
Restart=always
ExecStart=/usr/local/bin/prometheus \
    --config.file /etc/prometheus/prometheus.yml \
    --storage.tsdb.path /var/lib/prometheus/ \
    --web.external-url=http://eddie-albert:9090 \
    --web.enable-lifecycle \
    --log.level=debug

[Install]
WantedBy=multi-user.target
  • Logs:
    Logs from the restart:
Feb 22 16:24:27 eddie0001 systemd[1]: Stopping Prometheus...
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=warn ts=2019-02-22T15:24:27.653562331Z caller=main.go:464 msg="Received SIGTERM, exiting gracefully..."
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.653652827Z caller=main.go:489 msg="Stopping scrape discovery manager..."
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.653683824Z caller=main.go:503 msg="Stopping notify discovery manager..."
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.653700911Z caller=main.go:525 msg="Stopping scrape manager..."
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.653725679Z caller=main.go:485 msg="Scrape discovery manager stopped"
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.653745708Z caller=main.go:499 msg="Notify discovery manager stopped"
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.653793943Z caller=manager.go:736 component="rule manager" msg="Stopping rule manager..."
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.653812981Z caller=manager.go:742 component="rule manager" msg="Rule manager stopped"
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.653906645Z caller=main.go:519 msg="Scrape manager stopped"
Feb 22 16:24:27 eddie0001 prometheus[13920]: level=info ts=2019-02-22T15:24:27.656635443Z caller=notifier.go:521 component=notifier msg="Stopping notification manager..."
Feb 22 16:24:27 eddie0001 systemd[1]: Stopped Prometheus.
Feb 22 16:24:27 eddie0001 systemd[1]: Started Prometheus.
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.72016907Z caller=main.go:302 msg="Starting Prometheus" version="(version=2.7.1, branch=HEAD, revision=62e591f928ddf6b3468308b7ac1de1c63aa7fcf3)"
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.720245402Z caller=main.go:303 build_context="(go=go1.11.5, user=root@f9f82868fc43, date=20190131-11:16:59)"
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.720285625Z caller=main.go:304 host_details="(Linux 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 eddie0001 (none))"
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.720320351Z caller=main.go:305 fd_limits="(soft=1024, hard=4096)"
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.7203501Z caller=main.go:306 vm_limits="(soft=unlimited, hard=unlimited)"
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.721234359Z caller=main.go:620 msg="Starting TSDB ..."
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.721277764Z caller=web.go:416 component=web msg="Start listening for connections" address=0.0.0.0:9090
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.721673412Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1549993250740 maxt=1550016000000 ulid=01D3JKKQ92C5C64317672HVM3E
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.721783914Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550016000000 maxt=1550080800000 ulid=01D3MAHHCQD1ATNJRKQ33QTE7Z
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.721872127Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550080800000 maxt=1550145600000 ulid=01D3P8B2Q0N0QM94JABW7N42C8
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.721957396Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550145600000 maxt=1550210400000 ulid=01D3R64KTZTXZXQ3K507KSFVZK
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722035003Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550210400000 maxt=1550275200000 ulid=01D3T3Y532CE1TD200RYK74VPN
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722109798Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550275200000 maxt=1550340000000 ulid=01D3W1QPB8X34EP5M42C5X9GB9
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722188691Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550340000000 maxt=1550404800000 ulid=01D3XZH7J7ZTEF3TR1MP8DTRB4
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722262999Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550404800000 maxt=1550469600000 ulid=01D3ZXARDM84YH2R3096XNFRSW
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722341673Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550469600000 maxt=1550534400000 ulid=01D41V4A3B0NYW4GFA3YWAQ1MN
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722418629Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550534400000 maxt=1550599200000 ulid=01D43RXVCPVJJZ1A7QGMX71Y40
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722491205Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550599200000 maxt=1550664000000 ulid=01D45PQCKPDYDF4PZTVMW2YPGT
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722568342Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550664000000 maxt=1550728800000 ulid=01D47MGXV3BC1GANZC3RZQC70C
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722640618Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550728800000 maxt=1550793600000 ulid=01D49JAF36NGESNFZ7VQRS07CZ
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722704747Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550793600000 maxt=1550815200000 ulid=01D4A6XMS7V4E9EFBGH4PCD42A
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722753935Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550836800000 maxt=1550844000000 ulid=01D4AVGTF8GJY04Q20H7H89W1P
Feb 22 16:24:27 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:27.722819269Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1550815200000 maxt=1550836800000 ulid=01D4AVGTH47JR8ES9G6GC7Y0AK
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=warn ts=2019-02-22T15:24:30.585822221Z caller=head.go:440 component=tsdb msg="unknown series references" count=46969
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:30.59328862Z caller=main.go:635 msg="TSDB started"
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:30.593338246Z caller=main.go:695 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.59401404Z caller=manager.go:217 component="discovery manager scrape" msg="Starting provider" provider=string/0 subs=[prometheus]
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594043647Z caller=manager.go:217 component="discovery manager scrape" msg="Starting provider" provider=string/1 subs=[node_exporter]
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594056403Z caller=manager.go:217 component="discovery manager scrape" msg="Starting provider" provider=string/2 subs=[hardware_ros_monitor]
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594066108Z caller=manager.go:217 component="discovery manager scrape" msg="Starting provider" provider=string/3 subs=[dummy_exporter]
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594083673Z caller=manager.go:217 component="discovery manager notify" msg="Starting provider" provider=string/0 subs=[5aa2f45ac0d8a7879c8e834f97683fcc]
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594082104Z caller=manager.go:235 component="discovery manager scrape" msg="discoverer channel closed" provider=string/0
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594123267Z caller=manager.go:235 component="discovery manager scrape" msg="discoverer channel closed" provider=string/1
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594136985Z caller=manager.go:235 component="discovery manager scrape" msg="discoverer channel closed" provider=string/3
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594129372Z caller=manager.go:235 component="discovery manager scrape" msg="discoverer channel closed" provider=string/2
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=debug ts=2019-02-22T15:24:30.594211285Z caller=manager.go:235 component="discovery manager notify" msg="discoverer channel closed" provider=string/0
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:30.594361821Z caller=main.go:722 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
Feb 22 16:24:30 eddie0001 prometheus[5115]: level=info ts=2019-02-22T15:24:30.59437505Z caller=main.go:589 msg="Server is ready to receive web requests."

@fgrzadkowski

This comment has been minimized.

Copy link
Author

fgrzadkowski commented Feb 25, 2019

To add more context, I wrote a dummy exporter based on this example
(I wrote a simple counter which increments every second and has some labels)

For this dummy target I never loose data. no matter how and when I restart prometheus server or the exporter. To be honest I don't see any difference between dummy_exporter and custom_exporter (in reality the name is different) from the scraping point of view.

Any suggestions how to debug this further?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Feb 26, 2019

You probably want to increase the RestartSec option for the service.

I'm closing it for now. If you have further questions, please use our user mailing list, which you can also search.

@fgrzadkowski

This comment has been minimized.

Copy link
Author

fgrzadkowski commented Feb 28, 2019

@simonpasquier That's not it. The thing I'm observing happens even if I stop prometheus and then start it again after few minutes. I always loose data and only for those few metrics.

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Mar 5, 2019

It looks similar to #3632. Unfortunately no progress has been made so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.