Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loss after restart #4028

Closed
mattouille opened this Issue Mar 30, 2018 · 11 comments

Comments

Projects
None yet
3 participants
@mattouille
Copy link

mattouille commented Mar 30, 2018

What did you do?

Use systemd to restart prometheus (systemctl restart prometheus)

What did you expect to see?

Prometheus restarts and data is retained

What did you see instead? Under which circumstances?

Prometheus restarts, I saw all my data from when I started Prometheus, then a few seconds goes by and my data is cut down to about five minutes.

Environment

  • System information:

    Linux 4.4.0-87-generic x86_64

  • Prometheus version:

prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c)
build user: root@149e5b3f0829
build date: 20180314-14:15:45
go version: go1.10

  • Alertmanager version:
    alertmanager, version 0.14.0 (branch: HEAD, revision: 30af4d051b37ce817ea7e35b56c57a0e2ec9dbb0)
    build user: root@37b6a49ebba9
    build date: 20180213-08:16:42
    go version: go1.9.2
  • Prometheus configuration file:
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - localhost:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    scrape_interval: 5s
    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node_exporter'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9100']
  - job_name: 'rabbitmq_prod'
    scrape_interval: 5s
    metrics_path: '/api/metrics'
    static_configs:
      - targets: ['ip:15672']
  • Alertmanager configuration file:

N/a

  • Logs:
Mar 29 18:50:08 freedon alertmanager[30189]: level=info ts=2018-03-29T23:50:08.880362478Z caller=nflog.go:293 component=nflog msg="Running maintenance"
Mar 29 18:50:08 freedon alertmanager[30189]: level=info ts=2018-03-29T23:50:08.881315927Z caller=nflog.go:295 component=nflog msg="Maintenance done" duration=1.076145ms size=0
Mar 29 18:50:08 freedon alertmanager[30189]: level=info ts=2018-03-29T23:50:08.880813406Z caller=silence.go:269 component=silences msg="Running maintenance"
Mar 29 18:50:08 freedon alertmanager[30189]: level=info ts=2018-03-29T23:50:08.881742678Z caller=silence.go:271 component=silences msg="Maintenance done" duration=940.664µs size=0
Mar 29 19:05:08 freedon alertmanager[30189]: level=info ts=2018-03-30T00:05:08.88013978Z caller=nflog.go:293 component=nflog msg="Running maintenance"
Mar 29 19:05:08 freedon alertmanager[30189]: level=info ts=2018-03-30T00:05:08.880439453Z caller=silence.go:269 component=silences msg="Running maintenance"
Mar 29 19:05:08 freedon alertmanager[30189]: level=info ts=2018-03-30T00:05:08.881013506Z caller=nflog.go:295 component=nflog msg="Maintenance done" duration=881.166µs size=0
Mar 29 19:05:08 freedon alertmanager[30189]: level=info ts=2018-03-30T00:05:08.881112047Z caller=silence.go:271 component=silences msg="Maintenance done" duration=677.061µs size=0
Mar 29 19:09:11 freedon systemd[1]: Stopping Prometheus...
Mar 29 19:09:11 freedon prometheus[30305]: level=warn ts=2018-03-30T00:09:11.479577566Z caller=main.go:374 msg="Received SIGTERM, exiting gracefully..."
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.479678608Z caller=main.go:398 msg="Stopping scrape discovery manager..."
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.479705708Z caller=main.go:411 msg="Stopping notify discovery manager..."
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.479724774Z caller=main.go:432 msg="Stopping scrape manager..."
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.479750723Z caller=main.go:407 msg="Notify discovery manager stopped"
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.479768921Z caller=main.go:394 msg="Scrape discovery manager stopped"
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.47983884Z caller=main.go:426 msg="Scrape manager stopped"
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.480757861Z caller=manager.go:460 component="rule manager" msg="Stopping rule manager..."
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.480798552Z caller=manager.go:466 component="rule manager" msg="Rule manager stopped"
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.480824321Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.480847878Z caller=main.go:573 msg="Notifier manager stopped"
Mar 29 19:09:11 freedon prometheus[30305]: level=info ts=2018-03-30T00:09:11.481171437Z caller=main.go:584 msg="See you next time!"
Mar 29 19:09:11 freedon systemd[1]: Stopped Prometheus.
Mar 29 19:09:11 freedon systemd[1]: Started Prometheus.
Mar 29 19:09:11 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:11.624833188Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
Mar 29 19:09:11 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:11.624895856Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
Mar 29 19:09:11 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:11.62492176Z caller=main.go:222 host_details="(Linux 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 freedon (none))"
Mar 29 19:09:11 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:11.624944313Z caller=main.go:223 fd_limits="(soft=1024, hard=4096)"
Mar 29 19:09:11 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:11.628139207Z caller=main.go:504 msg="Starting TSDB ..."
Mar 29 19:09:11 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:11.62824636Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
Mar 29 19:09:13 freedon grafana-server[15874]: t=2018-03-29T19:09:13-0500 lvl=info msg="Request Completed" logger=context userId=2 orgId=1 uname=mattouille method=GET path=/api/datasources/proxy/1/api/v1/query_range status=503 remote_addr=192.168.1.57 time_ms=2 size=19 referer="http://grafana.dc.capstonemetering.net:3000/d/SoRzVrziz/rabbitmq?orgId=1&refresh=5s&from=now-3h&to=now&panelId=4&tab=general"
Mar 29 19:09:13 freedon grafana-server[15874]: t=2018-03-29T19:09:13-0500 lvl=info msg="Request Completed" logger=context userId=2 orgId=1 uname=mattouille method=GET path=/api/datasources/proxy/1/api/v1/query_range status=503 remote_addr=192.168.1.57 time_ms=3 size=19 referer="http://grafana.dc.capstonemetering.net:3000/d/SoRzVrziz/rabbitmq?orgId=1&refresh=5s&from=now-3h&to=now&panelId=4&tab=general"
Mar 29 19:09:13 freedon grafana-server[15874]: t=2018-03-29T19:09:13-0500 lvl=info msg="Request Completed" logger=context userId=2 orgId=1 uname=mattouille method=GET path=/api/datasources/proxy/1/api/v1/query_range status=503 remote_addr=192.168.1.57 time_ms=3 size=19 referer="http://grafana.dc.capstonemetering.net:3000/d/SoRzVrziz/rabbitmq?orgId=1&refresh=5s&from=now-3h&to=now&panelId=4&tab=general"
Mar 29 19:09:14 freedon grafana-server[15874]: t=2018-03-29T19:09:14-0500 lvl=info msg="Request Completed" logger=context userId=2 orgId=1 uname=mattouille method=GET path=/api/datasources/proxy/1/api/v1/query_range status=503 remote_addr=192.168.1.57 time_ms=5 size=19 referer="http://grafana.dc.capstonemetering.net:3000/d/SoRzVrziz/rabbitmq?orgId=1&refresh=5s&from=now-3h&to=now&panelId=4&tab=general"
Mar 29 19:09:14 freedon grafana-server[15874]: t=2018-03-29T19:09:14-0500 lvl=info msg="Request Completed" logger=context userId=2 orgId=1 uname=mattouille method=GET path=/api/datasources/proxy/1/api/v1/query_range status=503 remote_addr=192.168.1.57 time_ms=2 size=19 referer="http://grafana.dc.capstonemetering.net:3000/d/SoRzVrziz/rabbitmq?orgId=1&refresh=5s&from=now-3h&to=now&panelId=4&tab=general"
Mar 29 19:09:14 freedon grafana-server[15874]: t=2018-03-29T19:09:14-0500 lvl=info msg="Request Completed" logger=context userId=2 orgId=1 uname=mattouille method=GET path=/api/datasources/proxy/1/api/v1/query_range status=503 remote_addr=192.168.1.57 time_ms=3 size=19 referer="http://grafana.dc.capstonemetering.net:3000/d/SoRzVrziz/rabbitmq?orgId=1&refresh=5s&from=now-3h&to=now&panelId=4&tab=general"
Mar 29 19:09:15 freedon grafana-server[15874]: t=2018-03-29T19:09:15-0500 lvl=info msg="Request Completed" logger=context userId=2 orgId=1 uname=mattouille method=GET path=/api/datasources/proxy/1/api/v1/query_range status=503 remote_addr=192.168.1.57 time_ms=1 size=19 referer="http://grafana.dc.capstonemetering.net:3000/d/SoRzVrziz/rabbitmq?orgId=1&refresh=5s&from=now-3h&to=now"
Mar 29 19:09:15 freedon grafana-server[15874]: message repeated 2 times: [ t=2018-03-29T19:09:15-0500 lvl=info msg="Request Completed" logger=context userId=2 orgId=1 uname=mattouille method=GET path=/api/datasources/proxy/1/api/v1/query_range status=503 remote_addr=192.168.1.57 time_ms=1 size=19 referer="http://grafana.dc.capstonemetering.net:3000/d/SoRzVrziz/rabbitmq?orgId=1&refresh=5s&from=now-3h&to=now"]
Mar 29 19:09:16 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:16.97288941Z caller=main.go:514 msg="TSDB started"
Mar 29 19:09:16 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:16.973002448Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Mar 29 19:09:16 freedon prometheus[5157]: level=info ts=2018-03-30T00:09:16.974342869Z caller=main.go:491 msg="Server is ready to receive web requests."

No error is output from the logs, that I can see, just a very weird behavior. I've now deleted my WAL and am trying to isolate the issue further.

@mattouille

This comment has been minimized.

Copy link
Author

mattouille commented Mar 30, 2018

This seems to primarily happen after two hours of running and collecting data from my observations but I'm going to try to pin that down tonight.

@mattouille

This comment has been minimized.

Copy link
Author

mattouille commented Mar 30, 2018

As long as the server didn't restart it retained its data. As soon as I ran systemctl restart prometheus it dropped nearly all of its data but two hours.

Mar 29 21:50:08 freedon alertmanager[30189]: level=info ts=2018-03-30T02:50:08.880162131Z caller=nflog.go:293 component=nflog msg="Running maintenance"
Mar 29 21:50:08 freedon alertmanager[30189]: level=info ts=2018-03-30T02:50:08.880504986Z caller=silence.go:269 component=silences msg="Running maintenance"
Mar 29 21:50:08 freedon alertmanager[30189]: level=info ts=2018-03-30T02:50:08.880969031Z caller=nflog.go:295 component=nflog msg="Maintenance done" duration=817.021µs size=0
Mar 29 21:50:08 freedon alertmanager[30189]: level=info ts=2018-03-30T02:50:08.88120478Z caller=silence.go:271 component=silences msg="Maintenance done" duration=704.504µs size=0
Mar 29 21:54:33 freedon systemd[1]: Started Session 357 of user matt.
Mar 29 21:54:46 freedon systemd[1]: Stopping Prometheus...
Mar 29 21:54:46 freedon prometheus[9731]: level=warn ts=2018-03-30T02:54:46.859656691Z caller=main.go:374 msg="Received SIGTERM, exiting gracefully..."
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.859742325Z caller=main.go:398 msg="Stopping scrape discovery manager..."
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.859772719Z caller=main.go:411 msg="Stopping notify discovery manager..."
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.859791452Z caller=main.go:432 msg="Stopping scrape manager..."
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.859827153Z caller=main.go:407 msg="Notify discovery manager stopped"
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.859817932Z caller=main.go:394 msg="Scrape discovery manager stopped"
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.859855917Z caller=main.go:426 msg="Scrape manager stopped"
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.860017482Z caller=manager.go:460 component="rule manager" msg="Stopping rule manager..."
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.86004536Z caller=manager.go:466 component="rule manager" msg="Rule manager stopped"
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.860063316Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.860085889Z caller=main.go:573 msg="Notifier manager stopped"
Mar 29 21:54:46 freedon prometheus[9731]: level=info ts=2018-03-30T02:54:46.860489722Z caller=main.go:584 msg="See you next time!"
Mar 29 21:54:46 freedon systemd[1]: Stopped Prometheus.
Mar 29 21:54:46 freedon systemd[1]: Started Prometheus.
Mar 29 21:54:46 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:46.973098863Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
Mar 29 21:54:46 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:46.973183053Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
Mar 29 21:54:46 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:46.973220426Z caller=main.go:222 host_details="(Linux 4.4.0-87-generic #110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017 x86_64 freedon (none))"
Mar 29 21:54:46 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:46.97325355Z caller=main.go:223 fd_limits="(soft=1024, hard=4096)"
Mar 29 21:54:46 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:46.976446141Z caller=main.go:504 msg="Starting TSDB ..."
Mar 29 21:54:46 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:46.976489915Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
Mar 29 21:54:47 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:47.365207343Z caller=main.go:514 msg="TSDB started"
Mar 29 21:54:47 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:47.36529438Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
Mar 29 21:54:47 freedon prometheus[14849]: level=info ts=2018-03-30T02:54:47.366442726Z caller=main.go:491 msg="Server is ready to receive web requests."

I think one of the worst things about this is it produces zero errors.

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Mar 30, 2018

Hmm, Could you share the service file for prometheus too?

@mattouille

This comment has been minimized.

Copy link
Author

mattouille commented Mar 30, 2018

@gouthamve

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus   --log.level=debug   --config.file=/etc/prometheus/prometheus.yml   --storage.tsdb.path=/var/lib/prometheus/   --storage.tsdb.retention=15d   --web.console.templates=/etc/prometheus/consoles   --web.console.libraries=/etc/prometheus/console_libraries
[Install]
WantedBy=multi-user.target
@mattouille

This comment has been minimized.

Copy link
Author

mattouille commented Mar 30, 2018

Just a brief look over my Prometheus UI, it looks like storage.tsdb.min-block-duration lines up with my loss timeline which is about every 2 hours.

@mattouille

This comment has been minimized.

Copy link
Author

mattouille commented Mar 30, 2018

I added a discovery folder to /var/lib/prometheus and I'm getting some new stuff:

├── 01C9TV9FHB13HS6W401734KYDX
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── 01C9VPRCBEWRY7P62S96ZRFSDT
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── 01C9VPRCFGSJWX23NEHVHX30V1
│   ├── chunks
│   │   └── 000001
│   ├── index
│   ├── meta.json
│   └── tombstones
├── discovery
│   ├── node
│   └── snmp
├── lock
└── wal
    ├── 000001
    └── 000021

those base folders with random strings and integers never existed before, so that's new. discovery also has two new folders. Before all that existed was lock and wall with some files inside wal.

@mattouille

This comment has been minimized.

Copy link
Author

mattouille commented Mar 30, 2018

Perms for tsdb data dir

drwxr-xr-x  7 prometheus prometheus 4.0K Mar 30 12:00 .
drwxr-xr-x 48 root       root       4.0K Mar 27 15:45 ..
drwxr-xr-x  3 prometheus prometheus 4.0K Mar 30 10:00 01C9VPRCBEWRY7P62S96ZRFSDT
drwxr-xr-x  3 prometheus prometheus 4.0K Mar 30 12:00 01C9VXM3JGDV5SV1364W6GC8ZZ
drwxr-xr-x  3 prometheus prometheus 4.0K Mar 30 12:00 01C9VXM3PAAMXA4WTCKBNMRC2S
drwxr-xr-x  4 prometheus prometheus 4.0K Mar 29 22:11 discovery
-rw-------  1 prometheus prometheus    6 Mar 29 23:01 lock
drwxr-xr-x  2 prometheus prometheus 4.0K Mar 30 12:00 wal
matt@freedon:/var/lib/prometheus$ ls -lah wal
total 513M
drwxr-xr-x 2 prometheus prometheus 4.0K Mar 30 12:00 .
drwxr-xr-x 7 prometheus prometheus 4.0K Mar 30 12:00 ..
-rw-r--r-- 1 prometheus prometheus 288K Mar 30 12:00 000001
-rw-r--r-- 1 prometheus prometheus 256M Mar 30 11:46 000021
-rw-r--r-- 1 prometheus prometheus 256M Mar 30 12:33 000022
@mattouille

This comment has been minimized.

Copy link
Author

mattouille commented Mar 30, 2018

I've tried restarting the server after having not restart it for 12 hours and it restarted fine, retaining all data. I'm not sure what's led up to this.

@mattouille

This comment has been minimized.

Copy link
Author

mattouille commented Mar 30, 2018

I'll close this issue for now, however, a root cause definitely would've been nice to find.

@mattouille mattouille closed this Mar 30, 2018

@TroubleshooteRJ

This comment has been minimized.

Copy link

TroubleshooteRJ commented Aug 7, 2018

hey mattouille, Did you find the root cause ?? I am scared to restart the prometheus after my changes of increasing the disk size.
#4475

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.