Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upprometheus is not stable after restoring a snapshot. #4643
Comments
This comment has been minimized.
This comment has been minimized.
twforeman
commented
Nov 6, 2018
|
This is still happening on version 2.4.3. After about four hours the logs start complaining about data compaction errors due to time stamp issues and then a few hours later prometheus locks up. This really needs to be addressed. |
This comment has been minimized.
This comment has been minimized.
|
could you also add the logs showing the block overlaps errors or if possible the full log |
krasi-georgiev
added
the
component/local storage
label
Nov 6, 2018
This comment has been minimized.
This comment has been minimized.
twforeman
commented
Nov 6, 2018
|
I paste the logs in the issue I raised in the tsdb sub-project, but I'll paste them here too:
|
krasi-georgiev
referenced this issue
Nov 30, 2018
Merged
no overlapping on compaction when an existing block is not within default boundaries. #461
krasi-georgiev
closed this
in
prometheus/tsdb#461
Dec 4, 2018
krasi-georgiev
added a commit
to prometheus/tsdb
that referenced
this issue
Dec 4, 2018
This comment has been minimized.
This comment has been minimized.
danielfm
commented
Dec 17, 2018
|
Just to let you know I also saw this when trying to restore a snapshot created from a v2.4.3 Prometheus server to a v2.5.0. The logs looked similar:
I'm now using the v2.6.0-rc.1, which included the fix in prometheus/tsdb#461, and saw no issues so far. |
This comment has been minimized.
This comment has been minimized.
|
@danielfm nice , thanks for the update. |
ryanisroot commentedSep 20, 2018
Proposal
Use case. Why is this important?
Snapshots should be restorable.
Bug Report
What did you do?
Trigger a snapshot, tar it up, then try to restore to an identical server.
snapshot is triggered with http://localhost:9090/api/v1/admin/tsdb/snapshot?skip_head=false
What did you expect to see?
What did you see instead? Under which circumstances?
After a few hours of running with the snapshot, and continuing to collect data, prometheus will stop working, and refuse to start again.
I restore the snapshot on the server, then point the data dir to the restored directory and restart. The data from the snapshot appears immediately, as expected.
Prometheus continues to collect data from my servers, again, as expected.
At some point ( it has happened within 10 minutes and also taken several hours ) prometheus will go down and refuse to start again, until the data directory is cleared. It will then start up and collect data as usual.
Environment
Linux 3.10.0-862.11.6.el7.x86_64 x86_64
prometheus, version 2.4.1 (branch: HEAD, revision: ce6716f)
build user: root@00c7c051c350
build date: 20180919-13:43:51
go version: go1.10.3
Behaviour was the same on 2.3.1, and was supposed to be fixed in 2.3.2. I have also tested with both of those versions.
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
monitor: 'codelab-monitor'
rule_files:
scrape_configs:
scrape_interval: "15s"
static_configs:
Sep 20 19:54:57 ip-10-129-1-159 systemd[1]: Started Prometheus Server.
Sep 20 19:54:57 ip-10-129-1-159 systemd[1]: Starting Prometheus Server...
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.106842205Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.1, branch=HEAD, revision=ce6716fe90ed67cb91cf8cf38a5de951853dcc2b)"
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.106907045Z caller=main.go:239 build_context="(go=go1.10.3, user=root@00c7c051c350, date=20180919-13:43:51)"
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.106927531Z caller=main.go:240 host_details="(Linux 3.10.0-862.11.6.el7.x86_64 #1 SMP Tue Aug 14 21:49:04 UTC 2018 x86_64 ip-10-129-1-159 (none))"
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.106943953Z caller=main.go:241 fd_limits="(soft=1024, hard=4096)"
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.106959272Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.107735716Z caller=main.go:554 msg="Starting TSDB ..."
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108220117Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537387200000 maxt=1537401600000 ulid=01CQTNJXCR80T94ZSFFNJEXZ9P
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108324649Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537401600000 maxt=1537423200000 ulid=01CQV3ACZG0WQ6GW8C2W1J1KR3
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108396602Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537444800000 maxt=1537452000000 ulid=01CQVQXCRXGRKGV4D315BV1VKQ
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108498694Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537423200000 maxt=1537444800000 ulid=01CQVQXGXR82004DT3MQKPAGH2
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108561545Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537452000000 maxt=1537457680887 ulid=01CQVSWWVGGNQFX2YK9GB615EZ
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.10867296Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537387200000 maxt=1537444800000 ulid=01CQW0QNYQEXMESMNJN0TH66MZ
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108739453Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537452000000 maxt=1537459200000 ulid=01CQW46FFVPC1WDDEHPDF91MFW
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108793681Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537452000000 maxt=1537459200000 ulid=01CQW46GKFVJB8AKQDT1B0MZEF
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108845858Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537452000000 maxt=1537459200000 ulid=01CQW46JPG1S32TNR4TW876253
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108899296Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537452000000 maxt=1537459200000 ulid=01CQW46PQY8RP1YMQYJRY7JB37
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.108949738Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1537452000000 maxt=1537459200000 ulid=01CQW46YPBVM98180J133YC97B
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.111017973Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
Sep 20 19:54:57 ip-10-129-1-159 prometheus[11264]: level=info ts=2018-09-20T19:54:57.12833645Z caller=main.go:423 msg="Stopping scrape discovery manager..."
Sep 20 19:54:57 ip-10-129-1-159 systemd[1]: prometheus.service: main process exited, code=exited, status=1/FAILURE
Sep 20 19:54:57 ip-10-129-1-159 systemd[1]: Unit prometheus.service entered failed state.
Sep 20 19:54:57 ip-10-129-1-159 systemd[1]: prometheus.service failed.