Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus does not start: Opening storage failed invalid block sequence #4324

Closed
apsega opened this Issue Jun 29, 2018 · 12 comments

Comments

Projects
None yet
4 participants
@apsega
Copy link

apsega commented Jun 29, 2018

Bug Report

After restarting Prometheus, it couldn't manage to startup (that's quite sad with 1.5TB local data). It has been running on 2.3 version for a week now, and upgrade to 2.3.1 didn't help.

We have HA Prometheus setup and the other instance (v2.1.0) is running just perfectly fine.

Environment

  • System information:

Prometheus is running in Docker (Docker version 1.13.0, build 49bf474) on CentOS (7.4.1708).

Linux 3.10.0-514.6.1.el7.x86_64 x86_64
  • Prometheus version:
version=2.3.1, branch=HEAD, revision=188ca45bd85ce843071e768d855722a9d9dabe03
  • Logs:
level=info ts=2018-06-29T05:42:47.874516727Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.1, branch=HEAD, revision=188ca45bd85ce843071e768d855722a9d9dabe03)"
level=info ts=2018-06-29T05:42:47.874611131Z caller=main.go:223 build_context="(go=go1.10.3, user=root@82ef94f1b8f7, date=20180619-15:56:22)"
level=info ts=2018-06-29T05:42:47.87463482Z caller=main.go:224 host_details="(Linux 3.10.0-514.6.1.el7.x86_64 #1 SMP Wed Jan 18 13:06:36 UTC 2017 x86_64 4da5295cfc49 (none))"
level=info ts=2018-06-29T05:42:47.874664035Z caller=main.go:225 fd_limits="(soft=65536, hard=65536)"
level=info ts=2018-06-29T05:42:47.87602246Z caller=web.go:415 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-06-29T05:42:47.875990629Z caller=main.go:514 msg="Starting TSDB ..."
level=info ts=2018-06-29T05:42:47.876915963Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1524873600000 maxt=1524938400000 ulid=01CC6T4QA264XXY06Y9FGJA6Y6
level=info ts=2018-06-29T05:42:47.87698551Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1524938400000 maxt=1525003200000 ulid=01CC8QXXBDJEW4KE1B0XTB9GN5
level=info ts=2018-06-29T05:42:47.877031253Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1525003200000 maxt=1525068000000 ulid=01CCANQJPRY3C2T2ZJX8JGG6T7
level=info ts=2018-06-29T05:42:47.877091724Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1525068000000 maxt=1525262400000 ulid=01CCGF8JYFBPYQY3MKMG5D07MH
level=info ts=2018-06-29T05:42:47.877148247Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1525262400000 maxt=1525456800000 ulid=01CCP8MXJDJ6X4DXDPAPAYYZZ6
level=info ts=2018-06-29T05:42:47.87720284Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1525456800000 maxt=1525651200000 ulid=01CCW215PSGPS4XG9YNAH1XZ9R
level=info ts=2018-06-29T05:42:47.877257559Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1525651200000 maxt=1525845600000 ulid=01CD1VEH75RQCP1DZQB3Q6DEXQ
level=info ts=2018-06-29T05:42:47.877312083Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1525845600000 maxt=1526040000000 ulid=01CD7MV6ZMNAGV2AY5XXSGDHQC
level=info ts=2018-06-29T05:42:47.877366476Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1526040000000 maxt=1526234400000 ulid=01CDDE7KTFE65TCXNK9GP279BQ
level=info ts=2018-06-29T05:42:47.877421121Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1526234400000 maxt=1526428800000 ulid=01CDK7M8KPGRTKABFK8SPB342V
level=info ts=2018-06-29T05:42:47.87749242Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1526428800000 maxt=1526623200000 ulid=01CDS11974525DAHSYYC3GMJXG
level=info ts=2018-06-29T05:42:47.877558161Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1526623200000 maxt=1526817600000 ulid=01CDYTDPYYSAW8YVQH0CVNVC9F
level=info ts=2018-06-29T05:42:47.877622592Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1526817600000 maxt=1527012000000 ulid=01CE4KTTBVM8ND1KM0BD07K8DC
level=info ts=2018-06-29T05:42:47.877684055Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527012000000 maxt=1527206400000 ulid=01CEAD77M8BDW92NXFZ2E07E9T
level=info ts=2018-06-29T05:42:47.877744944Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527206400000 maxt=1527400800000 ulid=01CEG6KRKRNR8Z670BPHWFQ1NB
level=info ts=2018-06-29T05:42:47.877806268Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527400800000 maxt=1527595200000 ulid=01CEP00V2778QM9EA1WP3ADGHZ
level=info ts=2018-06-29T05:42:47.877866166Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527595200000 maxt=1527789600000 ulid=01CEVSDBRPEANB1N5H251D698Z
level=info ts=2018-06-29T05:42:47.877939432Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527789600000 maxt=1527984000000 ulid=01CF1JTRAEPGGBDA4ZCGZZ2ZP4
level=info ts=2018-06-29T05:42:47.878000643Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527984000000 maxt=1528178400000 ulid=01CF7C7CVPCCS6VN3A5Q2TY4VG
level=info ts=2018-06-29T05:42:47.878062404Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528178400000 maxt=1528372800000 ulid=01CFD5KHYBW2N7CBB1R7R6DYE3
level=info ts=2018-06-29T05:42:47.878123757Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528372800000 maxt=1528567200000 ulid=01CFJZ5NX4D0EB8VDHPVWSY9JT
level=info ts=2018-06-29T05:42:47.878193538Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528567200000 maxt=1528761600000 ulid=01CFRRJJAT0PBKK6Z6HSFX51ZD
level=info ts=2018-06-29T05:42:47.878261655Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528761600000 maxt=1528956000000 ulid=01CFYRSMJX7TJPGHQRF7RM7GDS
level=info ts=2018-06-29T05:42:47.878325809Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528956000000 maxt=1529150400000 ulid=01CG4J4SKHVGTGQDNFRBVJSCHD
level=info ts=2018-06-29T05:42:47.878386708Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529150400000 maxt=1529344800000 ulid=01CGABHZCXX219YYV1GNGH4T3P
level=info ts=2018-06-29T05:42:47.878461579Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529344800000 maxt=1529539200000 ulid=01CGG4ZHR1Y9G3CM5D16RJMPBB
level=info ts=2018-06-29T05:42:47.878528372Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529539200000 maxt=1529733600000 ulid=01CGNYC3YDTWASN55F8XF4Z22R
level=info ts=2018-06-29T05:42:47.878589781Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529733600000 maxt=1529928000000 ulid=01CGVQRM4G5GSE74CMG3ARWD2D
level=info ts=2018-06-29T05:42:47.878653241Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529928000000 maxt=1530122400000 ulid=01CH1H57YV29TJF7HHHCQ5PNRJ
level=info ts=2018-06-29T05:42:47.878696161Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530122400000 maxt=1530144000000 ulid=01CH25EFB1HFA8KWPH4D31WFV8
level=info ts=2018-06-29T05:42:47.878738139Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530144000000 maxt=1530165600000 ulid=01CH2T1CRDR48BE57SJJ3MWCN1
level=info ts=2018-06-29T05:42:47.878778122Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530187200000 maxt=1530194400000 ulid=01CH3EH0DNEBYGQ928FJ69JN7T
level=info ts=2018-06-29T05:42:47.878819258Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530165600000 maxt=1530187200000 ulid=01CH3EN9XXC5JSSXNTERE5NVM4
level=info ts=2018-06-29T05:42:47.878874239Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530122400000 maxt=1530187200000 ulid=01CH3F08F417H0GGM056PF3FKS
level=info ts=2018-06-29T05:42:52.251098703Z caller=main.go:402 msg="Stopping scrape discovery manager..."
level=info ts=2018-06-29T05:42:52.251154781Z caller=main.go:416 msg="Stopping notify discovery manager..."
level=info ts=2018-06-29T05:42:52.251165314Z caller=main.go:438 msg="Stopping scrape manager..."
level=info ts=2018-06-29T05:42:52.251185958Z caller=manager.go:464 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-06-29T05:42:52.251198231Z caller=manager.go:470 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-06-29T05:42:52.251208823Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2018-06-29T05:42:52.251223998Z caller=main.go:588 msg="Notifier manager stopped"
level=info ts=2018-06-29T05:42:52.251207307Z caller=main.go:432 msg="Scrape manager stopped"
level=info ts=2018-06-29T05:42:52.251224054Z caller=main.go:412 msg="Notify discovery manager stopped"
level=info ts=2018-06-29T05:42:52.251247707Z caller=main.go:398 msg="Scrape discovery manager stopped"
level=error ts=2018-06-29T05:42:52.251604683Z caller=main.go:597 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1530122400000, maxt: 1530144000000, range: 6h0m0s, blocks: 2]: <ulid: 01CH25EFB1HFA8KWPH4D31WFV8, mint: 1530122400000, maxt: 1530144000000, range: 6h0m0s>, <ulid: 01CH3F08F417H0GGM056PF3FKS, mint: 1530122400000, maxt: 1530187200000, range: 18h0m0s>\n[mint: 1530144000000, maxt: 1530165600000, range: 6h0m0s, blocks: 2]: <ulid: 01CH3F08F417H0GGM056PF3FKS, mint: 1530122400000, maxt: 1530187200000, range: 18h0m0s>, <ulid: 01CH2T1CRDR48BE57SJJ3MWCN1, mint: 1530144000000, maxt: 1530165600000, range: 6h0m0s>\n[mint: 1530165600000, maxt: 1530187200000, range: 6h0m0s, blocks: 2]: <ulid: 01CH3F08F417H0GGM056PF3FKS, mint: 1530122400000, maxt: 1530187200000, range: 18h0m0s>, <ulid: 01CH3EN9XXC5JSSXNTERE5NVM4, mint: 1530165600000, maxt: 1530187200000, range: 6h0m0s>"
level=info ts=2018-06-29T05:42:52.251663561Z caller=main.go:599 msg="See you next time!"

Is there a way to start Prometheus by skipping broken blocks?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 29, 2018

Can you share the steps to reproduce this?
did it crash before this or was it OOM killed?

The logs before this happened would be useful.

as a workaround I have attached a tool I have been working on which should allow you to delete the overlapping blocks and allow a normal startup
It is still WIP and probably broken in many places so use at your own risk!
tsdb scan -h /path/to/data

tsdbScan.tar.gz

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jun 29, 2018

Looks like #4302

@apsega

This comment has been minimized.

Copy link
Author

apsega commented Jun 29, 2018

From the metrics, it seems that Prometheus got very hungry for memory (host has 32GB RAM):

screen shot 2018-06-29 at 12 43 50 pm

Unfortunately I do not have the logs before the event as I've done the upgrade.

Can you share the source code of tsdbScan? I'm not feeling that lucky running something blackbox in my environment :)

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 3, 2018

it is still WIP, but here is the code
prometheus/tsdb#320

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 3, 2018

if you can it will be even better to compile from that PR as it includes more recent code.
again it might be totally broken so run only as a last resort.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 5, 2018

Fabian did few changes in the tsdb that should prevent this so I would say it is safe to close and will revisit if it happens again in the next release which is coming soon.

@Hashfyre

This comment has been minimized.

Copy link

Hashfyre commented Feb 4, 2019

Running into this issue again

level=info ts=2019-02-04T13:01:25.694250286Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.3, branch=HEAD, revision=167a4b4e73a8eca8df648d2d2043e21bdb9a7449)"
level=info ts=2019-02-04T13:01:25.694299087Z caller=main.go:239 build_context="(go=go1.11.1, user=root@1e42b46043e9, date=20181004-08:42:02)"
level=info ts=2019-02-04T13:01:25.694316525Z caller=main.go:240 host_details="(Linux 4.4.0-1054-aws #63-Ubuntu SMP Wed Mar 28 19:42:42 UTC 2018 x86_64 prometheus-kube-prometheus-0 (none))"
level=info ts=2019-02-04T13:01:25.694335436Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2019-02-04T13:01:25.694351359Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-02-04T13:01:25.694825418Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2019-02-04T13:01:25.694856556Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-02-04T13:01:25.696252804Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1547964000000 maxt=1547971200000 ulid=01D1N7SWPVK1V3PF7P3M2S0DN3
level=info ts=2019-02-04T13:01:25.69630669Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1547971200000 maxt=1547978400000 ulid=01D1NENKYTXCG25J6XVBJ93AYA
level=info ts=2019-02-04T13:01:25.696340578Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1547978400000 maxt=1547985600000 ulid=01D1NNHB6TFFMP6R54GKEG5SND
level=info ts=2019-02-04T13:01:25.696370836Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1547985600000 maxt=1547992800000 ulid=01D1NWD2G7TTT8EB69FW1A419Y
level=info ts=2019-02-04T13:01:25.696399703Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1547992800000 maxt=1548000000000 ulid=01D1P38SRA374254F04YST2JV8
level=info ts=2019-02-04T13:01:25.696428285Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1548000000000 maxt=1548007200000 ulid=01D1PA4GXPWN9WMB9HEY1AHSCF
level=info ts=2019-02-04T13:01:25.696456944Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1548007200000 maxt=1548014400000 ulid=01D1PH0869D9C3GBSMX1A4DTDS
level=info ts=2019-02-04T13:01:25.696487497Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1548014400000 maxt=1548021600000 ulid=01D1PQVZE9CJ3Z2P2DZZHX15RX
level=info ts=2019-02-04T13:01:25.696516275Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1548021600000 maxt=1548028800000 ulid=01D1PYQPP9ZP9W5HBEKEH2ZM2Q
level=info ts=2019-02-04T13:01:25.696544626Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1548028800000 maxt=1548036000000 ulid=01D1Q5KDY92GY4PSP2R5GH5W37

using: kube-prometheus/jsonnet
prometheus version: prometheus:v2.4.3
k8s versions:

Client Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.2", GitCommit:"17c77c7898218073f14c8d573582e8d2313dc740", GitTreeState:"clean", BuildDate:"2018-10-30T21:39:38Z", GoVersion:"go1.11.1", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"12", GitVersion:"v1.12.1", GitCommit:"4ed3216f3ec431b140b1d899130a69fc671678f4", GitTreeState:"clean", BuildDate:"2018-10-05T16:36:14Z", GoVersion:"go1.10.4", Compiler:"gc", Platform:"linux/amd64"}
```
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 4, 2019

@Hashfyre please open a new issue with more details.
Your logs don't show any errors.

@Hashfyre

This comment has been minimized.

Copy link

Hashfyre commented Feb 4, 2019

cc @krasi-georgiev let me know if I can provide more context.
It does not, but prometheus fails to start and hangs on infinitely, since we are running using kube-prometheus, the readyness checks fail and the pod gets restarted as healthcheck on :9090 fails

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 4, 2019

This could be due to many reasons - large WAL file could be one of them(it could take few minutes to read a large WAL file.).
If this happens on 2.7 please open a new issue and we can troubleshoot when you add more details.
Thanks.

@Hashfyre

This comment has been minimized.

Copy link

Hashfyre commented Feb 4, 2019

@krasi-georgiev our issue seems to mimick #4003
as of now.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Feb 4, 2019

unless you spend the time to open a new issue including all relevant details don't think anyone would spend the time for a proper troubleshooting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.