Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash during startup #4598

Closed
haraldschilly opened this Issue Sep 12, 2018 · 8 comments

Comments

Projects
None yet
4 participants
@haraldschilly
Copy link

haraldschilly commented Sep 12, 2018

What did you do?

starting prometheus, with some data back from version 2.3.1. then I tried 2.4.0 rc0, which did crash, and today I saw this release and it also crashed.

I fear my problem report isn't helping at all, though, and I can't really share the data files. I've deleted them and so far it looks good. In case this happens again I'll amend this with more information.

What did you expect to see?

log continues to run until it is ready to receive requests ...

What did you see instead? Under which circumstances?

sudden stop and crash

Environment

  • System information:

    4.15.0-1009-gcp x86_64

      it's the official prometheus docker image from quay.io, at version 2.4.0
    
  • Prometheus version:

    /prometheus $ prometheus --version
    prometheus, version 2.4.0 (branch: HEAD, revision: 068eaa5)
    build user: root@d84c15ea5e93
    build date: 20180911-10:46:37
    go version: go1.10.3

  • Logs:

level=info ts=2018-09-12T07:22:17.706304482Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535443800000 maxt=1535444400
000 ulid=01CQ53MSQA9RAS5V3885H8HVHG                                                                                                         
level=info ts=2018-09-12T07:22:17.706336172Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535443800000 maxt=1535444400
000 ulid=01CQ53V9F4F17F4GWQP3METJKG                                                                                                         
level=info ts=2018-09-12T07:22:17.706061915Z caller=web.go:440 component=web msg="router prefix" prefix=/prometheus                         
level=info ts=2018-09-12T07:22:17.706370646Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535443800000 maxt=1535444400
000 ulid=01CQ541SSQGJB9EDDPGH2CN5T2                                                                                                         
level=info ts=2018-09-12T07:22:17.706409572Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535443800000 maxt=1535444400
000 ulid=01CQ548JQKP22FGAWCV8Y7PS4Z                                                                                                         
level=info ts=2018-09-12T07:22:17.706446167Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535443800000 maxt=1535444400
000 ulid=01CQ54FEHM9K47JG89W301HC19                                                                                                         
level=info ts=2018-09-12T07:22:17.706495488Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535443800000 maxt=1535444400
000 ulid=01CQ54P6PADX6ACS5QKSQ01JWW                                                                                                         
level=info ts=2018-09-12T07:22:17.706524329Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1535443800000 maxt=1535444400
000 ulid=01CQ54WZ3HP4VCAN9NHXMZST0B                                                                                                         
level=info ts=2018-09-12T07:22:19.037072124Z caller=main.go:423 msg="Stopping scrape discovery manager..."                                  
level=info ts=2018-09-12T07:22:19.037126169Z caller=main.go:437 msg="Stopping notify discovery manager..."                                  
level=info ts=2018-09-12T07:22:19.037135719Z caller=main.go:459 msg="Stopping scrape manager..."                                            
level=info ts=2018-09-12T07:22:19.037146759Z caller=main.go:433 msg="Notify discovery manager stopped"                                      
level=info ts=2018-09-12T07:22:19.037170849Z caller=main.go:419 msg="Scrape discovery manager stopped"                                      
level=info ts=2018-09-12T07:22:19.037335637Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."                  
level=info ts=2018-09-12T07:22:19.037423951Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"                      
level=info ts=2018-09-12T07:22:19.037311076Z caller=main.go:453 msg="Scrape manager stopped"                                                
level=info ts=2018-09-12T07:22:19.03744385Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."                
level=info ts=2018-09-12T07:22:19.037474862Z caller=main.go:608 msg="Notifier manager stopped"                                              
level=error ts=2018-09-12T07:22:19.03761488Z caller=main.go:617 err="opening storage failed: invalid block sequence: block time ranges overl
ap: [mint: 1535443800000, maxt: 1535444400000, range: 10m0s, blocks: 19]: <ulid: 01CQ51N0KKW3HFCKWBR30GZKDK, mint: 1535443800000, maxt: 1535
444400000, range: 10m0s>, <ulid: 01CQ51S03TH1S8W088S284VYNJ, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ51VY4V94KSM
07Y4D85TABW, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ5223J6VAY1YS4Y6YJXWSYC, mint: 1535443800000, maxt: 15354444
00000, range: 10m0s>, <ulid: 01CQ525JFAHERZ5B2BTKVVVRWJ, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ529MSAA79JZA8X2
FFGXXVB, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ52EQETE2AGZE05HDEEXX1T, mint: 1535443800000, maxt: 153544440000
0, range: 10m0s>, <ulid: 01CQ52N29BVYVPPTENWMY1GZMX, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ52V9C8Q780F772XB4NG
3BF, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ531NA61E3QHAX92YG5MV3V, mint: 1535443800000, maxt: 1535444400000, r
ange: 10m0s>, <ulid: 01CQ5380PRAEHXYR3H0WAC6JA0, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ53EEPH357D2SK4D6G6MEQ2,
 mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ53MSQA9RAS5V3885H8HVHG, mint: 1535443800000, maxt: 1535444400000, range
: 10m0s>, <ulid: 01CQ53V9F4F17F4GWQP3METJKG, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ541SSQGJB9EDDPGH2CN5T2, min
t: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ548JQKP22FGAWCV8Y7PS4Z, mint: 1535443800000, maxt: 1535444400000, range: 10
m0s>, <ulid: 01CQ54FEHM9K47JG89W301HC19, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ54P6PADX6ACS5QKSQ01JWW, mint: 1
535443800000, maxt: 1535444400000, range: 10m0s>, <ulid: 01CQ54WZ3HP4VCAN9NHXMZST0B, mint: 1535443800000, maxt: 1535444400000, range: 10m0s>
"                                                                                                                                           
@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Sep 12, 2018

Can you share the flags used to start Prometheus?

It looks like you've set storage.tsdb.min-block-duration and/or storage.tsdb.max-block-duration since the logs show that your block interval is 10 minutes.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 1, 2018

@haraldschilly

This comment has been minimized.

Copy link
Author

haraldschilly commented Oct 1, 2018

We downgraded back to version 2.3.2. However, yes, the block durations aren't the defaults.
Right now it's

--storage.tsdb.min-block-duration=10m
--storage.tsdb.max-block-duration=1h

and well, previously, the max block duration was 2h. Maybe there is some mixup when the max block duration parameter is different from the one on disk? I didn't spend any time in testing this for reporting something reproducible. Hence I'm not sure how valuable this info is.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 1, 2018

yeah without steps to reproduce It would be time consuming to troubleshoot this so I wouldn't have the time to take this road.

the logs show that all overlapping blocks have the same time ranges so it probably went into some sort of a dead loop maybe caused by some edge case with your settings and the upgrade path.

If you don't think we can figure out the steps to reproduce now, my suggestion is close the issue and revisit if it happens again.

@haraldschilly

This comment has been minimized.

Copy link
Author

haraldschilly commented Oct 1, 2018

Yes, I fully understand :-)

Just as an additional datapoint, before I close this, I experimented around with the older version 2.3.2 (not the one I was originally reporting) and also changing these min/max intervals from 2h → 1h and 10m → 5m. This is the last line before the crash, which happened during startup in the first couple of minutes:

These mint are almost everywhere the same. Is that what you mean? There were one or two OOM kills before that, and hence some datafile might not have cleaned up properly enough as it should be.

So, all I want to add is that whatever is going on is not specific to 2.4.0

level=info ts=2018-10-01T21:29:27.058502454Z caller=main.go:432 msg="Scrape manager stopped"
level=error ts=2018-10-01T21:29:27.058781064Z caller=main.go:596 err="Opening storage failed invalid block sequence: block time ranges overla
p: [mint: 1538424000000, maxt: 1538424300000, range: 5m0s, blocks: 28]: <ulid: 01CRRQTB6R6K7Z89C6B9T2QGPY, mint: 1538424000000, maxt: 1538424
600000, range: 10m0s>, <ulid: 01CRRMYP0YH42HHK3AYFB5TYC1, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRN0PP5JFGBXR93P
Q7H8BN6, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRN36FQVTXKRQB4FF6X3NDQ, mint: 1538424000000, maxt: 1538424600000
, range: 10m0s>, <ulid: 01CRRN6K31YKF2PAATPSB2K0Y4, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRN9ZBTE9T81GEC87B3GH$
K, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRNDERSG19BVWFC694PWW9Z, mint: 1538424000000, maxt: 1538424600000, ran$
e: 10m0s>, <ulid: 01CRRNGWBDMJ6SCEHW5DKBY0WF, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRPMM4H96RQ728JJJBK7V8E, mi$
t: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRNQRA1S5BPCJQS92MPB6S7, mint: 1538424000000, maxt: 1538424600000, range: 10$
0s>, <ulid: 01CRRNV7X5QBFGZ6PM0D6EHBWS, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRNYTQAZDBD0EE0W6X7PM15, mint: 15$
8424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRP29V65S9KK7DGAEJHD2ZW, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, $
ulid: 01CRRP5RVWD5XYMGVF1W2PV0VS, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRP9APY6R3VA2NF3W1DFV7V, mint: 15384240$
0000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRPD48829DSZAC74M8TR859, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid:
01CRRPGWM5QZWMA05STEGMY0SF, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRNMADDSFT19J2C3Z180YQ4, mint: 1538424000000,
maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRPR7GZ17K74NPP5RHGE2NP, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRR$
VX02AJ559EN6SX4MHSBE, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRPZKQ6BT5YKD0CGQ995PZN, mint: 1538424000000, maxt:
1538424600000, range: 10m0s>, <ulid: 01CRRQ39MW735XKK244XKJ3K77, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRQ720EN$
ZY44M5FEKYQXS8, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRQAT6TFPPAHX8EY2EF1ZSJ, mint: 1538424000000, maxt: 15384$
4600000, range: 10m0s>, <ulid: 01CRRQENEHXBDYJZ8YDMZZGY6K, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRQJDHAW4PR6QP$
Q9M6VXCD, mint: 1538424000000, maxt: 1538424600000, range: 10m0s>, <ulid: 01CRRQPA66DB73XHBN26KAE3KD, mint: 1538424000000, maxt: 15384246000$
0, range: 10m0s>, <ulid: 01CRRKWB2Y2GRCVSC176BEYR4Z, mint: 1538424000000, maxt: 1538424300000, range: 5m0s>"
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Oct 1, 2018

there was one specific fix that prevent crash loops after OOM so chances are this should be fine in 2.4

@Hashfyre

This comment has been minimized.

Copy link

Hashfyre commented Mar 2, 2019

This is happening even in 2.5.0

level=info ts=2019-03-02T13:49:56.082840042Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1551484800000 maxt=1551492000000 ulid=01D4Y5G5W8Y8KDH01KR175HA4C
level=info ts=2019-03-02T13:49:56.083309721Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1551492000000 maxt=1551499200000 ulid=01D4YCBX48XFHFQAERZNC9E4JZ
level=info ts=2019-03-02T13:49:56.083793592Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1551499200000 maxt=1551506400000 ulid=01D4YK7MC85KQSS46MVA9JJ40A
level=info ts=2019-03-02T13:49:56.084290175Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1551506400000 maxt=1551513600000 ulid=01D4YT3BM8P5W5VQWP7E1EG1H1
level=info ts=2019-03-02T13:49:56.084764443Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1551513600000 maxt=1551520800000 ulid=01D4Z0Z2W8HVCAM08M98W4DP90
level=info ts=2019-03-02T13:49:56.085222429Z caller=repair.go:35 component=tsdb msg="found healthy block" mint=1551520800000 maxt=1551528000000 ulid=01D4Z7TT48FBSE0VQPDD7HPPEW
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 2, 2019

many changes have been done since 2.5, can you please try with 2.7 and if it happens again please open a new issue with steps to replicate.
Thanks

@prometheus prometheus locked as resolved and limited conversation to collaborators Mar 2, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.