Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Opening storage failed invalid block sequence: block time ranges overlap after updating from 2.3.1 to 2.3.2 (potential corruption?) #4388

Open
ankon opened this Issue Jul 16, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@ankon
Copy link

ankon commented Jul 16, 2018

Bug Report

What did you do?

Looking into an issue with Prometheus failing due to a disk-full situation. I resized the partition (with Prometheus not running as far as I can tell), and then also noticed Prometheus 2.3.2 with TSDB fixes being available, so I added that in for good measure.

What did you expect to see?

Prometheus starting up again, potentially with reports of lost data due to the disk full problems.

What did you see instead? Under which circumstances?

Prometheus fails to start, reporting:

level=error ts=2018-07-16T11:11:47.886999926Z caller=main.go:596 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1531540800000, maxt: 1531548000000, range: 2h0m0s, blocks: 2]: <ulid: 01CJH2QN6AM789GKHEMYJBFNVS, mint: 1531526400000, maxt: 1531548000000, range: 6h0m0s>, <ulid: 01CJH1KXT0B5HRJQFBAH3BM7SJ, mint: 1531540800000, maxt: 1531548000000, range: 2h0m0s>\n[mint: 1531526400000, maxt: 1531533600000, range: 2h0m0s, blocks: 2]: <ulid: 01CJH1GB8RE458KJYGX7HS09DY, mint: 1531526400000, maxt: 1531533600000, range: 2h0m0s>, <ulid: 01CJH2QN6AM789GKHEMYJBFNVS, mint: 1531526400000, maxt: 1531548000000, range: 6h0m0s>\n[mint: 1531533600000, maxt: 1531540800000, range: 2h0m0s, blocks: 2]: <ulid: 01CJH2QN6AM789GKHEMYJBFNVS, mint: 1531526400000, maxt: 1531548000000, range: 6h0m0s>, <ulid: 01CJH1JP2BJHGM4YY69G123GDV, mint: 1531533600000, maxt: 1531540800000, range: 2h0m0s>"

This looks like it was introduced with prometheus/tsdb#347, which was pulled into Prometheus 2.3.2.

If I were to guess: things crashed mid-merging?

More data about these chunks:

core@ip-10-0-0-105 /mnt $ ls -ldtr 01CJH2QN6AM789GKHEMYJBFNVS  01CJH1KXT0B5HRJQFBAH3BM7SJ  01CJH1GB8RE458KJYGX7HS09DY 01CJH1JP2BJHGM4YY69G123GDV                                                                   
drwxrwsr-x. 3 nobody 2000 4096 Jul 16 07:58 01CJH1GB8RE458KJYGX7HS09DY
drwxrwsr-x. 3 nobody 2000 4096 Jul 16 07:58 01CJH1JP2BJHGM4YY69G123GDV
drwxrwsr-x. 3 nobody 2000 4096 Jul 16 07:59 01CJH1KXT0B5HRJQFBAH3BM7SJ
drwxrwsr-x. 3 nobody 2000 4096 Jul 16 08:19 01CJH2QN6AM789GKHEMYJBFNVS
core@ip-10-0-0-105 /mnt $ cat {01CJH2QN6AM789GKHEMYJBFNVS,01CJH1KXT0B5HRJQFBAH3BM7SJ,01CJH1GB8RE458KJYGX7HS09DY,01CJH1JP2BJHGM4YY69G123GDV}/meta.json                                                              
{
	"ulid": "01CJH2QN6AM789GKHEMYJBFNVS",
	"minTime": 1531526400000,
	"maxTime": 1531548000000,
	"stats": {
		"numSamples": 79359608,
		"numSeries": 738422,
		"numChunks": 738422
	},
	"compaction": {
		"level": 2,
		"sources": [
			"01CJH1GB8RE458KJYGX7HS09DY",
			"01CJH1JP2BJHGM4YY69G123GDV",
			"01CJH1KXT0B5HRJQFBAH3BM7SJ"
		]
	},
	"version": 1
}
{
	"ulid": "01CJH1KXT0B5HRJQFBAH3BM7SJ",
	"minTime": 1531540800000,
	"maxTime": 1531548000000,
	"stats": {},
	"compaction": {
		"level": 1,
		"sources": [
			"01CJH1KXT0B5HRJQFBAH3BM7SJ"
		]
	},
	"version": 1
}
{
	"ulid": "01CJH1GB8RE458KJYGX7HS09DY",
	"minTime": 1531526400000,
	"maxTime": 1531533600000,
	"stats": {
		"numSamples": 79359608,
		"numSeries": 738422,
		"numChunks": 738422
	},
	"compaction": {
		"level": 1,
		"sources": [
			"01CJH1GB8RE458KJYGX7HS09DY"
		]
	},
	"version": 1
}
{
	"ulid": "01CJH1JP2BJHGM4YY69G123GDV",
	"minTime": 1531533600000,
	"maxTime": 1531540800000,
	"stats": {},
	"compaction": {
		"level": 1,
		"sources": [
			"01CJH1JP2BJHGM4YY69G123GDV"
		]
	},
	"version": 1
}

Environment

Prometheus 2.3.2 running on 2.3.1 data in Kubernetes 1.5 on AWS (data directory is on EBS).

Disk was full, so data loss is expected.

  • Logs:
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.856464129Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.2, branch=HEAD, revision=71af5e29e815795e9dd14742ee7725682fa14b7b)"
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.856551324Z caller=main.go:223 build_context="(go=go1.10.3, user=root@5258e0bd9cc1, date=20180712-14:02:52)"
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.856578956Z caller=main.go:224 host_details="(Linux 4.7.3-coreos-r3 #1 SMP Thu Feb 23 02:16:16 UTC 2017 x86_64 prometheus-1049297295-tvzr2 (none))"
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.856672893Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.861356668Z caller=main.go:533 msg="Starting TSDB ..."
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.861462971Z caller=web.go:415 component=web msg="Start listening for connections" address=0.0.0.0:9090
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.864592633Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529928000000 maxt=1530122400000 ulid=01CH1H0CBCD9Z84N9CQKR7M2GD
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.866239981Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530122400000 maxt=1530316800000 ulid=01CH7A8Z96AQV0JNR05C650SAK
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.867005889Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530316800000 maxt=1530511200000 ulid=01CHD3NCS2AK99D4W86EQPFF6P
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.867590448Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530511200000 maxt=1530705600000 ulid=01CHJX26ZVHA5685RY3X42W5GN
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.868173154Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530705600000 maxt=1530900000000 ulid=01CHRPESVQRZ914EKGRJ2BC8QY
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.86878508Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530900000000 maxt=1531094400000 ulid=01CHYFVKZKMVCYY1M0BQB811GN
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.869489208Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531094400000 maxt=1531288800000 ulid=01CJ4981MBD1FY705BS8R13DFB
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.870222048Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531288800000 maxt=1531353600000 ulid=01CJ6701NMR4HT0RK2KWHFQKE2
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.870781844Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531353600000 maxt=1531418400000 ulid=01CJ84V94T77CFBG9T4VMW2WE7
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.871460001Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531418400000 maxt=1531483200000 ulid=01CJA2MK1VEPTK536B2CVGTSKP
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.874716788Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531483200000 maxt=1531504800000 ulid=01CJAQ5RCX27PR4V0F648Z3788
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.875331734Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531526400000 maxt=1531533600000 ulid=01CJH1GB8RE458KJYGX7HS09DY
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.875969965Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531533600000 maxt=1531540800000 ulid=01CJH1JP2BJHGM4YY69G123GDV
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.87759591Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531540800000 maxt=1531548000000 ulid=01CJH1KXT0B5HRJQFBAH3BM7SJ
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.878068132Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531548000000 maxt=1531555200000 ulid=01CJH1N4FSNT558A8AR35DW0QR
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.878573836Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531555200000 maxt=1531562400000 ulid=01CJH1PHGFZDZEVSEVVNGW9E4N
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.879252678Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531562400000 maxt=1531569600000 ulid=01CJH1R438CSAF387YT7NZ4C7X
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.879706675Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531569600000 maxt=1531576800000 ulid=01CJH1SGAW2NHWE7B93A53ZVCJ
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.880210755Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531576800000 maxt=1531584000000 ulid=01CJH1V1QS3EBWABPER0CHXJZQ
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.880721565Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531584000000 maxt=1531591200000 ulid=01CJH1WD6W1EBSDPRFQ258NC64
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.881204346Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531591200000 maxt=1531598400000 ulid=01CJH1XRW1E7GHPB1XWZWB67CJ
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.881653736Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531598400000 maxt=1531605600000 ulid=01CJH1Z3M6958RN64E5QGFMB1M
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.882082157Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531605600000 maxt=1531612800000 ulid=01CJH20C201MCCE9686SV1GWRR
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.882517979Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531612800000 maxt=1531620000000 ulid=01CJH21NR4645A9P4RJ9S07M0T
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.882951551Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531620000000 maxt=1531627200000 ulid=01CJH230MHS1JBE341EDMB6HCD
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.883426565Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531627200000 maxt=1531634400000 ulid=01CJH24D57JW195DF3NY6XHSZN
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.883925942Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531634400000 maxt=1531641600000 ulid=01CJH25QHRCH9K80P1ZFKY1JQG
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.884371249Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531641600000 maxt=1531648800000 ulid=01CJH26YX4Y9DVYF31AWHV1KP8
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.884893077Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531648800000 maxt=1531656000000 ulid=01CJH28B15MVFCMBV183BDKP5K
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.885329824Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531656000000 maxt=1531663200000 ulid=01CJH29KEXD5KAB27WNM3CG7TA
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.8857801Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531663200000 maxt=1531670400000 ulid=01CJH2B4S5DWAWQZMVHVZ4MRVY
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.886245695Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531670400000 maxt=1531677600000 ulid=01CJH2CH7ECR5M90W37N1B4XC2
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.886704687Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531677600000 maxt=1531684800000 ulid=01CJH2DX5460BP99SACT0HGNWS
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.887140542Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531684800000 maxt=1531692000000 ulid=01CJH2FANY1XD8G39E23V73VVW
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.887568906Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531692000000 maxt=1531699200000 ulid=01CJH2GQCY4QX6JYTBJ8Q390KR
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.888013924Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531699200000 maxt=1531706400000 ulid=01CJH2J2J59HYRRP9ME4EK8G24
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.888521165Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531706400000 maxt=1531713600000 ulid=01CJH2KD4TPMDSTX1H8AZ71BEX
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.889141955Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531713600000 maxt=1531720800000 ulid=01CJH2MPTWHR7R44D7KCZ35HRH
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.889677064Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531504800000 maxt=1531526400000 ulid=01CJH2P1AQM72CX777YE0ZQ37A
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:44.890189275Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1531526400000 maxt=1531548000000 ulid=01CJH2QN6AM789GKHEMYJBFNVS
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886274756Z caller=main.go:402 msg="Stopping scrape discovery manager..."
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.88633018Z caller=main.go:416 msg="Stopping notify discovery manager..."
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886345428Z caller=main.go:438 msg="Stopping scrape manager..."
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886360225Z caller=main.go:412 msg="Notify discovery manager stopped"
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886405787Z caller=main.go:398 msg="Scrape discovery manager stopped"
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886435794Z caller=main.go:432 msg="Scrape manager stopped"
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886459055Z caller=manager.go:464 component="rule manager" msg="Stopping rule manager..."
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886489642Z caller=manager.go:470 component="rule manager" msg="Rule manager stopped"
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886507479Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
prometheus-1049297295-tvzr2 prometheus level=info ts=2018-07-16T11:11:47.886528797Z caller=main.go:587 msg="Notifier manager stopped"
prometheus-1049297295-tvzr2 prometheus level=error ts=2018-07-16T11:11:47.886999926Z caller=main.go:596 err="Opening storage failed invalid block sequence: block time ranges overlap: [mint: 1531540800000, maxt: 1531548000000, range: 2h0m0s, blocks: 2]: <ulid: 01CJH2QN6AM789GKHEMYJBFNVS, mint: 1531526400000, maxt: 1531548000000, range: 6h0m0s>, <ulid: 01CJH1KXT0B5HRJQFBAH3BM7SJ, mint: 1531540800000, maxt: 1531548000000, range: 2h0m0s>\n[mint: 1531526400000, maxt: 1531533600000, range: 2h0m0s, blocks: 2]: <ulid: 01CJH1GB8RE458KJYGX7HS09DY, mint: 1531526400000, maxt: 1531533600000, range: 2h0m0s>, <ulid: 01CJH2QN6AM789GKHEMYJBFNVS, mint: 1531526400000, maxt: 1531548000000, range: 6h0m0s>\n[mint: 1531533600000, maxt: 1531540800000, range: 2h0m0s, blocks: 2]: <ulid: 01CJH2QN6AM789GKHEMYJBFNVS, mint: 1531526400000, maxt: 1531548000000, range: 6h0m0s>, <ulid: 01CJH1JP2BJHGM4YY69G123GDV, mint: 1531533600000, maxt: 1531540800000, range: 2h0m0s>"
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 16, 2018

Can you clarify if the blocks it's complaining about were generated by 2.3.1 or 2.3.2?

@ankon

This comment has been minimized.

Copy link
Author

ankon commented Jul 16, 2018

The blocks it is complaining about should have been generated by 2.3.1. I only had 2.3.2 attempt to start up once, which failed in the described manner. Looking at the timestamps (ls output is UTC, logs are in GMT+2, sorry for that): The files are older than the log, so I'm fairly sure here.


I've now deleted the 2h files and only kept the 6h file. This allowed Prometheus to start up.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 16, 2018

I believe this was the issue fixed by 2.3.2, so this is as expected. 2.3.2 can no longer produce such blocks.

@ankon

This comment has been minimized.

Copy link
Author

ankon commented Jul 16, 2018

Fair enough :)

Would it make sense to consider this is worthy of a "release note", something like: "If you see Prometheus failing to start up with [...], then you should manually clean out the blocks such that they don't overlap. Take into account the meta.json information, and use the aggregated blocks if possible"?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jul 16, 2018

Fabian added some code in 2.3.2 that would handle crash loops, but if the crash loop happened in 2.3.1 than the blocks wouldn't have the metadata needed to recover from this state.

For such cases the only option would be to use the tsdb scan tool
to remove the broken blocks generated by a version prior to 2.3.2.

@ntindall

This comment has been minimized.

Copy link

ntindall commented Aug 9, 2018

@krasi-georgiev can you point me to the source code / documentation for the tsdb scan tool?

I don't think it is this, is it? https://github.com/prometheus/tsdb/blob/master/cmd/tsdb/main.go

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Aug 11, 2018

prometheus/tsdb#320

it is still WIP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.