Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tsdb Panic when reading chunk #4128

Closed
karlhungus opened this Issue May 1, 2018 · 17 comments

Comments

Projects
None yet
8 participants
@karlhungus
Copy link

karlhungus commented May 1, 2018

Bug Report

What did you do?
Ran prometheus

What did you expect to see?
It run

What did you see instead? Under which circumstances?

Exception on startup: panic: runtime error: index out of range (looks similar to: prometheus/tsdb#251)

Environment

  • System information:
/prometheus $ uname -srm
Linux 4.15.3-1.el7.elrepo.x86_64 x86_64
  • Prometheus version:
/prometheus $ prometheus --version
prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10
  • Logs:
level=info ts=2018-05-01T12:35:26.455261083Z caller=main.go:220 msg="Starting Prometheus" version="(version=2.2.1, branch=HEAD, revision=bc6058c81272a8d938c05e75607371284236aadc)"
level=info ts=2018-05-01T12:35:26.45538143Z caller=main.go:221 build_context="(go=go1.10, user=root@149e5b3f0829, date=20180314-14:15:45)"
level=info ts=2018-05-01T12:35:26.455414873Z caller=main.go:222 host_details="(Linux 4.15.3-1.el7.elrepo.x86_64 #1 SMP Mon Feb 12 06:46:25 EST 2018 x86_64 prometheus-59bf4d7cf-sxsk5 (none))"
level=info ts=2018-05-01T12:35:26.455427991Z caller=main.go:223 fd_limits="(soft=65536, hard=65536)"
level=info ts=2018-05-01T12:35:26.459927519Z caller=web.go:382 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-05-01T12:35:26.45947935Z caller=main.go:504 msg="Starting TSDB ..."
level=error ts=2018-05-01T12:36:48.215666256Z caller=wal.go:275 component=tsdb msg="WAL corruption detected; truncating" err="unexpected CRC32 checksum 6cafd16a, want 0" file=/data/prometheus/wal/033372 pos=75965705
level=info ts=2018-05-01T12:36:48.950404399Z caller=main.go:514 msg="TSDB started"
level=info ts=2018-05-01T12:36:48.950585155Z caller=main.go:588 msg="Loading configuration file" filename=/etc/prometheus/config.yaml
level=info ts=2018-05-01T12:36:48.96267101Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-05-01T12:36:48.965331005Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-05-01T12:36:48.966139809Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-05-01T12:36:48.966845893Z caller=kubernetes.go:191 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2018-05-01T12:36:48.96989835Z caller=main.go:491 msg="Server is ready to receive web requests."
level=info ts=2018-05-01T12:37:48.996924066Z caller=compact.go:393 component=tsdb msg="compact blocks" count=4 mint=1524960000000 maxt=1524981600000
panic: runtime error: index out of range

goroutine 403 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.(*Reader).Chunk(0xc75cb30c40, 0x13e, 0x1c56c80, 0xc5b252e0f0, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks/chunks.go:365 +0x3f8
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionSeriesSet).Next(0xc75bbc0dc0, 0xc763d9c770)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:691 +0x223
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.newCompactionMerger(0x1c47a80, 0xc75bbc0dc0, 0x1c47a80, 0xc75bbc0e60, 0xc7687230a0, 0x1c51880, 0xc7687230a0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:733 +0x8f
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc420d93220, 0xc75cb30e80, 0x4, 0x4, 0xc763d9c540, 0x1c56cc0, 0xc768766000, 0x1c3e680, 0xc7671fe640, 0x0, ...)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:537 +0x5bf
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).write(0xc420d93220, 0x7ffc588f58e3, 0x10, 0xc763d9c540, 0xc75cb30e80, 0x4, 0x4, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:441 +0x5dd
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).Compact(0xc420d93220, 0x7ffc588f58e3, 0x10, 0xc75cb30b40, 0x4, 0x4, 0x3023447db51b6301, 0x6431175cbef51e81, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:339 +0x693
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).compact(0xc420c7a0c0, 0x0, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:419 +0x400
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).run(0xc420c7a0c0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:279 +0x2cc
created by github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.Open
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:243 +0x5ba
@karlhungus

This comment has been minimized.

Copy link
Author

karlhungus commented May 1, 2018

Apologies if this is in the wrong repo. Any advice on recovering from this error without removing all tsdb files would be appreciated.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 1, 2018

had quick look and it seems some corruption in the chunk so it fails to read the encoding.

Is this easy to reproduce?

@karlhungus

This comment has been minimized.

Copy link
Author

karlhungus commented May 1, 2018

Is this easy to reproduce?

No it happened in a 1Tb tsdb (which I eventually deleted), and it really could be a result of any failure (disk or whatever). So I don't think it's a problem things get corrupt, I think the problem is that tsdb can't work around it (or I can't tell it to work around it), I'd much prefer to lose a bit of data then have to start fresh with everything.

I'm cool with closing this if this is too broad to be a bug.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 1, 2018

In most cases I have seen only one block is causing an issue so if you delete the entire directory you loose that date range,but Prometheus should still read the remaining.
By looking at the logs and checking the meta.json file in each folder you should be able to tell which is the offending directory.

@gouthamve suggested that we should probably add this as a cleanup command to the tsdb tool so if I find time I will try to implement this.

closing for now, but feel free to reopen if you can provide more info how to reproduce.

@sr

This comment has been minimized.

Copy link

sr commented May 4, 2018

@krasi-georgiev In the future, how can we collect more useful debugging data to help diagnose corruption issues? Unfortunately our tsdb contains private information and we are not able to share it, unless there's some way we can anonymize labels.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented May 4, 2018

that is a good idea , I will think what and how we can expose something useful .

btw I already opened the PR for the scan and delete corrupted data and just waiting for some input from the other maintainers before I start.

prometheus/tsdb#320

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented May 7, 2018

One thing that surprised me is how the checksums failed to catch this. I think something more than a simple corruption is going on here. And if it is indeed a corruption, I would like the querier to catch it, error rather than panic.

Re-opening to track, will be closed when fixed.

@gouthamve gouthamve reopened this May 7, 2018

@bonan

This comment has been minimized.

Copy link

bonan commented Jun 7, 2018

I'm seeing the same issue in both 2.2.1 and 2.3.0:

prometheus, version 2.2.1 (branch: HEAD, revision: bc6058c81272a8d938c05e75607371284236aadc)
  build user:       root@149e5b3f0829
  build date:       20180314-14:15:45
  go version:       go1.10
-- Logs begin at Thu 2018-06-07 13:44:14 UTC, end at Thu 2018-06-07 16:29:07 UTC. --
level=info ts=2018-06-07T15:00:00.485343904Z caller=compact.go:393 component=tsdb msg="compact blocks" count=1 mint=1528372800000 maxt=1528380000000
level=info ts=2018-06-07T15:00:01.027407263Z caller=head.go:348 component=tsdb msg="head GC completed" duration=16.657809ms
level=info ts=2018-06-07T15:00:01.082731112Z caller=head.go:357 component=tsdb msg="WAL truncation completed" duration=55.186079ms
level=info ts=2018-06-07T15:00:01.141441745Z caller=compact.go:393 component=tsdb msg="compact blocks" count=3 mint=1528351200000 maxt=1528372800000
level=info ts=2018-06-07T15:00:01.926333494Z caller=compact.go:393 component=tsdb msg="compact blocks" count=3 mint=1528308000000 maxt=1528372800000
level=info ts=2018-06-07T15:00:03.57978523Z caller=compact.go:393 component=tsdb msg="compact blocks" count=3 mint=1528178400000 maxt=1528372800000
panic: runtime error: index out of range
goroutine 277 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.(*Reader).Chunk(0xc423c76080, 0x2113006, 0x1c56c80, 0xc422088600, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks/chunks.go:365 +0x3f8
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionSeriesSet).Next(0xc424398000, 0xc421e2fb70)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:691 +0x223
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionMerger).Next(0xc4202d7a40, 0xc426c2a000)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:785 +0x46b
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionMerger).Next(0xc4202d7ab0, 0x15b5)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:773 +0x7f7
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc42015f540, 0xc423c76200, 0x3, 0x4, 0xc4202d77a0, 0x1c56cc0, 0xc426290000, 0x1c3e680, 0xc421d06a00, 0x0, ...)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:554 +0xb62
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).write(0xc42015f540, 0x7ffc87dd0f27, 0xb, 0xc4202d77a0, 0xc423c76200, 0x3, 0x4, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:441 +0x5dd
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).Compact(0xc42015f540, 0x7ffc87dd0f27, 0xb, 0xc423c76040, 0x3, 0x4, 0xdc9a7b03c3da6301, 0x6fe58eff5c2bd7e6, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:339 +0x693
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).compact(0xc4200ac180, 0x1, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:419 +0x400
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).run(0xc4200ac180)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:279 +0x2cc
created by github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.Open
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:243 +0x5ba

After updating to 2.3.0 (and running with debug log):

prometheus, version 2.3.0 (branch: HEAD, revision: 290d71791a507a5057b9a099c9d48703d86dc941)
  build user:       root@d539e167976a
  build date:       20180607-08:46:54
  go version:       go1.10.2
level=info ts=2018-06-07T16:55:59.948403059Z caller=main.go:222 msg="Starting Prometheus" version="(version=2.3.0, branch=HEAD, revision=290d71791a507a5057b9a099c9d48703d86dc941)"
level=info ts=2018-06-07T16:55:59.948573817Z caller=main.go:223 build_context="(go=go1.10.2, user=root@d539e167976a, date=20180607-08:46:54)"
level=info ts=2018-06-07T16:55:59.948643933Z caller=main.go:224 host_details="(Linux 4.13.0-37-generic #42~16.04.1-Ubuntu SMP Wed Mar 7 16:03:28 UTC 2018 x86_64 af3f50766d0d (none))"
level=info ts=2018-06-07T16:55:59.948716437Z caller=main.go:225 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-06-07T16:55:59.954446942Z caller=web.go:426 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-06-07T16:55:59.954461917Z caller=main.go:514 msg="Starting TSDB ..."
level=info ts=2018-06-07T16:55:59.956396679Z caller=web.go:478 component=web msg="router prefix" prefix=/prometheus
level=info ts=2018-06-07T16:55:59.956548849Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1510963200000 maxt=1511654400000 ulid=01BZV48VWFBHKZAMK620J772QQ
level=info ts=2018-06-07T16:55:59.95975985Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1511654400000 maxt=1516903200000 ulid=01C4QB6SE6B85VWNVNTSAHKWMN
level=info ts=2018-06-07T16:55:59.961946787Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1516903200000 maxt=1522152000000 ulid=01C9KZKY0XQMYPWH6R80JDZK7T
level=info ts=2018-06-07T16:55:59.964069945Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1522152000000 maxt=1527400800000 ulid=01CEGD8QHTF9K8F5825018E4VW
level=info ts=2018-06-07T16:55:59.964626111Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527400800000 maxt=1527984000000 ulid=01CF1SCY67E74X3016N9JS3HZS
level=info ts=2018-06-07T16:55:59.964958328Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527984000000 maxt=1528178400000 ulid=01CF7JSDMBXQCGYYBBP8TNMJTP
level=info ts=2018-06-07T16:55:59.96525028Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528178400000 maxt=1528243200000 ulid=01CF9GJXCH259TBHSHA8VQ2C5X
level=info ts=2018-06-07T16:55:59.965534207Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528243200000 maxt=1528308000000 ulid=01CFBECE07AQWSVF1F9GBP2ZMM
level=info ts=2018-06-07T16:55:59.966122625Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528372800000 maxt=1528380000000 ulid=01CFDC5XV51TZRNF3YWG8HF5FH
level=info ts=2018-06-07T16:55:59.966423978Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528308000000 maxt=1528372800000 ulid=01CFDC5Z869GY9JBK061DBM4B2
level=info ts=2018-06-07T16:56:04.288798178Z caller=main.go:524 msg="TSDB started"
level=info ts=2018-06-07T16:56:04.289014727Z caller=main.go:603 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2018-06-07T16:56:04.296028834Z caller=main.go:500 msg="Server is ready to receive web requests."
level=debug ts=2018-06-07T16:56:10.807616966Z caller=scrape.go:703 component="scrape manager" scrape_pool=network target=http://172.17.0.1:9130/metrics msg="Scrape failed" err="server returned HTTP status 500 Internal Server Error"
panic: runtime error: index out of range

goroutine 45 [running]:
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks.(*Reader).Chunk(0xc4262acf00, 0x2113006, 0x1df2e80, 0xc424da4960, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/chunks/chunks.go:365 +0x3f8
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionSeriesSet).Next(0xc421304be0, 0xc42051a870)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:710 +0x223
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionMerger).Next(0xc42030e000, 0xc423efa000)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:799 +0x46b
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*compactionMerger).Next(0xc42030e070, 0x15b5)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:787 +0x7f7
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).populateBlock(0xc420509c70, 0xc4262ad0c0, 0x3, 0x4, 0xc42033fea0, 0x1df2ec0, 0xc42563f7c0, 0x1dd9a40, 0xc42017b180, 0x0, ...)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:573 +0xb62
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).write(0xc420509c70, 0x7ffe26298f0a, 0xb, 0xc42033fea0, 0xc4262ad0c0, 0x3, 0x4, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:459 +0x3a9
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*LeveledCompactor).Compact(0xc420509c70, 0x7ffe26298f0a, 0xb, 0xc4262acec0, 0x3, 0x4, 0xfd803b242edb6301, 0x540af2e51e8907e8, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/compact.go:345 +0x7e9
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).compact(0xc4202ac3c0, 0x0, 0x0, 0x0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:415 +0x400
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*DB).run(0xc4202ac3c0)
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:275 +0x2cc
created by github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.Open
        /go/src/github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb/db.go:239 +0x593
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 7, 2018

@bonan thanks for the report.
What are the steps to reproduce this?

@simonpasquier

This comment has been minimized.

Copy link
Member

simonpasquier commented Jun 8, 2018

@krasi-georgiev see prometheus/tsdb#344. As I said in the PR, I'm not sure how the issue can be triggered except if the chunk is corrupted?

@bonan

This comment has been minimized.

Copy link

bonan commented Jun 8, 2018

Reboot and forced fsck seems to have solved the problem for me, so I can't reproduce it any more

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 27, 2018

@simonpasquier is this fixed with prometheus/tsdb#344?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 27, 2018

One thing that surprised me is how the checksums failed to catch this. I think something more than a simple corruption is going on here. And if it is indeed a corruption, I would like the querier to catch it, error rather than panic.

@gouthamve can you point out where do you think this should be cached?

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jun 28, 2018

I don't think there's an issue. This is indeed possibly some corruption. But if the corruption causes the file to be too short, we don't even get to a stage where we could validate a checksum. This was simply a bug where we didn't handle an error condition properly.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 28, 2018

yes this was my understanding as well , just didn't know if goutham had anything else in mind.

I will update tsdb and will close this when merged.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Jun 28, 2018

lets consider this fixed and it will be added to the next release.
Please reopen if you find that this doesn't fix the bug.
Thanks for the report again!

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.