Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrated to 2.4.0 - Crash during startup corrupted WAL "opening storage failed: read WAL: repair corrupted WAL: cannot handle error" #4603

Closed
Eric-Fontana-Bose opened this Issue Sep 13, 2018 · 36 comments

Comments

Projects
None yet
@Eric-Fontana-Bose
Copy link

Eric-Fontana-Bose commented Sep 13, 2018

Crashing on startup after moving to 2.4.0

Bug Report

What did you do?

  • Prometheus version:

    2.4.0

  • Prometheus configuration file:

 args:
      - '--config.file=/config/prometheus.yml'
      - '--storage.tsdb.path=/data'
      - '--storage.tsdb.retention=90d'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
      - '--web.route-prefix=/'
      - '--web.external-url=https://ingress-platform.live-aws-useast1.bose.io/dev/core-operations/%(environment)s/prometheus2/'
      - '--web.enable-admin-api'
      - '--web.enable-lifecycle'
      - '--storage.tsdb.no-lockfile'

  • Logs:
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444128311Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.0, branch=HEAD, revision=068eaa5dbfce6c08f3d05d3d3c0bfd96267cfed2)"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444208601Z caller=main.go:239 build_context="(go=go1.10.3, user=root@d84c15ea5e93, date=20180911-10:46:37)"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444232038Z caller=main.go:240 host_details="(Linux 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 prometheus2-78df54755f-sxhzd (none))"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444252099Z caller=main.go:241 fd_limits="(soft=65536, hard=65536)"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.444290623Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.445174596Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.445163061Z caller=main.go:554 msg="Starting TSDB ..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.445737537Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1524139200000 maxt=1525651200000 ulid=01CCWFHCSNJ6HDYBZP7JYC2YPZ
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446129786Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1525651200000 maxt=1527400800000 ulid=01CEGD9PPRP7CBCMSNM5Q5S46W
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446318403Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527400800000 maxt=1527984000000 ulid=01CF1SE7KXPXWSJ2156J1BMECP
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446489737Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527984000000 maxt=1528567200000 ulid=01CFK5NR55KXHPDEMXJ43VGMJG
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446638706Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528567200000 maxt=1529150400000 ulid=01CG4HW6Y1FF16HR11EC178479
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.446874633Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529150400000 maxt=1530900000000 ulid=01CHRPM0GA88M75MTVNN3HMNRY
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447141029Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530900000000 maxt=1532649600000 ulid=01CKCV9K8G6E6218842MS5HW4N
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447412757Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1532649600000 maxt=1534399200000 ulid=01CN0ZW1A6T0TVS39DZR2K826Y
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447714569Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534399200000 maxt=1536148800000 ulid=01CPN4FN43S4MQ2HDFKRH9QVS3
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447888122Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536148800000 maxt=1536732000000 ulid=01CQ6GPX7TPKZNV0G4820MRDYC
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.447960779Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536753600000 ulid=01CQ74W71A3JATT9K1ETJDXG43
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.448045939Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536775200000 maxt=1536782400000 ulid=01CQ7SAT4XVJDD56G7MC93H05T
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:27:22.448129211Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536753600000 maxt=1536775200000 ulid=01CQ7SF6MJRRF8D02GMD16EPWH
prometheus2-78df54755f-sxhzd prometheus2 level=warn ts=2018-09-13T10:30:41.128060768Z caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 48 at 78834898: unexpected checksum ae7d1bfa, expected 512fafa7"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672805885Z caller=main.go:423 msg="Stopping scrape discovery manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672857035Z caller=main.go:437 msg="Stopping notify discovery manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672868276Z caller=main.go:459 msg="Stopping scrape manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672878612Z caller=main.go:433 msg="Notify discovery manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672908483Z caller=main.go:419 msg="Scrape discovery manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672929891Z caller=main.go:453 msg="Scrape manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.672961275Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.673005956Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.673039451Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
prometheus2-78df54755f-sxhzd prometheus2 level=info ts=2018-09-13T10:30:43.673056203Z caller=main.go:608 msg="Notifier manager stopped"
prometheus2-78df54755f-sxhzd prometheus2 level=error ts=2018-09-13T10:30:43.6731401Z caller=main.go:617 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error"

@Eric-Fontana-Bose Eric-Fontana-Bose changed the title Migrated to 2.4.0 - Crash during startup corrupted WAL Migrated to 2.4.0 - Crash during startup corrupted WAL "opening storage failed: read WAL: repair corrupted WAL: cannot handle error" Sep 13, 2018

@codesome

This comment has been minimized.

Copy link
Member

codesome commented Sep 14, 2018

What was the version before migrating to 2.4.0?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 14, 2018

@Eric-Fontana-Bose thanks for the report.

Can you send me the corrupted WAL privetely at kgeorgi at redhat.com so I can reproduce locally and find the culprit.

@Eric-Fontana-Bose

This comment has been minimized.

Copy link
Author

Eric-Fontana-Bose commented Sep 14, 2018

Is this the contents of the /data dir? Where is it

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 14, 2018

yes in /data/wal

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 17, 2018

did you find it? it should be some files in a format of 0000001 , 0000002 (numbering is not important)

@lostick

This comment has been minimized.

Copy link

lostick commented Sep 18, 2018

@krasi-georgiev Seeing a similar issue since we bumped from 2.3.2 to 2.4.0

level=warn ts=2018-09-18T09:50:31.815202492Z caller=scrape.go:819 component="scrape manager" scrape_pool=datacenter-prometheus-xxx target="https://prometheus-xxxx:443/federate?match%5B%5D=%7Bjob%3D~%22.%2B%22%7D" msg="appending scrape report failed" err="write to WAL: log samples: write /prometheus/wal/00000024: cannot allocate memory"

I've sent you the corrupted 00000024 WAL binary on your email in case you want to have a look

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 18, 2018

@lostick haven't received it. Maybe too big to send over an email. maybe upload somewhere and send me an invitation.

btw why do you think the issue is the same?
The logs indicate that the system doesn't have enough memory.

@lostick

This comment has been minimized.

Copy link

lostick commented Sep 18, 2018

@krasi-georgiev yes i'm unable to send it, redhat rejects the file (whether it's attached or as a wetransfer link).

btw why do you think the issue is the same?
The logs indicate that the system doesn't have enough memory.

We had no memory issues when previously running on 2.3.x, so I'd be more inclined towards an issue since the latest WAL implementation.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 18, 2018

@krasi-georgiev yes i'm unable to send it, redhat rejects the file (whether it's attached or as a wetransfer link).

That is weird, we don't have any restrictions AFAIK, you can send it to krasi.root at gmail.com

anyway although the issue is with the WAL it is different than the problem in this issue. Could you please open a new issue with steps to reproduce so we don't mix and I can look into it.

2.4 had a new WAL implementation so probably it is using more memory causing this problem on your system. You can increase the memory and that should solve the problem, but once you open the issue I will also check if there is anything we can do about it.

what is strange is that I have run a Benchmark before adding the new wal and it actually showed a reduced memory usage
#4471
http://prombench.prometheus.io/grafana/d/7gmLoNDmz/prombench?orgId=1&var-RuleGroup=All&var-pr-number=4471&from=1534048844837&to=1534202667746
so I guess the test was wrong or something weird is going on.

@VR6Pete

This comment has been minimized.

Copy link

VR6Pete commented Sep 18, 2018

I am also seeing this issue and as a result have lost one of my servers in my environment.

Is there anything I can do to assist resolution of the issue?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 18, 2018

@VR6Pete open an issue, steps to reproduce and ping me 😜

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 18, 2018

or if you are sure it is the same as the original one reported here please add steps to reproduce and send me the WAL

@VR6Pete

This comment has been minimized.

Copy link

VR6Pete commented Sep 18, 2018

How do i determine which is the corrupted WAL, nothing in the logs to state which one is at fault.

caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 573 at 64205313: unexpected checksum 5327a8, expected 970642dd"

caller=main.go:617 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error"

Happy to open another issue, or send you the WAL file direct as I feel is related to this issue.

@Eric-Fontana-Bose

This comment has been minimized.

Copy link
Author

Eric-Fontana-Bose commented Sep 18, 2018

I'm trying to get the WAL, but the issue is I can't get into the pod long enough to get it, prometheus
crashes and K8s restarts is.

@lostick

This comment has been minimized.

Copy link

lostick commented Sep 18, 2018

@krasi-georgiev yes i'm unable to send it, redhat rejects the file (whether it's attached or as a wetransfer link).

That is weird, we don't have any restrictions AFAIK, you can send it to krasi.root at gmail.com

anyway although the issue is with the WAL it is different than the problem in this issue. Could you please open a new issue with steps to reproduce so we don't mix and I can look into it.

2.4 had a new WAL implementation so probably it is using more memory causing this problem on your system. You can increase the memory and that should solve the problem, but once you open the issue I will also check if there is anything we can do about it.

what is strange is that I have run a Benchmark before adding the new wal and it actually showed a reduced memory usage
#4471
http://prombench.prometheus.io/grafana/d/7gmLoNDmz/prombench?orgId=1&var-RuleGroup=All&var-pr-number=4471&from=1534048844837&to=1534202667746
so I guess the test was wrong or something weird is going on.

Thanks, we have already doubled the memory, to no luck so far as it keeps growing.
I'll investigate some more and will open a separate issue if needed 👍

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 18, 2018

@VR6Pete send me everything from the data/wal folder

@Eric-Fontana-Bose don't you have the data mounted a persistent volume? If not than this is what you should do to keep it on restarts.

@stefancrain

This comment has been minimized.

Copy link

stefancrain commented Sep 18, 2018

We've also ran into a similar issue. This morning we went to restart the Prometheus container and we're faced with the issue.

Prometheus version:

  • 2.4.0

Previous version:

  • 2.3.2

Prometheus configuration file:

--web.enable-admin-api
--web.enable-lifecycle
--web.external-url=https://abc.com/prometheus
--config.file=/prom-conf/prometheus.yml
--storage.tsdb.retention=360d
--storage.tsdb.path=/prometheus
--web.console.libraries=/etc/prometheus/console_libraries
--web.console.templates=/etc/prometheus/consoles

Logs:

9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.920216527Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.0, branch=HEAD, revision=068eaa5dbfce6c08f3d05d3d3c0bfd96267cfed2)"
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.920315627Z caller=main.go:239 build_context="(go=go1.10.3, user=root@d84c15ea5e93, date=20180911-10:46:37)"
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.920344127Z caller=main.go:240 host_details="(Linux 3.10.0-862.9.1.el7.x86_64 #1 SMP Mon Jul 16 16:29:36 UTC 2018 x86_64 7086d4aee0ff (none))"
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.920371827Z caller=main.go:241 fd_limits="(soft=65536, hard=65536)"
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.920394428Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.921264833Z caller=main.go:554 msg="Starting TSDB ..."
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.921319433Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.921730336Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1526234400000 maxt=1526817600000 ulid=01CDZ1A6J3KQGQVMVY6FHFJVCN
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.921848336Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1526817600000 maxt=1527012000000 ulid=01CE4TH3XWSC9J0WDY624R22KG
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.921946237Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527012000000 maxt=1527206400000 ulid=01CEARAH5VGSC55SJBC5N7D2A2
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.922054238Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527206400000 maxt=1527400800000 ulid=01CEGDARTZ2G3BF01BY8GE70P0
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.922087038Z caller=web.go:440 component=web msg="router prefix" prefix=/prometheus
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.922143038Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527400800000 maxt=1527465600000 ulid=01CEJB47T7084TQJ2DHK7DKJZS
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.922590641Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527465600000 maxt=1527530400000 ulid=01CEM8W40CFSS22DGXMFP7ZRSK
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.922817642Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527530400000 maxt=1527595200000 ulid=01CEPSD3N69QC4C33P50HW0N98
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.922956243Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527595200000 maxt=1527789600000 ulid=01CEW02HGN9JDERDVWTTD2A30V
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.923071144Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527789600000 maxt=1527984000000 ulid=01CF1SGJZ68T3W4YGAY1ERB1WE
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.923599547Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1527984000000 maxt=1528567200000 ulid=01CFK5RT7C78ASNGZ7K7DK9G78
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.923783748Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1528567200000 maxt=1529150400000 ulid=01CG4HX1739MKSHR8XVY0P2AVG
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.924201851Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1529150400000 maxt=1530900000000 ulid=01CK49Q2VMSEA860D4QKAJQQ0K
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.924471552Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1530900000000 maxt=1532649600000 ulid=01CKCV800AJR23XMMRKNDWEAEQ
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.924744854Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1532649600000 maxt=1534399200000 ulid=01CN0ZTKTVQV346C9T24GE3HQA
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.924995955Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1534399200000 maxt=1536148800000 ulid=01CPN4TTC99T3CB8HD2CB6RQ6K
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.925132556Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536148800000 maxt=1536732000000 ulid=01CQ6GEPYN92JG75D0C2R8TDBP
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.925251057Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536732000000 maxt=1536926400000 ulid=01CQC9K306JS8R98X1GKG2WWP7
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.925415458Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536926400000 maxt=1537120800000 ulid=01CQJ2ZYKFH94YK7908AZ8ZGFQ
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.925686259Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1537120800000 maxt=1537185600000 ulid=01CQM0RPPB7VER4WQ0K6N0PSGC
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.92581216Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1537250400000 maxt=1537257600000 ulid=01CQNYGRY87WPCQP1GEKTR5H4B
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.925901461Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1537185600000 maxt=1537250400000 ulid=01CQNYJ7NWNAXE6QHX6ZWHMBCG
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.925969661Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1537257600000 maxt=1537264800000 ulid=01CQP5CG66RRY9FJ5GGFVBPYX0
9/18/2018 10:32:22 AMlevel=info ts=2018-09-18T14:32:22.926036262Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1537264800000 maxt=1537272000000 ulid=01CQPC87E8ENY32Y8BK2520QH4
9/18/2018 10:33:19 AMlevel=warn ts=2018-09-18T14:33:19.569651101Z caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 129 at 38398874: unexpected checksum b5ce4fe3, expected 20682715"
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717710181Z caller=main.go:423 msg="Stopping scrape discovery manager..."
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717761782Z caller=main.go:437 msg="Stopping notify discovery manager..."
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717778782Z caller=main.go:459 msg="Stopping scrape manager..."
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717793882Z caller=main.go:433 msg="Notify discovery manager stopped"
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717816682Z caller=main.go:419 msg="Scrape discovery manager stopped"
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717833882Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717888082Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717836282Z caller=main.go:453 msg="Scrape manager stopped"
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717919783Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
9/18/2018 10:33:19 AMlevel=info ts=2018-09-18T14:33:19.717947983Z caller=main.go:608 msg="Notifier manager stopped"
9/18/2018 10:33:19 AMlevel=error ts=2018-09-18T14:33:19.718090484Z caller=main.go:617 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error"

Directory:

# ls -lah /prometheus
total 160
drwxrwxr-x   29 172      root        4.0K Sep 18 13:00 .
drwxr-xr-x    1 root     root          45 Sep 18 14:21 ..
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CDZ1A6J3KQGQVMVY6FHFJVCN
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CE4TH3XWSC9J0WDY624R22KG
drwxr-xr-x    3 root     root          33 May 25 04:14 01CEAR3FMZBVAHTAQT9TPPV202.tmp
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CEARAH5VGSC55SJBC5N7D2A2
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CEGDARTZ2G3BF01BY8GE70P0
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CEJB47T7084TQJ2DHK7DKJZS
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CEM8W40CFSS22DGXMFP7ZRSK
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CEPSD3N69QC4C33P50HW0N98
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CEW02HGN9JDERDVWTTD2A30V
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CF1SGJZ68T3W4YGAY1ERB1WE
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CFK5RT7C78ASNGZ7K7DK9G78
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CG4HX1739MKSHR8XVY0P2AVG
drwxr-xr-x    3 root     root          33 Jun 30 03:02 01CH7A92M7FT5V0FREWPVXK2ZZ.tmp
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CK49Q2VMSEA860D4QKAJQQ0K
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CKCV800AJR23XMMRKNDWEAEQ
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CN0ZTKTVQV346C9T24GE3HQA
drwxr-xr-x    3 root     root          33 Aug 23 21:00 01CNM9DMVG4ZPSWMN9V2K6A0HB.tmp
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CPN4TTC99T3CB8HD2CB6RQ6K
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CQ6GEPYN92JG75D0C2R8TDBP
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CQC9K306JS8R98X1GKG2WWP7
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CQJ2ZYKFH94YK7908AZ8ZGFQ
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CQM0RPPB7VER4WQ0K6N0PSGC
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CQNYGRY87WPCQP1GEKTR5H4B
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CQNYJ7NWNAXE6QHX6ZWHMBCG
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CQP5CG66RRY9FJ5GGFVBPYX0
drwxr-xr-x    3 root     root          68 Sep 18 13:28 01CQPC87E8ENY32Y8BK2520QH4
-rwxrwxr-x    1 172      root         272 Apr 26 20:33 index.html
-rw-r--r--    1 root     root           0 Jun  8 19:43 lock
-rwxrwxr-x    1 172      root      138.6K Apr 26 20:32 metrics
-rw-r--r--    1 root     root        4.4K May 15 18:40 status
drwxr-xr-x    3 root     root        4.0K Sep 18 14:00 wal
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Sep 18, 2018

@stefancrain please send me the wal files.

Ping to @fabxc who can probably fix this a lot quicker with his magic ;)

@stefancrain

This comment has been minimized.

Copy link

stefancrain commented Sep 18, 2018

@krasi-georgiev sent. please reach out if there is anything else I can do to help.

@sc250024

This comment has been minimized.

Copy link

sc250024 commented Sep 18, 2018

@krasi-georgiev Same issue here. Using kube-prometheus based installation. There weren't any memory problems with the pod. Will send you the /wal directory.

Previous Prometheus version: 2.3.2

level=info ts=2018-09-18T17:42:07.493615331Z caller=main.go:238 msg="Starting Prometheus" version="(version=2.4.0, branch=HEAD, revision=068eaa5dbfce6c08f3d05d3d3c0bfd96267cfed2)"
level=info ts=2018-09-18T17:42:07.493697559Z caller=main.go:239 build_context="(go=go1.10.3, user=root@d84c15ea5e93, date=20180911-10:46:37)"
level=info ts=2018-09-18T17:42:07.493720216Z caller=main.go:240 host_details="(Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u4 (2018-08-21) x86_64 prometheus-kube-prometheus-0 (none))"
level=info ts=2018-09-18T17:42:07.493737558Z caller=main.go:241 fd_limits="(soft=1048576, hard=1048576)"
level=info ts=2018-09-18T17:42:07.493837443Z caller=main.go:242 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2018-09-18T17:42:07.494483966Z caller=main.go:554 msg="Starting TSDB ..."
level=info ts=2018-09-18T17:42:07.494591411Z caller=web.go:397 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2018-09-18T17:42:07.494970164Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536912000000 maxt=1536926400000 ulid=01CQCGD593YEGMW0DJEBQR14ZR
level=info ts=2018-09-18T17:42:07.495115091Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536926400000 maxt=1536991200000 ulid=01CQE7B3HZZ7CE4MMT8RHHK25M
level=info ts=2018-09-18T17:42:07.495276547Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1536991200000 maxt=1537012800000 ulid=01CQEVY34FSJGKZ0MTG6N1GN02
level=info ts=2018-09-18T17:42:07.495377333Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1537034400000 maxt=1537041600000 ulid=01CQFGGZ51K72F1J90NVPRV73W
level=info ts=2018-09-18T17:42:07.495464753Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1537012800000 maxt=1537034400000 ulid=01CQFGH72V2ATA6Q458G87BJNW
level=info ts=2018-09-18T17:42:07.495591005Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1537041600000 maxt=1537048800000 ulid=01CQFQCPD9QT6T09F8N8KH3QGN
level=warn ts=2018-09-18T17:42:27.446517864Z caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 79 at 86179260: unexpected checksum fa1b22e5, expected 9df7f0d2"
level=info ts=2018-09-18T17:42:27.506001854Z caller=main.go:423 msg="Stopping scrape discovery manager..."
level=info ts=2018-09-18T17:42:27.50605812Z caller=main.go:437 msg="Stopping notify discovery manager..."
level=info ts=2018-09-18T17:42:27.506068979Z caller=main.go:459 msg="Stopping scrape manager..."
level=info ts=2018-09-18T17:42:27.506079338Z caller=main.go:433 msg="Notify discovery manager stopped"
level=info ts=2018-09-18T17:42:27.506164661Z caller=main.go:419 msg="Scrape discovery manager stopped"
level=info ts=2018-09-18T17:42:27.506179016Z caller=main.go:453 msg="Scrape manager stopped"
level=info ts=2018-09-18T17:42:27.50625795Z caller=manager.go:638 component="rule manager" msg="Stopping rule manager..."
level=info ts=2018-09-18T17:42:27.506311855Z caller=manager.go:644 component="rule manager" msg="Rule manager stopped"
level=info ts=2018-09-18T17:42:27.506418152Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2018-09-18T17:42:27.506448819Z caller=main.go:608 msg="Notifier manager stopped"
level=error ts=2018-09-18T17:42:27.506588933Z caller=main.go:617 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error"

Config file

--web.console.templates=/etc/prometheus/consoles
--web.console.libraries=/etc/prometheus/console_libraries
--config.file=/etc/prometheus/config_out/prometheus.env.yaml
--storage.tsdb.path=/prometheus
--storage.tsdb.retention=90d
--web.enable-lifecycle
--storage.tsdb.no-lockfile
--web.external-url=http://prometheus-prod.company.com/
--web.route-prefix=/
@stefancrain

This comment has been minimized.

Copy link

stefancrain commented Sep 18, 2018

This is causing a production monitoring outage for us. Could you suggest a fix to get back online? Would deleting the content of the wal folder and restarting resolve this?

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Sep 18, 2018

@stefancrain Sorry that is the case. Deleting the WAL and restarting would fix it.

Now that we have an offending WAL, we'll have a fix quite soon, hopefully by tomorrow. It'll also help if you could forward the mail to krasi to me also at (gouthamve [at] gmail.com)

@stefancrain

This comment has been minimized.

Copy link

stefancrain commented Sep 18, 2018

@gouthamve Deleting the wal files allowed prometheus to come back online. Thanks for that!

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Sep 19, 2018

Hi, thanks for sending the WALs everyone, it made finding and fixing the issue very easy. The fix is out here: prometheus/tsdb#389 and will make it into the 2.4.1 release hopefully today.

@Eric-Fontana-Bose

This comment has been minimized.

Copy link
Author

Eric-Fontana-Bose commented Sep 20, 2018

This did NOT! make it into 2.4.1 . when will it get fixed? We're busted.

@lostick

This comment has been minimized.

Copy link

lostick commented Sep 20, 2018

They will probably get a fix for the fix in 2.4.2 (see #4636)

@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Sep 20, 2018

Hi, sorry, due to an error on my side, it did not make it in. We're in the process of releasing 2.4.2 but due to other issues on Windows we've discovered, we had to delay it until they are fixed. The PR is out and reviewed and we hope the release of 2.4.2 will happen over the next 24hrs.

Sorry for the experience you had.

See: prometheus/tsdb#392 which is the fix for: #4635

@stefancrain

This comment has been minimized.

Copy link

stefancrain commented Sep 21, 2018

2.4.2 looks to have been released ~7 hours ago with the fix for this issue. Our team has upgraded and will report back if we run into this issue again.

@sumkincpp

This comment has been minimized.

Copy link

sumkincpp commented Sep 21, 2018

Hit the same problem on 2.4.0, 2.4.2 repaired DB and fixed the issue repairing corrupted block :

 caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 2 at 32238072: unexpected checksum 4cbbf0b2, expected c50d6eb7"
@Place1

This comment has been minimized.

Copy link

Place1 commented Oct 7, 2018

I've hit this as well on prom/prometheus:v2.4.3

level=error ts=2018-10-07T02:39:01.980632051Z caller=main.go:617 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error: invalid record type 255"
@gouthamve

This comment has been minimized.

Copy link
Member

gouthamve commented Oct 7, 2018

Hi @Place1, I've opened a new issue for it here: #4705

In the meantime while we fix it, you could delete the contents of the WAL directory to unblock prometheus. Please note that there will be loss of some data (upto 2hrs).

@lxkaka

This comment has been minimized.

Copy link

lxkaka commented Nov 26, 2018

same issue with version 2.5.0

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Nov 26, 2018

@lxkaka you need to open a new issue with some details , error message and how to replicate.
Thanks.

@lxkaka

This comment has been minimized.

Copy link

lxkaka commented Nov 26, 2018

@krasi-georgiev ok, for now I delete WAL directory to resolve this issue

@anisimovyuriy

This comment has been minimized.

Copy link

anisimovyuriy commented Mar 11, 2019

we have the same issue in our dev environment after we added multiple consul integrations and many of the services are failed to scrape because on the service-side issues. To delete WAL directory won't help in our case as when we do this - same issue will be reproduced in a day...
Any help please?

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 11, 2019

Locking this issue now as all these will need a new issue with more details for some proper troubleshooting.

@prometheus prometheus locked and limited conversation to collaborators Mar 11, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.