Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WAL Corruption - recover.go does not recover #5369

Closed
kiefersmith opened this Issue Mar 15, 2019 · 14 comments

Comments

Projects
None yet
4 participants
@kiefersmith
Copy link

kiefersmith commented Mar 15, 2019

Bug Report

What did you do?

I noticed Grafana lost connection to Prom. Looking at the Prom logs (below) there seems to be a corruption in the WAL. This error made Prometheus' tsdb completely unavailable.

What did you expect to see?

No errors.

What did you see instead? Under which circumstances?

I saw an error in Grafana - tsdb.HandleRequest() error ... dial tcp <prom_ip_addr>:9090: connect: connection refused

Errors in Prom logs.

Environment

  • System information:

    Linux 4.19.12-1.el7.elrepo.x86_64 x86_64

  • Prometheus version:

    prometheus, version 2.3.3 (branch: release-2.3.3, revision: c94ba49)
    build user: timo@umbra
    build date: #20180814-13:05:03
    go version: go1.10.3

  • Prometheus configuration file:
    NULL

  • Logs:

INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 12:56:01, next run: 2019-03-15 13:01:01)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:01:01, next run: 2019-03-15 13:06:01)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:06:01, next run: 2019-03-15 13:11:01)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:11:02, next run: 2019-03-15 13:16:02)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:16:02, next run: 2019-03-15 13:21:02)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:21:02, next run: 2019-03-15 13:26:02)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:26:02, next run: 2019-03-15 13:31:02)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:31:03, next run: 2019-03-15 13:36:03)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:36:03, next run: 2019-03-15 13:41:03)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:41:03, next run: 2019-03-15 13:46:03)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:46:03, next run: 2019-03-15 13:51:03)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:51:04, next run: 2019-03-15 13:56:04)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 13:56:04, next run: 2019-03-15 14:01:04)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:01:04, next run: 2019-03-15 14:06:04)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:06:05, next run: 2019-03-15 14:11:05)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:11:05, next run: 2019-03-15 14:16:05)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:16:05, next run: 2019-03-15 14:21:05)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:21:05, next run: 2019-03-15 14:26:05)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:26:06, next run: 2019-03-15 14:31:06)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:31:06, next run: 2019-03-15 14:36:06)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:36:06, next run: 2019-03-15 14:41:06)
INFO:schedule:Running job Every 5 minutes do discover_mesos_slaves() (last run: 2019-03-15 14:41:06, next run: 2019-03-15 14:46:06)
level=error ts=2019-03-15T14:46:33.706574386Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.798892788Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.799231821Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.842408336Z caller=engine.go:508 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.844569477Z caller=engine.go:508 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.85853192Z caller=engine.go:508 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.915564881Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.91581997Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.925904698Z caller=engine.go:508 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:33.931537797Z caller=engine.go:508 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:38.651556616Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:38.655657559Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:43.661907353Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:48.648798956Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:53.64350617Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:46:58.654623039Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:03.658527431Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:08.700502992Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:13.671682199Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:18.671546164Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:47.286772473Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:47.30382954Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:47.305759426Z caller=engine.go:508 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:50.818849113Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
level=error ts=2019-03-15T14:47:50.92907986Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
2019-03-15 14:47:58,522 INFO exited: prometheus (terminated by SIGKILL; not expected)
2019-03-15 14:47:59,538 INFO spawned: 'prometheus' with pid 37
level=info ts=2019-03-15T14:47:59.558285294Z caller=main.go:230 msg="Starting Prometheus" version="(version=2.3.3, branch=release-2.3.3, revision=c94ba498aebb649e86eba3d4a2f1d08a23187cb2)"
level=info ts=2019-03-15T14:47:59.558342258Z caller=main.go:231 build_context="(go=go1.10.3, user=timo@umbra, date=20180814-13:05:03)"
level=info ts=2019-03-15T14:47:59.55837162Z caller=main.go:232 host_details="(Linux 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 11e6c6c94a75 (none))"
level=info ts=2019-03-15T14:47:59.558394364Z caller=main.go:233 fd_limits="(soft=65536, hard=65536)"
level=info ts=2019-03-15T14:47:59.558412278Z caller=main.go:234 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-03-15T14:47:59.559243347Z caller=web.go:414 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-03-15T14:47:59.559230995Z caller=main.go:545 msg="Starting TSDB ..."
level=info ts=2019-03-15T14:47:59.568604921Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1549951200000 maxt=1550145600000 ulid=01D3P8BC04R89CAK604Y1RDZ14
level=info ts=2019-03-15T14:47:59.576836794Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550145600000 maxt=1550340000000 ulid=01D3W1QXTETA3VKG94A0NDEM9R
level=info ts=2019-03-15T14:47:59.577350293Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550340000000 maxt=1550534400000 ulid=01D41V4FXQ46ZD1MJZ9B0553JD
level=info ts=2019-03-15T14:47:59.585488515Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550534400000 maxt=1550728800000 ulid=01D47MHC8N4H4CXX9A2GSMW7NW
level=info ts=2019-03-15T14:47:59.586015235Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550728800000 maxt=1550923200000 ulid=01D4DDXRS29ZF5TNCD8MW71XNE
level=info ts=2019-03-15T14:47:59.59517528Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550923200000 maxt=1551117600000 ulid=01D4K7AE7A4QRD9TSPW5KPEM41
level=info ts=2019-03-15T14:47:59.628816429Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551117600000 maxt=1551312000000 ulid=01D4S0Q1XN6B6Y53QCHYXSCHE1
level=info ts=2019-03-15T14:47:59.636570806Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551312000000 maxt=1551506400000 ulid=01D4YT3JZJRYBA5WC1593RBD8P
level=info ts=2019-03-15T14:47:59.637136467Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551506400000 maxt=1551700800000 ulid=01D54KG7DSAT4SRVTWDPFZ9R7R
level=info ts=2019-03-15T14:47:59.63753834Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551700800000 maxt=1551895200000 ulid=01D5ACWZZV8VWSAHQ1PBH13W3T
level=info ts=2019-03-15T14:47:59.638050257Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551895200000 maxt=1552089600000 ulid=01D5G69HY96P6X2C3CN4XMG4J0
level=info ts=2019-03-15T14:47:59.638561277Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552089600000 maxt=1552284000000 ulid=01D5NZP2KXK3JBHF5PHHFM25F6
level=info ts=2019-03-15T14:47:59.643668409Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552284000000 maxt=1552478400000 ulid=01D5VS2WWM933JT4GPFVJ3VY0B
level=info ts=2019-03-15T14:47:59.656647997Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552478400000 maxt=1552543200000 ulid=01D5XPW7Q9PW67Z92MCX709CGG
level=info ts=2019-03-15T14:47:59.670694639Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552543200000 maxt=1552608000000 ulid=01D5ZMNPABE55821SQDX8DWJPG
level=info ts=2019-03-15T14:47:59.67116454Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552629600000 maxt=1552636800000 ulid=01D6098QNSJTP1NDCD9CJ39QB7
level=info ts=2019-03-15T14:47:59.671841501Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552608000000 maxt=1552629600000 ulid=01D6098VNK1PDBSA4ZPHMR515A
level=info ts=2019-03-15T14:47:59.672339904Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552636800000 maxt=1552644000000 ulid=01D60G4EXSN6ENP0SJSA59SKXP
level=info ts=2019-03-15T14:47:59.67936303Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552644000000 maxt=1552651200000 ulid=01D60Q065T5DR1YG15XY78XEVD
2019-03-15 14:48:09,686 INFO success: prometheus entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
level=warn ts=2019-03-15T14:48:34.37403836Z caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 4378 at 120789296: unexpected checksum 7c9a8e94, expected d91bf761"
level=info ts=2019-03-15T14:48:34.3985243Z caller=main.go:414 msg="Stopping scrape discovery manager..."
level=info ts=2019-03-15T14:48:34.398554639Z caller=main.go:428 msg="Stopping notify discovery manager..."
level=info ts=2019-03-15T14:48:34.398562715Z caller=main.go:450 msg="Stopping scrape manager..."
level=info ts=2019-03-15T14:48:34.398571097Z caller=main.go:424 msg="Notify discovery manager stopped"
level=info ts=2019-03-15T14:48:34.398594458Z caller=main.go:410 msg="Scrape discovery manager stopped"
level=info ts=2019-03-15T14:48:34.39861124Z caller=manager.go:634 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-03-15T14:48:34.398621771Z caller=manager.go:640 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-03-15T14:48:34.398634673Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2019-03-15T14:48:34.398616339Z caller=main.go:444 msg="Scrape manager stopped"
level=info ts=2019-03-15T14:48:34.398646271Z caller=main.go:599 msg="Notifier manager stopped"
level=error ts=2019-03-15T14:48:34.398869931Z caller=main.go:608 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error"
2019-03-15 14:48:34,427 INFO exited: prometheus (exit status 1; not expected)
2019-03-15 14:48:35,429 INFO spawned: 'prometheus' with pid 61
level=info ts=2019-03-15T14:48:35.449388547Z caller=main.go:230 msg="Starting Prometheus" version="(version=2.3.3, branch=release-2.3.3, revision=c94ba498aebb649e86eba3d4a2f1d08a23187cb2)"
level=info ts=2019-03-15T14:48:35.449446505Z caller=main.go:231 build_context="(go=go1.10.3, user=timo@umbra, date=20180814-13:05:03)"
level=info ts=2019-03-15T14:48:35.449485317Z caller=main.go:232 host_details="(Linux 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 11e6c6c94a75 (none))"
level=info ts=2019-03-15T14:48:35.449504781Z caller=main.go:233 fd_limits="(soft=65536, hard=65536)"
level=info ts=2019-03-15T14:48:35.449517858Z caller=main.go:234 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-03-15T14:48:35.450467726Z caller=web.go:414 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-03-15T14:48:35.450444996Z caller=main.go:545 msg="Starting TSDB ..."
level=info ts=2019-03-15T14:48:35.45079697Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1549951200000 maxt=1550145600000 ulid=01D3P8BC04R89CAK604Y1RDZ14
level=info ts=2019-03-15T14:48:35.450887713Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550145600000 maxt=1550340000000 ulid=01D3W1QXTETA3VKG94A0NDEM9R
level=info ts=2019-03-15T14:48:35.450972038Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550340000000 maxt=1550534400000 ulid=01D41V4FXQ46ZD1MJZ9B0553JD
level=info ts=2019-03-15T14:48:35.451078216Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550534400000 maxt=1550728800000 ulid=01D47MHC8N4H4CXX9A2GSMW7NW
level=info ts=2019-03-15T14:48:35.451164174Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550728800000 maxt=1550923200000 ulid=01D4DDXRS29ZF5TNCD8MW71XNE
level=info ts=2019-03-15T14:48:35.451247493Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1550923200000 maxt=1551117600000 ulid=01D4K7AE7A4QRD9TSPW5KPEM41
level=info ts=2019-03-15T14:48:35.451333886Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551117600000 maxt=1551312000000 ulid=01D4S0Q1XN6B6Y53QCHYXSCHE1
level=info ts=2019-03-15T14:48:35.451416932Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551312000000 maxt=1551506400000 ulid=01D4YT3JZJRYBA5WC1593RBD8P
level=info ts=2019-03-15T14:48:35.451511414Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551506400000 maxt=1551700800000 ulid=01D54KG7DSAT4SRVTWDPFZ9R7R
level=info ts=2019-03-15T14:48:35.451587038Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551700800000 maxt=1551895200000 ulid=01D5ACWZZV8VWSAHQ1PBH13W3T
level=info ts=2019-03-15T14:48:35.451675749Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1551895200000 maxt=1552089600000 ulid=01D5G69HY96P6X2C3CN4XMG4J0
level=info ts=2019-03-15T14:48:35.451785409Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552089600000 maxt=1552284000000 ulid=01D5NZP2KXK3JBHF5PHHFM25F6
level=info ts=2019-03-15T14:48:35.451881568Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552284000000 maxt=1552478400000 ulid=01D5VS2WWM933JT4GPFVJ3VY0B
level=info ts=2019-03-15T14:48:35.451970285Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552478400000 maxt=1552543200000 ulid=01D5XPW7Q9PW67Z92MCX709CGG
level=info ts=2019-03-15T14:48:35.452053606Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552543200000 maxt=1552608000000 ulid=01D5ZMNPABE55821SQDX8DWJPG
level=info ts=2019-03-15T14:48:35.452123356Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552629600000 maxt=1552636800000 ulid=01D6098QNSJTP1NDCD9CJ39QB7
level=info ts=2019-03-15T14:48:35.452202374Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552608000000 maxt=1552629600000 ulid=01D6098VNK1PDBSA4ZPHMR515A
level=info ts=2019-03-15T14:48:35.452258937Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552636800000 maxt=1552644000000 ulid=01D60G4EXSN6ENP0SJSA59SKXP
level=info ts=2019-03-15T14:48:35.452318107Z caller=repair.go:39 component=tsdb msg="found healthy block" mint=1552644000000 maxt=1552651200000 ulid=01D60Q065T5DR1YG15XY78XEVD
2019-03-15 14:48:45,468 INFO success: prometheus entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
level=warn ts=2019-03-15T14:48:45.704848994Z caller=head.go:415 component=tsdb msg="encountered WAL error, attempting repair" err="read records: corruption in segment 4378 at 120789296: unexpected checksum 7c9a8e94, expected d91bf761"
level=info ts=2019-03-15T14:48:45.727345649Z caller=main.go:414 msg="Stopping scrape discovery manager..."
level=info ts=2019-03-15T14:48:45.727383414Z caller=main.go:428 msg="Stopping notify discovery manager..."
level=info ts=2019-03-15T14:48:45.727391523Z caller=main.go:450 msg="Stopping scrape manager..."
level=info ts=2019-03-15T14:48:45.727399554Z caller=main.go:424 msg="Notify discovery manager stopped"
level=info ts=2019-03-15T14:48:45.727427038Z caller=manager.go:634 component="rule manager" msg="Stopping rule manager..."
level=info ts=2019-03-15T14:48:45.727437637Z caller=manager.go:640 component="rule manager" msg="Rule manager stopped"
level=info ts=2019-03-15T14:48:45.727459164Z caller=notifier.go:512 component=notifier msg="Stopping notification manager..."
level=info ts=2019-03-15T14:48:45.727471074Z caller=main.go:599 msg="Notifier manager stopped"
level=info ts=2019-03-15T14:48:45.727429963Z caller=main.go:410 msg="Scrape discovery manager stopped"
level=info ts=2019-03-15T14:48:45.727442867Z caller=main.go:444 msg="Scrape manager stopped"
level=error ts=2019-03-15T14:48:45.727693683Z caller=main.go:608 err="opening storage failed: read WAL: repair corrupted WAL: cannot handle error"
2019-03-15 14:48:45,762 INFO exited: prometheus (exit status 1; not expected)

The repair.go loop continued until I deleted the contents of WAL (as suggested in other issues). This issue happened on our Dev platform, so not a huge impact, but for our production grade platforms deleting WAL is less acceptable.

How can I go about establishing root cause? What is the best course of action to take in this situation (ideally avoiding deleting all of WAL dir)?

Thank you!

#4705 #4603

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 15, 2019

Can you reproduce with a newer version? It should delete everything after that point automatically.

@kiefersmith

This comment has been minimized.

Copy link
Author

kiefersmith commented Mar 15, 2019

I don't know what caused the error, really. Any idea where this line:
level=error ts=2019-03-15T14:46:33.798892788Z caller=engine.go:530 component="query engine" msg="error expanding series set" err="context canceled"
comes from? Upgrading to a newer version sounds like a good idea.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 15, 2019

That's unrelated, it's shutdown stopping running queries.

@kiefersmith

This comment has been minimized.

Copy link
Author

kiefersmith commented Mar 15, 2019

Any ideas about the SIGKILL, then? That was not a manual action.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 15, 2019

That would have come from outside Prometheus.

@kiefersmith

This comment has been minimized.

Copy link
Author

kiefersmith commented Mar 15, 2019

OK. I'll try to reproduce with both the new and old versions of Prom and see what I come up with. I suspect it has to do with an overload of the query engine.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 28, 2019

@kiefersmith we had few fixes in the WAL in the recent versions. Did you manage to test the latest version?

@kiefersmith

This comment has been minimized.

Copy link
Author

kiefersmith commented Mar 28, 2019

I was able to replicate the crash scenario - it seems sufficiently large queries coming from Grafana will hit the rate limit / overwhelm the query engine.

We use a custom version of Prometheus, so I have not yet been able to test the latest version (put together the required PR). I may be able to get to it next week. 🙂

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Mar 28, 2019

Ok would appreciate an update and if the bug is still there would try to replicate and work on a fix.

@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 2, 2019

@kiefersmith did you have the chance to test the latest 2.8 Prometheus release?

@kiefersmith

This comment has been minimized.

Copy link
Author

kiefersmith commented Apr 2, 2019

I did test out 2.8, but could not replicate the exact issue. I tried writing garbage to the WAL, deleting lines, but that ended up being a bit too destructive.

@kiefersmith

This comment has been minimized.

Copy link
Author

kiefersmith commented Apr 2, 2019

Will try sending large queries to the query API later today and see if that turns up anything interesting.

@kiefersmith

This comment has been minimized.

Copy link
Author

kiefersmith commented Apr 2, 2019

It would seem that the latest version is indeed better at recovering from failure! Feel free to call this one closed. I think it's supervisorD that sends the SIGKILL when I lock up Prometheus. Let me know if there's any more information you guys would like.

2019-04-02 13:47:34,543 INFO spawned: 'prometheus' with pid 34
level=warn ts=2019-04-02T13:47:34.571064495Z caller=main.go:274 deprecation_notice="'storage.tsdb.retention' flag is deprecated use 'storage.tsdb.retention.time' instead."
level=info ts=2019-04-02T13:47:34.571120257Z caller=main.go:321 msg="Starting Prometheus" version="(version=2.8.0, branch=HEAD, revision=59369491cfdfe8dcb325723d6d28a837887a07b9)"
level=info ts=2019-04-02T13:47:34.571139426Z caller=main.go:322 build_context="(go=go1.11.5, user=root@4c4d5c29b71f, date=20190312-07:46:58)"
level=info ts=2019-04-02T13:47:34.571158717Z caller=main.go:323 host_details="(Linux 4.19.12-1.el7.elrepo.x86_64 #1 SMP Fri Dec 21 11:06:36 EST 2018 x86_64 7ddf10cfa5b4 (none))"
level=info ts=2019-04-02T13:47:34.571178075Z caller=main.go:324 fd_limits="(soft=65536, hard=65536)"
level=info ts=2019-04-02T13:47:34.571195438Z caller=main.go:325 vm_limits="(soft=unlimited, hard=unlimited)"
level=info ts=2019-04-02T13:47:34.571828103Z caller=main.go:640 msg="Starting TSDB ..."
level=info ts=2019-04-02T13:47:34.571856743Z caller=web.go:418 component=web msg="Start listening for connections" address=0.0.0.0:9090
level=info ts=2019-04-02T13:47:34.572450035Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551506400000 maxt=1551700800000 ulid=01D54KG7DSAT4SRVTWDPFZ9R7R
level=info ts=2019-04-02T13:47:34.572736547Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551700800000 maxt=1551895200000 ulid=01D5ACWZZV8VWSAHQ1PBH13W3T
level=info ts=2019-04-02T13:47:34.573029599Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1551895200000 maxt=1552089600000 ulid=01D5G69HY96P6X2C3CN4XMG4J0
level=info ts=2019-04-02T13:47:34.587993996Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552089600000 maxt=1552284000000 ulid=01D5NZP2KXK3JBHF5PHHFM25F6
level=info ts=2019-04-02T13:47:34.609958319Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552284000000 maxt=1552478400000 ulid=01D5VS2WWM933JT4GPFVJ3VY0B
level=info ts=2019-04-02T13:47:34.629675798Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552478400000 maxt=1552672800000 ulid=01D61JFBNFY4MTDF576TWKVWNJ
level=info ts=2019-04-02T13:47:34.676690543Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552672800000 maxt=1552867200000 ulid=01D67BVXB9M8MA04FKRH5SHJG1
level=info ts=2019-04-02T13:47:34.71006658Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1552867200000 maxt=1553061600000 ulid=01D6D58R4RQJ8WSJ96ZFBSF89K
level=info ts=2019-04-02T13:47:34.710351903Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1553061600000 maxt=1553256000000 ulid=01D6JYNBZA20M9QW25Q3B45R3P
level=info ts=2019-04-02T13:47:34.710627405Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1553256000000 maxt=1553450400000 ulid=01D6RR1T23R222NYW9D1AXYNGY
level=info ts=2019-04-02T13:47:34.710877244Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1553450400000 maxt=1553644800000 ulid=01D6YHEDDE0K15V26RBHQWVB0A
level=info ts=2019-04-02T13:47:34.711224314Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1553644800000 maxt=1553839200000 ulid=01D74AVBYBFWBEJFWQ6T0S4E61
level=info ts=2019-04-02T13:47:34.711464805Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1553839200000 maxt=1554033600000 ulid=01D7A47K5A2MRESME68ZDBP4HM
level=info ts=2019-04-02T13:47:34.711745268Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1554033600000 maxt=1554098400000 ulid=01D7C215A8QBC7CDEFVESGZ6RK
level=info ts=2019-04-02T13:47:34.729709848Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1554098400000 maxt=1554163200000 ulid=01D7DZTK9HCVQ78KFWJ74KC9YB
level=info ts=2019-04-02T13:47:34.729973325Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1554184800000 maxt=1554192000000 ulid=01D7EMDP8GB5ZK7VVZ8E49V0SJ
level=info ts=2019-04-02T13:47:34.730209308Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1554163200000 maxt=1554184800000 ulid=01D7EMDRQPKCP030RK6KMTX35V
level=info ts=2019-04-02T13:47:34.730552165Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1554192000000 maxt=1554199200000 ulid=01D7EV9DGGX5PF67YWSXE0GFZW
level=info ts=2019-04-02T13:47:34.730751308Z caller=repair.go:48 component=tsdb msg="found healthy block" mint=1554199200000 maxt=1554206400000 ulid=01D7F254RPQYA5FA1KY1VFVEJ0
level=warn ts=2019-04-02T13:47:34.743074528Z caller=wal.go:116 component=tsdb msg="last page of the wal is torn, filling it with zeros" segment=/prometheus/wal/00000098
2019-04-02 13:47:44,751 INFO success: prometheus entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
level=info ts=2019-04-02T13:47:49.106257782Z caller=main.go:655 msg="TSDB started"
level=info ts=2019-04-02T13:47:49.106320298Z caller=main.go:724 msg="Loading configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-04-02T13:47:49.107238101Z caller=main.go:751 msg="Completed loading of configuration file" filename=/etc/prometheus/prometheus.yml
level=info ts=2019-04-02T13:47:49.107265647Z caller=main.go:609 msg="Server is ready to receive web requests."
@krasi-georgiev

This comment has been minimized.

Copy link
Member

krasi-georgiev commented Apr 2, 2019

Thanks for the update. Feel free to reopen if you can replicate and will continue the troubleshooting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.