Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"The storage is now inconsistent. Restart Prometheus ASAP to initiate recovery." When restarting #2509

Closed
nicklan opened this Issue Mar 17, 2017 · 4 comments

Comments

Projects
None yet
4 participants
@nicklan
Copy link

nicklan commented Mar 17, 2017

What did you do?
Shut down prometheus
What did you expect to see?
Clean shutdown and restart

What did you see instead? Under which circumstances?
Saw the following message: time="2017-03-17T18:29:58Z" level=error msg="The storage is now inconsistent. Restart Prometheus ASAP to initiate recovery." error="error in method hasArchivedMetric(a64b08e58585b0fd): leveldb: closed" source="persistence.go:399"

Environment

  • System information:
    Linux 4.3.3-coreos x86_64
    AWS m4.10xlarge

  • Prometheus version:

prometheus, version 1.5.2 (branch: master, revision: bd1182d29f462c39544f94cc822830e1c64cf55b)
  build user:       root@1a01c5f68840
  build date:       20170210-16:23:28
  go version:       go1.7.5
  • Logs:
time="2017-03-17T18:27:30Z" level=warning msg="Received SIGTERM, exiting gracefully..." source="main.go:230" 
time="2017-03-17T18:27:30Z" level=info msg="See you next time!" source="main.go:237" 
time="2017-03-17T18:27:30Z" level=info msg="Stopping target manager..." source="targetmanager.go:75" 
time="2017-03-17T18:27:35Z" level=info msg="Stopping rule manager..." source="manager.go:374" 
time="2017-03-17T18:27:35Z" level=info msg="Rule manager stopped." source="manager.go:380" 
time="2017-03-17T18:27:35Z" level=info msg="Stopping notification handler..." source="notifier.go:369" 
time="2017-03-17T18:27:35Z" level=info msg="Stopping local storage..." source="storage.go:396" 
time="2017-03-17T18:27:35Z" level=info msg="Stopping maintenance loop..." source="storage.go:398" 
time="2017-03-17T18:28:21Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1m48.97614954s." source="persistence.go:639" 
time="2017-03-17T18:28:21Z" level=info msg="Maintenance loop stopped." source="storage.go:1259" 
time="2017-03-17T18:28:21Z" level=info msg="Stopping series quarantining..." source="storage.go:402" 
time="2017-03-17T18:28:21Z" level=info msg="Series quarantining stopped." source="storage.go:1701" 
time="2017-03-17T18:28:21Z" level=info msg="Stopping chunk eviction..." source="storage.go:406" 
time="2017-03-17T18:28:21Z" level=info msg="Chunk eviction stopped." source="storage.go:1079" 
time="2017-03-17T18:28:21Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:612" 
time="2017-03-17T18:29:45Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1m24.051053986s." source="persistence.go:639" 
time="2017-03-17T18:29:45Z" level=info msg="Checkpointing fingerprint mappings..." source="persistence.go:1480" 
time="2017-03-17T18:29:45Z" level=info msg="Done checkpointing fingerprint mappings in 85.264922ms." source="persistence.go:1503" 
time="2017-03-17T18:29:58Z" level=error msg="The storage is now inconsistent. Restart Prometheus ASAP to initiate recovery." error="error in method hasArchivedMetric(a64b08e58585b0fd): leveldb: closed" source="persistence.go:399" 
time="2017-03-17T18:29:58Z" level=info msg="Local storage stopped." source="storage.go:421" 

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 20, 2017

This looks like a problem in one of the LevelDB indices.

If you restart the server again and go to crash recovery again, those problems might get fixed.

But there are definitely corruptions in leveldb possible that cannot be fixed in that way.

Once #2210 is fixed, you might have even better chances.

There are also tools out there in the internet to repair a leveldb (in cold state).

It might also be more feasible in this case to purge you whole data directory and start from a blank storage.

@jkemp101

This comment has been minimized.

Copy link

jkemp101 commented Apr 26, 2017

We just experienced the same scenario. Running V1.5.0 on Kubernetes/AWS with the database on a EBS volume. We had received some high CPU alerts caused by Prometheus a little earlier in the day so not sure if Prometheus was already having issues with the database. During shutdown received the same set of messages and then when restarted it logged the following messages during crash recovery:

time="2017-04-25T22:27:45Z" level=warning msg="Fingerprint ddb8f7da9dc068d0 assumed archived but couldn't be found in archived index." source="crashrecovery.go:388"

I was under a time constraint so couldn't wait for recovery to finish to see if it was going to be repaired. Restarted Prometheus with a fresh database.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Apr 26, 2017

Please upgrade to 1.5.2 or 1.6.1 ASAP. That's a known bug in 1.5.0.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.