Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus fills up disk after restart #2542

Closed
michaeljs1990 opened this Issue Mar 28, 2017 · 3 comments

Comments

Projects
None yet
2 participants
@michaeljs1990
Copy link

michaeljs1990 commented Mar 28, 2017

Restarted prometheus after seeing that the main prometheus box has entered rush mode with a score of one for the period of about a minute then dropped back down to our configured 0.8... then instantly following this prometheus_local_storage_indexing_queue_length started growing resulting in a value of 16k before restarting. After restarting 600GB of data was filled on disk and the logs below occurred.

Environment

  • System information:

    Linux 3.13.0-105-generic x86_64

  • Prometheus version:

    prometheus, version 1.0.0 (branch: v1.0.0-marathon-auth, revision: 710c7da)
    build user: root@967a46ea24e5
    build date: 20160728-18:51:38
    go version: go1.6.2

  • Logs:

time="2017-03-28T06:25:53Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=9 source="scrape.go:470"
time="2017-03-28T06:25:53Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=1 source="scrape.go:470"
time="2017-03-28T06:25:53Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=10 source="scrape.go:470"
time="2017-03-28T06:25:56Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=10 source="scrape.go:470"
time="2017-03-28T06:25:56Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=5 source="scrape.go:470"
time="2017-03-28T06:25:56Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=19 source="scrape.go:470"
....... (goes on forever and never tries to persist to disk until I restarted and then) ......
time="2017-03-28T14:47:27Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=1181523 maxChunksToPersist=2000000 maxToleratedMemChunks=3520000 memoryChunks=3520214 source="storage.go:707"
time="2017-03-28T14:48:20Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1m10.062012775s." source="persistence.go:563"
time="2017-03-28T14:48:20Z" level=warning msg="Storage has entered rushed mode." chunksToPersist=1182625 maxChunksToPersist=2000000 maxMemoryChunks=3200000 memoryChunks=3520297 source="storage.go:1404" urgencyScore=1
time="2017-03-28T14:48:22Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=8 source="scrape.go:470"
time="2017-03-28T14:48:31Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=9 source="scrape.go:470"
time="2017-03-28T14:48:51Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=7 source="scrape.go:470"
..... (then a ton of these)
time="2017-03-28T14:49:03Z" level=warning msg="Series quarantined." fingerprint=648fe85b5ccf479b metric=node_cpu{collins_contact="Foundations", collins_nodeclass="anode", collins_pool="SUBMITQUEUE", collins_primary_role="BUILD", collins_secondary_role="SUBMITQUEUE", cpu="cpu9", host="lolnotarealhost.com", instance="109.102.192.174:9100", job="hosts-targets", mode="idle"} reason="write /data/64/8fe85b5ccf479b.db.tmp: no space left on device" source="storage.go:1443"

I am unsure of why so much data was used from disk it also seems like a deadlock occured as prometheus was out of rush mode and everything returned to normal after the 1 minute spike where it went into rush mode however the queue for indexing just kept growing.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 2, 2017

If series are quarantined during normal operation, they end up in the orphaned directory (a sub-directory of the data directory). Same happens if files cannot be repaired during crash recovery. Thus, you might have a lot of data accumulated in that directory. (Unless you plan some kind of forensics on the data, you can simply delete those files.)

A possible explanation for that "temporary deadlock" might be the following: A busy Prometheus server is stressing your disk quite a bit. Also, an SSD device that is almost full (like in your case) degrades dramatically in performance. If those two things come together, your SSD might lock up for minutes. (On Linux, dmesg will show warnings about file operations hanging for a long time.) This lock-up heavily affects LevelDB, which we use for indexing, which would explain why the indexing queue grew so much.

This doesn't look like a bug, just like an operational issue. It makes most sense to discuss problems like this on the prometheus-users mailinglist rather than in a GitHub issue. In that way, more people are available to help you, and others can benefit more easily from presented solution.

@beorn7 beorn7 closed this Apr 2, 2017

@michaeljs1990

This comment has been minimized.

Copy link
Author

michaeljs1990 commented Apr 3, 2017

Thanks for the response I would almost guarantee this is an operational issue. I'll try my luck on the mailing list thanks.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.