Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make logging about suspended ingestion less spammy (and more meaningful). #1269

Closed
filippog opened this Issue Dec 18, 2015 · 7 comments

Comments

Projects
None yet
3 participants
@filippog
Copy link

filippog commented Dec 18, 2015

hi!
I'm trying prometheus 0.16.1 with go 1.5.1 from Debian stretch for evaluation purposes. After a restart if chunks are detected to need persistance there seems to be ping-pong going on with log messages and resulting spam until persistent chunks fall under the watermark.

time="2015-12-18T09:29:15Z" level=info msg="Starting target manager..." source="targetmanager.go:114" 
time="2015-12-18T09:29:15Z" level=info msg="Listening on :9090" source="web.go:218" 
time="2015-12-18T09:29:15Z" level=warning msg="2391974 chunks waiting for persistence (93% of the allowed maximum 2545728). Storage is now in graceful degradation mode. Series files are not synced anymore if following the adaptive strategy. Checkpoints are not performed more often than every 5m0s. Series maintenance happens as frequently as possible." source="storage.go:1156" 
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
[...]
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
[...]
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:29:26Z" level=warning msg="2545730 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 
time="2015-12-18T09:29:26Z" level=warning msg="2545730 chunks waiting for persistence, sample ingestion suspended." source="storage.go:544" 

sometimes large spams of Sample ingestion resumed would appear too

time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed." source="storage.go:548" 
[...]
@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Dec 18, 2015

Storage logging can be very noisy. There's a high number of operations and
different stages it can go back and forth between as you described.
Unfortunately there is no one-fits-all solution to improve the situation.

Generally it has not reached a degree where it would be inhibiting the
server's operation though.

On Fri, Dec 18, 2015 at 10:46 AM Filippo Giunchedi notifications@github.com
wrote:

hi!
I'm trying prometheus 0161 with go 151 from Debian stretch for evaluation
purposes After a restart if chunks are detected to need persistance there
seems to be ping-pong going on with log messages and resulting spam until
persistent chunks fall under the watermark

time="2015-12-18T09:29:15Z" level=info msg="Starting target manager" source="targetmanagergo:114"
time="2015-12-18T09:29:15Z" level=info msg="Listening on :9090" source="webgo:218"
time="2015-12-18T09:29:15Z" level=warning msg="2391974 chunks waiting for persistence (93% of the allowed maximum 2545728) Storage is now in graceful degradation mode Series files are not synced anymore if following the adaptive strategy Checkpoints are not performed more often than every 5m0s Series maintenance happens as frequently as possible" source="storagego:1156"
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
[]
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
[]
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="2545728 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="2545729 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:29:26Z" level=warning msg="2545730 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"
time="2015-12-18T09:29:26Z" level=warning msg="2545730 chunks waiting for persistence, sample ingestion suspended" source="storagego:544"

sometimes large spams of Sample ingestion resumed would appear too

time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
time="2015-12-18T09:43:33Z" level=warning msg="Sample ingestion resumed" source="storagego:548"
[]


Reply to this email directly or view it on GitHub
#1269.

@filippog

This comment has been minimized.

Copy link
Author

filippog commented Dec 18, 2015

thanks for the quick reply @fabxc ! yeah it hasn't been generally a problem during normal server operations. I suppose because the chunks to persist normally are not very close to maxChunksToPersist, would it make sense to log based on percentChunksToPersistForDegradation instead? Or rate limit since log writing would be serialized anyway I think

thanks!

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Dec 18, 2015

The reason for the many lag messages is that every target logs it separately. While that's mildly annoying, it's not really a problem. And it mirrors the fact that every target is suspended and resumed separately.

Also, take into account that suspension of ingestion is a highly irregular state. If it happens regularly, you are either overloading your server, or you need to tweak flags. It's definitely a sign you should be very worried about. From that perspective, "spamming" the logs is not the worst that can happen.

@beorn7 beorn7 added the enhancement label Dec 18, 2015

@filippog

This comment has been minimized.

Copy link
Author

filippog commented Dec 18, 2015

thanks @beorn7! that prompted me to look and indeed prometheus_local_storage_chunks_to_persist / prometheus_local_storage_max_chunks_to_persist has been hovering around 80%, I've increased max chunks to persist to 100% of chunks to keep in memory (was 50%) and set an alert on that too, unless there's another way of telling if storage is in degraded mode?

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Dec 18, 2015

Observing prometheus_local_storage_chunks_to_persist / prometheus_local_storage_max_chunks_to_persist is the right way. 80% is the threshold where it goes into the so-called degraded mode. That sounds worse than it is. It simply means that certain safeguards for data consistency are treated in a more relaxed way (e.g. not sync'ing after writes to series files, i.e. give the OS disk cache a chance to improve write speed). Given the many ways filesystems "cheat" anyway to improve performance, the loss in data safety by "degraded" mode is not that big. So hovering around 80% is probably OK, as it merely means you "need" degraded mode for sufficient speed. The true problem is if it peaks to 100% because than you get ingestion suspension, and you really don't want that.

More RAM usage helps to boos performance here, so increase -storage.local.memory-chunks as much as your available RAM allows, and then feel free to set -storage.local.max-chunks-to-persist to the same value. The drawback of that is bigger checkpoint sizes, but that's much preferred over suspended ingestion.

@beorn7 beorn7 changed the title log spam while persisting chunks on startup Make logging about suspended ingestion less spammy (and more meaningful). Dec 18, 2015

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Feb 3, 2016

This is fixed in master.

@beorn7 beorn7 closed this Feb 3, 2016

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.