Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High disk utilization once a day #2547

Closed
shini4i opened this Issue Mar 30, 2017 · 2 comments

Comments

Projects
None yet
2 participants
@shini4i
Copy link

shini4i commented Mar 30, 2017

  • System information:

    Linux 3.16.0-4-amd64 x86_64

  • Prometheus version:

    prometheus, version 1.5.2 (branch: master, revision: bd1182d)
    build user: root@a8af9200f95d
    build date: 20170210-14:41:22
    go version: go1.7.5

We started having strange problem recently. From 06:25 UTC prometheus process starts eating all hdd resources, iopses goes from ~1k to ~2.5-3K which is way above our IOPS limit on aws. And what is even worse, every day it takes more and more time.

What can cause such behaviour? Can it be related to retention? None of tested settings helps. Only thing what helps is starting prometheus on clean data directory. And i don't remember having such problem on older versions.

Server has 16 CPU 30 Gb ram.
Data directory takes 219 GB

Prometheus started with following settings:
-config.file /etc/prometheus/prometheus.yml -storage.local.path /opt/prometheus -web.console.templates /etc/prometheus/consoles -web.console.libraries /etc/prometheus/console_libraries -alertmanager.url=http://localhost:9093 -storage.local.memory-chunks 4194304 -storage.local.max-chunks-to-persist 2097152 -log.format=logger:syslog?appname=prometheus&local=7 -storage.local.series-file-shrink-ratio=0.3 -storage.local.series-sync-strategy=never -storage.local.checkpoint-interval=15m -storage.local.retention=360h

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 2, 2017

On average, v1.5 will use less disk i/o than previous versions (because the -storage.local.series-file-shrink-ratio finally has an effect – this has been confirmed on a large amount of servers at SoundCloud). However, if your time series grow quite uniformly, it might lead to sudden increases in disk activity when a lot of time series reach the 30% shrinking point at the same time.

You can go back to the old behavior by setting -storage.local.series-file-shrink-ratio=0, which means Prometheus will rewrite the file whenever at least one whole chunk is beyond the retention time. That's kind of abusive towards the disk. HDDs are more or less fine with that but SSDs will burn more quickly through their limited lifetime, and they might even degrade in performance.

Another common source of excessive disk i/o is checkpointing. If you are fine with the increased data loss in case of a crash, set -storage.local.checkpoint-interval to something longer than the 5m default (like 15m or 30m). On SSDs, it's also highly recommended to set -storage.local.checkpoint-dirty-series-limit to something like a million.

What you can try in any case is -storage.local.chunk-encoding-version=2. That gives you 3x better compression (and thus potentially 3x less disk i/o). The only cost is an increased query time for certain types of query, see https://prometheus.io/blog/2016/05/08/when-to-use-varbit-chunks/ .

I'm closing this issue now as it doesn't look like a bug. If you continue to have problems, I suggest to continue the discussion on the prometheus-users mailinglist. In that way, more people are available to help you, and others can benefit more easily from presented solutions. Should things turn out to be a bug after all, we can re-open this GH issue.

@beorn7 beorn7 closed this Apr 2, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.