Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rushed mode after decreasing retention time. #1740

Closed
Andor opened this Issue Jun 14, 2016 · 5 comments

Comments

Projects
None yet
2 participants
@Andor
Copy link

Andor commented Jun 14, 2016

history

I had ~1M timeseries and retention time 1440h, 470Gb data.
Retention time decreased to 360h.

After apply these changes I have rushed mode and never scrape localhost.

prometheus: time="2016-06-14T16:01:37Z" level=info msg="Storage does not need throttling anymore." chunksToPersist=5217499 maxChunksToPersist=5242880 maxToleratedMemChunks=11534336 memoryChunks=6418755 source="storage.go:714"

Also, I have very long checkpointing:

prometheus: time="2016-06-14T16:02:46Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
prometheus: time="2016-06-14T16:09:58Z" level=info msg="Done checkpointing in-memory metrics and chunks in 7m11.638723479s." source="persistence.go:563"

According to atop prometheus constantly writing/reading disks, ~100% disk utilization:
For 15 minutes running prometheus i have incredible write values:

  PID     TID   RUID       EUID        THR   SYSCPU   USRCPU    VGROW    RGROW    RDDSK    WRDSK   ST   EXC   S   CPUNR    CPU   CMD       1/26
61715       -   root       root         31    9m23s    2h19m    36.6G    33.8G   310.1M    15.6G   N-     -   S      17     0%   prometheus

I'm using hardware raid10 spinning SAS disks.

current config

  scrape_interval: "10s"
  scrape_timeout: "09s"
  evaluation_interval: "15s"
scrape_configs: 
  - job_name: prometheus
    target_groups: 
      - targets: 
          - "localhost:9090"

running options:

before:

-storage.local.index-cache-size.fingerprint-to-metric=268435456
-storage.local.index-cache-size.fingerprint-to-timerange=268435456
-storage.local.index-cache-size.label-name-to-label-values=268435456
-storage.local.index-cache-size.label-pair-to-fingerprints=268435456
-storage.local.max-chunks-to-persist=5242880
-storage.local.memory-chunks=10485760
-storage.local.retention=1440h

after:

-storage.local.index-cache-size.fingerprint-to-metric=268435456
-storage.local.index-cache-size.fingerprint-to-timerange=268435456
-storage.local.index-cache-size.label-name-to-label-values=268435456
-storage.local.index-cache-size.label-pair-to-fingerprints=268435456
-storage.local.max-chunks-to-persist=5242880
-storage.local.memory-chunks=10485760
-storage.local.retention=360h

prometheus version

I'm using prometheus 1.19.1 with custom service discovery via etcd

prometheus, version 0.19.1 (branch: discovery_etcd, revision: 25bb5af)
  build user:       andor@t450
  build date:       20160527-11:13:38
  go version:       go1.6.1
@Andor

This comment has been minimized.

Copy link
Author

Andor commented Jun 15, 2016

After two hours of weird disk and cpu activity we left rushed mode and checkpoint time slowly go down:

prometheus: time="2016-06-14T18:04:12Z" level=info msg="Storage has left rushed mode." chunksToPersist=3670008 maxChunksToPersist=5242880 maxMemoryChunks=10485760 memoryChunks=6731280 source="storage.go:1390" urgencyScore=0.6999984741210937
prometheus: time="2016-06-14T18:04:12Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
prometheus: time="2016-06-14T18:08:32Z" level=info msg="Done checkpointing in-memory metrics and chunks in 4m20.062222697s." source="persistence.go:563"
prometheus: time="2016-06-14T18:13:42Z" level=info msg="Done checkpointing in-memory metrics and chunks in 3m42.604901474s." source="persistence.go:563"
,,,,
prometheus: time="2016-06-15T05:20:05Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:539"
prometheus: time="2016-06-15T05:21:06Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1m0.436781974s." source="persistence.go:563"

Config file still the same, with only ony target.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 15, 2016

This is all "working as intended". Even if you have only one current target, you have a lot of older data in your storage. If you reduce the retention time by a lot, Prometheus will essentially rewrite your whole data into smaller files. That requires a lot of disk i/o. "Rushed mode" is a mean to work through the backlog, so it's good and not bad that Prometheus switches into rushed mode. What's bad is if it throttles ingestion. Apparently, that happened in your case, but the log line you posted above tells us that throttling has just ended. So you might have only suffered a small amount of throttling. Your checkpointing time seems OK to me, too. You had 5217499 chunks to persist. Those are the ones that have to be checkpointed. Your checkpoint file will be about 6GiB in size, so your checkpointing write speed is ~15MiB/s.
Once the truncation of data was done, persistence caught up, and fewer chunks had to be checkpointed.

@beorn7 beorn7 closed this Jun 15, 2016

@Andor

This comment has been minimized.

Copy link
Author

Andor commented Jun 15, 2016

Problem is: i have debug log level and i don't know what's going on with prometheus :(

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jun 15, 2016

What are you missing?

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.