Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheus skips scrapes due to throttling #2222

Closed
onorua opened this Issue Nov 27, 2016 · 12 comments

Comments

Projects
None yet
4 participants
@onorua
Copy link

onorua commented Nov 27, 2016

What did you do?
Actually our prometheus got updated to version 1.4 which did not work well (#2220), so we rolled back to version 1.3.1, and this issue started to happen.

What did you expect to see?
Don't skip scrapes every 2 hours or so.

What did you see instead? Under which circumstances?
screen shot 2016-11-26 at 6 32 48 pm

Environment
we run prometheus with following flags:

/bin/prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/prometheus -storage.local.retention=193h -storage.local.chunk-encoding-version=1 -storage.local.memory-chunks=3145728 -alertmanager.url=http://alertmanager:9093

config follows, we don't do anything fancy, it is just a DO machine which was working fine for like a months until that upgrade happened.

  • System information:
    Linux 4.4.0-45-generic x86_64

  • Prometheus version:

prometheus, version 1.3.1 (branch: master, revision: be476954e80349cb7ec3ba6a3247cd712189dfcb)
  build user:       root@37f0aa346b26
  build date:       20161104-20:24:03
  go version:       go1.7.3
  • Prometheus configuration file:
# my global config
global:
  scrape_interval:     30s # By default, scrape targets every 15 seconds.
  evaluation_interval: 30s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: 'cloud'

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - "alert.rules"
  # - "first.rules"
  # - "second.rules"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  # Auto-discovery via consul tags by Registrator
  - job_name: 'consul-services'

    consul_sd_configs:
    - server: 'consul:8500'
      scheme: 'http'

    relabel_configs:
    - source_labels: [ '__meta_consul_tags' ]
      action: keep
      regex: .*exporter.*
    - source_labels: [ '__meta_consul_address', '__meta_consul_service_port' ]
      action: replace
      regex: (.+)(?::\d+);(\d+)
      replacement: $1:$2
      target_label: __address__
    - source_labels: [ '__meta_consul_service' ]
      action: keep
      regex: (.+)
      replacement: $1
      target_label: __name__
    - source_labels: [ '__meta_consul_service_id' ]
      action: replace
      regex: (.*):(.*):(.*)
      replacement: $2
      target_label: container_name
    - source_labels: [ '__meta_consul_service_id' ]
      action: replace
      regex: (.*):(.*):(.*)
      replacement: $1
      target_label: host_name
    - source_labels: [ '__meta_consul_service_id' ]
      action: replace
      regex: .*-(\w*)-(\w*)-(\d*):(.*):(.*)
      replacement: $1
      target_label: colo
  • Logs:
time="2016-11-27T00:58:34Z" level=warning msg="Storage has entered rushed mode." chunksToPersist=447275 maxChunksToPersist=524288 maxMemoryChunks=3145728 memoryChunks=3169371 source="storage.go:1607" urgencyScore=0.8531093597412109
time="2016-11-27T00:59:56Z" level=info msg="Completed maintenance sweep through 1616945 in-memory fingerprints in 54m39.960566695s." source="storage.go:1180"
time="2016-11-27T01:03:34Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-11-27T01:03:35Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=524468 maxChunksToPersist=524288 maxToleratedMemChunks=3460300 memoryChunks=3152841 source="storage.go:908"
time="2016-11-27T01:04:53Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1m18.770319321s." source="persistence.go:573"
time="2016-11-27T01:09:53Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-11-27T01:11:14Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1m21.390220948s." source="persistence.go:573"
time="2016-11-27T01:16:14Z" level=info msg="Checkpointing in-memory metrics and chunks..." source="persistence.go:549"
time="2016-11-27T01:17:35Z" level=info msg="Done checkpointing in-memory metrics and chunks in 1m21.080615741s." source="persistence.go:573"
time="2016-11-27T01:22:15Z" level=info msg="Storage does not need throttling anymore." chunksToPersist=510456 maxChunksToPersist=524288 maxToleratedMemChunks=3460300 memoryChunks=3145729 source="storage.go:916"

I'll do /debug/pprof/goroutine?debug=2 as soon as I've get this happening the time I'll be online, but as it is holidays, it is getting tricky.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Nov 27, 2016

Did you recently increase the load on Prometheus, i.e. higher number of targets?
You already mentioned int #2220 that the storage is checkpointing very frequently and here it shows in the logs that the storage is throttled, which causes scrapes to be skipped. That's intended behavior.

It's very unlikely that this has anything to do with 1.4, even less likely that the downgrade caused a previously working version to not work anymore.

Can you read through the storage documentation and verify whether you've to adjust any of the flags for your setup?

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Nov 27, 2016

we have no increased load, let me show you the graph:
screen shot 2016-11-27 at 8 26 16 am

As you can see, previously Prometheus was working fine, than it got updated and we naively waited for couple of hours until it get back to service, as IO was high, we were thinking that it is rebuilding some indexes or something like this. Then, we rolled back to previous version, and we've got this periodic scrape misses.
We have DO instance with 32GB of RAM and 12 Cores to monitor 240 machines, we have cAdviser and node_exporter and 1 our own exporter with 2 metrics. That amount of nodes should not cause scrape skips I believe, but I will be more than thankful for any advises on how to tune our storage configuration to prevent that skips.

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Nov 27, 2016

Debug info which one of our colleagues could catch today:
prometh_gap_201611271107MSK.txt

Let me know if I can help you with anything else, or we doing something really wrong.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 27, 2016

time="2016-11-27T01:03:35Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=524468 maxChunksToPersist=524288 maxToleratedMemChunks=3460300 memoryChunks=3152841 source="storage.go:908"

One thing that line tells me is that you have configured a decent amount of memoryChunks (~3M) but you have left the -storage.local.max-chunks-to-persist at its default value of 524288. That doesn't leave a lot of headroom. I'd try to set it to ~1.5M.

It can well be that your server was quite close to getting into persistence problems. Only after the restart, it had to catch up on things like deletion of old samples. There is a bit of a hysteresis, so it might never recover to the healthy state even with the same load. But as said, with the setting suggested above, your server should be able to handle much more.

I'm really sorry that flag tweaking is some kind of dark art right now. I'm working on making this all automagic, but that's not trivial to do.

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Nov 27, 2016

My current command line arguments looks like this:

      - '-config.file=/etc/prometheus/prometheus.yml'
      - '-storage.local.path=/prometheus'
      - '-storage.local.retention=193h'
      - '-storage.local.chunk-encoding-version=1'
      - '-storage.local.memory-chunks=3145728'
      - '-storage.local.checkpoint-dirty-series-limit=25000'
      - '-storage.local.max-chunks-to-persist=1572864'

We need to have around 24 hours to check whether we have better Prometheus behavior.

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Nov 29, 2016

we had 24 hours of normal operations and I believe issues is solved. Thank you for help.

@onorua onorua closed this Nov 29, 2016

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Nov 30, 2016

That is kinda weird, we could sustain less than 300 nodes with mentioned config, but then problem come back again as soon as we reached 300+ nodes.
I've read the tuning manual and watched the video, but it seems I'm failing to make it working stable...
I've noticed, that scrapes duration is huge when we have this drops on graphs, can this be a cause of the problem or this is a result of skipped scrapes?
What metrics should I monitor for our Prometheus server to debug the problem in our case? Any help would be highly appreciated.

@onorua onorua reopened this Nov 30, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Nov 30, 2016

The log message about the throttling is usually quite helpful.

Also, all the prometheus_local_storage_.* metrics. You can get them via
curl -s http://your.prometheus.example.org/metrics | grep ^prometheus_local_storage
and paste the output here.

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Nov 30, 2016

  • throttling messages
time="2016-11-30T05:49:23Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=1317835 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3460511 source="storage.go:908"
time="2016-11-30T05:53:15Z" level=info msg="Storage does not need throttling anymore." chunksToPersist=1289184 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3431151 source="storage.go:916"
time="2016-11-30T06:19:21Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=1316551 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3460398 source="storage.go:908"
time="2016-11-30T06:36:23Z" level=info msg="Storage does not need throttling anymore." chunksToPersist=1310534 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3450881 source="storage.go:916"
time="2016-11-30T06:41:43Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=1320310 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3460564 source="storage.go:908"
time="2016-11-30T06:43:17Z" level=info msg="Storage does not need throttling anymore." chunksToPersist=1301062 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3442308 source="storage.go:916"
time="2016-11-30T09:34:36Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=1319804 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3460356 source="storage.go:908"
time="2016-11-30T09:51:24Z" level=info msg="Storage does not need throttling anymore." chunksToPersist=1293578 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3431311 source="storage.go:916"
time="2016-11-30T11:23:08Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=1318918 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3460505 source="storage.go:908"
time="2016-11-30T11:24:15Z" level=info msg="Storage does not need throttling anymore." chunksToPersist=1295727 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3435150 source="storage.go:916"
  • metrics
prometheus_local_storage_checkpoint_duration_seconds 123.275170229
prometheus_local_storage_chunk_ops_total{type="clone"} 60483
prometheus_local_storage_chunk_ops_total{type="create"} 1.1473596e+07
prometheus_local_storage_chunk_ops_total{type="drop"} 7.530293e+06
prometheus_local_storage_chunk_ops_total{type="evict"} 1.1677451e+07
prometheus_local_storage_chunk_ops_total{type="load"} 3.57118e+06
prometheus_local_storage_chunk_ops_total{type="persist"} 8.10646e+06
prometheus_local_storage_chunk_ops_total{type="pin"} 1.1908332e+07
prometheus_local_storage_chunk_ops_total{type="transcode"} 1.4984969e+07
prometheus_local_storage_chunk_ops_total{type="unpin"} 1.1908332e+07
prometheus_local_storage_chunkdesc_ops_total{type="evict"} 7.50168e+06
prometheus_local_storage_chunkdesc_ops_total{type="load"} 310686
prometheus_local_storage_chunks_to_persist 1.230643e+06
prometheus_local_storage_fingerprint_mappings_total 0
prometheus_local_storage_inconsistencies_total 0
prometheus_local_storage_indexing_batch_duration_seconds{quantile="0.5"} 3.798416359
prometheus_local_storage_indexing_batch_duration_seconds{quantile="0.9"} 12.884763244
prometheus_local_storage_indexing_batch_duration_seconds{quantile="0.99"} 13.215276112
prometheus_local_storage_indexing_batch_duration_seconds_sum 20039.817416888982
prometheus_local_storage_indexing_batch_duration_seconds_count 1356
prometheus_local_storage_indexing_batch_sizes{quantile="0.5"} 45
prometheus_local_storage_indexing_batch_sizes{quantile="0.9"} 71
prometheus_local_storage_indexing_batch_sizes{quantile="0.99"} 82
prometheus_local_storage_indexing_batch_sizes_sum 666838
prometheus_local_storage_indexing_batch_sizes_count 1357
prometheus_local_storage_indexing_queue_capacity 16384
prometheus_local_storage_indexing_queue_length 18
prometheus_local_storage_ingested_samples_total 1.9500165e+09
prometheus_local_storage_maintain_series_duration_seconds{location="archived",quantile="0.5"} 0.001570654
prometheus_local_storage_maintain_series_duration_seconds{location="archived",quantile="0.9"} 0.005445075000000001
prometheus_local_storage_maintain_series_duration_seconds{location="archived",quantile="0.99"} 0.029498315
prometheus_local_storage_maintain_series_duration_seconds_sum{location="archived"} 1722.4060555070143
prometheus_local_storage_maintain_series_duration_seconds_count{location="archived"} 881459
prometheus_local_storage_maintain_series_duration_seconds{location="memory",quantile="0.5"} 0.001015929
prometheus_local_storage_maintain_series_duration_seconds{location="memory",quantile="0.9"} 0.003267751
prometheus_local_storage_maintain_series_duration_seconds{location="memory",quantile="0.99"} 0.014057316
prometheus_local_storage_maintain_series_duration_seconds_sum{location="memory"} 19067.43040654318
prometheus_local_storage_maintain_series_duration_seconds_count{location="memory"} 8.306398e+06
prometheus_local_storage_max_chunks_to_persist 2.572864e+06
prometheus_local_storage_memory_chunkdescs 2.8827217e+07
prometheus_local_storage_memory_chunks 3.367325e+06
prometheus_local_storage_memory_series 2.144199e+06
prometheus_local_storage_non_existent_series_matches_total 193
prometheus_local_storage_out_of_order_samples_total{reason="multiple_values_for_timestamp"} 0
prometheus_local_storage_out_of_order_samples_total{reason="timestamp_out_of_order"} 0
prometheus_local_storage_persist_errors_total 0
prometheus_local_storage_persistence_urgency_score 0.7044378916422533
prometheus_local_storage_rushed_mode 1
prometheus_local_storage_series_ops_total{type="archive"} 20057
prometheus_local_storage_series_ops_total{type="create"} 1179
prometheus_local_storage_series_ops_total{type="maintenance_in_archive"} 880642
prometheus_local_storage_series_ops_total{type="maintenance_in_memory"} 8.306398e+06
prometheus_local_storage_series_ops_total{type="purge_from_archive"} 665650
prometheus_local_storage_series_ops_total{type="purge_from_memory"} 27
prometheus_local_storage_series_ops_total{type="purge_on_request"} 0
prometheus_local_storage_series_ops_total{type="quarantine_completed"} 0
prometheus_local_storage_series_ops_total{type="quarantine_dropped"} 0
prometheus_local_storage_series_ops_total{type="quarantine_failed"} 0
prometheus_local_storage_series_ops_total{type="unarchive"} 11389
prometheus_local_storage_started_dirty 0

@onorua onorua changed the title prometheus skips scrapes every ~2 hours for 10-20 minutes prometheus crash on load Dec 1, 2016

@onorua onorua changed the title prometheus crash on load prometheus skips scrapes due to throttling Dec 1, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Dec 1, 2016

The throttling happens now because you have too many chunks in memory:

time="2016-11-30T11:23:08Z" level=error msg="Storage needs throttling. Scrapes and rule evaluations will be skipped." chunksToPersist=1318918 maxChunksToPersist=2572864 maxToleratedMemChunks=3460300 memoryChunks=3460505 source="storage.go:908"

Check memoryChunks vs. maxToleratedMemChunks. You don't have a lot of chunksToPersist, so persistence is going great, and there is not much to gain by tweaking it. However, you have more than 2M time series in memory:

prometheus_local_storage_memory_series 2.144199e+06

Each time series in memory needs at least one chunk. And probably most of them are actively appended to, so their one chunk will be still open and cannot be persisted. Thus, you have about 2.1M open chunks (which cannot be persisted) plus 1.3M chunks that can be persisted (and are actively done so, but you cannot just persist them in one go, see the linked talk — 1.3M chunks waiting for persistence is a reasonable number if you have 2.1M time series in memory). In sum, you have 3.4M chunks that are right now in memory and cannot be evicted. That's more than you have configured (-storage.local.memory-chunks=3145728). The only thing Prometheus can do in that situation is to stop ingestion until enough chunks have been persisted and subsequently evicted from memory.

The straight forward remedy is to increase -storage.local.memory-chunks. That requires more RAM, though. If you don't have more RAM, you have to reduce the number of time series (scrape fewer targets or drop metrics you don't need with metric_relabel_configs.

@onorua

This comment has been minimized.

Copy link
Author

onorua commented Dec 1, 2016

Yes, that is conclusion I've come to as well, we need more prometheus servers :)
we have 4 servers now, each of them is dealing with its shard. We will review metrics we don't use and remove them. Thank you for suggestions!

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.