Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingestion stops, probably due to deadlocked series maintenance #1459

Closed
guanglinlv opened this Issue Mar 4, 2016 · 34 comments

Comments

Projects
None yet
7 participants
@guanglinlv
Copy link

guanglinlv commented Mar 4, 2016

hi folks,

i update prometheus form 0.16.2 to 0.17.0. i try to reuse the old prometheus configuration and the data. but i got the error in status page, i can't get any samples.

my configuration is very simple

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 5s
    scrape_timeout: 10s

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    target_groups:
      - targets: ['localhost:9090']

  - job_name: 'overwritten-default'
    scrape_interval: 5s
    scrape_timeout: 10s
    consul_sd_configs:
      - server: 'consul server'

    relabel_configs:
      - source_labels: ['__meta_consul_service_id']
        regex:         '(.*)'
        target_label:  'job'
        replacement:   '$1'
        action:        'replace'
      - source_labels: ['__meta_consul_service_address','__meta_consul_service_port']
        separator:     ';'
        regex:         '(.*);(.*)'
        target_label:  '__address__'
        replacement:   '$1:$2'
        action:        'replace'
      - source_labels: ['__meta_consul_service_id']
        regex:         '^prometheus_.*'
        action:        'keep'

there is not any useful debug log or hints. what am i lost?

thanks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 4, 2016

Can you paste the error message you're getting about going into rushed mode?

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Mar 4, 2016

@brian-brazil
actually, i can't find any error message in the prometheus log at the debug level. i find the message scrape skipped due to throttled ingestion in the localhost:9090/status page.

any more debug information ?

BTW, all of things is ok when i rolled back to 0.16.2 version.

thanks.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Mar 4, 2016

You should be seeing this info log:

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 4, 2016

Also interesting would be the command line flags you have started your server with.

And the output of

curl http://your-prometheus-server:9090/metrics | grep '^prometheus_local_storage'

@Alfred-concur

This comment has been minimized.

Copy link

Alfred-concur commented Mar 4, 2016

experiencing the same errors - will issue command as suggested above and post them her for review ...

@Alfred-concur

This comment has been minimized.

Copy link

Alfred-concur commented Mar 4, 2016

Observed that when refreshing the status page- the last scrape time for at least one target jumps between 150-800 seconds .. and it jumps around between targets every time I refresh ...

@Alfred-concur

This comment has been minimized.

Copy link

Alfred-concur commented Mar 4, 2016

FYTI - my configuration YAML file is the default syntax for everything with only targets added ....

@Alfred-concur

This comment has been minimized.

Copy link

Alfred-concur commented Mar 4, 2016

[root@seapr1zenaa001 prometheus-0.17.0rc2.linux-amd64]# curl http://localhost:9090/metrics | grep '^prometheus_local_storage' % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 32481 100 32481 0 0 1461k 0 --:--:-- --:--:-- --:--:-- 3524k
prometheus_local_storage_checkpoint_duration_milliseconds 407.337785
prometheus_local_storage_chunk_ops_total{type="create"} 590741
prometheus_local_storage_chunk_ops_total{type="persist"} 55755
prometheus_local_storage_chunk_ops_total{type="transcode"} 599573
prometheus_local_storage_chunkdesc_ops_total{type="evict"} 619
prometheus_local_storage_chunks_to_persist 524290
prometheus_local_storage_fingerprint_mappings_total 0
prometheus_local_storage_inconsistencies_total 0
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.5"} NaN
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.9"} NaN
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.99"} NaN
prometheus_local_storage_indexing_batch_duration_milliseconds_sum 3.920845
prometheus_local_storage_indexing_batch_duration_milliseconds_count 3
prometheus_local_storage_indexing_batch_sizes{quantile="0.5"} NaN
prometheus_local_storage_indexing_batch_sizes{quantile="0.9"} NaN
prometheus_local_storage_indexing_batch_sizes{quantile="0.99"} NaN
prometheus_local_storage_indexing_batch_sizes_sum 34
prometheus_local_storage_indexing_batch_sizes_count 3
prometheus_local_storage_indexing_queue_capacity 16384
prometheus_local_storage_indexing_queue_length 0
prometheus_local_storage_ingested_samples_total 2.32768159e+08
prometheus_local_storage_invalid_preload_requests_total 0
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.5"} NaN
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.9"} NaN
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.99"} NaN
prometheus_local_storage_maintain_series_duration_milliseconds_sum{location="memory"} 20337.46568399991
prometheus_local_storage_maintain_series_duration_milliseconds_count{location="memory"} 8772
prometheus_local_storage_max_chunks_to_persist 524288
prometheus_local_storage_memory_chunkdescs 622841
prometheus_local_storage_memory_chunks 590741
prometheus_local_storage_memory_series 10705
prometheus_local_storage_out_of_order_samples_total 0
prometheus_local_storage_persist_errors_total 0
prometheus_local_storage_persistence_urgency_score 0.10509681701660156
prometheus_local_storage_rushed_mode 0
prometheus_local_storage_series_ops_total{type="archive"} 5
prometheus_local_storage_series_ops_total{type="create"} 34
prometheus_local_storage_series_ops_total{type="maintenance_in_memory"} 8772

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 4, 2016

As said, command line flags are also needed.

But I can see most of what I need from the above. There is definitely something wrong because you have
524290 chunks to persist, which is above the limit of 524288. Still the urgency score is only 0.1. Investigating...

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 4, 2016

It actually looks as if the series maintenance goroutine is deadlocked. Then neither are chunks persisted, nor is the urgency score updated. That would also explain why the prometheus_local_storage_maintain_series_duration_milliseconds are NaN although maintenance has happened in the past. They are probably more than 10m ago.

@beorn7 beorn7 added bug and removed question labels Mar 4, 2016

@beorn7 beorn7 changed the title scrape skipped due to throttled ingestion Ingestion stops, probably due to deadlocked series maintenance Mar 4, 2016

@Alfred-concur

This comment has been minimized.

Copy link

Alfred-concur commented Mar 4, 2016

I have restarted the Prometheus server, started Prometheus and node-exported for the local host ... currently everything is working fine - will monitor for reoccurrence of the issue .... When Prometheus came up - there were a lot of WARN message regarding data recovery ...

Small sample

WARN[0009] Recovered metric prometheus_notifications_latency_seconds{instance="localhost:9090", job="prometheus", quantile="0.9"}, fingerprint ff83bffa6f97fd9e: all 345 chunks recovered from series file. source=crashrecovery.go:319
WARN[0009] Recovered metric container_network_transmit_packets_dropped_total{id="/system.slice/sshd-keygen.service", instance="dfwdr1kbnwrkr01:4194", interface="flannel.1", job="prometheus"}, fingerprint ff4a37aed954485f: all 27 chunks recovered from series file. source=crashrecovery.go:319
WARN[0009] Recovered metric node_exporter_scrape_duration_seconds{collector="conntrack", instance="dfwdr1kbnwrkr02:9100", job="prometheus", quantile="0.5", result="success"}, fingerprint ff50b786dd113b4e: all 231 chunks recovered from series file. source=crashrecovery.go:319
WARN[0009] Recovered metric container_memory_failures_total{id="/system.slice/usr-share-oem.mount", instance="dfwdr1kbnmstr01:4194", job="prometheus", scope="hierarchy", type="pgmajfault"}, fingerprint ff2437ea11fd7288: all 23 chunks recovered from series file. source=crashrecovery.go:319
WARN[0009] Recovered metric container_cpu_user_seconds_total{id="/docker-daemon", instance="dfwdr1kbnwrkr01:4194", job="prometheus"}, fingerprint ff867d748def5bd1: all 199 chunks recovered from series file. source=crashrecovery.go:319

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 4, 2016

Yeah, that's the crash recovery and normal if you don't have a clean shutdown.

It looks like the series maintenance stops under circumstances. Unfortunately, I couldn't reproduce the problem in any of our 48 production prometheus servers that run 0.17.0. Every single one of them is maintaining their series just fine...

@beorn7 beorn7 self-assigned this Mar 4, 2016

@Alfred-concur

This comment has been minimized.

Copy link

Alfred-concur commented Mar 4, 2016

Understood ... I will keep an eye on Prometheus and report any further problems related to this issue.

Thanks for your help ..

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 4, 2016

In case it gets into that state again, a goroutine dump would be great. Then we could see which goroutine is deadlocked, if any. You get it with

curl http://your-prometheus-server:9090/debug/pprof/goroutine?debug=2

Another explanation would be if your server is stuck in writing a checkpoint file, e.g. because the underlying disk is very slow or blocked. (Perhaps that could happen on Amazon or other cloud providers if you are running out of IOps quota?)

@Alfred-concur

This comment has been minimized.

Copy link

Alfred-concur commented Mar 4, 2016

This is my first time working with Prometheus and the first time my company has looked into using it so it is all new to me. I am using a Centos 6.5 server running on VMware - the guest is hosted on a dedicated server with only 6 other guests - the storage is local SSD storage - I would hope the SSD are fast enough :)

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Mar 7, 2016

hi @beorn7 , sorry for late reply. here is my situation:

  1. my command line with 0.16.2

    ./prometheus -config.file=./prometheus.yml -storage.local.path=/tmp/prom_data -storage.local.memory-chunks=10000 -log.level=debug -storage.remote.opentsdb-url=http://<opentsdb>:4242 -alertmanager.url=http://localhost:9093
    
  2. running more than one week with version 0.16.2

  3. find scrape target by consul,more than fifty targets actually.

  4. i update my prometheus as following:

- kill -SIGTERM `pgrep prometheus`
- run prometheus-0.17 binary with the same command line and configuration.
- there is not any recovery process. the scrape skip error  occur immediately.
- roll back to 0.16, any thing looks ok. 

It is reproduced in my side. any debug information you need i can post here. when i come back to my lab a later.

thanks.

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Mar 7, 2016

@beorn7

  • prometheus command line:

    ./prometheus -config.file=./prometheus.yml -storage.local.path=/tmp/prom_data -storage.local.memory-chunks=10000 -log.level=debug -storage.remote.opentsdb-url=http://<opentsdb>:4242 -alertmanager.url=http://localhost:9093

  • prometheus_local_storage metrics:

    curl http://10.63.121.65:9090/metrics | grep '^prometheus_local_storage'

    prometheus_local_storage_checkpoint_duration_milliseconds 0
    prometheus_local_storage_chunk_ops_total{type="create"} 88705
    prometheus_local_storage_chunk_ops_total{type="evict"} 11319
    prometheus_local_storage_chunk_ops_total{type="load"} 11238
    prometheus_local_storage_chunk_ops_total{type="persist"} 85
    prometheus_local_storage_chunk_ops_total{type="pin"} 25144
    prometheus_local_storage_chunk_ops_total{type="transcode"} 6463
    prometheus_local_storage_chunk_ops_total{type="unpin"} 25144
    prometheus_local_storage_chunkdesc_ops_total{type="evict"} 207
    prometheus_local_storage_chunkdesc_ops_total{type="load"} 77
    prometheus_local_storage_chunks_to_persist 78041
    prometheus_local_storage_fingerprint_mappings_total 0
    prometheus_local_storage_inconsistencies_total 0
    prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.5"} NaN
    prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.9"} NaN
    prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.99"} NaN
    prometheus_local_storage_indexing_batch_duration_milliseconds_sum 0
    prometheus_local_storage_indexing_batch_duration_milliseconds_count 0
    prometheus_local_storage_indexing_batch_sizes{quantile="0.5"} NaN
    prometheus_local_storage_indexing_batch_sizes{quantile="0.9"} NaN
    prometheus_local_storage_indexing_batch_sizes{quantile="0.99"} NaN
    prometheus_local_storage_indexing_batch_sizes_sum 0
    prometheus_local_storage_indexing_batch_sizes_count 0
    prometheus_local_storage_indexing_queue_capacity 16384
    prometheus_local_storage_indexing_queue_length 0
    prometheus_local_storage_ingested_samples_total 10722
    prometheus_local_storage_invalid_preload_requests_total 0
    prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.5"} 16.843054
    prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.9"} 20.851371
    prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.99"} 28.510981
    prometheus_local_storage_maintain_series_duration_milliseconds_sum{location="memory"} 321.30428399999994
    prometheus_local_storage_maintain_series_duration_milliseconds_count{location="memory"} 15
    prometheus_local_storage_max_chunks_to_persist 524288
    prometheus_local_storage_memory_chunkdescs 785842
    prometheus_local_storage_memory_chunks 88624
    prometheus_local_storage_memory_series 10586
    prometheus_local_storage_out_of_order_samples_total 0
    prometheus_local_storage_persist_errors_total 0
    prometheus_local_storage_persistence_urgency_score 0.14886093139648438
    prometheus_local_storage_rushed_mode 0
    prometheus_local_storage_series_ops_total{type="maintenance_in_memory"} 15
    prometheus_local_storage_series_ops_total{type="unarchive"} 36
    

thanks.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 7, 2016

@guanglinlv You have configured a maximum of 10000 memory chunks, but in reality, you have 88624. 0.16.x would just keep going in that case (and most likely OOM, assuming that people tailor their -storage.local.memory-chunks flag value to their available RAM). 0.17 tries to defend against excessive memory usage by throttling ingestion.

As your machine is apparently happy with 88624 memory chunks (and since you have 10586 active time series, i.e. more than you would allow chunks in memory), set -storage.local.memory-chunks to something like 100,000 and try again. This should require not more than 500MiB RAM.

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Mar 8, 2016

@beorn7 yeah,0.17 is running now. thank you.

another problem is the instant vector selectors can not work with regex-match expression.

for example

node_cpu{mode=~"us"} got no data in 0.17, =~ seem to lose efficacy.

is any changes about it in 0.17?

thanks.

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Mar 8, 2016

that's all right ,i have check the change log for 0.17

  • [CHANGE] Regular expression matchers in PromQL are anchored

this change make me very confused, awesome QL is very popular feature for me. why did you need to anchor it ?

Thanks.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Mar 8, 2016

@guanglinlv See #996. You'll also find some other issues about the anchoring behavior. It is believed to help people doing the right thing.

@guanglinlv

This comment has been minimized.

Copy link
Author

guanglinlv commented Mar 8, 2016

@grobie ok, i agree with that exactly match should help people doing the right thing. but i think a optional choice is better than full anchor. this will break some user case, for example

node_cpu{instance=~"xx|yy",mode="user"} help me getting the point both xx and yy instance. it's very useful to make a comparing among different instance. how can i get that if regular expression is anchored ?

thanks.

@grobie

This comment has been minimized.

Copy link
Member

grobie commented Mar 8, 2016

@guanglinlv We went way off-topic here, I extracted your question in a new issue #1470. Let's move the discussion there.

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 8, 2016

To not lose track: The issue still remaining here is the one reported by @Alfred-concur . @Alfred-concur , if it happens again, it would be extremely useful to get the goroutine dump as described.

@Alfred-concur

This comment has been minimized.

Copy link

Alfred-concur commented Mar 8, 2016

Cool - thank you ....

@ArtemChekunov

This comment has been minimized.

Copy link

ArtemChekunov commented Mar 24, 2016

i have a similar problem
i have a lot of errors "scrape skipped due to throttled ingestion" on status page
grouting dump id here: https://gist.github.com/sc0rp1us/5eccbc7f5e33dc4687fe

iotop -botqqq --iter=1 | grep prom | grep -v grep
06:39:29 14632 be/4 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % prometheus -config.file=/etc/prometheus.yml -query.staleness-delta 30s -storage.local.path=/srv/prometheus/ -storage.local.retention=360h0m0s -alertmanager.url=http://localhost:9093
du -hs /srv/prometheus/
109G    /srv/prometheus/
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 24, 2016

@sc0rp1us Please send the storage metrics, too

curl http://your-prometheus-server:9090/metrics | grep '^prometheus_local_storage'

@ArtemChekunov

This comment has been minimized.

Copy link

ArtemChekunov commented Mar 24, 2016

prometheus_local_storage_checkpoint_duration_milliseconds 27036.055987
prometheus_local_storage_chunk_ops_total{type="clone"} 20805
prometheus_local_storage_chunk_ops_total{type="create"} 1.1458e+06
prometheus_local_storage_chunk_ops_total{type="drop"} 506375
prometheus_local_storage_chunk_ops_total{type="evict"} 91938
prometheus_local_storage_chunk_ops_total{type="load"} 4710
prometheus_local_storage_chunk_ops_total{type="persist"} 484852
prometheus_local_storage_chunk_ops_total{type="pin"} 1.786422e+06
prometheus_local_storage_chunk_ops_total{type="transcode"} 1.190708e+06
prometheus_local_storage_chunk_ops_total{type="unpin"} 1.786422e+06
prometheus_local_storage_chunkdesc_ops_total{type="evict"} 3.272045e+06
prometheus_local_storage_chunkdesc_ops_total{type="load"} 2.881968e+06
prometheus_local_storage_chunks_to_persist 524768
prometheus_local_storage_fingerprint_mappings_total 0
prometheus_local_storage_inconsistencies_total 0
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.5"} 21.617701
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.9"} 36.254423
prometheus_local_storage_indexing_batch_duration_milliseconds{quantile="0.99"} 36.254423
prometheus_local_storage_indexing_batch_duration_milliseconds_sum 4840.978521000004
prometheus_local_storage_indexing_batch_duration_milliseconds_count 156
prometheus_local_storage_indexing_batch_sizes{quantile="0.5"} 1
prometheus_local_storage_indexing_batch_sizes{quantile="0.9"} 1
prometheus_local_storage_indexing_batch_sizes{quantile="0.99"} 1
prometheus_local_storage_indexing_batch_sizes_sum 147262
prometheus_local_storage_indexing_batch_sizes_count 156
prometheus_local_storage_indexing_queue_capacity 16384
prometheus_local_storage_indexing_queue_length 0
prometheus_local_storage_ingested_samples_total 9.9791132e+07
prometheus_local_storage_invalid_preload_requests_total 0
prometheus_local_storage_maintain_series_duration_milliseconds{location="archived",quantile="0.5"} 40.516838
prometheus_local_storage_maintain_series_duration_milliseconds{location="archived",quantile="0.9"} 135.572496
prometheus_local_storage_maintain_series_duration_milliseconds{location="archived",quantile="0.99"} 1571.727343
prometheus_local_storage_maintain_series_duration_milliseconds_sum{location="archived"} 162800.1146930002
prometheus_local_storage_maintain_series_duration_milliseconds_count{location="archived"} 1370
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.5"} 41.137639
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.9"} 76.504416
prometheus_local_storage_maintain_series_duration_milliseconds{location="memory",quantile="0.99"} 1480.461253
prometheus_local_storage_maintain_series_duration_milliseconds_sum{location="memory"} 7.099646124857001e+06
prometheus_local_storage_maintain_series_duration_milliseconds_count{location="memory"} 90693
prometheus_local_storage_max_chunks_to_persist 524288
prometheus_local_storage_memory_chunkdescs 8.787405e+06
prometheus_local_storage_memory_chunks 1.048576e+06
prometheus_local_storage_memory_series 136243
prometheus_local_storage_out_of_order_samples_total 0
prometheus_local_storage_persist_errors_total 0
prometheus_local_storage_persistence_urgency_score 1
prometheus_local_storage_rushed_mode 1
prometheus_local_storage_series_ops_total{type="archive"} 23
prometheus_local_storage_series_ops_total{type="create"} 53
prometheus_local_storage_series_ops_total{type="maintenance_in_archive"} 1061
prometheus_local_storage_series_ops_total{type="maintenance_in_memory"} 90693
prometheus_local_storage_series_ops_total{type="purge_from_archive"} 143
prometheus_local_storage_series_ops_total{type="unarchive"} 1231
@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Mar 24, 2016

@sc0rp1us This doesn't look like a deadlock. It looks more like your disk cannot keep up with the ingestion speed. On what kind of disk are you running?
If you have more than 4GiB RAM available, you could try to tweak flags to make use of more RAM, see http://prometheus.io/docs/operating/storage/

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Apr 6, 2016

@beorn7 @sc0rp1us any updates? If this is a bug I would be keen to resolve it for 0.18.0.

@ArtemChekunov

This comment has been minimized.

Copy link

ArtemChekunov commented Apr 6, 2016

nop
i increase next values:

-storage.local.memory-chunks=6666666 -storage.local.max-chunks-to-persist=3333332 

because i have enough RAM
and i ordered new server with ssd

current time it's work ok

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Apr 6, 2016

Okay, seems that it was what @beorn7 suspected then.
Thanks for responding so quickly!

@fabxc fabxc closed this Apr 6, 2016

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Apr 6, 2016

FTR: The deadlock as seen by @Alfred-concur was in fact a real issue, not just a configuration snafu. But without the goroutine dump, we have no idea where to start looking.

So as long as it doesn't reoccur, we can leave this closed.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.