Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upprometheus 2.3.0 become OOM when consul is unavailable #4253
Comments
This comment has been minimized.
This comment has been minimized.
|
Does the same happen without the Consul scrape config being listed? The consul log messages could be circumstantial. |
This comment has been minimized.
This comment has been minimized.
|
looking |
This comment has been minimized.
This comment has been minimized.
|
so, I can't reproduce it. Looks like the stacktrace is pointing at |
This comment has been minimized.
This comment has been minimized.
|
@iksaif I did a test: Environment 1
Environment 2
Environment 3
Result
|
This comment has been minimized.
This comment has been minimized.
|
I don't know if this change on consul make prometheus OOM |
This comment has been minimized.
This comment has been minimized.
|
We used 1.1.0 and I wasn't able to reproduce this issue. Could you try to get a heap profile with |
This comment has been minimized.
This comment has been minimized.
|
@iksaif This is heap pprof output. |
This comment has been minimized.
This comment has been minimized.
|
@Wing924 this doesn't show high RAM usage. Can you take the profile when at the time it replicates the problem or before it is OOM killed. |
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev Sorry, I upload the new pprof here. |
This comment has been minimized.
This comment has been minimized.
|
This happened after I upgrading consul servers cluster from 1.0.1 to 1.1.0. |
This comment has been minimized.
This comment has been minimized.
|
that is weird as the high mem usage points to federation. Is this behaving normal in Prometheus 2.2 ? can you strip down to the most minimal config that replicates the bug so I can also try it locally. |
krasi-georgiev
added
component/service discovery
kind/bug
labels
Jun 13, 2018
This comment has been minimized.
This comment has been minimized.
|
@krasi-georgiev Half of my prometheus are 2.2.1 and others are 2.3.0. Both of them happened.
I can't try the test again because this accident will stop all monitoring system. |
This comment has been minimized.
This comment has been minimized.
|
@Wing924 thanks much appreciated. Ping me with the results and I will try to replicate with the minimal config as well. |
This comment has been minimized.
This comment has been minimized.
Yes, that looks like it. There's a massive federation request being processed here. Can you share the configuration of the Prometheus sending it? |
This comment has been minimized.
This comment has been minimized.
|
@brian-brazil global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
scrape_configs:
- job_name: 'federate'
honor_labels: true
params:
'match[]':
- 'up{}'
- 'netdata_info{}'
- '{job="netdata", __name__=~"netdata_(apache|cpu_cpu|disk|ipv4_(sockstat_tcp_sockets|tcpsock)|memcached|nginx|rabbitmq|redis|springboot|system_(cpu|io|ipv4|load|ram|swap)|tomcat|users|web_log)_.*"}'
- '{env=~".+", job!="netdata"}'
relabel_configs:
- source_labels: [metrics_path]
target_label: __metrics_path__
- regex: metrics_path
action: labeldrop
file_sd_configs:
- files:
- /etc/prometheus/conf.d/worker_groups.yml |
This comment has been minimized.
This comment has been minimized.
|
So that's going to pull in basically an entire Prometheus worth of data via federation. I'm a bit confused as to why the sampleRing is so big though. |
This comment has been minimized.
This comment has been minimized.
|
This looks like it was #4254. |
brian-brazil
closed this
Jun 22, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


Wing924 commentedJun 12, 2018
Bug Report
What did you do?
I use consul SD in prometheus.
I upgraded consul servers but failed to failover and cause consul cluster died for a few minutes.
What did you expect to see?
Prometheus running as usual.
What did you see instead? Under which circumstances?
Prometheus become out of memory.
Environment
System information:
Linux 3.10.0-693.2.2.el7.x86_64 x86_64
Prometheus version:
prometheus, version 2.3.0 (branch: HEAD, revision: 290d717)
build user: root@d539e167976a
build date: 20180607-08:46:54
go version: go1.10.2
Prometheus configuration file: