Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upPrometheus 2.0 reload configuration issues #3707
Comments
This comment has been minimized.
This comment has been minimized.
|
One more case with low performance from prometheus 2.* Our prometheus was deployed in openshift. It has about 60 targets which were monitored by snmp, blackbox and other exporters. When we tried to update prometheus configuration from external network, we receive 504 error (30-seconds timeout). When we execute reload from internal network we receive response about successfull update in 90-120 seconds. Prometheus started on local PC:
prometheus configuration:
|
Chevron94
changed the title
Prometheus 2.0 unable to reload configuration if remote storage is down
Prometheus 2.0 reload configuration issues
Jan 23, 2018
This comment has been minimized.
This comment has been minimized.
|
for case with 2000 targets from trace (prometheus 2.0)
|
Chevron94
closed this
Jan 29, 2018
Chevron94
reopened this
Jan 29, 2018
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
cc @tomwilkie |
tomwilkie
added
the
component/remote storage
label
Jan 29, 2018
This comment has been minimized.
This comment has been minimized.
|
I think this is a dupe of #2972 |
This comment has been minimized.
This comment has been minimized.
|
Sorry, just seen the latest update - looks more retrieval related. |
tomwilkie
removed
the
component/remote storage
label
Jan 29, 2018
This comment has been minimized.
This comment has been minimized.
|
@tomwilkie I did a bit of tracing as well and the blocking happens at the prometheus/storage/remote/queue_manager.go Lines 274 to 283 in 2dda577 I will continue digging , but some more input might be useful. |
This comment has been minimized.
This comment has been minimized.
|
Its definitely the case the remote write being down can block shutdown. This issue seemed to imply retrieval could too. If thats not the case, close as dupe of #2972 and I'll do a fix. |
This comment has been minimized.
This comment has been minimized.
|
ok I will wait for your fix and will continue digging if it doesn't resolve. |
tomwilkie
referenced this issue
Jan 31, 2018
Closed
Only give remote queues 1 minute to flush samples on shutdown. #3773
This comment has been minimized.
This comment has been minimized.
vitaly-m
commented
Jan 31, 2018
|
Hi, unfortunately we have such issue not only in case of unavailable remote storage, but also in case of many targets (configured using file sd config). |
This comment has been minimized.
This comment has been minimized.
|
yes that is also a problem and should be addressed in #3762 |
This comment has been minimized.
This comment has been minimized.
|
@tomwilkie, @krasi-georgiev |
This comment has been minimized.
This comment has been minimized.
|
The guarantee we offer upstream remote write endpoints is in-order sample delivery; with the wait group, and old shard can be continue flushing after a new shard starts, delivering samples out of order. |
This comment has been minimized.
This comment has been minimized.
|
What if we made "force" stopping shards: func (s *shards) stop(force bool) {
for _, shard := range s.queues {
close(shard)
}
if force == false {
s.wg.Wait()
}
} |
This comment has been minimized.
This comment has been minimized.
|
If you skip the wait, samples will be sent out of order. If you want to force the shards to stop, you need #3773 |
tomwilkie
referenced this issue
May 23, 2018
Merged
Only give remote queues 1 minute to flush samples on shutdown. #4187
This comment has been minimized.
This comment has been minimized.
|
Thanks, will look at it. |
tomwilkie
closed this
in
#4187
May 29, 2018
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 22, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |

Chevron94 commentedJan 19, 2018
What did you do?
Prometheus sends samples to our remote storage. One time when remote storage is going to down, we see in logs:
level=warn ts=2018-01-19T08:29:08.516631616Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=1 err="Post <storage_url>: getsockopt: no route to host"
If we try to reload prometheus configuration by sending post on <prometheus_url>/-/reload when remote storage is down, this request is never finished. Looks like prometheus trying to send data to storage again and again, before reload configuration.
What did you expect to see?
Prometheus should reload configuration
What did you see instead? Under which circumstances?
Configuration wasn't reloaded, no response on POST <prometheus_url>/-/reload request
Environment
System information:
Linux 3.10.0-514.26.2.el7.x86_64 x86_64
Prometheus version:
2.0.0
Prometheus configuration file: