Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upprometheus 2.4.3 not increasing shards to match remote write capacity #4771
Comments
This comment has been minimized.
This comment has been minimized.
|
I just noticed that 2.4.3 seems to have changed the default for Bumping my 2.4.3 instance to the higher capacity and retry count seems to fix my issue (though it also causes a longer hangout at 2 shards which isn't quite enough either. I'm not sure if the documentation is out of date, or if this is a bug in prometheus. As an aside, is there a way to hint the number of shards prometheus should use at startup? It takes about 5 minutes to get to a stable value now, and took 3 hours to get there with the default values. |
This comment has been minimized.
This comment has been minimized.
It was and has been fixed by #4715. It will appear correctly once v2.5.0 is out. |
This comment has been minimized.
This comment has been minimized.
|
Thanks, So my original issue comes down to a lack of tuning of the remote storage queue on my part then. I'm happy to close out the issue on that basis, or leave open if we believe the default behavior should be better. |
This comment has been minimized.
This comment has been minimized.
|
I think we can close this issue. The rationale for changing the defaults were detailed here: the new defaults minimize the risk of Prometheus being OOM-killed when the remote write is down. |
BertHartm
closed this
Oct 25, 2018
This comment has been minimized.
This comment has been minimized.
ntindall
commented
Jan 3, 2019
•
|
@brian-brazil @tomwilkie @gouthamve - we should document changes to default configuration options in the release notes. I don't see anything about it in the We just got a bit burned by this upgrading from |
This comment has been minimized.
This comment has been minimized.
|
The change was documented in the changelog in the v2.4.0 section. |
BertHartm commentedOct 22, 2018
Bug Report
What did you do?
I upgraded from prometheus 2.2.1 to 2.4.3 while using the remote write endpoint
What did you expect to see?
I was using 3 remote_write shards before and not dropping metrics. I expected that to continue. I write about 30k metrics/s, and the scrape interval is 15s.
What did you see instead? Under which circumstances?
Remote storage shards goes to 3 on the new version by default, but it's dropping about 40k samples per minute. I would expect it to increase the number of shards to compensate, but that doesn't seem to happen. Taking down my remote_write endpoint causes the number of shards to increase which resolves the issue until a restart.
Environment
Linux 4.15.0-1023-azure x86_64
prometheus, version 2.4.3 (branch: HEAD, revision: 167a4b4)
build user: root@1e42b46043e9
build date: 20181004-08:42:02
go version: go1.11.1
then lots of:
I briefly take down m3coordinator:
and fewer of the "remote storage queue full" messages
It seems like the dropped samples are happening roughly every minute (we have a few scrape jobs that are every minute), and
max_over_time(prometheus_remote_storage_queue_length[1m])is only showing values in the 40k value sometimes, usually sub 10k. Also, 5 shards doesn't completely solve the dropped samples, only mostly.