Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRemote storage write client should retry on certain errors, to deal with temporary issues #2512
Comments
This comment has been minimized.
This comment has been minimized.
|
The "class" of errors on which we retry should be network errors and http 5xx. We shouldn't retry on 4xx, as the remote storage might not be able to accept the batch for some reason (rate limits, exceeding label lengths or cardinality etc). Any thoughts? I can probably code this up this week. |
tomwilkie
referenced this issue
Mar 20, 2017
Closed
don't lose data during a prom->cortex network disruption #263
This comment has been minimized.
This comment has been minimized.
|
Sounds reasonable. We've done things like this within the Alertmanager as well where it retries for notifications on certain errors for the providers. |
This comment has been minimized.
This comment has been minimized.
|
This seems fine to me as long as the amount of memory this can fill up is very limited (causing an OOM is the big danger, and the samples in the remote storage code are in a representation where they use maximally much memory) and it does not result in backpressure to the scrapers. |
This comment has been minimized.
This comment has been minimized.
Yeah, we limit to 50,000,000 samples (500 shared * 100 samples per batch * 1000 batches) - https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L44 On most systems it will be a lot less, as it won't scale up to 500 shards. 500 shards is for 1million samples/s.
It will never: https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L249 |
This comment has been minimized.
This comment has been minimized.
Right, that's how the scraper checks beforehand whether it should throttle itself, but the property that |
This comment has been minimized.
This comment has been minimized.
|
I see - I do intend to keep it like that. The dynamic sharding should ensure the queues never get to full, and if they do you've probably got bigger problems... |
This comment has been minimized.
This comment has been minimized.
|
Great. |
tomwilkie
referenced this issue
Apr 1, 2017
Merged
Remote writes: retry on recoverable errors. #2552
brian-brazil
changed the title
Remote storage client should retry on certain errors, to deal with temporary issues
Remote storage write client should retry on certain errors, to deal with temporary issues
Apr 5, 2017
juliusv
closed this
in
#2552
Apr 6, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
tomwilkie commentedMar 20, 2017
We should minimise the chance of dropping samples when sending them to remote storage by retrying on a certain class of errors.
We currently just drop the batch on the floor: https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L282
Ideally we would retry (not re-queue, as we want to preserve the ordering guarantees). We should cap the retries at a configurable limit, and ensure this doesn't interact badly with the dynamic re-sharding behaviour by including the retires is the observed batch latency.