Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote storage write client should retry on certain errors, to deal with temporary issues #2512

Closed
tomwilkie opened this Issue Mar 20, 2017 · 8 comments

Comments

Projects
None yet
3 participants
@tomwilkie
Copy link
Member

tomwilkie commented Mar 20, 2017

We should minimise the chance of dropping samples when sending them to remote storage by retrying on a certain class of errors.

We currently just drop the batch on the floor: https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L282

Ideally we would retry (not re-queue, as we want to preserve the ordering guarantees). We should cap the retries at a configurable limit, and ensure this doesn't interact badly with the dynamic re-sharding behaviour by including the retires is the observed batch latency.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 20, 2017

The "class" of errors on which we retry should be network errors and http 5xx. We shouldn't retry on 4xx, as the remote storage might not be able to accept the batch for some reason (rate limits, exceeding label lengths or cardinality etc).

Any thoughts? I can probably code this up this week.

@brancz

This comment has been minimized.

Copy link
Member

brancz commented Mar 20, 2017

Sounds reasonable. We've done things like this within the Alertmanager as well where it retries for notifications on certain errors for the providers.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 21, 2017

This seems fine to me as long as the amount of memory this can fill up is very limited (causing an OOM is the big danger, and the samples in the remote storage code are in a representation where they use maximally much memory) and it does not result in backpressure to the scrapers.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 21, 2017

the amount of memory this can fill up is very limited

Yeah, we limit to 50,000,000 samples (500 shared * 100 samples per batch * 1000 batches) - https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L44

On most systems it will be a lot less, as it won't scale up to 500 shards. 500 shards is for 1million samples/s.

it does not result in backpressure to the scrapers.

It will never: https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L249

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 21, 2017

it does not result in backpressure to the scrapers.

It will never:
https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L249

Right, that's how the scraper checks beforehand whether it should throttle itself, but the property that Append() never blocks should be maintained too. Right now it ensures that by just throwing samples away when the shard's queue is full (https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L239), so that needs to obviously stay like that as well. You probably intend to do that anyways and just do the retrying within the sharded queues, up to their capacity.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 21, 2017

I see - I do intend to keep it like that. The dynamic sharding should ensure the queues never get to full, and if they do you've probably got bigger problems...

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Mar 21, 2017

Great.

@brian-brazil brian-brazil changed the title Remote storage client should retry on certain errors, to deal with temporary issues Remote storage write client should retry on certain errors, to deal with temporary issues Apr 5, 2017

@juliusv juliusv closed this in #2552 Apr 6, 2017

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.