Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow users to tweak remote queue parameters #2445

Closed
tomwilkie opened this Issue Feb 23, 2017 · 7 comments

Comments

Projects
None yet
2 participants
@tomwilkie
Copy link
Member

tomwilkie commented Feb 23, 2017

See https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L118

We should at least expose these in the config file - right now we're limiting users to 10 shards, 100 samples / batch - so for a service with 100ms latency, that only 10k samples/s.

We also discussed back in #1931 making this dynamic - so now is probably a good time to discuss that. I guess users probably want to bound how long data sits in the queue, and bound the maximum parallelism (to not overwhelm their prometheus) and batch size? From there we could minimise parallelism / maximise batch 'fullness'?

Thoughts?

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 23, 2017

The primary goal is to not take down the Prometheus, while also not doing things that'd be difficult to deal with on the other end (e.g. 1k shards for everyone would eat unnecessary resources on both ends for the vast majority of users).

I propose we have something that is effectively a memory limit on the Prometheus side (which may be tied automatically to throughput) and try to automatically run a sufficient number of workers/shards to manage that.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Feb 23, 2017

automatically run a sufficient number of workers/shards to manage that.

With a cap, right? As you say, we almost never want a 1k shards.

There also has to be something to optimise for - otherwise we'd always just run max shards. For example, consider:

  • 100 max shards
  • a batch size of 100
  • a low sample rate (100/s)

If we just ran 100 shards, then the user would have to wait 100s to fill a batch up and flush. If we ran less shards (for 100/s, one shard) we could give them a latency of 1s.

If we capped the max buffer size, we could do something like every time the buffer fills up and we drop a sample, we double the number of shards, up to max shards. For the negative pressure, we could do something like every time a shard has to wait more than target seconds to fill a batch, we reduce the number of shards. And then add some hysteresis.

It all sounds a little complicated though...

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Feb 23, 2017

Something along those lines sounds about right.

I suggest we hardcode a maximum delay somewhere in the 1-10s range.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Feb 23, 2017

SGTM. I'll let this sit for a few days before I start anything - I'm hoping someone knows of a fancy algorithm that will stop this oscillating etc. Something like a PID.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 7, 2017

I've coded up a quick experiment using a PID to control resharding:

master...tomwilkie:remote-write-pid-sharding

TLDR; the PID interacts badly with the queuing. I think a simpler approach might work better, so I'm going to try that tomorrow.

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 8, 2017

I've got rid of the PID and just done a proportional feedback loop:

master...tomwilkie:remote-write-sharding

Its works much better, but still some oscillation. I'm going to add come of the concepts from the PID controller back in, hopefully that will solve the problem.

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 23, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 23, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.