Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAllow users to tweak remote queue parameters #2445
Comments
This comment has been minimized.
This comment has been minimized.
|
The primary goal is to not take down the Prometheus, while also not doing things that'd be difficult to deal with on the other end (e.g. 1k shards for everyone would eat unnecessary resources on both ends for the vast majority of users). I propose we have something that is effectively a memory limit on the Prometheus side (which may be tied automatically to throughput) and try to automatically run a sufficient number of workers/shards to manage that. |
This comment has been minimized.
This comment has been minimized.
With a cap, right? As you say, we almost never want a 1k shards. There also has to be something to optimise for - otherwise we'd always just run max shards. For example, consider:
If we just ran 100 shards, then the user would have to wait 100s to fill a batch up and flush. If we ran less shards (for 100/s, one shard) we could give them a latency of 1s. If we capped the max buffer size, we could do something like every time the buffer fills up and we drop a sample, we double the number of shards, up to max shards. For the negative pressure, we could do something like every time a shard has to wait more than target seconds to fill a batch, we reduce the number of shards. And then add some hysteresis. It all sounds a little complicated though... |
This comment has been minimized.
This comment has been minimized.
|
Something along those lines sounds about right. I suggest we hardcode a maximum delay somewhere in the 1-10s range. |
This comment has been minimized.
This comment has been minimized.
|
SGTM. I'll let this sit for a few days before I start anything - I'm hoping someone knows of a fancy algorithm that will stop this oscillating etc. Something like a PID. |
tomwilkie
referenced this issue
Feb 28, 2017
Closed
0-value data in cortex-ui but not plain prom #304
This comment has been minimized.
This comment has been minimized.
|
I've coded up a quick experiment using a PID to control resharding: master...tomwilkie:remote-write-pid-sharding TLDR; the PID interacts badly with the queuing. I think a simpler approach might work better, so I'm going to try that tomorrow. |
tomwilkie
referenced this issue
Mar 8, 2017
Closed
Scheme for automatically handling remote queue backlog / failures / scaling #1933
This comment has been minimized.
This comment has been minimized.
|
I've got rid of the PID and just done a proportional feedback loop: master...tomwilkie:remote-write-sharding Its works much better, but still some oscillation. I'm going to add come of the concepts from the PID controller back in, hopefully that will solve the problem. |
tomwilkie
referenced this issue
Mar 13, 2017
Merged
Dynamically reshard the QueueManager based on observed load. #2494
juliusv
closed this
in
#2494
Mar 20, 2017
This comment has been minimized.
This comment has been minimized.
lock
bot
commented
Mar 23, 2019
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
tomwilkie commentedFeb 23, 2017
See https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L118
We should at least expose these in the config file - right now we're limiting users to 10 shards, 100 samples / batch - so for a service with 100ms latency, that only 10k samples/s.
We also discussed back in #1931 making this dynamic - so now is probably a good time to discuss that. I guess users probably want to bound how long data sits in the queue, and bound the maximum parallelism (to not overwhelm their prometheus) and batch size? From there we could minimise parallelism / maximise batch 'fullness'?
Thoughts?