New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remote storage ordering #1931
Remote storage ordering #1931
Conversation
👍 like in the downstream copy of this (tomwilkie#72 (comment)). I agree that it seems like @brian-brazil's concerns are orthogonal to the parallelization/ordering you're doing here, but would like him to confirm. |
I see; whats the issue number? My aim was not to address all existing problems with this code in this PR, although I'd love to discuss / implement solutions to the problems you describe in the not-to-distant future, especially as we're likely to be the ones that hit them!
I don't follow; why can't the 'retrieval' prometheus alert on this problem? The metrics expose dropped samples and remote latencies (I also happened to clean up the metrics in this PR - mind taking a look?) |
There isn't one, but it's a known problem. If we were just talking long term storage some flags would be okay, but that's not the case with distributed storage.
In the Frankenstein architecture there's nothing scraping the Prometheus having the problem. |
Why so? I'm not against making this dynamic per-se, but I don't see it as a blocker for this PR - this is no worse than what exists and I've factored out the config so we can make it configurable in the next few days. I'd be happy to make it configurable in this PR, and then by running a bunch of these we can learn how to make them more dynamic where appropriate.
Interestingly we have the retrieval scrapers setup to scrape themselves, so we could indeed detect this problem upstream / service side. And we're also planing to implement alerts for when we don't see any incoming samples for some duration. I think we're now straying off topic for this PR though... |
With long term storage, it doesn't matter if the long term storage doesn't work as you still have all the data in your local Prometheus and almost all your alerts and graphs will be working off that. With distributed storage if all the data can't get to/from the remote storage you're toast.
Only if the data is making it to the distributed storage.
This would present as some samples coming through, but not all. Practically speaking it'd top out at some upper bound determined by RTT, so gradual growth could get you into this situation. |
@brian-brazil Can we agree that this is an orthogonal problem which can be fixed independently and that this PR doesn't materially make anything worse with regards to that? |
Okay, but this needs to be handled automatically. |
Agreed, filed #1933 for the follow up. |
BatchSendDeadline time.Duration // Maximum time sample will wait in buffer. | ||
} | ||
|
||
var defaultConfig = StorageQueueManagerConfig{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last nit: if something is called defaultConfig
, I would expect that the constructor takes a pointer to a config and if it's nil, use the default config automatically, rather than having every caller pass it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 done, PTAL
By splitting the single queue into multiple queues and flushing each individual queue in serially (and all queues in parallel), we can guarantee to preserve the order of timestampsin samples sent to downstream systems.
…n both samples and batches, in a consistent fashion. Also, report total queue capacity of all queues, i.e. capacity * shards.
ec6f7d0
to
d41d913
Compare
By splitting the single queue into multiple queues and flushing each individual queue in serially (and all queues in parallel), we can guarantee to preserve the order of timestamps in samples sent to downstream systems.
Also, rationalise the metrics exported by the queue manager.
Fixes #1843