Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote queue manager should deal with blips in upstream latency #2648

Closed
tomwilkie opened this Issue Apr 21, 2017 · 6 comments

Comments

Projects
None yet
5 participants
@tomwilkie
Copy link
Member

tomwilkie commented Apr 21, 2017

Currently the queue is per shard and only buffers a 100k samples; if you have a remote target that is usually very fast, but periodically slow, we will have one shard that can easily overwhelm its queue.

We could consider having a globally fixed length queue that gets partitioned by the number of shards. Or we could make resharding happen when the queue is overwhelmed. Or we could just accept dropping samples in this case.

@danielshiplett

This comment has been minimized.

Copy link

danielshiplett commented Apr 21, 2017

More notes on what I was seeing. The time to call my remote write was typically 80ms (0.999), 20ms (0.99), and 5ms (0.9). Occasionally I would see blips where my 0.999 quantile was over 1s.

Since I've switched to using Undertow as my embedded server, I'm seeing far fewer blips to 1s. I've also added more scrape targets and I would say that Prometheus is now actively adjusting shard counts. I'm seeing no dropped records. If this is due to the fewer blips, the change to Undertow, or the more active sharding in Prometheus, I have no idea.

I'll report some more next week after this has had a chance to soak in for a while.

@danielshiplett

This comment has been minimized.

Copy link

danielshiplett commented May 4, 2017

So things are generally running better now. I have 500,000 timeseries and I see about 300 remote write requests per second (almost 2 million samples per minute). I still see occasional blips but they are occurring much less frequently. My shards seem to bounce around between 4-6.

However, I've also noticed that when I do a configuration reload (POST to /-/reload) that my shards gets reset to 1. Does Prometheus not save this information between runs? When it gets reset to 1 I go through the process of losing more write requests until it works its shards back up again.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented May 4, 2017

@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented May 4, 2017

@poblahblahblah

This comment has been minimized.

Copy link

poblahblahblah commented May 29, 2018

It looks like the queue configuration was addressed in https://sourcegraph.com/github.com/prometheus/prometheus/-/commit/454b6611458d01398678baca372a57973df4215f and then documented in #4126

gouthamve added a commit to gouthamve/prometheus that referenced this issue Jul 18, 2018

Reload remote write only on config change
Right now it is quite disruptive, forgetting the old shard num and
starting from 1 causing prometheus to drop samples.

Related: prometheus#2648

Signed-off-by: Goutham Veeramachaneni <cs14btech11014@iith.ac.in>
@tomwilkie

This comment has been minimized.

Copy link
Member Author

tomwilkie commented Mar 4, 2019

I'd say we can close this: the new WAL based remote_write code in 2.8 includes a much more proactive resharding code that responds to upstream latency variations very quickly.

@tomwilkie tomwilkie closed this Mar 4, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.