Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upRemote queue manager should deal with blips in upstream latency #2648
Comments
This comment has been minimized.
This comment has been minimized.
danielshiplett
commented
Apr 21, 2017
•
|
More notes on what I was seeing. The time to call my remote write was typically 80ms (0.999), 20ms (0.99), and 5ms (0.9). Occasionally I would see blips where my 0.999 quantile was over 1s. Since I've switched to using Undertow as my embedded server, I'm seeing far fewer blips to 1s. I've also added more scrape targets and I would say that Prometheus is now actively adjusting shard counts. I'm seeing no dropped records. If this is due to the fewer blips, the change to Undertow, or the more active sharding in Prometheus, I have no idea. I'll report some more next week after this has had a chance to soak in for a while. |
This comment has been minimized.
This comment has been minimized.
danielshiplett
commented
May 4, 2017
|
So things are generally running better now. I have 500,000 timeseries and I see about 300 remote write requests per second (almost 2 million samples per minute). I still see occasional blips but they are occurring much less frequently. My shards seem to bounce around between 4-6. However, I've also noticed that when I do a configuration reload (POST to /-/reload) that my shards gets reset to 1. Does Prometheus not save this information between runs? When it gets reset to 1 I go through the process of losing more write requests until it works its shards back up again. |
This comment has been minimized.
This comment has been minimized.
|
Yeah, config reloads are still fairly intrusive for the remote write path,
as that all gets completely flushed/reloaded on config reload. See also
this comment:
https://github.com/prometheus/prometheus/blob/master/storage/remote/write.go#L36-L37
…On Thu, May 4, 2017 at 4:54 PM, Daniel Shiplett ***@***.***> wrote:
So things are generally running better now. I have 500,000 timeseries and
I see about 300 remote write requests per second (almost 2 million samples
per minute). I still see occasional blips but they are occurring much less
frequently. My shards seem to bounce around between 4-6.
However, I've also noticed that when I do a configuration reload (POST to
/-/reload) that my shards gets reset to 1. Does Prometheus not save this
information between runs? When it gets reset to 1 I go through the process
of losing more write requests until it works its shards back up again.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2648 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAg1mCVxhRlSEXDLWmlt9zXEE8zSpuxNks5r2eaQgaJpZM4NEf6->
.
|
This comment has been minimized.
This comment has been minimized.
|
We should indeed optimise this bit, it shouldn't be too hard.
…On Thu, May 4, 2017 at 6:28 PM Julius Volz ***@***.***> wrote:
Yeah, config reloads are still fairly intrusive for the remote write path,
as that all gets completely flushed/reloaded on config reload. See also
this comment:
https://github.com/prometheus/prometheus/blob/master/storage/remote/write.go#L36-L37
On Thu, May 4, 2017 at 4:54 PM, Daniel Shiplett ***@***.***>
wrote:
> So things are generally running better now. I have 500,000 timeseries and
> I see about 300 remote write requests per second (almost 2 million
samples
> per minute). I still see occasional blips but they are occurring much
less
> frequently. My shards seem to bounce around between 4-6.
>
> However, I've also noticed that when I do a configuration reload (POST to
> /-/reload) that my shards gets reset to 1. Does Prometheus not save this
> information between runs? When it gets reset to 1 I go through the
process
> of losing more write requests until it works its shards back up again.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <
#2648 (comment)
>,
> or mute the thread
> <
https://github.com/notifications/unsubscribe-auth/AAg1mCVxhRlSEXDLWmlt9zXEE8zSpuxNks5r2eaQgaJpZM4NEf6-
>
> .
>
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2648 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAbGhYsObO7A3wrq-JqB5NVkqPAC-6Fqks5r2grAgaJpZM4NEf6->
.
|
This comment has been minimized.
This comment has been minimized.
poblahblahblah
commented
May 29, 2018
|
It looks like the queue configuration was addressed in https://sourcegraph.com/github.com/prometheus/prometheus/-/commit/454b6611458d01398678baca372a57973df4215f and then documented in #4126 |
gouthamve
added a commit
to gouthamve/prometheus
that referenced
this issue
Jul 18, 2018
This comment has been minimized.
This comment has been minimized.
|
I'd say we can close this: the new WAL based remote_write code in 2.8 includes a much more proactive resharding code that responds to upstream latency variations very quickly. |
tomwilkie commentedApr 21, 2017
Currently the queue is per shard and only buffers a 100k samples; if you have a remote target that is usually very fast, but periodically slow, we will have one shard that can easily overwhelm its queue.
We could consider having a globally fixed length queue that gets partitioned by the number of shards. Or we could make resharding happen when the queue is overwhelmed. Or we could just accept dropping samples in this case.