Add internal event queuing and flushing #227

claytono · 2019-05-30T11:46:47Z

At high traffic levels, the locking around sending on channels can cause a large amount of blocking and CPU usage. This adds an event queue mechanism so that events are queued for short period of time, and flushed in batches to the main exporter goroutine periodically.

The default is is to flush every 1000 events, or every 200ms, whichever happens first.

matthiasr · 2019-05-31T09:09:19Z

interesting, this somewhat relates to the discussion we had around measuring UDP packet drops here #196 (comment)

what would happen with the different listeners if the queue overflows?

claytono · 2019-05-31T12:28:13Z

I skimmed through that a while back, and I like the idea in general, but I haven't actually read the PR. If nothing else, being able to have metrics for dropped packets I think has value. Another approach to getting those metrics might be to switch the front end listeners to do non-blocking sends on the event queue. That would allow dropping events when the queue is full and there could be a metric for that. That might be a simplest change than adding another stage to the processing pipeline.

I think the behavior regarding queue overflows stays mostly the same. The event queue that the front end listeners push events to now has a bigger buffer, and there is the small internal buffer, but eventually both of those will fill and it will block on trying to send on that channel.

matthiasr · 2019-06-04T08:35:32Z

Another approach to getting those metrics might be to switch the front end listeners to do non-blocking sends on the event queue

That is my preferred approach as well, because it's platform independent. Does it make sense to add it here?

At high traffic levels, the locking around sending on channels can cause a large amount of blocking and CPU usage. These adds an event queue mechanism so that events are queued for short period of time, and flushed in batches to the main exporter goroutine periodically. The default is is to flush every 1000 events, or every 200ms, whichever happens first. Signed-off-by: Clayton O'Neill <claytono@github.com>

claytono · 2019-06-04T11:03:53Z

If you're ok with it, I'd prefer to address that in a future PR. I'm not sure if I'll have the time to incorporate that into this PR in the short term.

matthiasr

Looks good, my only concern is about visibility into what's happening with the queue. With only the flush counter, you can kinda sorta infer when it ran full because the rate goes over the baseline. I think it would be useful to get an earlier warning, and to be able to see when the queue is always nearly full, or there are short spikes that are over the batch size and could be handled in one batch with a slightly bigger queue size.

telemetry.go

Co-Authored-By: Matthias Rampke <mr@soundcloud.com> Signed-off-by: Clayton O'Neill <claytono@github.com>

claytono · 2019-06-07T13:02:44Z

@matthiasr I think that makes sense. I'd also like to get that metric in place also, but I'm tied up with another project at the moment. Your comment about the metric naming makes sense and I've updated the PR to incorporate that.

and add the missing changelog entry for #227 Signed-off-by: Matthias Rampke <mr@soundcloud.com>

claytono force-pushed the event-queuing branch 2 times, most recently from a9ff5d3 to d16e73e Compare May 30, 2019 11:56

claytono marked this pull request as ready for review May 30, 2019 12:04

claytono force-pushed the event-queuing branch from d16e73e to c7e7696 Compare June 4, 2019 11:01

matthiasr reviewed Jun 7, 2019

View reviewed changes

telemetry.go Outdated Show resolved Hide resolved

Update event queue metric name to be more descriptive.

091bf99

Co-Authored-By: Matthias Rampke <mr@soundcloud.com> Signed-off-by: Clayton O'Neill <claytono@github.com>

claytono force-pushed the event-queuing branch from 85978bb to 091bf99 Compare June 7, 2019 13:01

matthiasr merged commit 5832aa9 into prometheus:master Jun 7, 2019

matthiasr pushed a commit that referenced this pull request Jun 13, 2019

Release 0.11.0

70c2275

and add the missing changelog entry for #227 Signed-off-by: Matthias Rampke <mr@soundcloud.com>

matthiasr mentioned this pull request Dec 5, 2022

Performance improvement - Introduce queue for raw statsd payloads #459

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add internal event queuing and flushing #227

Add internal event queuing and flushing #227

claytono commented May 30, 2019 •

edited

matthiasr commented May 31, 2019

claytono commented May 31, 2019

matthiasr commented Jun 4, 2019

claytono commented Jun 4, 2019

matthiasr left a comment

claytono commented Jun 7, 2019

Add internal event queuing and flushing #227

Add internal event queuing and flushing #227

Conversation

claytono commented May 30, 2019 • edited

matthiasr commented May 31, 2019

claytono commented May 31, 2019

matthiasr commented Jun 4, 2019

claytono commented Jun 4, 2019

matthiasr left a comment

Choose a reason for hiding this comment

claytono commented Jun 7, 2019

claytono commented May 30, 2019 •

edited