Flush full batches instead of silently dropping #245

samschlegel · 2020-10-05T23:34:10Z

Currently if we have an export rate higher than the buffer can hold for the given interval, we just lose spans which doesn't seem great. This changes the worker logic to flush when the buffer is full.

I believe it will probably have some conflicts with master due to the async changes, so will probably need some changes, but wanted to throw up a PR as I've had this sitting in a branch for a while

linux-foundation-easycla · 2020-10-05T23:34:13Z

The committers are authorized under a signed CLA.

✅ Sam Schlegel (f67c612)

jtescher · 2020-10-06T00:13:32Z

@samschlegel looks good, should be easy to rebase, or just apply your changes here now.

jtescher · 2020-10-06T22:31:40Z

There may be other questions around backpressure here if the intent is to have spans never be dropped for load shedding purposes. If spans are being produced faster than the exporter can send, where should they buffer and should they buffer until OOM or should there be a cap elsewhere at which point spans start dropping? Currently for example you could set your max_queue_size to be very high and you shouldn't drop any spans, but if it did happen to hit that high limit then it would start dropping rather than OOM and crash.

samschlegel · 2020-10-06T22:45:40Z

Yeah I was thinking that as well right after I posted this. Perhaps what would be better instead is to add metrics so that we can track the buffer size and how many spans have been dropped. As is it's pretty silent, and in our services we hit that default queue size pretty quickly (which which works out to ~410 spans/sec) and I imagine others probably would as well.

jtescher · 2020-10-06T22:53:29Z

@samschlegel yeah definitely folks interested in having more metrics awareness generally on the trace side open-telemetry/opentelemetry-specification#381 for example. There is also now a global error handler, it's currently used for errors in metrics but could (and should) be extended to trace errors, this could be a reasonable way to get a hook do do app-specific behavior.

samschlegel · 2020-10-06T23:39:25Z

open-telemetry/opentelemetry-specification#381 seems to be more about deriving metrics from traces, not adding traces to the internals of the SDK side of things? What I'd want is the span_processor to export internal metrics for the queue length that we could hookup through normal opentelemetry metrics means to export to prometheus.

The current metrics I'm thinking about are current_queue_length, max_queue_length, and spans_dropped. Can use the first two as an early warning/alerting system, and the third for knowing how much we've lost

jtescher · 2020-10-13T02:37:05Z

@samschlegel that seems reasonable, but likely outside the scope of this PR. Should we close this an open an issue for span processor metrics?

jtescher · 2020-10-15T16:41:07Z

Closing this for now, can follow up with a metrics approach in the future.

flush full batches immediately, but also on timeout

f67c612

samschlegel requested a review from a team as a code owner October 5, 2020 23:34

jtescher mentioned this pull request Oct 15, 2020

Track batch span processor metrics #272

Open

jtescher closed this Oct 15, 2020

djc mentioned this pull request Mar 5, 2021

batchexporter doesnt export spans when batch size is reached #468

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flush full batches instead of silently dropping #245

Flush full batches instead of silently dropping #245

samschlegel commented Oct 5, 2020 •

edited

Loading

linux-foundation-easycla bot commented Oct 5, 2020 •

edited

Loading

jtescher commented Oct 6, 2020

jtescher commented Oct 6, 2020

samschlegel commented Oct 6, 2020

jtescher commented Oct 6, 2020 •

edited

Loading

samschlegel commented Oct 6, 2020

jtescher commented Oct 13, 2020

jtescher commented Oct 15, 2020

Flush full batches instead of silently dropping #245

Flush full batches instead of silently dropping #245

Conversation

samschlegel commented Oct 5, 2020 • edited Loading

linux-foundation-easycla bot commented Oct 5, 2020 • edited Loading

jtescher commented Oct 6, 2020

jtescher commented Oct 6, 2020

samschlegel commented Oct 6, 2020

jtescher commented Oct 6, 2020 • edited Loading

samschlegel commented Oct 6, 2020

jtescher commented Oct 13, 2020

jtescher commented Oct 15, 2020

samschlegel commented Oct 5, 2020 •

edited

Loading

linux-foundation-easycla bot commented Oct 5, 2020 •

edited

Loading

jtescher commented Oct 6, 2020 •

edited

Loading