Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flush full batches instead of silently dropping #245

Closed

Conversation

samschlegel
Copy link
Contributor

@samschlegel samschlegel commented Oct 5, 2020

Currently if we have an export rate higher than the buffer can hold for the given interval, we just lose spans which doesn't seem great. This changes the worker logic to flush when the buffer is full.

I believe it will probably have some conflicts with master due to the async changes, so will probably need some changes, but wanted to throw up a PR as I've had this sitting in a branch for a while

@samschlegel samschlegel requested a review from a team as a code owner October 5, 2020 23:34
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 5, 2020

CLA Check
The committers are authorized under a signed CLA.

@jtescher
Copy link
Member

jtescher commented Oct 6, 2020

@samschlegel looks good, should be easy to rebase, or just apply your changes here now.

@jtescher
Copy link
Member

jtescher commented Oct 6, 2020

There may be other questions around backpressure here if the intent is to have spans never be dropped for load shedding purposes. If spans are being produced faster than the exporter can send, where should they buffer and should they buffer until OOM or should there be a cap elsewhere at which point spans start dropping? Currently for example you could set your max_queue_size to be very high and you shouldn't drop any spans, but if it did happen to hit that high limit then it would start dropping rather than OOM and crash.

@samschlegel
Copy link
Contributor Author

Yeah I was thinking that as well right after I posted this. Perhaps what would be better instead is to add metrics so that we can track the buffer size and how many spans have been dropped. As is it's pretty silent, and in our services we hit that default queue size pretty quickly (which which works out to ~410 spans/sec) and I imagine others probably would as well.

@jtescher
Copy link
Member

jtescher commented Oct 6, 2020

@samschlegel yeah definitely folks interested in having more metrics awareness generally on the trace side open-telemetry/opentelemetry-specification#381 for example. There is also now a global error handler, it's currently used for errors in metrics but could (and should) be extended to trace errors, this could be a reasonable way to get a hook do do app-specific behavior.

@samschlegel
Copy link
Contributor Author

open-telemetry/opentelemetry-specification#381 seems to be more about deriving metrics from traces, not adding traces to the internals of the SDK side of things? What I'd want is the span_processor to export internal metrics for the queue length that we could hookup through normal opentelemetry metrics means to export to prometheus.

The current metrics I'm thinking about are current_queue_length, max_queue_length, and spans_dropped. Can use the first two as an early warning/alerting system, and the third for knowing how much we've lost

@jtescher
Copy link
Member

@samschlegel that seems reasonable, but likely outside the scope of this PR. Should we close this an open an issue for span processor metrics?

@jtescher
Copy link
Member

Closing this for now, can follow up with a metrics approach in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants