-
Notifications
You must be signed in to change notification settings - Fork 384
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow batch span processor to export data as soon as a batch is filled #494
Conversation
a84d003
to
e6dd0d6
Compare
Why is an extra task needed to solve this problem? Could we just kick off an export if the buffer fills up without spawning an extra task to do it? |
My intention would be to keep filing the next batch while the first one is being exported. So we could keep the exporter exporting without waiting for the buffer to be filled. |
Interesting idea! sorry haven't had time to fully read over the change, will try to review tomorrow |
Is that actually a problem we've observed in practice? The design does sound sane to me, I'm just wondering if we have some proof the extra complexity is actually warranted. |
Wouldn't we get a simpler API by doing what is most appropriate in each occasion without exposing further configuration to the user? |
The batch processor spec only mentions dropping spans once max queue size is hit, which seems to be the conservative approach (i.e. you always know what the max spans are that any app can produce over a given period of time for capacity planning), but it also doesn't explicitly disallow sending early. |
The overhead brought by this method is you potentially sending more requests. Users who may not care that much about span dropping may find application generating more internet IO. But it's debatable which method should we use as default. |
The problem was mainly because we have a fix interval between exports. Thus, if there is a spike of span volume, chances are they gonna be dropped. So this should be mitigated once we push on batch filled. The alternative to having two separate tasks is having only one task. But make the size of the channel between application task and background task to be I guess the most concern I have is we re-create the delay future on each loop. Not so how much overhead that gonna bring to the time of each loop. |
At least in tokio, delay futures can be reset, IIRC. |
At least in our case, it's not the application to choose what to drop or what to keep. |
That's true. But the Did some simple benchmark to see how much the impact is. We end a span every 10 ms. Based on the time needed to export a batch
|
I agree that the usual use case is to run a collector as a sidecar or a dedicated service to process the spans before sending to the backend. So maybe we want to make this PR's mode as default instead |
I think this is the right direction, there may be some ways to simplify but overall I think the two future approach makes sense to be able to independently make progress on both tasks. I think I'd go with the dropping behavior by default as it seems to be what the spec wants, but let people opt in to this behavior if they prefer. |
* delay v.s. interval Tokio's interval will complete immediately so in the loop the interval will always complete once we restart the loop. It's also more intuitive to use delay as the batch processor will restart the delay/interval whenever it export. Removed the Unpin restriction on Runtime::Interval * separate two different export in two functions
… job. Most of the errors could happen are from channels. It's usually indicate the other side of the channel has been dropped.
3126f48
to
e3c2c91
Compare
Additional data point to consider: exporting early when a batch is filled is the default behaviour of the batching processor in the OpenTelemetry Collector (see here). |
Should we create a issue in spec to clarify the behavior of batch span processor? |
@TommyCpp yeah could be useful to open an issue, this also brings up the question of batch parallelization, i.e. if we can export between the specified intervals as well, are the number of batches capped? or can we just keep creating more in-flight batches as needed? |
Same question for force flush I suppose? if you both end a span and call force flush 3 times do you have 3 batches in flight being exported? or do you have 1 in flight and subsequent force flushes have no effect until the batch has been exported? |
Looks like the go batch exporter uses mutexes, so single batch in flight at any time, and exports when full. |
An implementation of @LukeMathWalker's suggestion to match the go/otlp behavior would look like d9a0015 I believe. That version would always exports when batch limit is hit, max queue size is enforced at initial channel send rather than in the processor loop, single in-flight batch at any point. Some pros with this approach are:
Cons:
|
@jtescher I think this should help users with a high volume of spans. I agree we should keep the implementation align with other SDKs. If users find the multiple batches can help we can always add it later. Feel free to close this PR and open another one base on your implementation if we decide to go that way |
I think the design sounds like an improvement. The code in |
@TommyCpp right, will clean that up a bit then get a PR 👍 |
resolves #468
Problem
When generating spans at a high speed, users often experience span loss because the buffer reaches max size. Currently, the solution is to encourage users to use a larger buffer size and a shorter delay. This PR proposes a new export mode within the batch span processor. This allows the batch span processor to export spans as soon as the batch is filled.
Proposed solution
We spawn two tasks to handle the exports of spans. One is
collect task
, which collects spans and determines whether it's time to export it. The other isexport task
, which exports a batch of spans.The
collect task
should be non-blocking unless the message it received is shutdown or force push. When a span sent tocollect task
, it will push the span into the buffer.collect task
will send a batch toexport task
in two cases, one is when the batch's size reaches themax_export_size
, the other is when pre-definedscheduled_delay
reaches.collect task
will restart theschedule_delay
timer after each export.The
export task
polls batches of spans from the channel between it andcollect task
. The channel between them serves as a temporary buffer for the exporter. The channel length ismax_queue_size
. Note that the element of this channel isVec<SpanData>
. Thus, the buffer can holdmax_queue_size * batch_size
spans.Note that even in this mode, users can still experience span loss if the spans are generating faster than the exporter can handle. Specifically, the
collect task
will drop spans if the channel between it andexport task
has been filled.Diagram
Open questions
collect task
andexport task
?