New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jaeger Exporter Performance #1254
Conversation
activity.TagObjects, | ||
ref jaegerTags, | ||
ProcessActivityTagRef); | ||
activity.EnumerateTagValues(ref jaegerTags); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alanwest FYI I updated this for the new extension. It is a slightly easier API to use, and a bit faster because it doesn't hit the concurrent dictionary each time.
{ | ||
foreach (var batch in this.CurrentBatches) | ||
{ | ||
var task = this.thriftClient.WriteBatchAsync(batch.Value, CancellationToken.None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because I know someone will ask 😄
The thriftClient
instance Jaeger is using is hooked up to a UdpClient. The API is all async but ultimately writes to a socket buffer and returns completely synchronously. To save some perf I skipped awaiting the task. The #if DEBUG
checks in here are just in case that ever changes.
@@ -155,13 +119,25 @@ private Activity CreateTestActivity() | |||
ActivityTraceFlags.Recorded)), | |||
}; | |||
|
|||
return activitySource.StartActivity( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was returning null before because there was nothing sampling the source, which caused the benchmarks to blow up. ActivitySource API scoffs at the pit of success! 🕳️
Codecov Report
@@ Coverage Diff @@
## master #1254 +/- ##
==========================================
+ Coverage 79.01% 79.02% +0.01%
==========================================
Files 216 215 -1
Lines 6270 6169 -101
==========================================
- Hits 4954 4875 -79
+ Misses 1316 1294 -22
|
{ | ||
spanProcess = new Process(spanServiceName, this.Process.Tags); | ||
spanProcess.Message = this.BuildThriftMessage(spanProcess).ToArray(); | ||
this.processCache.Add(spanServiceName, spanProcess); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious - is there a need to put an upper bound (something like splay tree with a max size) or we are pretty confident that there won't be too many unique service names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure. Kind of related to #1235. I feel like it would be a special/rare case where you would be connecting to an ever-growing list of services. Some kind of gateway thing perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks solid.
One idea for consideration - the batching/grouping logic would save the bandwidth by only sending the service name once for a set of spans (if they share the same service name), while it comes with the cost of extra dictionary maintenance + GC overhead + complex code, wonder if it is a break even deal.
@reyang I've been thinking about your comment. It's definitely an interesting idea. One thing to note, it's not just the service name we send in that batch header. Process can also have tags. Here's where the message is constructed: opentelemetry-dotnet/src/OpenTelemetry.Exporter.Jaeger/Process.cs Lines 70 to 118 in 768eaa3
There's probably a sweet spot where the process data is small enough we should duplicate it on every span? We cache the bytes for those messages so it would be an easy check to add, just not sure how to figure out the size for the sweet spot. I just made a version that sends a Jaeger Batch (process + span) for every Activity: Before:
After:
Much less memory used up in the process, but more bytes written to the socket (not tracked in the benchmark). A bit slower too, which was unexpected. Not sure if that is statistically significant. Could be the result of copying many more small buffers to the socket? Where do you think we should go from here? |
This could be tested by replacing the actual socket syscall with a "drop-on-the-floor" dummy function?
The current perf results doesn't give a strong enough reason to make the change. Where is the perf test code (that has the "process + span" change)? - I wish to take a look. |
Sorry, my statement was misleading. The benchmark isn't actually writing to a socket, it writes to a "black hole" transport which just drops the bytes. I should have said: "Could be the result of copying many more small buffers to the stream?" That is more accurate with what is actually going on, I think.
Check out this branch: https://github.com/CodeBlanch/opentelemetry-dotnet/tree/jaeger-process-batch Luckily I stashed the changes when I was doing this. |
{ | ||
var task = this.thriftClient.WriteBatchAsync(batch.Value, CancellationToken.None); | ||
#if DEBUG | ||
if (task.Status != TaskStatus.RanToCompletion) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds more like a Debug.Assert
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated on #1262.
Changes
Performance before:
Performance after:
CHANGELOG.md
updated for non-trivial changes