-
Notifications
You must be signed in to change notification settings - Fork 19
[improvement] Tracer spans are inexpensive for unsampled operations #180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
1fc38fb to
49c6b99
Compare
49c6b99 to
f08617b
Compare
Added final fastStartSpan overload
Doesn't change performance.
schlosna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 the 4x speedup and 4x reduction in allocations for common UNDECIDED case and 10x improvements for DO_NOT_SAMPLE are excellent 😄
| /** | ||
| * Like {@link #startSpan(String, String, SpanType)}, but does not return an {@link OpenSpan}. | ||
| */ | ||
| abstract void fastStartSpan(String operation, String parentSpanId, SpanType type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps what's warranted here is just a brand new tracing implementation? I don't think continuing to hack the current implementation even further.
I'm also perplexed by why this would be the primary target for perf improvement, we do lots of expensive processing throughout the stack of calls across our services, and while Tracing is a constant overhead it's likely dwarfed by serde costs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps what's warranted here is just a brand new tracing implementation?
Sure, there are several problems we could solve in a new tracing implementation, I would be happy to help with that project when I have cycles, but it's not something that I can drive at the moment.
I'm also perplexed by why this would be the primary target for perf improvement, we do lots of expensive processing throughout the stack of calls across our services, and while Tracing is a constant overhead it's likely dwarfed by serde costs.
That is certainly true of the standard conjure service stack, however this change is not targeting problems with conjure services. We have an internal application with unusually high throughput and low latency requirements, built on our service library (for configuration, status, logging, metrics, etc) where interoperability with the standard stack allows us to avoid reinventing these components.
This sounded lower than expected, the benchmark was set to 10% sample rate instead of 1% for the |
| * not thread-safe and is intended to be used in a thread-local context. | ||
| */ | ||
| public final class Trace { | ||
| public abstract class Trace { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure, but I think this might be an ABI break.
A quick search internally suggested that nobody is actually importing the com.palantir.tracing.Trace class - perhaps we could even make it package-private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s exposed by Tracer.getAndClearTrace, however all methods are package private — it is only available as a marker
|
|
||
| abstract Optional<OpenSpan> top(); | ||
|
|
||
| abstract Optional<OpenSpan> pop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the contract of these two methods a bit confusing... normally an Optional.empty means an element was not present, but in this case the UnsampledTrace always returns Optional.empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, this PR introduces a subtle change in behavior where completing an unsampled span (even using the slow complete span method) no longer returns a present value, because we know it is not recorded (in fact, sampling an unsampled span would be a bug).
Our tracing API is missing clean isolation between producers and consumers -- if I am creating spans, I don't need to know whether or not they are sampled, or what the span IDs/timing looks like, just how to start and stop them. Conceptually there's no reason to consume these values because that functionality is a concern of the SpanObserver.
The contract is defined by
tracing-java/tracing/src/main/java/com/palantir/tracing/Tracer.java
Lines 182 to 188 in 42aafc8
| /** | |
| * Completes and returns the current span (if it exists) and notifies all {@link #observers subscribers} about the | |
| * completed span. | |
| * If the return value is not used, prefer {@link Tracer#fastCompleteSpan()}. | |
| */ | |
| @CheckReturnValue | |
| public static Optional<Span> completeSpan() { |
There's a lot going on in that comment: "returns the current span (if it exists)". If we're not sampled, we don't really have a current span because we didn't record one. We have to track how many we would have created in the sampled case, but it does not violate the contract to return an empty value. This is reinforced by the next statement: "notifies all subscribers about the completed span", because subscribers are only notified if spans are sampled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, this is a very surprising PR. It is very much a break in behavior, and excusing it on a possible implied reading of the javadoc is not reasonable. This was done as a minor version bump (although given our current tooling a major version bump wouldn't have made a huge difference).
Conceptually there's no reason to consume these values because that functionality is a concern of the SpanObserver.
If that were true we would only ever be using fastCompleteSpan(). As that isn't true, this is not an implementation detail but an API contract.
|
|
||
| @Override | ||
| Optional<OpenSpan> pop() { | ||
| validateDepth(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm in the Sampled class, calling pop() repeatedly on an empty trace just returns Optional.empty over and over again
In this case though, it will throw. Should we make these behave the same? or is this difference important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pop validates that the value is positive, this should never throw unless we write a bug. I can add a comment explaining that, or remove the validation.
|
|
||
| @Override | ||
| boolean isEmpty() { | ||
| validateDepth(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory this call should be safe to delete right? As pop() is the only method that can decrement depth?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, validateDepth only exists to make sure I didn't write a bug :-)
We already have it in most cases, no reason to look up from the threadlocal.
|
👍 |
Before this PR
Operations which are fast in the common case were infeasible to trace because regardless of the trace rate, span creation performance was worse than the operation being measured.
After this PR
Added
Tracer.fastStartSpanwhich can effectively no-op for unsampled operations.