-
Notifications
You must be signed in to change notification settings - Fork 19
Trace thread state is removed when root spans are completed #22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trace thread state is removed when root spans are completed #22
Conversation
Multiple root spans are not allowed, traceId state should not leak between multiple root spans.
|
@schlosna fysa |
| /** Returns the globally unique identifier for this thread's trace. */ | ||
| public static String getTraceId() { | ||
| return currentTrace.get().getTraceId(); | ||
| return Preconditions.checkNotNull(currentTrace.get(), "There is no root span").getTraceId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered returning a newly generated ID when no trace is available, but that could potentially result in bad data if we initialize a new span assuming that traceId will be maintained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly want to use the safelogging Preconditions here so these messages will definitely not get redacted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that would be ideal, but we don't currently depend on safe-logging. I'd prefer not to add deps in this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we expose a public static boolean hasTraceId() to allow checking if tracing is currently enabled?
Once this PR goes through we can obviously clean up https://github.com/palantir/tritium/blob/develop/tritium-tracing/src/main/java/com/palantir/tritium/tracing/TracingInvocationEventHandler.java#L71 as getAndClearTrace will now do the right thing and clear state, but there may be cases where we want to know if there is a root trace ID already set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to a hasTraceId or a new method to return Optional<String>... otherwise you can't eyeball I method like this and be confident it's safe:
@Override
public HttpConnection create(URL url) throws IOException {
HttpConnection connection = delegate.create(url);
String traceId = Tracer.getTraceId();
connection.setRequestProperty(TraceHttpHeaders.TRACE_ID, traceId);
return connection;
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
I searched for usages of getTraceId, and everything that I found called it after creating a new span, so it should be safe, though there may be cases that I'm not aware of. For what it's worth, anything that does throw here is also responsible for leaking traceIds.
| MDC.put(Tracers.TRACE_ID_KEY, trace.getTraceId()); | ||
| return trace; | ||
| }); | ||
| private static final ThreadLocal<Trace> currentTrace = new ThreadLocal<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK so I guess this is the key change - this guy now uses null to represent the idea that we're not in a trace?
I guess we just have to be super careful now everywhere that we use currentTrace.get() that it might actually be null!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct, otherwise trace thread state leaks outside of root spans, which causes multiple root spans unless code directly creates a new traceId.
| .filter(openSpan -> currentTrace.get().isObservable()) | ||
| .map(openSpan -> toSpan(openSpan, metadata)) | ||
| .ifPresent(Tracer::notifyObservers); | ||
| if (isTraceObservable()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why put this behind the isTraceObservable check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack, this is a bug. fix incoming
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
I think this makes sense directionally, but I want to be super careful changing this stuff because it's been stable for quite a long time and it underpins pretty much every one of our services! |
| return trace; | ||
| } | ||
|
|
||
| private static void clearCurrentTrace() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the only two places this can be called are:
public Trace getAndClearTrace()- when a call to
fastCompleteSpan()->popCurrentSpan()(pops the last remaining span)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is correct
schlosna
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supportive of this PR as it will allow removing the workarounds added in palantir/tritium#134
Have a few questions/nits below
| Optional<OpenSpan> prevState = trace.top(); | ||
| if (prevState.isPresent()) { | ||
| spanBuilder.parentSpanId(prevState.get().getSpanId()); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could simplify
| } | |
| trace.top().ifPresent(openSpan -> spanBuilder.parentSpanId(openSpan.getSpanId())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opted to avoid the lambda allocation since this is a relatively hot path. We do a lot elsewhere, so I could go either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense
| if (trace != null) { | ||
| boolean observable = isTraceObservable(); | ||
| Optional<OpenSpan> span = popCurrentSpan(); | ||
| if (observable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we reuse the trace to avoid an additional thread local get?
| if (observable) { | |
| if (isObservable(trace)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call, I think we want trace.isObservable() rather than a new method unless I'm misunderstanding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the call sites should have a separate null check on the trace, so I think this will be okay.
| notifyObservers(span); | ||
| } | ||
| }); | ||
| if (maybeSpan.isPresent() && observable) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we reuse the trace to avoid additional thread local get and simplify?
| if (maybeSpan.isPresent() && observable) { | |
| if (isObservable(trace)) { | |
| maybeSpan.ifPresent(Tracer::notifyObservers); | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| public static boolean isTraceObservable() { | ||
| return currentTrace.get().isObservable(); | ||
| Trace trace = currentTrace.get(); | ||
| return trace != null && trace.isObservable(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to above comments, could add isObservable(Trace) to avoid additional thread local lookups
| return trace != null && trace.isObservable(); | |
| return isObservable(currentTrace.get()); | |
| } | |
| private static boolean isObservable(Trace trace) { | |
| return trace != null && trace.isObservable(); | |
| } |
| /** Returns the globally unique identifier for this thread's trace. */ | ||
| public static String getTraceId() { | ||
| return currentTrace.get().getTraceId(); | ||
| return Preconditions.checkNotNull(currentTrace.get(), "There is no root span").getTraceId(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we expose a public static boolean hasTraceId() to allow checking if tracing is currently enabled?
Once this PR goes through we can obviously clean up https://github.com/palantir/tritium/blob/develop/tritium-tracing/src/main/java/com/palantir/tritium/tracing/TracingInvocationEventHandler.java#L71 as getAndClearTrace will now do the right thing and clear state, but there may be cases where we want to know if there is a root trace ID already set.
I think that would be helpful, but I would prefer to propose that in a separate change. |
|
I don't see a test for the changed behavior, could you add one, please? |
|
Was tested (though admittedly indirectly) by this assertion: Added a better test for this specific change. |
iamdanfox
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think this change looks good. It forces users to be a bit more precise rather than just assuming a root span will be invented.
I've also done a quick search internally for any users of the getTraceId method and it seems that very few people use it directly.
This version removes the need for the traceId thread state cleanup code. See: palantir/tracing-java#22
This version removes the need for the traceId thread state cleanup code. See: palantir/tracing-java#22
Multiple root spans are not allowed, traceId state should not
leak between multiple root spans.