New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add call tracing #115
Add call tracing #115
Conversation
0000191
to
f7cfd0b
Compare
@Value.Style(visibility = Value.Style.ImplementationVisibility.PACKAGE) | ||
public abstract class TraceState { | ||
|
||
private static final Random RANDOM = new Random(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to consider ThreadLocalRandom as:
When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention.
do we still need this now that #74 is merged? |
Yes, this provides a per-invocation tracking system which will supply the same id to code that wants it traversing possibly many system boundaries. (See Zipkin, for instance, as a viewer of this kind of data.) |
Ah I see. Ok cool. Though are there existing libraries that do this for us? Took a quick look around and would something like https://github.com/openzipkin/brave work for us? (maybe it's too much overhead and that's why we're doing it ourselves here?) |
Latest changeset moves to Zipkin/OpenTracing semantics, things left to do here:
|
88f30f1
to
583c2f1
Compare
String traceId = safeGetOnlyElement(response.headers().get(Traces.Headers.TRACE_ID), null); | ||
String spanId = safeGetOnlyElement(response.headers().get(Traces.Headers.SPAN_ID), null); | ||
Optional<TraceState> trace = Traces.getTrace(); | ||
if (traceId != null && spanId != null && trace.isPresent()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could consider always popping the top of the trace here, in theory every single request should be matched by a decode call.
👍 |
ed394f0
to
5c0decf
Compare
Dave, meant to ask, do you know if the IDs need to be 64bit or if that’s just common practice? If it’s common practice, how do others typically get them to be unique enough? |
5c0decf
to
cef3b73
Compare
Every implementation I have seen uses a variation of twitters snowflake algorithm. marc On May 18, 2016, at 11:20 AM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote: Dave, meant to ask, do you know if the IDs need to be 64bit or if that’s just common practice? If it’s common practice, how do others typically get them to be unique enough? — |
Interesting, it looks like that's a central service rather than an algorithm. With the fauxflake thing you linked I'm not sure what to seed as the service identifier. |
It can be anything really or a config value specified at startup time. marc On May 18, 2016, at 12:20 PM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote: Interesting, it looks like that's a central service rather than an algorithm. With the fauxflake thing you linked I'm not sure what to seed as the service identifier. — |
That's not super satisfying, I'll look at what we can use.
|
I like time stamp: MAC address: service specific id : counter (from marc On May 18, 2016, at 12:45 PM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote: That's not super satisfying, I'll look at what we can use. — |
Maybe some variation of service name+version:MAC address: time stamp? Assuming that MAC address is accessible and that we don't collide multiple copies services on the same host and we can trust outer ntp servers. marc On May 18, 2016, at 12:45 PM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote: That's not super satisfying, I'll look at what we can use. — |
I think the challenge is turning that into a 64bit number?
The sad thing is the form you're describing what UUIDs are supposed to be anyway.
|
Yeah I know, but type 1 uuids are so confusing! marc On May 18, 2016, at 1:44 PM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote: I think the challenge is turning that into a 64bit number? The sad thing is the form you're describing what UUIDs are supposed to be anyway. — |
|
||
public abstract long getStartTimeMs(); | ||
|
||
public abstract long getDurationNs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does ns resolution make sense if the start time is in ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, that sounds fine
On Wed, May 18, 2016 at 6:35 PM Mark Elliot notifications@github.com
wrote:
In tracing/src/main/java/com/palantir/tracing/Span.java
#115 (comment):+@JsonSerialize(as = ImmutableSpan.class)
+@Value.Immutable
+@Value.Style(visibility = Value.Style.ImplementationVisibility.PACKAGE)
+public abstract class Span {
+
- public abstract String getTraceId();
- public abstract Optional getParentSpanId();
- public abstract String getSpanId();
- public abstract String getOperation();
- public abstract long getStartTimeMs();
- public abstract long getDurationNs();
Accurate timing on the JVM needs to use the nanosecond counter on the CPU,
but wall clock time is only really available in milliseconds. I'm happy to
convert the nanoseconds to milliseconds but I figured I'd save the math
operation and leave that to a subscriber. (Don't have a strong opinion,
just explaining rationale.)—
You are receiving this because you were assigned.Reply to this email directly or view it on GitHub
https://github.com/palantir/http-remoting/pull/115/files/cef3b7359acf8d47d417137799a5f5547358e14e#r63811121
a1460f3
to
986c1dd
Compare
@uschi2000 ready for another look, LMK if you have additional feedbakc |
For the tracing, how about using https://github.com/openzipkin/brave? I am about to prototype this for Phoenix. |
Could you point me to some details on how to make a span internal to a server with Brave? Poking around the readmes and code this seems entirely non-obvious to me, and is an important part of the tracing framework.
Separately, as long as the HTTP parts of the implementation match the Zipkin APIs, it shouldn't matter how the internals work.
|
On Thu, Jun 2, 2016 at 3:31 PM Mark Elliot notifications@github.com wrote:
|
Maybe I'm misreading the code, but that doesn't seem to allow nesting of recorded local spans? |
Haven't dug in enough detail to answer this either, sorry. On Thu, Jun 2, 2016 at 4:07 PM Mark Elliot notifications@github.com wrote:
|
Generally, the Brave code-base looks very solid (and has no external On Thu, Jun 2, 2016 at 4:43 PM Robert Fink rf@robertfink.de wrote:
|
It looks like it'd be a pretty substantive change to how it stores state (been poking around code for quite a bit now) to make it work, and would be enough of an API break that it'd likely mean a major version bump for it. We've found a lot of value from other internal initiatives having this particular feature, though maybe it's less interesting when services are granular enough. At any rate, I can make two proposals:
Upside for option #1 is immediate integration with Zipkin. I think higher priority for us than that, though, is a tracing-aware dropwizard logging addition, though, so we can get trace and span ids into every log statement we emit, including in request logs. |
I am going to be digging into this a bit more over the course of next week, but my understanding was that the definitely had some solution for inter-process tracing (ThreadLocal based) and they had a solution for crossing threadpool boundaries. |
I am definitely going to be prototyping this for Phoenix with Brave, simply because that is the only thing I can use right now (we have upgrade to http-remoting in the works). I want to see Zipkin style traces coming out of Phoenix: adding trace aware log messages is useful, but right now Phoenix doesn't actually do much that much logging, so it's lower priority. |
Crossing thread pool boundaries is simple enough and would be a small addition here, the challenge is maintaining an internal stack of spans so that you can instrument internally in a detailed way. Again, not that familiar with Brave, so definitely open to the possibility I'm missing something in its capability or implementation. |
Should we set a deadline for the Brave experimentation and punt on this PR On Thu, Jun 2, 2016 at 5:18 PM Mark Elliot notifications@github.com wrote:
|
A bit more digging around: I hooked brave into a servlet filter and then invoked LocalTracer twice:
Gave me 2 traces:
So the traces got linked correctly, but it looks to me like it lost the intermediate this-is-a-test and only recorded this-is-a-test1 - so no stacking as per what @markelliot is saying. I'll continue poking around tomorrow, maybe I am using it wrong. |
Looking at it a bit more, all the ServerTracer, ClientTracer and LocalTracer operate on ServerClientAndLocalSpanState which is an interface and can be replaced. Not sure how many things it would break if we replaced it's ThreadLocalServerClientAndLocalSpanState with what you had implemented here. Not sure how valuable it is to use this library if we need to replace this part of it and how constraining it is. |
Ok I think I have a better understanding of the library: The core is ServerClientAndLocalSpanState: this is basically the per-thread storage for the current server, local and client spans: it only stores a single span per each category. Then you have ServerTracer -> LocalTracer -> ClientTracer. This means that whenever LocalTracer wants to start a new span, it will lookup the current server trace from ServerClientAndLocalSpanState and use that as it's parent and set the new span into ServerClientAndLocalSpanState. Similarly, for ClientTracer: it will use the current local trace as it's parent. Therefore, if you want to have nesting in local tracing, you just need to have another layer that restores the current local trace after a subtrace returns: the ServerClientAndLocalSpanState has basically methods that set the spans for server, local and client. I am going to go with Brave for my little experiment (in fact I have almost finished integrating it), since adding this small layer for nested local tracing (which I will not want just yet) should be fairly trivial and I gain a lot of the out of the box integrations that Brave comes with. Now sure how we'd like to proceed? I suggest:
|
Also how does this all compare to Chronicle and what should we use? |
Chronicle and the implementation here are nearly identical (though that’s not so surprising since I’m the original author of Chronicle). We would’ve used it except its dependency tree makes it hard to open source – I looked at the impl a bit as I put together this setup. I think internal spans have been extremely useful in the past, and would highly value continuing to have them. Replacing the guts of Brave seems pretty daunting – the setup here would give us an API-compatible system though none of the Zipkin connectivity, so the trade is likely a hard one. |
Looks like there is interest from brave to support this: openzipkin/brave#166 |
All remoting calls will now emit standard Zipkin-style tracing headers for traceId and spanId.
Additionally, applications including the tracing library can start and stop traces
inside the application using
Traces.deriveTrace(String)
and a completion callto emit a span
Traces.complete()
. (Traces must happen within a singlethread, work dispatched to executors may not track correctly given current
implementation details.)
Servers can interpret, load, and will automatically continue passing
the same trace identifier by installing the
TraceInheritingFilter
Jersey resource. Calls that did not include a trace will emit an identified trace
as a result of installing the filter.