Add call tracing #115

markelliot · 2016-05-02T17:52:30Z

All remoting calls will now emit standard Zipkin-style tracing headers for traceId and spanId.

Additionally, applications including the tracing library can start and stop traces
inside the application using Traces.deriveTrace(String) and a completion call
to emit a span Traces.complete(). (Traces must happen within a single
thread, work dispatched to executors may not track correctly given current
implementation details.)

Servers can interpret, load, and will automatically continue passing
the same trace identifier by installing the TraceInheritingFilter
Jersey resource. Calls that did not include a trace will emit an identified trace
as a result of installing the filter.

schlosna · 2016-05-02T18:15:45Z

tracing/src/main/java/com/palantir/tracing/TraceState.java

+@Value.Style(visibility = Value.Style.ImplementationVisibility.PACKAGE)
+public abstract class TraceState {
+
+    private static final Random RANDOM = new Random();


May want to consider ThreadLocalRandom as:

When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention.

pnepywoda · 2016-05-03T01:12:45Z

do we still need this now that #74 is merged?

markelliot · 2016-05-03T06:09:25Z

Yes, this provides a per-invocation tracking system which will supply the same id to code that wants it traversing possibly many system boundaries. (See Zipkin, for instance, as a viewer of this kind of data.)

pnepywoda · 2016-05-03T18:26:11Z

Ah I see. Ok cool. Though are there existing libraries that do this for us? Took a quick look around and would something like https://github.com/openzipkin/brave work for us? (maybe it's too much overhead and that's why we're doing it ourselves here?)

markelliot · 2016-05-16T16:50:37Z

Latest changeset moves to Zipkin/OpenTracing semantics, things left to do here:

generate 64bit IDs instead of UUIDs
figure out how to complete spans on Feign/Retrofit returns (which may or may not matter since we now automatically complete spans when the server responds)
figure out where to send completion events

markelliot · 2016-05-17T10:18:23Z

http-clients/src/main/java/feign/TraceResponseDecoder.java

+        String traceId = safeGetOnlyElement(response.headers().get(Traces.Headers.TRACE_ID), null);
+        String spanId = safeGetOnlyElement(response.headers().get(Traces.Headers.SPAN_ID), null);
+        Optional<TraceState> trace = Traces.getTrace();
+        if (traceId != null && spanId != null && trace.isPresent()) {


we could consider always popping the top of the trace here, in theory every single request should be matched by a decode call.

schlosna · 2016-05-18T03:10:00Z

👍

markelliot · 2016-05-18T15:20:27Z

Dave, meant to ask, do you know if the IDs need to be 64bit or if that’s just common practice? If it’s common practice, how do others typically get them to be unique enough?

splittingfield · 2016-05-18T16:00:29Z

Every implementation I have seen uses a variation of twitters snowflake algorithm.

marc

On May 18, 2016, at 11:20 AM, Mark Elliot <notifications@github.com mailto:notifications@github.com> wrote:

Dave, meant to ask, do you know if the IDs need to be 64bit or if that’s just common practice? If it’s common practice, how do others typically get them to be unique enough?

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220060911&d=DQMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=ke0Nhk2N6Or67A-QApBwBbgx8M2luq4iaEaDNHD8IPM&s=85nvqwuSnA-GAhS-NKK8XcWOmLMauVV17o1zGPamhjM&e=

markelliot · 2016-05-18T16:19:56Z

Interesting, it looks like that's a central service rather than an algorithm. With the fauxflake thing you linked I'm not sure what to seed as the service identifier.

splittingfield · 2016-05-18T16:30:07Z

It can be anything really or a config value specified at startup time.

marc

On May 18, 2016, at 12:20 PM, Mark Elliot <notifications@github.com mailto:notifications@github.com> wrote:

Interesting, it looks like that's a central service rather than an algorithm. With the fauxflake thing you linked I'm not sure what to seed as the service identifier.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220080022&d=DQMCaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=rclNz4WJHuHrX5BFxB0oyDb9GhS4sFEQie8TmIsUOS8&s=yiGDWQyeqAJqD9ml0_gITjn5bzlci2JHTnfwZDDKnMY&e=

markelliot · 2016-05-18T16:45:43Z

That's not super satisfying, I'll look at what we can use.

splittingfield · 2016-05-18T17:23:23Z

I like time stamp: MAC address: service specific id : counter (from

marc

On May 18, 2016, at 12:45 PM, Mark Elliot <notifications@github.com mailto:notifications@github.com> wrote:

That's not super satisfying, I'll look at what we can use.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220087669&d=DQMCaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=XSnz1nL8ebBBuYR7298WXc8Eif2Sbpu4vUQLE7ECQ04&s=zJkDYdsv5SQyrth7Ll7dwfNsgGGeCfZt6sG-TA0b63I&e=

splittingfield · 2016-05-18T17:42:04Z

Maybe some variation of service name+version:MAC address: time stamp?

Assuming that MAC address is accessible and that we don't collide multiple copies services on the same host and we can trust outer ntp servers.

marc

On May 18, 2016, at 12:45 PM, Mark Elliot <notifications@github.com mailto:notifications@github.com> wrote:

That's not super satisfying, I'll look at what we can use.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220087669&d=DQMCaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=XSnz1nL8ebBBuYR7298WXc8Eif2Sbpu4vUQLE7ECQ04&s=zJkDYdsv5SQyrth7Ll7dwfNsgGGeCfZt6sG-TA0b63I&e=

markelliot · 2016-05-18T17:44:32Z

I think the challenge is turning that into a 64bit number? The sad thing is the form you're describing what UUIDs are supposed to be anyway.

splittingfield · 2016-05-18T17:45:17Z

Yeah I know, but type 1 uuids are so confusing!

marc

On May 18, 2016, at 1:44 PM, Mark Elliot <notifications@github.com mailto:notifications@github.com> wrote:

I think the challenge is turning that into a 64bit number?

The sad thing is the form you're describing what UUIDs are supposed to be anyway.

—
You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220104063&d=DQMCaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=vk9DMcgd3_es-cDhC94suGtrviHEyj_c2R3eEFBVU0E&s=A3eV2yTup0OgHilSfTC_XXEJBvxm6FWBPfk74N0lM5E&e=

uschi2000 · 2016-05-19T00:45:24Z

tracing/src/main/java/com/palantir/tracing/Span.java

+
+    public abstract long getStartTimeMs();
+
+    public abstract long getDurationNs();


does ns resolution make sense if the start time is in ms?

Accurate timing on the JVM needs to use the nanosecond counter on the CPU, but wall clock time is only really available in milliseconds. I'm happy to convert the nanoseconds to milliseconds but I figured I'd save the math operation and leave that to a subscriber. (Don't have a strong opinion, just explaining rationale.)

ok, that sounds fine

On Wed, May 18, 2016 at 6:35 PM Mark Elliot notifications@github.com
wrote:

In tracing/src/main/java/com/palantir/tracing/Span.java
#115 (comment):

+@JsonSerialize(as = ImmutableSpan.class)
+@Value.Immutable
+@Value.Style(visibility = Value.Style.ImplementationVisibility.PACKAGE)
+public abstract class Span {
+

public abstract String getTraceId();

public abstract Optional getParentSpanId();

public abstract String getSpanId();

public abstract String getOperation();

public abstract long getStartTimeMs();

public abstract long getDurationNs();

Accurate timing on the JVM needs to use the nanosecond counter on the CPU,
but wall clock time is only really available in milliseconds. I'm happy to
convert the nanoseconds to milliseconds but I figured I'd save the math
operation and leave that to a subscriber. (Don't have a strong opinion,
just explaining rationale.)

—
You are receiving this because you were assigned.

Reply to this email directly or view it on GitHub
https://github.com/palantir/http-remoting/pull/115/files/cef3b7359acf8d47d417137799a5f5547358e14e#r63811121

…rializableError class)

markelliot · 2016-05-22T20:16:28Z

@uschi2000 ready for another look, LMK if you have additional feedbakc

jkozlowski · 2016-06-02T07:59:18Z

For the tracing, how about using https://github.com/openzipkin/brave? I am about to prototype this for Phoenix.

markelliot · 2016-06-02T13:30:56Z

Could you point me to some details on how to make a span internal to a server with Brave? Poking around the readmes and code this seems entirely non-obvious to me, and is an important part of the tracing framework. Separately, as long as the HTTP parts of the implementation match the Zipkin APIs, it shouldn't matter how the internals work.

uschi2000 · 2016-06-02T14:03:21Z

Like this?
https://github.com/openzipkin/brave/blob/master/brave-core/src/main/java/com/github/kristofa/brave/LocalTracer.java

On Thu, Jun 2, 2016 at 3:31 PM Mark Elliot notifications@github.com wrote:

Could you point me to some details on how to make a span internal to a
server with Brave? Poking around the readmes and code this seems entirely
non-obvious to me, and is an important part of the tracing framework.

Separately, as long as the HTTP parts of the implementation match the
Zipkin APIs, it shouldn't matter how the internals work.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#115 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGOdwcrFx1QBps1n9isw4MSgRwzQkkRIks5qHtsRgaJpZM4IVjpL
.

markelliot · 2016-06-02T14:07:07Z

Maybe I'm misreading the code, but that doesn't seem to allow nesting of recorded local spans?

uschi2000 · 2016-06-02T14:43:58Z

Haven't dug in enough detail to answer this either, sorry.

On Thu, Jun 2, 2016 at 4:07 PM Mark Elliot notifications@github.com wrote:

Maybe I'm misreading the code, but that doesn't seem to allow nesting of
recorded local spans?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#115 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGOdwSVk2zaaxSvkY6kZ-L-_Z429tYOPks5qHuOMgaJpZM4IVjpL
.

uschi2000 · 2016-06-02T14:45:33Z

Generally, the Brave code-base looks very solid (and has no external
dependencies), so I think we should consider it, even if it's currently
lacking that feature. (We could always push for adding it there.)

On Thu, Jun 2, 2016 at 4:43 PM Robert Fink rf@robertfink.de wrote:

Haven't dug in enough detail to answer this either, sorry.

On Thu, Jun 2, 2016 at 4:07 PM Mark Elliot notifications@github.com
wrote:

Maybe I'm misreading the code, but that doesn't seem to allow nesting of
recorded local spans?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#115 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGOdwSVk2zaaxSvkY6kZ-L-_Z429tYOPks5qHuOMgaJpZM4IVjpL
.

markelliot · 2016-06-02T15:01:08Z

It looks like it'd be a pretty substantive change to how it stores state (been poking around code for quite a bit now) to make it work, and would be enough of an API break that it'd likely mean a major version bump for it.

We've found a lot of value from other internal initiatives having this particular feature, though maybe it's less interesting when services are granular enough.

At any rate, I can make two proposals:

You take a stab at an alternative PR that adds Brave everywhere necessary (I'm not sure I have time at this point to retry this with another library)
We take this PR and advise against anyone invoking "Traces" directly for internal instrumentation so that we can re-evaluate later

Upside for option #1 is immediate integration with Zipkin. I think higher priority for us than that, though, is a tracing-aware dropwizard logging addition, though, so we can get trace and span ids into every log statement we emit, including in request logs.

jkozlowski · 2016-06-02T15:15:09Z

I am going to be digging into this a bit more over the course of next week, but my understanding was that the definitely had some solution for inter-process tracing (ThreadLocal based) and they had a solution for crossing threadpool boundaries.

jkozlowski · 2016-06-02T15:17:06Z

I am definitely going to be prototyping this for Phoenix with Brave, simply because that is the only thing I can use right now (we have upgrade to http-remoting in the works). I want to see Zipkin style traces coming out of Phoenix: adding trace aware log messages is useful, but right now Phoenix doesn't actually do much that much logging, so it's lower priority.

markelliot · 2016-06-02T15:17:49Z

Crossing thread pool boundaries is simple enough and would be a small addition here, the challenge is maintaining an internal stack of spans so that you can instrument internally in a detailed way. Again, not that familiar with Brave, so definitely open to the possibility I'm missing something in its capability or implementation.

uschi2000 · 2016-06-02T15:40:41Z

Should we set a deadline for the Brave experimentation and punt on this PR
until then? Would suggest end of next week.

On Thu, Jun 2, 2016 at 5:18 PM Mark Elliot notifications@github.com wrote:

Crossing thread pool boundaries is simple enough and would be a small
addition here, the challenge is maintaining an internal stack of spans so
that you can instrument internally in a detailed way. Again, not that
familiar with Brave, so definitely open to the possibility I'm missing
something in its capability or implementation.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#115 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGOdwTvO-7TmO7wWzay69aGQ-6f7rC-Sks5qHvQegaJpZM4IVjpL
.

jkozlowski · 2016-06-02T17:46:48Z

A bit more digging around: I hooked brave into a servlet filter and then invoked LocalTracer twice:

    public PhoenixVersion getVersion() {
        brave.localTracer().startNewSpan("this-is-a-test", "text");
        try {
            Thread.sleep(1000);
            secondLevel();
        } catch (InterruptedException e) {
            throw Throwables.propagate(e);
        } finally {
            brave.localTracer().finishSpan();
        }
    }

    private void secondLevel() {
        brave.localTracer().startNewSpan("this-is-a-test1", "text1");
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            throw Throwables.propagate(e);
        } finally {
            brave.localTracer().finishSpan();
        }
    }

Gave me 2 traces:

{"traceId":"0000000000000001","name":"text1","id":"7672bbb60ba5b369","parentId":"0000000000000002","timestamp":1464889183460000,"duration":2004831,"annotations":[],"binaryAnnotations":[{"key":"lc","value":"this-is-a-test1","endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"}}]}

{"traceId":"0000000000000001","name":"get","id":"0000000000000002","timestamp":1464889182435000,"duration":3150000,"annotations":[{"endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"},"timestamp":1464889182435000,"value":"sr"},{"endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"},"timestamp":1464889185585000,"value":"ss"}],"binaryAnnotations":[{"key":"http.status_code","value":"200","endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"}},{"key":"http.url","value":"/phoenix/api/v0/version","endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"}}]}

So the traces got linked correctly, but it looks to me like it lost the intermediate this-is-a-test and only recorded this-is-a-test1 - so no stacking as per what @markelliot is saying. I'll continue poking around tomorrow, maybe I am using it wrong.

jkozlowski · 2016-06-03T07:43:12Z

Looking at it a bit more, all the ServerTracer, ClientTracer and LocalTracer operate on ServerClientAndLocalSpanState which is an interface and can be replaced. Not sure how many things it would break if we replaced it's ThreadLocalServerClientAndLocalSpanState with what you had implemented here. Not sure how valuable it is to use this library if we need to replace this part of it and how constraining it is.

jkozlowski · 2016-06-03T09:24:38Z

Ok I think I have a better understanding of the library:

The core is ServerClientAndLocalSpanState: this is basically the per-thread storage for the current server, local and client spans: it only stores a single span per each category.

Then you have ServerTracer -> LocalTracer -> ClientTracer. This means that whenever LocalTracer wants to start a new span, it will lookup the current server trace from ServerClientAndLocalSpanState and use that as it's parent and set the new span into ServerClientAndLocalSpanState. Similarly, for ClientTracer: it will use the current local trace as it's parent.

Therefore, if you want to have nesting in local tracing, you just need to have another layer that restores the current local trace after a subtrace returns: the ServerClientAndLocalSpanState has basically methods that set the spans for server, local and client.

I am going to go with Brave for my little experiment (in fact I have almost finished integrating it), since adding this small layer for nested local tracing (which I will not want just yet) should be fairly trivial and I gain a lot of the out of the box integrations that Brave comes with.

Now sure how we'd like to proceed? I suggest:

Amend the PR to use Brave and hook everything up with a caveat that LocalTracer should not be used for multiple layers of tracing.
Contribute back the filters you created and integrations for retrofit etc. and then pull in the newest version.
Once we need the local tracing amended, implement it here first and test drive it on a few projects and see how they feel about contributing this back.

jkozlowski · 2016-06-03T10:35:53Z

Also how does this all compare to Chronicle and what should we use?

markelliot · 2016-06-03T23:44:01Z

Chronicle and the implementation here are nearly identical (though that’s not so surprising since I’m the original author of Chronicle). We would’ve used it except its dependency tree makes it hard to open source – I looked at the impl a bit as I put together this setup.

I think internal spans have been extremely useful in the past, and would highly value continuing to have them.

Replacing the guts of Brave seems pretty daunting – the setup here would give us an API-compatible system though none of the Zipkin connectivity, so the trade is likely a hard one.

jkozlowski · 2016-06-06T08:41:00Z

Looks like there is interest from brave to support this: openzipkin/brave#166

markelliot force-pushed the feature/tracing branch from 0000191 to f7cfd0b Compare May 2, 2016 17:54

markelliot assigned uschi2000 May 2, 2016

schlosna reviewed May 2, 2016
View reviewed changes

markelliot force-pushed the feature/tracing branch 2 times, most recently from 88f30f1 to 583c2f1 Compare May 17, 2016 09:17

markelliot reviewed May 17, 2016
View reviewed changes

markelliot force-pushed the feature/tracing branch from ed394f0 to 5c0decf Compare May 18, 2016 15:13

markelliot force-pushed the feature/tracing branch from 5c0decf to cef3b73 Compare May 18, 2016 15:29

uschi2000 reviewed May 19, 2016
View reviewed changes

Update for PR comments

986c1dd

markelliot force-pushed the feature/tracing branch from a1460f3 to 986c1dd Compare May 19, 2016 17:19

markelliot added 3 commits May 19, 2016 13:21

Use standard OSS header

0a3a9d7

Update for PR comments

ad830bb

Undo immutables upgrade (which is causing build issues with latest Se…

2eb6153

…rializableError class)

schlosna mentioned this pull request May 25, 2016

As a developer/operator I want to be able to see what queries Atlas is running palantir/atlasdb#495

Closed

markelliot closed this Jun 22, 2016

markelliot deleted the feature/tracing branch June 22, 2016 02:51

schlosna mentioned this pull request Sep 21, 2016

Initialize Brave Zipkin tracing for Dropwizard servers and JAX RS clients #235

Closed

schlosna added a commit to schlosna/http-remoting that referenced this pull request Oct 11, 2016

WIP, merge of palantir#115

d432f1b

schlosna mentioned this pull request Oct 11, 2016

Brave tracer v2 #244

Closed

markelliot mentioned this pull request Mar 28, 2017

Remove Brave tracing #253

Merged


		public abstract long getStartTimeMs();

		public abstract long getDurationNs();

Add call tracing #115

Add call tracing #115

Conversation

markelliot commented May 2, 2016 • edited

schlosna May 2, 2016

Choose a reason for hiding this comment

pnepywoda commented May 3, 2016

markelliot commented May 3, 2016

pnepywoda commented May 3, 2016

markelliot commented May 16, 2016

markelliot May 17, 2016

Choose a reason for hiding this comment

schlosna commented May 18, 2016

markelliot commented May 18, 2016

splittingfield commented May 18, 2016

markelliot commented May 18, 2016

splittingfield commented May 18, 2016

markelliot commented May 18, 2016 via email

splittingfield commented May 18, 2016

splittingfield commented May 18, 2016

markelliot commented May 18, 2016 via email

splittingfield commented May 18, 2016

uschi2000 May 19, 2016

Choose a reason for hiding this comment

markelliot May 19, 2016 via email

Choose a reason for hiding this comment

uschi2000 May 19, 2016

Choose a reason for hiding this comment

markelliot commented May 22, 2016

jkozlowski commented Jun 2, 2016

markelliot commented Jun 2, 2016 via email

uschi2000 commented Jun 2, 2016

markelliot commented Jun 2, 2016

uschi2000 commented Jun 2, 2016

uschi2000 commented Jun 2, 2016

markelliot commented Jun 2, 2016

jkozlowski commented Jun 2, 2016

jkozlowski commented Jun 2, 2016

markelliot commented Jun 2, 2016

uschi2000 commented Jun 2, 2016

jkozlowski commented Jun 2, 2016 • edited

jkozlowski commented Jun 3, 2016

jkozlowski commented Jun 3, 2016

jkozlowski commented Jun 3, 2016 • edited

markelliot commented Jun 3, 2016

jkozlowski commented Jun 6, 2016

markelliot commented May 2, 2016 •

edited

jkozlowski commented Jun 2, 2016 •

edited

jkozlowski commented Jun 3, 2016 •

edited