Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add call tracing #115

Closed
wants to merge 13 commits into from
Closed

Add call tracing #115

wants to merge 13 commits into from

Conversation

markelliot
Copy link
Contributor

@markelliot markelliot commented May 2, 2016

All remoting calls will now emit standard Zipkin-style tracing headers for traceId and spanId.

Additionally, applications including the tracing library can start and stop traces
inside the application using Traces.deriveTrace(String) and a completion call
to emit a span Traces.complete(). (Traces must happen within a single
thread, work dispatched to executors may not track correctly given current
implementation details.)

Servers can interpret, load, and will automatically continue passing
the same trace identifier by installing the TraceInheritingFilter
Jersey resource. Calls that did not include a trace will emit an identified trace
as a result of installing the filter.

@Value.Style(visibility = Value.Style.ImplementationVisibility.PACKAGE)
public abstract class TraceState {

private static final Random RANDOM = new Random();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to consider ThreadLocalRandom as:

When applicable, use of ThreadLocalRandom rather than shared Random objects in concurrent programs will typically encounter much less overhead and contention.

@pnepywoda
Copy link
Contributor

do we still need this now that #74 is merged?

@markelliot
Copy link
Contributor Author

Yes, this provides a per-invocation tracking system which will supply the same id to code that wants it traversing possibly many system boundaries. (See Zipkin, for instance, as a viewer of this kind of data.)

@pnepywoda
Copy link
Contributor

Ah I see. Ok cool. Though are there existing libraries that do this for us? Took a quick look around and would something like https://github.com/openzipkin/brave work for us? (maybe it's too much overhead and that's why we're doing it ourselves here?)

@markelliot
Copy link
Contributor Author

Latest changeset moves to Zipkin/OpenTracing semantics, things left to do here:

  • generate 64bit IDs instead of UUIDs
  • figure out how to complete spans on Feign/Retrofit returns (which may or may not matter since we now automatically complete spans when the server responds)
  • figure out where to send completion events

@markelliot markelliot force-pushed the feature/tracing branch 2 times, most recently from 88f30f1 to 583c2f1 Compare May 17, 2016 09:17
String traceId = safeGetOnlyElement(response.headers().get(Traces.Headers.TRACE_ID), null);
String spanId = safeGetOnlyElement(response.headers().get(Traces.Headers.SPAN_ID), null);
Optional<TraceState> trace = Traces.getTrace();
if (traceId != null && spanId != null && trace.isPresent()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could consider always popping the top of the trace here, in theory every single request should be matched by a decode call.

@schlosna
Copy link
Contributor

👍

@markelliot
Copy link
Contributor Author

Dave, meant to ask, do you know if the IDs need to be 64bit or if that’s just common practice? If it’s common practice, how do others typically get them to be unique enough?

@splittingfield
Copy link
Contributor

Every implementation I have seen uses a variation of twitters snowflake algorithm.

marc

On May 18, 2016, at 11:20 AM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote:

Dave, meant to ask, do you know if the IDs need to be 64bit or if that’s just common practice? If it’s common practice, how do others typically get them to be unique enough?


You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220060911&d=DQMFaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=ke0Nhk2N6Or67A-QApBwBbgx8M2luq4iaEaDNHD8IPM&s=85nvqwuSnA-GAhS-NKK8XcWOmLMauVV17o1zGPamhjM&e=

@markelliot
Copy link
Contributor Author

Interesting, it looks like that's a central service rather than an algorithm. With the fauxflake thing you linked I'm not sure what to seed as the service identifier.

@splittingfield
Copy link
Contributor

It can be anything really or a config value specified at startup time.

marc

On May 18, 2016, at 12:20 PM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote:

Interesting, it looks like that's a central service rather than an algorithm. With the fauxflake thing you linked I'm not sure what to seed as the service identifier.


You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220080022&d=DQMCaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=rclNz4WJHuHrX5BFxB0oyDb9GhS4sFEQie8TmIsUOS8&s=yiGDWQyeqAJqD9ml0_gITjn5bzlci2JHTnfwZDDKnMY&e=

@markelliot
Copy link
Contributor Author

markelliot commented May 18, 2016 via email

@splittingfield
Copy link
Contributor

I like time stamp: MAC address: service specific id : counter (from

marc

On May 18, 2016, at 12:45 PM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote:

That's not super satisfying, I'll look at what we can use.


You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220087669&d=DQMCaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=XSnz1nL8ebBBuYR7298WXc8Eif2Sbpu4vUQLE7ECQ04&s=zJkDYdsv5SQyrth7Ll7dwfNsgGGeCfZt6sG-TA0b63I&e=

@splittingfield
Copy link
Contributor

Maybe some variation of service name+version:MAC address: time stamp?

Assuming that MAC address is accessible and that we don't collide multiple copies services on the same host and we can trust outer ntp servers.

marc

On May 18, 2016, at 12:45 PM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote:

That's not super satisfying, I'll look at what we can use.


You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220087669&d=DQMCaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=XSnz1nL8ebBBuYR7298WXc8Eif2Sbpu4vUQLE7ECQ04&s=zJkDYdsv5SQyrth7Ll7dwfNsgGGeCfZt6sG-TA0b63I&e=

@markelliot
Copy link
Contributor Author

markelliot commented May 18, 2016 via email

@splittingfield
Copy link
Contributor

Yeah I know, but type 1 uuids are so confusing!

marc

On May 18, 2016, at 1:44 PM, Mark Elliot <notifications@github.commailto:notifications@github.com> wrote:

I think the challenge is turning that into a 64bit number?

The sad thing is the form you're describing what UUIDs are supposed to be anyway.


You are receiving this because you commented.
Reply to this email directly or view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_palantir_http-2Dremoting_pull_115-23issuecomment-2D220104063&d=DQMCaQ&c=izlc9mHr637UR4lpLEZLFFS3Vn2UXBrZ4tFb6oOnmz8&r=jrWvj6_tsYoXkAuwsua7SN7dyBWoSoJdaocZd8kysVs&m=vk9DMcgd3_es-cDhC94suGtrviHEyj_c2R3eEFBVU0E&s=A3eV2yTup0OgHilSfTC_XXEJBvxm6FWBPfk74N0lM5E&e=


public abstract long getStartTimeMs();

public abstract long getDurationNs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does ns resolution make sense if the start time is in ms?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, that sounds fine

On Wed, May 18, 2016 at 6:35 PM Mark Elliot notifications@github.com
wrote:

In tracing/src/main/java/com/palantir/tracing/Span.java
#115 (comment):

+@JsonSerialize(as = ImmutableSpan.class)
+@Value.Immutable
+@Value.Style(visibility = Value.Style.ImplementationVisibility.PACKAGE)
+public abstract class Span {
+

  • public abstract String getTraceId();
  • public abstract Optional getParentSpanId();
  • public abstract String getSpanId();
  • public abstract String getOperation();
  • public abstract long getStartTimeMs();
  • public abstract long getDurationNs();

Accurate timing on the JVM needs to use the nanosecond counter on the CPU,
but wall clock time is only really available in milliseconds. I'm happy to
convert the nanoseconds to milliseconds but I figured I'd save the math
operation and leave that to a subscriber. (Don't have a strong opinion,
just explaining rationale.)


You are receiving this because you were assigned.

Reply to this email directly or view it on GitHub
https://github.com/palantir/http-remoting/pull/115/files/cef3b7359acf8d47d417137799a5f5547358e14e#r63811121

@markelliot
Copy link
Contributor Author

@uschi2000 ready for another look, LMK if you have additional feedbakc

@jkozlowski
Copy link
Contributor

For the tracing, how about using https://github.com/openzipkin/brave? I am about to prototype this for Phoenix.

@markelliot
Copy link
Contributor Author

markelliot commented Jun 2, 2016 via email

@uschi2000
Copy link
Contributor

Like this?
https://github.com/openzipkin/brave/blob/master/brave-core/src/main/java/com/github/kristofa/brave/LocalTracer.java

On Thu, Jun 2, 2016 at 3:31 PM Mark Elliot notifications@github.com wrote:

Could you point me to some details on how to make a span internal to a
server with Brave? Poking around the readmes and code this seems entirely
non-obvious to me, and is an important part of the tracing framework.

Separately, as long as the HTTP parts of the implementation match the
Zipkin APIs, it shouldn't matter how the internals work.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#115 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGOdwcrFx1QBps1n9isw4MSgRwzQkkRIks5qHtsRgaJpZM4IVjpL
.

@markelliot
Copy link
Contributor Author

Maybe I'm misreading the code, but that doesn't seem to allow nesting of recorded local spans?

@uschi2000
Copy link
Contributor

Haven't dug in enough detail to answer this either, sorry.

On Thu, Jun 2, 2016 at 4:07 PM Mark Elliot notifications@github.com wrote:

Maybe I'm misreading the code, but that doesn't seem to allow nesting of
recorded local spans?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#115 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGOdwSVk2zaaxSvkY6kZ-L-_Z429tYOPks5qHuOMgaJpZM4IVjpL
.

@uschi2000
Copy link
Contributor

Generally, the Brave code-base looks very solid (and has no external
dependencies), so I think we should consider it, even if it's currently
lacking that feature. (We could always push for adding it there.)

On Thu, Jun 2, 2016 at 4:43 PM Robert Fink rf@robertfink.de wrote:

Haven't dug in enough detail to answer this either, sorry.

On Thu, Jun 2, 2016 at 4:07 PM Mark Elliot notifications@github.com
wrote:

Maybe I'm misreading the code, but that doesn't seem to allow nesting of
recorded local spans?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#115 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGOdwSVk2zaaxSvkY6kZ-L-_Z429tYOPks5qHuOMgaJpZM4IVjpL
.

@markelliot
Copy link
Contributor Author

It looks like it'd be a pretty substantive change to how it stores state (been poking around code for quite a bit now) to make it work, and would be enough of an API break that it'd likely mean a major version bump for it.

We've found a lot of value from other internal initiatives having this particular feature, though maybe it's less interesting when services are granular enough.

At any rate, I can make two proposals:

  1. You take a stab at an alternative PR that adds Brave everywhere necessary (I'm not sure I have time at this point to retry this with another library)
  2. We take this PR and advise against anyone invoking "Traces" directly for internal instrumentation so that we can re-evaluate later

Upside for option #1 is immediate integration with Zipkin. I think higher priority for us than that, though, is a tracing-aware dropwizard logging addition, though, so we can get trace and span ids into every log statement we emit, including in request logs.

@jkozlowski
Copy link
Contributor

I am going to be digging into this a bit more over the course of next week, but my understanding was that the definitely had some solution for inter-process tracing (ThreadLocal based) and they had a solution for crossing threadpool boundaries.

@jkozlowski
Copy link
Contributor

I am definitely going to be prototyping this for Phoenix with Brave, simply because that is the only thing I can use right now (we have upgrade to http-remoting in the works). I want to see Zipkin style traces coming out of Phoenix: adding trace aware log messages is useful, but right now Phoenix doesn't actually do much that much logging, so it's lower priority.

@markelliot
Copy link
Contributor Author

Crossing thread pool boundaries is simple enough and would be a small addition here, the challenge is maintaining an internal stack of spans so that you can instrument internally in a detailed way. Again, not that familiar with Brave, so definitely open to the possibility I'm missing something in its capability or implementation.

@uschi2000
Copy link
Contributor

Should we set a deadline for the Brave experimentation and punt on this PR
until then? Would suggest end of next week.

On Thu, Jun 2, 2016 at 5:18 PM Mark Elliot notifications@github.com wrote:

Crossing thread pool boundaries is simple enough and would be a small
addition here, the challenge is maintaining an internal stack of spans so
that you can instrument internally in a detailed way. Again, not that
familiar with Brave, so definitely open to the possibility I'm missing
something in its capability or implementation.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#115 (comment),
or mute the thread
https://github.com/notifications/unsubscribe/AGOdwTvO-7TmO7wWzay69aGQ-6f7rC-Sks5qHvQegaJpZM4IVjpL
.

@jkozlowski
Copy link
Contributor

jkozlowski commented Jun 2, 2016

A bit more digging around: I hooked brave into a servlet filter and then invoked LocalTracer twice:

    public PhoenixVersion getVersion() {
        brave.localTracer().startNewSpan("this-is-a-test", "text");
        try {
            Thread.sleep(1000);
            secondLevel();
        } catch (InterruptedException e) {
            throw Throwables.propagate(e);
        } finally {
            brave.localTracer().finishSpan();
        }
    }

    private void secondLevel() {
        brave.localTracer().startNewSpan("this-is-a-test1", "text1");
        try {
            Thread.sleep(2000);
        } catch (InterruptedException e) {
            throw Throwables.propagate(e);
        } finally {
            brave.localTracer().finishSpan();
        }
    }

Gave me 2 traces:

{"traceId":"0000000000000001","name":"text1","id":"7672bbb60ba5b369","parentId":"0000000000000002","timestamp":1464889183460000,"duration":2004831,"annotations":[],"binaryAnnotations":[{"key":"lc","value":"this-is-a-test1","endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"}}]}
{"traceId":"0000000000000001","name":"get","id":"0000000000000002","timestamp":1464889182435000,"duration":3150000,"annotations":[{"endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"},"timestamp":1464889182435000,"value":"sr"},{"endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"},"timestamp":1464889185585000,"value":"ss"}],"binaryAnnotations":[{"key":"http.status_code","value":"200","endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"}},{"key":"http.url","value":"/phoenix/api/v0/version","endpoint":{"serviceName":"braveservletinterceptorintegration","ipv4":"10.203.68.159"}}]}

So the traces got linked correctly, but it looks to me like it lost the intermediate this-is-a-test and only recorded this-is-a-test1 - so no stacking as per what @markelliot is saying. I'll continue poking around tomorrow, maybe I am using it wrong.

@jkozlowski
Copy link
Contributor

Looking at it a bit more, all the ServerTracer, ClientTracer and LocalTracer operate on ServerClientAndLocalSpanState which is an interface and can be replaced. Not sure how many things it would break if we replaced it's ThreadLocalServerClientAndLocalSpanState with what you had implemented here. Not sure how valuable it is to use this library if we need to replace this part of it and how constraining it is.

@jkozlowski
Copy link
Contributor

Ok I think I have a better understanding of the library:

The core is ServerClientAndLocalSpanState: this is basically the per-thread storage for the current server, local and client spans: it only stores a single span per each category.

Then you have ServerTracer -> LocalTracer -> ClientTracer. This means that whenever LocalTracer wants to start a new span, it will lookup the current server trace from ServerClientAndLocalSpanState and use that as it's parent and set the new span into ServerClientAndLocalSpanState. Similarly, for ClientTracer: it will use the current local trace as it's parent.

Therefore, if you want to have nesting in local tracing, you just need to have another layer that restores the current local trace after a subtrace returns: the ServerClientAndLocalSpanState has basically methods that set the spans for server, local and client.

I am going to go with Brave for my little experiment (in fact I have almost finished integrating it), since adding this small layer for nested local tracing (which I will not want just yet) should be fairly trivial and I gain a lot of the out of the box integrations that Brave comes with.

Now sure how we'd like to proceed? I suggest:

  • Amend the PR to use Brave and hook everything up with a caveat that LocalTracer should not be used for multiple layers of tracing.
  • Contribute back the filters you created and integrations for retrofit etc. and then pull in the newest version.
  • Once we need the local tracing amended, implement it here first and test drive it on a few projects and see how they feel about contributing this back.

@jkozlowski
Copy link
Contributor

jkozlowski commented Jun 3, 2016

Also how does this all compare to Chronicle and what should we use?

@markelliot
Copy link
Contributor Author

Chronicle and the implementation here are nearly identical (though that’s not so surprising since I’m the original author of Chronicle). We would’ve used it except its dependency tree makes it hard to open source – I looked at the impl a bit as I put together this setup.

I think internal spans have been extremely useful in the past, and would highly value continuing to have them.

Replacing the guts of Brave seems pretty daunting – the setup here would give us an API-compatible system though none of the Zipkin connectivity, so the trade is likely a hard one.

@jkozlowski
Copy link
Contributor

Looks like there is interest from brave to support this: openzipkin/brave#166

@markelliot markelliot closed this Jun 22, 2016
@markelliot markelliot deleted the feature/tracing branch June 22, 2016 02:51
schlosna added a commit to schlosna/http-remoting that referenced this pull request Oct 11, 2016
@schlosna schlosna mentioned this pull request Oct 11, 2016
@markelliot markelliot mentioned this pull request Mar 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants