Allow client to support brave/zipkin distributed tracing in a non-intrusive way #69

m50d · 2020-09-29T08:16:32Z

Based on #50 and the discussion in #57 . Not polished - wanted to discuss whether the architecture makes sense first.

Generalise the LoggingContext that's returned from ProcessorContextImpl to a TracingContext.

Allow it to be customised by a TracingContextFactory.

Define a brave-specific implementation that reads the zipkin trace information from the Kafka record headers in a separate brave module.

CLAassistant · 2020-09-29T08:16:36Z

All committers have signed the CLA.

kawamuray · 2020-09-30T09:42:55Z

Thanks for the PR!

I've looked the patch briefly and it looks good overall (as you correctly reflected my opinion at #57 as you mentioned :) ).

One major point I'd like to discuss before start looking into the detail is the scope of tracing application.
Currently the tracing context is instantiated just before the DecatonProcessor#process starts, which should tell us the time taken between produce to right before the process, and time taken to complete processing.
However as I write in #57 I think when investigating a process latency issue we typically wants to know the exact time that the record is delivered to the consumer to distinguish latency among kafka message delivery (matter of kafka clients and broker), decaton internal (queuing and waiting for preceding tasks's complete) and processing time.
So I think we should support a bit more higher granularity for measurement. At last once at the time of consume, (Consumer returns the record) and right before the process.
For that, I think the tracing should be configured for SubscriptionBuilder and not for ProcessorsBuilders.

Also, some users for implements DecatonProcessor in async way (using DeferredCompletion) may want to measure the time they "complete" the task which seems to be not possible in current usage of brave.

Can you consider this point?

m50d · 2020-10-01T09:25:08Z

So I think we should support a bit more higher granularity for measurement. At last once at the time of consume, (Consumer returns the record) and right before the process.

Makes sense. In that case we also need to propagate the context from the ProcessorSubscription's polling thread to the ProcessorUnits executor. I'm not sure how decoupled this will actually be, but I'll try to make an interface that makes sense.

Also, some users for implements DecatonProcessor in async way (using DeferredCompletion) may want to measure the time they "complete" the task which seems to be not possible in current usage of brave.

The processor implementation can start its own spans, which will have the current span as a parent (since it's in context on the current thread). So I don't think we need to do anything at the decaton level here (and indeed I think we don't need to start a span around the process call either - if the processor wants this to be a separate span it can start one itself). That said I do think it makes more sense to close the span when the processresult is completed rather than when the call to push returns though.

Will rework.

kawamuray · 2020-10-02T07:32:15Z

it makes more sense to close the span when the processresult is completed rather than when the call to push returns though.

Indeed. Decaton already has two concept for tracking task's processing duration - process time and complete time, as they're exposed independently in metrics. Process time corresponds to the time that the process method took till it returns and complete time corresponds to the time the task's offset is marked as completed, so complete time includes process time. I think spans can follow the same structure for this. The span measuring task's complete time (starts right before process call, finish when the process result's marked as completed) and it has one or more children where's the first one is "process time" - the duration of process method call.
As you said users can add one or more child span under the "complete time" span or under the "process time" span depending on whether its executed synchronously in process method or not, so they're likely get the view on zipkin UI with no gap between spans if they design their own spans correctly.

m50d · 2020-10-02T07:56:37Z

I think spans can follow the same structure for this. The span measuring task's complete time (starts right before process call, finish when the process result's marked as completed) and it has one or more children where's the first one is "process time" - the duration of process method call.

As you said users can add one or more child span under the "complete time" span or under the "process time" span depending on whether its executed synchronously in process method or not, so they're likely get the view on zipkin UI with no gap between spans if they design their own spans correctly.

I can see that many users will want that kind of structure, but I think for some users two separate spans might be more complex than necessary. So I don't think we actually need to provide the "process time" span at decaton framework level - if a user wants a span that covers the execution of their process method, they can define a process method that starts a (child) span at the start and closes it just before returning (or some more complex structure of child spans, some or all of which may be async). But if the user prefers to see a single span per record for simplicity then I'd like to support that use case as well.

brave/src/main/java/com/linecorp/decaton/processor/runtime/BraveTracingProvider.java

processor/src/it/java/com/linecorp/decaton/processor/RetryQueueingTest.java

testing/src/main/java/com/linecorp/decaton/testing/KafkaAdmin.java

testing/src/main/java/com/linecorp/decaton/testing/TestUtils.java

testing/src/main/java/com/linecorp/decaton/testing/processor/ProcessingGuarantee.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/TracingProvider.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/PartitionContexts.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/SubscriptionBuilder.java

kawamuray · 2020-10-05T11:04:25Z

if a user wants a span that covers the execution of their process method, they can define a process method that starts a (child) span at the start and closes it just before returning (or some more complex structure of child spans, some or all of which may be async)

Yeah, that's true. At the same time one of the key benefits that a library provides is a good default that helps many users from needing to write same boilerplates.

As you said, especially for sync process implementations having two parts - process time and complete time - is almost pointless since they both shows exactly the same durations (I guess two same bars up and down on zipkin UI?)
At the same time, we saw many users use deferred completion (async process) with relatively lightweight synchronous work in process method, which in case they want to add another span for measuring synchronous part in process I guess.

So a good default depends on how harmful it is to have both process and complete measurements for sync type process users.

I can see that many users will want that kind of structure, but I think for some users two separate spans might be more complex than necessary.

Can you explain bit more about how is it bad for users to have 2 spans? Is it about two, overlapping bars on UI?

processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java

m50d · 2020-10-06T03:31:50Z

Can you explain bit more about how is it bad for users to have 2 spans? Is it about two, overlapping bars on UI?

Yes, and also increasing the storage volume (which in our case has been a significant concern). Maybe an annotation within the span would be the right way to represent this in Zipkin.

At the same time, we saw many users use deferred completion (async process) with relatively lightweight synchronous work in process method, which in case they want to add another span for measuring synchronous part in process I guess.

At least in the zipkin case, such a user already has to add some boilerplate to propagate the context onto that async task, so I don't think it's any more overhead for them to start a new span at the same time.

I had wondered about implementing a parent class (TracedDecatonProcessor or some such) that starts the spans in the right places. But I think that would only be worthwhile for tracing implementations that automatically propagate the context across async calls, of which the only one I know about is Kamon.

testing/src/main/java/com/linecorp/decaton/testing/processor/TestTracingProducer.java

kawamuray · 2020-10-06T10:29:59Z

At least in the zipkin case, such a user already has to add some boilerplate to propagate the context onto that async task, so I don't think it's any more overhead for them to start a new span at the same time.

Okay. Let's go with single span for process (complete) part for now. We can easily add one another if we got demands for it :)

kawamuray

Thanks for redesigning interface, look better :) just few more points.

docs/tracing.adoc

processor/src/main/java/com/linecorp/decaton/processor/runtime/SubscriptionBuilder.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/TracingProvider.java

testing/src/main/java/com/linecorp/decaton/testing/KafkaAdmin.java

testing/src/main/java/com/linecorp/decaton/testing/TestUtils.java

brave/src/main/java/com/linecorp/decaton/processor/runtime/BraveTracingProvider.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessingContextImpl.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessPipeline.java

brave/src/main/java/com/linecorp/decaton/processor/runtime/BraveTracingProvider.java

ocadaruma

added minor comment, but almost lgtm except kawamuray's points.

processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java

testing/src/main/java/com/linecorp/decaton/testing/KafkaAdmin.java

processor/src/main/java/com/linecorp/decaton/processor/ProcessingContext.java

brave/src/main/java/com/linecorp/decaton/processor/runtime/BraveTracingProvider.java

docs/retry-queueing.adoc

processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/TaskRequest.java

processor/src/main/java/com/linecorp/decaton/processor/runtime/TracingProvider.java

benchmark/src/main/java/com/linecorp/decaton/benchmark/TemporaryTopic.java

kawamuray · 2020-10-12T04:38:55Z

Except those minor points now this looks good overall to me.
Please check comments once again since there are some comments folded by github and still aren't addressed (like unused import)

m50d · 2020-10-12T07:02:04Z

Hi,

I don't understand the Travis build failure. Since I couldn't reproduce it locally I've rebased the branch to the minimum change that causes the failure. It looks like just adding the trace ID header in ProcessorTestSuite causes these two test failures. Are either of you able to reproduce them? Or is there any way to see what's happening on Travis in more detail?

Many thanks,
Mickey

kawamuray · 2020-10-12T07:23:17Z

should be #56 ...

it shouldn't keep failing forever though.
6733994 got it passed apparently.
Once you finalize your PR, please just check the test results. If it's difficult to make it pass with 1 or 2 retries (due to those flaky ones) we'll just proceed to merge it disregarding them for now.

m50d · 2020-10-12T08:07:14Z

should be #56 ...

it shouldn't keep failing forever though.
6733994 got it passed apparently.
Once you finalize your PR, please just check the test results. If it's difficult to make it pass with 1 or 2 retries (due to those flaky ones) we'll just proceed to merge it disregarding them for now.

Ah, I see. Thanks.
I'm happy with the PR (and I think I've addressed all your and @ocadaruma 's comments?).
It looks like I can't re-run the Travis test myself since I don't have write access to this repository?

ocadaruma

LGTM

kawamuray

LGTM 👍

Thanks for contributing a great feature!

All CI failures are caused by #56 and are not related to this patch.

kawamuray requested review from kawamuray and ocadaruma September 30, 2020 08:53

m50d force-pushed the tracing branch from 9c65db7 to adf8445 Compare October 2, 2020 07:48

kawamuray requested changes Oct 5, 2020

View reviewed changes

ocadaruma reviewed Oct 5, 2020

View reviewed changes

processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java Outdated Show resolved Hide resolved

kawamuray requested changes Oct 6, 2020

View reviewed changes

testing/src/main/java/com/linecorp/decaton/testing/processor/TestTracingProducer.java Outdated Show resolved Hide resolved

m50d requested review from kawamuray and ocadaruma October 6, 2020 08:04

m50d force-pushed the tracing branch from e537c2a to 8dab9a9 Compare October 7, 2020 06:38

kawamuray requested changes Oct 8, 2020

View reviewed changes

ocadaruma reviewed Oct 8, 2020

View reviewed changes

processor/src/main/java/com/linecorp/decaton/processor/runtime/ProcessorSubscription.java Outdated Show resolved Hide resolved

ocadaruma reviewed Oct 9, 2020

View reviewed changes

kawamuray requested changes Oct 12, 2020

View reviewed changes

kawamuray reviewed Oct 12, 2020

View reviewed changes

benchmark/src/main/java/com/linecorp/decaton/benchmark/TemporaryTopic.java Outdated Show resolved Hide resolved

m50d force-pushed the tracing branch from 9e743fb to 9d3270b Compare October 12, 2020 06:22

m50d added 2 commits October 12, 2020 15:25

Update documentation

e758c9e

Don't delete topics on windows

6733994

m50d force-pushed the tracing branch from 9d3270b to 6733994 Compare October 12, 2020 06:32

Set Kafka tracing header when testing

6ba0031

m50d force-pushed the tracing branch from 15be6e2 to 6ba0031 Compare October 12, 2020 06:48

m50d added 3 commits October 12, 2020 17:04

Wire TracingProvider through to where it will be used

11e0023

Invoke the TracingProvider (with test)

bbe2820

Add Brave implementation of TracingProvider

8c8bcd1

m50d requested a review from kawamuray October 12, 2020 08:07

ocadaruma approved these changes Oct 12, 2020

View reviewed changes

kawamuray approved these changes Oct 13, 2020

View reviewed changes

kawamuray merged commit bee6810 into line:master Oct 13, 2020

m50d deleted the tracing branch October 13, 2020 04:55

ocadaruma mentioned this pull request Dec 16, 2020

Add distributed tracing support #50

Closed

Yang-33 mentioned this pull request Feb 13, 2023

Add missing url on docs #180

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow client to support brave/zipkin distributed tracing in a non-intrusive way #69

Allow client to support brave/zipkin distributed tracing in a non-intrusive way #69

m50d commented Sep 29, 2020

CLAassistant commented Sep 29, 2020 •

edited

Loading

kawamuray commented Sep 30, 2020 •

edited

Loading

m50d commented Oct 1, 2020

kawamuray commented Oct 2, 2020

m50d commented Oct 2, 2020

kawamuray commented Oct 5, 2020

m50d commented Oct 6, 2020

kawamuray commented Oct 6, 2020

kawamuray left a comment

ocadaruma left a comment

kawamuray commented Oct 12, 2020

m50d commented Oct 12, 2020

kawamuray commented Oct 12, 2020

m50d commented Oct 12, 2020

ocadaruma left a comment

kawamuray left a comment

Allow client to support brave/zipkin distributed tracing in a non-intrusive way #69

Allow client to support brave/zipkin distributed tracing in a non-intrusive way #69

Conversation

m50d commented Sep 29, 2020

CLAassistant commented Sep 29, 2020 • edited Loading

kawamuray commented Sep 30, 2020 • edited Loading

m50d commented Oct 1, 2020

kawamuray commented Oct 2, 2020

m50d commented Oct 2, 2020

kawamuray commented Oct 5, 2020

m50d commented Oct 6, 2020

kawamuray commented Oct 6, 2020

kawamuray left a comment

Choose a reason for hiding this comment

ocadaruma left a comment

Choose a reason for hiding this comment

kawamuray commented Oct 12, 2020

m50d commented Oct 12, 2020

kawamuray commented Oct 12, 2020

m50d commented Oct 12, 2020

ocadaruma left a comment

Choose a reason for hiding this comment

kawamuray left a comment

Choose a reason for hiding this comment

CLAassistant commented Sep 29, 2020 •

edited

Loading

kawamuray commented Sep 30, 2020 •

edited

Loading