Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow client to support brave/zipkin distributed tracing in a non-intrusive way #69

Merged
merged 6 commits into from
Oct 13, 2020

Conversation

m50d
Copy link
Contributor

@m50d m50d commented Sep 29, 2020

Based on #50 and the discussion in #57 . Not polished - wanted to discuss whether the architecture makes sense first.

Generalise the LoggingContext that's returned from ProcessorContextImpl to a TracingContext.

Allow it to be customised by a TracingContextFactory.

Define a brave-specific implementation that reads the zipkin trace information from the Kafka record headers in a separate brave module.

@CLAassistant
Copy link

CLAassistant commented Sep 29, 2020

CLA assistant check
All committers have signed the CLA.

@kawamuray
Copy link
Member

kawamuray commented Sep 30, 2020

Thanks for the PR!

I've looked the patch briefly and it looks good overall (as you correctly reflected my opinion at #57 as you mentioned :) ).

One major point I'd like to discuss before start looking into the detail is the scope of tracing application.
Currently the tracing context is instantiated just before the DecatonProcessor#process starts, which should tell us the time taken between produce to right before the process, and time taken to complete processing.
However as I write in #57 I think when investigating a process latency issue we typically wants to know the exact time that the record is delivered to the consumer to distinguish latency among kafka message delivery (matter of kafka clients and broker), decaton internal (queuing and waiting for preceding tasks's complete) and processing time.
So I think we should support a bit more higher granularity for measurement. At last once at the time of consume, (Consumer returns the record) and right before the process.
For that, I think the tracing should be configured for SubscriptionBuilder and not for ProcessorsBuilders.

Also, some users for implements DecatonProcessor in async way (using DeferredCompletion) may want to measure the time they "complete" the task which seems to be not possible in current usage of brave.

Can you consider this point?

@m50d
Copy link
Contributor Author

m50d commented Oct 1, 2020

So I think we should support a bit more higher granularity for measurement. At last once at the time of consume, (Consumer returns the record) and right before the process.

Makes sense. In that case we also need to propagate the context from the ProcessorSubscription's polling thread to the ProcessorUnits executor. I'm not sure how decoupled this will actually be, but I'll try to make an interface that makes sense.

Also, some users for implements DecatonProcessor in async way (using DeferredCompletion) may want to measure the time they "complete" the task which seems to be not possible in current usage of brave.

The processor implementation can start its own spans, which will have the current span as a parent (since it's in context on the current thread). So I don't think we need to do anything at the decaton level here (and indeed I think we don't need to start a span around the process call either - if the processor wants this to be a separate span it can start one itself). That said I do think it makes more sense to close the span when the processresult is completed rather than when the call to push returns though.

Will rework.

@kawamuray
Copy link
Member

it makes more sense to close the span when the processresult is completed rather than when the call to push returns though.

Indeed. Decaton already has two concept for tracking task's processing duration - process time and complete time, as they're exposed independently in metrics. Process time corresponds to the time that the process method took till it returns and complete time corresponds to the time the task's offset is marked as completed, so complete time includes process time. I think spans can follow the same structure for this. The span measuring task's complete time (starts right before process call, finish when the process result's marked as completed) and it has one or more children where's the first one is "process time" - the duration of process method call.
As you said users can add one or more child span under the "complete time" span or under the "process time" span depending on whether its executed synchronously in process method or not, so they're likely get the view on zipkin UI with no gap between spans if they design their own spans correctly.

@m50d
Copy link
Contributor Author

m50d commented Oct 2, 2020

I think spans can follow the same structure for this. The span measuring task's complete time (starts right before process call, finish when the process result's marked as completed) and it has one or more children where's the first one is "process time" - the duration of process method call.

As you said users can add one or more child span under the "complete time" span or under the "process time" span depending on whether its executed synchronously in process method or not, so they're likely get the view on zipkin UI with no gap between spans if they design their own spans correctly.

I can see that many users will want that kind of structure, but I think for some users two separate spans might be more complex than necessary. So I don't think we actually need to provide the "process time" span at decaton framework level - if a user wants a span that covers the execution of their process method, they can define a process method that starts a (child) span at the start and closes it just before returning (or some more complex structure of child spans, some or all of which may be async). But if the user prefers to see a single span per record for simplicity then I'd like to support that use case as well.

@kawamuray
Copy link
Member

if a user wants a span that covers the execution of their process method, they can define a process method that starts a (child) span at the start and closes it just before returning (or some more complex structure of child spans, some or all of which may be async)

Yeah, that's true. At the same time one of the key benefits that a library provides is a good default that helps many users from needing to write same boilerplates.

As you said, especially for sync process implementations having two parts - process time and complete time - is almost pointless since they both shows exactly the same durations (I guess two same bars up and down on zipkin UI?)
At the same time, we saw many users use deferred completion (async process) with relatively lightweight synchronous work in process method, which in case they want to add another span for measuring synchronous part in process I guess.

So a good default depends on how harmful it is to have both process and complete measurements for sync type process users.

I can see that many users will want that kind of structure, but I think for some users two separate spans might be more complex than necessary.

Can you explain bit more about how is it bad for users to have 2 spans? Is it about two, overlapping bars on UI?

@m50d
Copy link
Contributor Author

m50d commented Oct 6, 2020

Can you explain bit more about how is it bad for users to have 2 spans? Is it about two, overlapping bars on UI?

Yes, and also increasing the storage volume (which in our case has been a significant concern). Maybe an annotation within the span would be the right way to represent this in Zipkin.

At the same time, we saw many users use deferred completion (async process) with relatively lightweight synchronous work in process method, which in case they want to add another span for measuring synchronous part in process I guess.

At least in the zipkin case, such a user already has to add some boilerplate to propagate the context onto that async task, so I don't think it's any more overhead for them to start a new span at the same time.

I had wondered about implementing a parent class (TracedDecatonProcessor or some such) that starts the spans in the right places. But I think that would only be worthwhile for tracing implementations that automatically propagate the context across async calls, of which the only one I know about is Kamon.

@kawamuray
Copy link
Member

At least in the zipkin case, such a user already has to add some boilerplate to propagate the context onto that async task, so I don't think it's any more overhead for them to start a new span at the same time.

Okay. Let's go with single span for process (complete) part for now. We can easily add one another if we got demands for it :)

Copy link
Member

@kawamuray kawamuray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for redesigning interface, look better :) just few more points.

Copy link
Member

@ocadaruma ocadaruma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added minor comment, but almost lgtm except kawamuray's points.

@kawamuray
Copy link
Member

Except those minor points now this looks good overall to me.
Please check comments once again since there are some comments folded by github and still aren't addressed (like unused import)

@m50d
Copy link
Contributor Author

m50d commented Oct 12, 2020

Hi,

I don't understand the Travis build failure. Since I couldn't reproduce it locally I've rebased the branch to the minimum change that causes the failure. It looks like just adding the trace ID header in ProcessorTestSuite causes these two test failures. Are either of you able to reproduce them? Or is there any way to see what's happening on Travis in more detail?

Many thanks,
Mickey

@kawamuray
Copy link
Member

should be #56 ...

it shouldn't keep failing forever though.
6733994 got it passed apparently.
Once you finalize your PR, please just check the test results. If it's difficult to make it pass with 1 or 2 retries (due to those flaky ones) we'll just proceed to merge it disregarding them for now.

@m50d
Copy link
Contributor Author

m50d commented Oct 12, 2020

should be #56 ...

it shouldn't keep failing forever though.
6733994 got it passed apparently.
Once you finalize your PR, please just check the test results. If it's difficult to make it pass with 1 or 2 retries (due to those flaky ones) we'll just proceed to merge it disregarding them for now.

Ah, I see. Thanks.
I'm happy with the PR (and I think I've addressed all your and @ocadaruma 's comments?).
It looks like I can't re-run the Travis test myself since I don't have write access to this repository?

@m50d m50d requested a review from kawamuray October 12, 2020 08:07
Copy link
Member

@ocadaruma ocadaruma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@kawamuray kawamuray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Thanks for contributing a great feature!

All CI failures are caused by #56 and are not related to this patch.

@kawamuray kawamuray merged commit bee6810 into line:master Oct 13, 2020
@m50d m50d deleted the tracing branch October 13, 2020 04:55
@Yang-33 Yang-33 mentioned this pull request Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants