Skip to content
This repository has been archived by the owner on May 23, 2023. It is now read-only.

Add clarification for Span under multithreading scenarios. #120

Conversation

carlosalberto
Copy link
Contributor

Multithreading Span usage.

Current State: Draft
Author: carlosalberto

In the OpenTracing specification, there is no mention of Span behavior under multithreaded applications, nor clarification of any thread-safety requirement. This proposal is thus intended to be explicit about this specific scenario.

Background

Different tracing systems have historically taken different decisions regarding this, with many of them making Spans (and their operations) thread-safe. Yet, it has been shown lately that a few frameworks have decided to not make Spans thread-safe, and instead provide their own ways to handle this scenario. This could in the near future create confusion among both users and tracing authors, as well as incompatibilities among implementations.

Specification Changes

The Span interface adds one or more paragraph detailing the expected behavior of the objects under multithreading scenarios, specially when a single operation may take place through different threads (maybe using a thread pool). Span objects should then either:

  • Being thread-safe in all its operations (logs, tags, baggage).
  • Guaranteed to work fine and correctly under the described scenarios.

Use Cases

Application using a thread pool

Multithreaded applications using a thread pool, with the user creating a Span for a given operation, which would be executed as a series of callbacks, each one running on a different, unknown thread.

For many frameworks, this could be approached with the Span being passed between the threads/callbacks during its execution, to add logs, set tags or modify its baggage.

Risk Assessment

The following risks have been identified:

  • This change may affect tracing systems that have taken for granted that there is no need to provide any form of thread-safety, as they would need to update or change its own API or code.

  • The clarification may stay too vague or be insufficient in the long term, regarding the frameworks that decide to not provide thread-safety. Thus a proper, clear addition needs to happen for this case.

@carlosalberto carlosalberto changed the title Add a RFC for Span under multithreading scenarios. Add clarification for Span under multithreading scenarios. May 30, 2018
Copy link
Member

@wu-sheng wu-sheng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If SkyWalking is going to support span across thread, I hope could be some explicit way to declare it. Otherwise, the auto instrumentation agent can't prepare for that. It is major difference between SkyWalking and manual instrumentation, we must propagate context in ThreadLocal(Java), because no one will do that, and we don't want people to do so.


The `Span` interface adds one or more paragraph detailing the expected behavior of the objects under multithreading scenarios, specially when a single operation may take place through different threads (maybe using a thread pool). `Span` objects should then either:

* Being thread-safe in all its operations (logs, tags, baggage).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be a performance concern for span's OPs. Many existed APM/tracers expected span don't face multiple threads, that is also based on this.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without nitpicking on specific wording, I am in favor of clarifying that span has to be thread-safe

@yurishkuro
Copy link
Member

@wu-sheng It was my understanding that the skywalking agent code is not using OpenTracing, it is coded directly against skyWalking spans. The only part that is OpenTracing compatible is if users interact with the spans directly from the application, via Tracer. Would that not provide sufficient indication that you need to expose a thread-safe version of the span if accessed via OT API, while the internal API used by the agent can be single-threaded?

@wu-sheng
Copy link
Member

It was my understanding that the skywalking agent code is not using OpenTracing, it is coded directly against skyWalking spans. The only part that is OpenTracing compatible is if users interact with the spans directly from the application, via Tracer. Would that not provide sufficient indication that you need to expose a thread-safe version of the span if accessed via OT API, while the internal API used by the agent can be single-threaded?

@yurishkuro Yes, for the bridge, we can do that. But, my concern is more about that. You could just consider this as the disadvantage of auto-instrumentation agents.

Java program also runs in a single thread(exclude distributed), except for thread pool or job assignment modules, so nearly all Java trace agent(.NET likely too), store context in Thread(ThreadLocal for Java), and also maintain an active span stack in context for help the plugin(instrumentation) developer easier to process parent/child relationship. New span's parent is also the last active span.

Such as: A->B->C, when C created, B is its parent span as auto and default. If you finish C, then create D, then D is also B's child too. All these are automatic, that is core of auto instrumentation for many languages.

So you can find out, when span(like C) can cross thread, the stack structure fails, because, you can't guarantee the C has been finished before D created, but you may expect that.

That is a deep conflict, and also why I argue the across thread span all the time.

Although, I think you understand, SkyWalking always did instrumentation in auto way, so we don't relied on OT instrumentation very much. If we consider a limited support, such as providing a way to create some span manually, then SkyWalking can say, we only support in that way.

@yurishkuro
Copy link
Member

ScopeManager also keeps spans in thread local in a stack-like manner.

@carlosalberto I'm curious, with the async frameworks using ScopeManager, has anyone looked at what happens with that stack? In Jaeger's previous context manager impl we had a clear() method that would always be called by the server endpoint, but ScopeManager doesn't have it.

@felixbarny
Copy link

Could you elaborate on the "Application using a thread pool" use case? Why don't you have one Span per thread/callback?

Ensuring visibility across threads is, of course, important. Especially when the spans are reported asynchronously in a different thread than in the one they have been recorded. But that can be assured using cheaper mechanisms, like memory barriers. Not quite sure why requiring full thread safety is needed. Did you do some benchmarks to get a feel for the costs of this? This should go into the risk assessment IMHO.

@wu-sheng
Copy link
Member

Why don't you have one Span per thread/callback?

SkyWalking did this too.

Did you do some benchmarks to get a feel for the costs of this?

Please consider the benchmark scenario should be collecting more than 6000 calls per second, each call has at least 15 spans. That is at least what SkyWalking need and promise to support in product use.

That is also my performance concern comes.

@carlosalberto
Copy link
Contributor Author

carlosalberto commented May 31, 2018

Why don't you have one Span per thread/callback?

Sure, this is the route taking by some applications or tracing designs (a new Span per thread) - but some other times you will end up creating a Span in a thread, pass it to next thread (maybe through a callback), one at a time, and finish it when the last point is reached.

So the point of this PR is precisely to get this discussion going, and get some feedback ;)

Please consider the benchmark scenario should be collecting more than 6000 calls per second, each call has at least 15 spans.

I know other tracing systems (Jaeger, LightStep for a start) do thread-safe Spans. Wondering about the performance scenarios (if any, and if so, which ones, etc).

@carlosalberto
Copy link
Contributor Author

Hey @yurishkuro

with the async frameworks using ScopeManager, has anyone looked at what happens with that stack? In Jaeger's previous context manager impl we had a clear() method that would always be called by the server endpoint

Oh, there was some basic experimentation with Akka (from my side at least), which ended up me being re-introducing the AutoFinishScopeManager intoopentracing-util. That being said, maybe it's time to re-check it (the clear() method would make lots of sense to me, btw).

@yurishkuro
Copy link
Member

@carlosalberto would be useful to run Akka etc. examples under profiler to check for memory leaks, because as far as I can tell a naive way of starting non-auto-finish span/scope might leave the scope on the stack forever.

@carlosalberto
Copy link
Contributor Author

Hey all!

We would like to have a Cross-Language group call on July 13th (next week) and discuss this item as part of it. Please let me know if you guys can make it. I'm specially interested, if possible, in having @wu-sheng and @felixbarny if possible ;)

We will also be probably discussing it over Gitter this week, prior to the call ;)

@tedsuo
Copy link
Member

tedsuo commented Jul 12, 2018

@felixbarny I agree that "Thread safety" should be clarified.

In this case, it should just mean "safety," in the sense that it exceptions will not occur, memory will not be corrupted, and functionality will continue to work. There is no need to guarantee any ordering of operations between threads, etc. In practice there could be more efficient safety guarantees than locks.

@yurishkuro yurishkuro closed this May 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants